# Detecting Disaster Tweets
In this project, we will examine a dataset of Tweets to identify which Tweets are about a real disaster. The dataset is hosted on Kaggle as an introduction to Natural Language Processing:

### Real or Not? NLP with Disaster Tweets
#### Predict which Tweets are about real disasters and which ones are not
https://www.kaggle.com/c/nlp-getting-started

The data are downloaded and stored in this project folder in the `../downloads` directory. As a first step, let's examine the dataset:

In [1]:
import pandas as pd

train = pd.read_csv('../downloads/train.csv')
test = pd.read_csv('../downloads/test.csv')
sample_submission = pd.read_csv('../downloads/sample_submission.csv')

## Text cleaning
After some analysis (not shown here), a few things were discovered:

- NaN values in keyword and locations
- the text is stored as "escaped html" that appears as strange codes within text
- there are some unicode character errors
- the text contains URLs
- lots of spelling errors!

Let's clean them up...

In [2]:
train.keyword.fillna('',inplace=True)
test.keyword.fillna('',inplace=True)
train.location.fillna('',inplace=True)
test.location.fillna('',inplace=True)

import html, re

def clean_text(text):
    text = html.unescape(text) #clean up html escape codes
    text = str(text.encode('ascii','ignore'),'utf8') #clean unicode errors
    text = re.sub('http\S+','',text) #delete URLs
    text = re.sub('\%20','_',text) #delete unescaped spaces in keywords
    text = re.sub('\n',' ',text) #delete newlines from tweets
    text = re.sub('#\S+','',text) #delete hashtags
    text = re.sub('@\S+','',text)
    return text
    
train['text'] = train['text'].apply(clean_text)
train['keyword'] = train['keyword'].apply(clean_text)
train['location'] = train['location'].apply(clean_text)

test['text'] = test['text'].apply(clean_text)
test['keyword'] = test['keyword'].apply(clean_text)
test['location'] = test['location'].apply(clean_text)

In [3]:
train.sample(5, random_state = 1)

Unnamed: 0,id,keyword,location,text,target
3228,4632,emergency_services,"Sydney, New South Wales",Goulburn man Henry Van Bilsen missing: Emergen...,1
3706,5271,fear,,The things we fear most in organizations--fluc...,0
6957,9982,tsunami,Land Of The Kings,?? hey Esh,0
2887,4149,drown,,you until you drown by water entering the lun...,0
7464,10680,wounds,"cody, austin follows ?*?",Crawling in my skin These wounds they will not...,1


In [4]:
test.sample(5, random_state = 1)

Unnamed: 0,id,keyword,location,text
1787,6035,heat_wave,"Brooklyn, NY",I added some dumb ideas to beat the heat chea...
666,2168,catastrophic,,If a 1 rise in wages is going to have such a c...
93,317,annihilated,,How do I step outside for 5 seconds and get an...
2924,9682,tornado,Canada,Tornado warnings issued for southern Alberta
1735,5857,hailstorm,"Calgary, Alberta, Canada",Get out of the hailstorm and come down to 2ni...


In [5]:
sample_submission.sample(5, random_state = 1)

Unnamed: 0,id,target
1787,6035,0
666,2168,0
93,317,0
2924,9682,0
1735,5857,0


The train.csv file contains a table with the following columns:
- id
- keyword
- location
- text
- target

The test.csv file contains the same columns as train but the target values are missing. Our objective will be to predict these values. The sample_submission.csv file contains only the id and target values. To perform in the Kaggle competition, we will submit a csv file with a table of id and target values to see how our model compares with other competitors.

## Exploratory Data Analysis (EDA)
Let's ask a few questions that we can answer by exploring the data:

- Do the test.csv and sample_submission.csv files have identical id values?
- How large are the test and train sets? 
- How many unique keywords are there?
- Are there any keywords not common among train and test sets?
- How many locations are there?
- Are there any locations not common among train and test sets?
- Are there any duplicated values in either set?
- How is the train target distributed?

In [6]:
# Do the test.csv and sample_submission.csv files have identical id values?

import numpy as np

assert np.all(test.id == sample_submission.id)

In [7]:
# How large are the test and train sets? 

print('Train size:', len(train.index))
print('Test size: ', len(test.index))

Train size: 7613
Test size:  3263


In [8]:
# How many unique keywords are there?
print('Number of train set keywords:', train.keyword.nunique())
print('Number of test set keywords: ', test.keyword.nunique())

Number of train set keywords: 222
Number of test set keywords:  222


In [9]:
# Are there any keywords not common among train and test sets?

train_keywords = set(train.keyword)
test_keywords = set(test.keyword)
print('Number of keywords not common to both sets:', len(test_keywords - train_keywords))

Number of keywords not common to both sets: 0


In [10]:
# How many unique locations are there?

print('Number of train set locations:', train.location.nunique())
print('Number of test set locations: ', test.location.nunique())

Number of train set locations: 3313
Number of test set locations:  1584


In [11]:
# Are there any locations not common among train and test sets?

train_locations = set(train.location)
test_locations = set(test.location)
print('Number of locations not common to both sets:', len(test_locations.symmetric_difference(train_locations)))
print('Number of locations common to both sets:', len(test_locations.intersection(train_locations)))

Number of locations not common to both sets: 4053
Number of locations common to both sets: 422


In [12]:
# Are there any duplicated values in either set?

train_duplicates = train.duplicated(subset=['text'],keep=False)
test_duplicates = test.duplicated(subset=['text'],keep=False)

print('Duplicated train items:' ,np.sum(train_duplicates))
print('Duplicated test items: ' ,np.sum(test_duplicates))

Duplicated train items: 983
Duplicated test items:  261


In [13]:
# How is the train target distributed?

import matplotlib.pyplot as plt

train.target.value_counts().plot(kind='bar')
plt.show()

<Figure size 640x480 with 1 Axes>

## Duplicates
Both the test and training sets have duplicated values. Let's explore these a bit more: 

In [14]:
pd.options.display.max_rows = None
pd.options.display.max_columns = None
pd.options.display.width = 1000
train[train_duplicates].sort_values(['text','location'])

Unnamed: 0,id,keyword,location,text,target
2165,3106,debris,,MH370: Aircraft debris found on La Reunion...,1
2168,3109,debris,,MH370: Aircraft debris found on La Reunion...,1
2170,3112,debris,,MH370: Aircraft debris found on La Reunion...,1
2175,3118,debris,,MH370: Aircraft debris found on La Reunion...,1
2182,3126,debris,,MH370: Aircraft debris found on La Reunion...,1
6413,9171,suicide_bomber,,Suicide bomber kills 15 in Saudi security ...,1
6416,9174,suicide_bomber,,Suicide bomber kills 15 in Saudi security ...,1
6426,9191,suicide_bomber,,Suicide bomber kills 15 in Saudi security ...,1
4221,5996,hazardous,,slips into loss after unsafe and hazardou...,1
4239,6023,hazardous,"Mysore, Karnataka",slips into loss after unsafe and hazardou...,1


## Dealing with the duplicates
The duplicated data may present problems during the model training. These items will be more heavily weighted in the cost function. Moreover, several of the duplicates have different target values. It's important to treat these special cases. Here are several ways to deal with them:

- Remove all duplicates from the set
- Select only the first occurance of each duplicate
- For contradictory target values, select the majority case

For this analysis, let's select the first occurance:

In [15]:
train_duplicates = train.duplicated(subset=['text'],keep='first')
train = train[~train_duplicates]
assert 0 == np.sum(train.duplicated(subset=['text']))

## Locations and keywords as features
While the train and test sets have 422 locations common among them, they also have 4095 locations that are not in common. Such locations are unlikely to help a model generalize. For this reason, we may consider removing uncommon locations as a feature from the model. We can do this by one-hot encoding common locations while reserving an 'unknown' state for the uncommon locations.

The keywords form a set that is common among both train and test sets. It makes sense to include these as one-hot encoded features. Let's identify and save the common keywords and locations.

In [16]:
keywords = set(train.keyword).intersection(set(test.keyword))
print('Number of keywords',len(keywords))

locations = set(train.location).intersection(set(test.location))

with open('../data/keywords.txt','w') as file:
    for kw in sorted(keywords):
        file.write(kw + '\n')

# from codecs import encode   
import html

with open('../data/tweets.txt','w') as file:
    for tweet in train.text:
        #file.write((str(tweet.encode('ascii', 'ignore'),'utf8') + '\n'))
        text = html.unescape(tweet)
        text = str(text.encode('ascii','ignore'),'utf8') + '\n'
        file.write(text)

Number of keywords 222


In [17]:
train.to_csv('../data/train_cleaned.csv')
test.to_csv('../data/test_cleaned.csv')

In [18]:
s = 'Twelve feared killed in Pakistani air ambulance helicopter crash '
for tweet in train[train['text'].str.contains(s)]['text']:
    print(tweet.split())

['Twelve', 'feared', 'killed', 'in', 'Pakistani', 'air', 'ambulance', 'helicopter', 'crash']
['Twelve', 'feared', 'killed', 'in', 'Pakistani', 'air', 'ambulance', 'helicopter', 'crash', '-', 'Reuters']
['Twelve', 'feared', 'killed', 'in', 'Pakistani', 'air', 'ambulance', 'helicopter', 'crash']
['Twelve', 'feared', 'killed', 'in', 'Pakistani', 'air', 'ambulance', 'helicopter', 'crash']
['Twelve', 'feared', 'killed', 'in', 'Pakistani', 'air', 'ambulance', 'helicopter', 'crash']


In [19]:
pakistan_duplicates = train['text'].str.contains(s)
pak_dups = train[pakistan_duplicates].copy()
for index, row in pak_dups.iterrows():
    print(row.id, row.keyword, row.location, row.text,)

247 ambulance Jackson  Twelve feared killed in Pakistani air ambulance helicopter crash 
249 ambulance  Twelve feared killed in Pakistani air ambulance helicopter crash - Reuters  
253 ambulance  Twelve feared killed in Pakistani air ambulance helicopter crash 
261 ambulance   Twelve feared killed in Pakistani air ambulance helicopter crash   
287 ambulance USA Twelve feared killed in Pakistani air ambulance helicopter crash  


In [20]:
#pak_dups = pak_dups.drop(['location'],inplace=True,axis=1)
pak_dups['text'] = pak_dups['text'].str.split()
pak_dups

Unnamed: 0,id,keyword,location,text,target
172,247,ambulance,Jackson,"[Twelve, feared, killed, in, Pakistani, air, a...",1
174,249,ambulance,,"[Twelve, feared, killed, in, Pakistani, air, a...",1
177,253,ambulance,,"[Twelve, feared, killed, in, Pakistani, air, a...",1
182,261,ambulance,,"[Twelve, feared, killed, in, Pakistani, air, a...",1
203,287,ambulance,USA,"[Twelve, feared, killed, in, Pakistani, air, a...",1


Looking offline at the saved files, it's apparent that there are still lots of duplicates. Some of these have different locations, others have minor differences in the text such as including different hashtags and punctuation. Further cleaning would be ideal, but it's not clear how deep it should go.

Let's put together a function to detect duplicates. The challenge is that the duplicates are not all exactly identical (the pandas `.duplicated()` method does not detect them) but they appear as duplicates. Some have different keywords. These can be eliminated by 

In [21]:
import dill
dill.dump_session('../session/DataWrangling_EDA.db')

In [22]:
import dill
dill.load_session('../session/DataWrangling_EDA.db')

In [23]:
pak_dups

Unnamed: 0,id,keyword,location,text,target
172,247,ambulance,Jackson,"[Twelve, feared, killed, in, Pakistani, air, a...",1
174,249,ambulance,,"[Twelve, feared, killed, in, Pakistani, air, a...",1
177,253,ambulance,,"[Twelve, feared, killed, in, Pakistani, air, a...",1
182,261,ambulance,,"[Twelve, feared, killed, in, Pakistani, air, a...",1
203,287,ambulance,USA,"[Twelve, feared, killed, in, Pakistani, air, a...",1


In [24]:
pak_dups.sample(2)

Unnamed: 0,id,keyword,location,text,target
177,253,ambulance,,"[Twelve, feared, killed, in, Pakistani, air, a...",1
172,247,ambulance,Jackson,"[Twelve, feared, killed, in, Pakistani, air, a...",1
