# Explorative Data Analysis

In [2]:
import pandas as pd
pd.set_option('display.max_rows', None)

In [3]:
train = pd.read_csv('data/train.csv')
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


# Summary Statistics

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


No missing data for id, text and target. Some missing data for keyword and location.

In [5]:
train.describe(include='all')

Unnamed: 0,id,keyword,location,text,target
count,7613.0,7552,5080,7613,7613.0
unique,,221,3341,7503,
top,,fatalities,USA,11-Year-Old Boy Charged With Manslaughter of T...,
freq,,45,104,10,
mean,5441.934848,,,,0.42966
std,3137.11609,,,,0.49506
min,1.0,,,,0.0
25%,2734.0,,,,0.0
50%,5408.0,,,,0.0
75%,8146.0,,,,1.0


Some duplicate text. Retweets? Target is between 0 and 1, around 43% of the tweets in the training set are about disasters.

# Duplicate text

In [6]:
train[train['text'].duplicated(keep=False)]

Unnamed: 0,id,keyword,location,text,target
40,59,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...,0
48,68,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...,0
106,156,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
115,165,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
118,171,aftershock,Switzerland,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/TH...,0
119,172,aftershock,Switzerland,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/TH...,0
147,211,airplane%20accident,,Experts in France begin examining airplane deb...,1
164,238,airplane%20accident,,Experts in France begin examining airplane deb...,1
610,881,bioterrorism,,To fight bioterrorism sir.,1
624,898,bioterrorism,,To fight bioterrorism sir.,0


In [7]:
train[train[['keyword','location','text','target']].duplicated(keep=False)]

Unnamed: 0,id,keyword,location,text,target
40,59,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...,0
48,68,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...,0
106,156,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
115,165,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
118,171,aftershock,Switzerland,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/TH...,0
119,172,aftershock,Switzerland,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/TH...,0
147,211,airplane%20accident,,Experts in France begin examining airplane deb...,1
164,238,airplane%20accident,,Experts in France begin examining airplane deb...,1
610,881,bioterrorism,,To fight bioterrorism sir.,1
624,898,bioterrorism,,To fight bioterrorism sir.,0


87 completely duplicated, 179 duplicated text

In [8]:
train[train[['text','target']].duplicated(keep=False)]

Unnamed: 0,id,keyword,location,text,target
40,59,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...,0
48,68,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...,0
106,156,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
115,165,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
118,171,aftershock,Switzerland,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/TH...,0
119,172,aftershock,Switzerland,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/TH...,0
147,211,airplane%20accident,,Experts in France begin examining airplane deb...,1
164,238,airplane%20accident,,Experts in France begin examining airplane deb...,1
610,881,bioterrorism,,To fight bioterrorism sir.,1
624,898,bioterrorism,,To fight bioterrorism sir.,0


Same text, not the same target... But only in 20 cases

# Keyword

## Missing data

In [16]:
train[train['keyword'].isnull()].head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [17]:
train['keyword'].isnull().sum()/len(train)

0.008012610009194798

<1 % missing data

## Univariate Analysis

In [9]:
df = train.groupby('keyword').size().reset_index(name='count')
df.sort_values(by='keyword')

Unnamed: 0,keyword,count
0,ablaze,36
1,accident,35
2,aftershock,34
3,airplane%20accident,35
4,ambulance,38
5,annihilated,34
6,annihilation,29
7,apocalypse,32
8,armageddon,42
9,army,34


In [10]:
df = train.groupby('keyword').size().reset_index(name='count')
df.sort_values(by='count')

Unnamed: 0,keyword,count
160,radiation%20emergency,9
134,inundation,10
194,threat,11
94,epicentre,12
115,forest%20fire,19
164,rescue,22
209,war%20zone,24
39,bush%20fires,25
15,battle,26
208,volcano,27


220 different keywords, used between 9-45 times. Many similar keywords used (synonyms, verbs/nouns with similar meaning (e.g. body bagging, body bag. Multiword keywords are seperated by %20.

## Keyword and target

In [12]:
df = train.groupby(['keyword','target']).size().reset_index(name='count')
df.sort_values(by=['keyword','target'])

Unnamed: 0,keyword,target,count
0,ablaze,0,23
1,ablaze,1,13
2,accident,0,11
3,accident,1,24
4,aftershock,0,34
5,airplane%20accident,0,5
6,airplane%20accident,1,30
7,ambulance,0,18
8,ambulance,1,20
9,annihilated,0,23


## Keyword ideas

Keyword may be a good feature, but there are many different ones and they all seem to be used more or less regularly (9-45 times)
Many keywords are similar, stem/manually group them/convert to wordvec?

# Location

In [18]:
df = train.groupby('location').size().reset_index(name ='count')
df = df.sort_values(by='count',ascending=False)
df

Unnamed: 0,location,count
2643,USA,104
1826,New York,71
2662,United States,50
1506,London,45
587,Canada,29
1860,Nigeria,28
2632,UK,27
1534,"Los Angeles, CA",26
1262,India,24
1719,Mumbai,22


Many locations only occur once. Check how much is covered by the top X locations.

In [19]:
df['cum percent'] = round(100.0*df['count'].cumsum()/df['count'].sum())
df['nr'] = range(len(df))
df

Unnamed: 0,location,count,cum percent,nr
2643,USA,104,2.0,0
1826,New York,71,3.0,1
2662,United States,50,4.0,2
1506,London,45,5.0,3
587,Canada,29,6.0,4
1860,Nigeria,28,6.0,5
2632,UK,27,7.0,6
1534,"Los Angeles, CA",26,7.0,7
1262,India,24,8.0,8
1719,Mumbai,22,8.0,9


## Location ideas

Too many locations with too few datapoint so use as is. Many locations do not seem to be 'real' world locations (e.g. 'Where I Need To Be'. Use a geocoder to find longitude/lattitude/validity of locations?

# Conclusion

## Cleanup
* Remove text duplications
* Keyword and location have missing data that might need to be taken care of
* Remove %20 in keywords, replace by whitespace?

## Feature Engineering Ideas

### Keywords
Convert keywords into something more useful for classification?
* Stem/lemmatize them
* Convert to word vecs?
* Cluster?

### Location
Convert location into something more useful? Use a geocoder to convert to longitude/lattitude and to find out of the location is a valid position on the world.