In [6]:
import os
import numpy as np
import pandas as pd
from pathlib import Path

In [11]:
data_folder = Path(os.getcwd()).parents[1].joinpath('data')

In [15]:
train_df = pd.read_csv(data_folder.joinpath('train.csv'))
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


## Check skewdness of target data

In [20]:
train_df['target'].value_counts()/len(train_df['target'])

0    0.57034
1    0.42966
Name: target, dtype: float64

Looks like 43% of the tweets are actually related to disasters, so from this dataset it doesn't look like class imbalance is a big issue.  
However, for a real life sample of tweets, a more significant class imbalance might be realistic.

## Significance of the variable location 

In [25]:
train_df['location'].isna().value_counts()/len(train_df)

False    0.66728
True     0.33272
Name: location, dtype: float64

About 66% of the data has a valid location.

In [26]:
train_df['valid_location'] = np.where(train_df['location'].isna(), 0, 1)

In [38]:
train_agg = train_df.groupby('valid_location').agg(sum_target = ('target', np.sum),
                                    length = ('target', len)).reset_index()
train_agg['prop_target'] = train_agg['sum_target']/train_agg['length']
train_agg

Unnamed: 0,valid_location,sum_target,length,prop_target
0,0,1075,2533,0.424398
1,1,2196,5080,0.432283


The presence of location in and of itself does not seem to be a good indicator of whether the tweet was associated with an actual disaster

## Keyword Variable
In comparison, the keyword variable seems to be populated for most of the tweets with a large majority of the tweets (seemingly) related to disasters

In [48]:
train_df['keyword'].isna().value_counts()/len(train_df)

False    0.991987
True     0.008013
Name: keyword, dtype: float64

In [47]:
train_df['keyword'].nunique()

221

It can also be observed that a lot of the keywords (as shown in the example below) are related to the same root word ('wreck' in the example below). Generating the lemmas for these key words can help further filter down these key words and identify relevance.

In [53]:
train_df.loc[train_df['keyword'].str.contains('wreck', na=False), 'keyword'].value_counts()

wreckage    39
wrecked     39
wreck       37
Name: keyword, dtype: int64

## Unique Tweet Contents

It is also observed that around 69 tweets have been included multiple times in the dataset.

In [63]:
train_text = train_df.groupby('text').agg(count = ('target',len)).reset_index().sort_values('count', ascending=False)
train_text.head()

Unnamed: 0,text,count
646,11-Year-Old Boy Charged With Manslaughter of T...,10
45,#Bestnaijamade: 16yr old PKK suicide bomber wh...,6
6131,The Prophet (peace be upon him) said 'Save you...,6
3589,He came to a land which was engulfed in tribal...,6
4589,Madhya Pradesh Train Derailment: Village Youth...,5


In [68]:
train_text[train_text['count']>1].shape[0]

69

In [69]:
dup_texts = train_text['text'].unique()

For some of these cases, there is also a mismatch in the target values - i.e. different instances of the same tweet have different taggings.

In [79]:
target_unq_cnt = train_df.loc[train_df['text'].isin(dup_texts)].groupby('text')['target'].nunique().reset_index().sort_values('target', ascending=False)
target_unq_cnt

Unnamed: 0,text,target
7265,like for the music video I want some real acti...,2
3618,Hellfire! We donÛªt even want to think about ...,2
6131,The Prophet (peace be upon him) said 'Save you...,2
4193,In #islam saving a person is equal in reward t...,2
6353,To fight bioterrorism sir.,2
...,...,...
2496,Back from Seattle Tacoma and Portland. Whirlwi...,1
2495,Baby elephant dies just days after surviving m...,1
2494,BUT I will be uploading these videos ASAP so y...,1
2493,BREAKING: Terror Attack On\nPolice Post #Udhampur,1


These texts are eliminated from the training set.

In [82]:
mismatch_target_list = target_unq_cnt.loc[target_unq_cnt['target']>1, 'text'].unique()
train_df_clean = train_df.loc[~train_df['text'].isin(mismatch_target_list)]

We also will eliminate any duplicated rows.

In [85]:
train_df_clean = train_df_clean.drop_duplicates(keep='first')

In [86]:
train_df_clean.to_csv(data_folder.joinpath('train_clean.csv'))