# NLP Distaster Tweets
Data are loaded in.

In [1]:
import pandas as pd

sample_submission = pd.read_csv('./Data/sample_submission.csv')
test = pd.read_csv('./Data/test.csv')
train = pd.read_csv('./Data/train.csv')

The data are displayed. 

## Train Data    

In [2]:
print(train.shape)
train.head()

(7613, 5)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


## Test Data

In [3]:
print(test.shape)
test.head()

(3263, 4)


Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


## Sample Submission

In [4]:
print(sample_submission.shape)
sample_submission.head()

(3263, 2)


Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


## Preprocessing
The text data must be processed before we can train any models on it. First we count the occurences of words in the first five tweets.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

train_text = train['text'].values

# Just see what it does on the first 5 tweets
train_text_5 = train_text[:5]
train_text_5

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_text_5)

print(vectorizer.get_feature_names())
print(X.shape)

['000', '13', 'alaska', 'all', 'allah', 'are', 'as', 'asked', 'being', 'by', 'california', 'canada', 'deeds', 'earthquake', 'evacuation', 'expected', 'fire', 'forest', 'forgive', 'from', 'got', 'in', 'into', 'just', 'la', 'may', 'near', 'no', 'notified', 'of', 'officers', 'or', 'orders', 'other', 'our', 'people', 'photo', 'place', 'pours', 'reason', 'receive', 'residents', 'ronge', 'ruby', 'sask', 'school', 'sent', 'shelter', 'smoke', 'the', 'this', 'to', 'us', 'wildfires']
(5, 54)


So it looks like the `CountVectorizer` found 54 different tokens for the first five tweets. Now we will use a naive vocabulary as tokens and explore the resulting data.

In [6]:
vocab = ['disaster', 'fire', 'storm', 'hurricane', 'ablaze', 'rumble', 'earthquake']
vectorizer = CountVectorizer(vocabulary=vocab)

X = vectorizer.fit_transform(train_text)

print(vectorizer.get_feature_names())
X.shape

['disaster', 'fire', 'storm', 'hurricane', 'ablaze', 'rumble', 'earthquake']


(7613, 7)

Now we explore the data.

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
df['target'] = train['target']

perc = np.zeros(df.shape[1] - 1)

# Calculate percentage of tweets that are disasters for each word
cnt = 0
for col_nm, col_data in df.iloc[:,:-1].iteritems():
    targets = df['target'][col_data > 0]
    if targets.values.size == 0:
        perc[cnt] = 0
    else:
        perc[cnt] = np.sum(targets.values)/targets.values.size * 100
    cnt += 1

# Plot the percentages
plt.figure(figsize=(15, 8))
plt.bar(df.columns[:-1], perc)
plt.title('Percentage of Tweets Containing Token that are a Disaster')
plt.xlabel('Tokens')
plt.ylabel('%');

Now we create a baseline model that considers any tweets that includes the word disaster, fire, storm, hurricane, or earthquake a disaster.

In [8]:
from sklearn.metrics import f1_score

disaster_vocab_df = df[['disaster', 'fire', 'storm', 'hurricane', 'earthquake']]
pred = (disaster_vocab_df > 0).any(axis=1)

f1 = f1_score(train['target'], pred)
print(f'The f1 score using the common sense baseline on the train data is {f1}.')

The f1 score using the common sense baseline on the train data is 0.21422986708365907.


## Preprocessing for Modeling
Now, we preprocess the data for actual predictive modeling. We will use `sklearn`'s default stop words in its `CountVectorizer`. This could be perhaps be probelmatic for reasons outlined [here](https://www.aclweb.org/anthology/W18-2502/).

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')

X = vectorizer.fit_transform(train_text)
X.shape

(7613, 21363)

Wow! That is a lot of features. My guess is that we do not need this many. We will try training a simple model with all the features and then go onto tuning the `TfidfVectorizer` to get the data preprocessed best.

In [10]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
import pickle

# # Get data in standard format in standard variables
# X = X.toarray()
# y = train['target'].values

# # Do cross validation on model
# gauss_NB = GaussianNB()
# cvs = cross_val_score(gauss_NB, X, y, scoring='f1', cv=5)

# # Save Results
# with open ('./Models/GaussianNB_cvs.pkl', 'wb') as f:
#     pickle.dump(cvs, f)
    
# Load in cross validation results
with open('./Models/GaussianNB_cvs.pkl', 'rb') as f:
   cvs = pickle.load(f)
    
print(cvs)
np.mean(cvs)

[0.58432935 0.57786614 0.56718346 0.59760956 0.62060457]


0.5895186150304811

Well, this is much better than the common sense baseline model. Let's try some more models! We will start with random forests.

In [11]:
from sklearn.ensemble import RandomForestClassifier

#rs = np.random.RandomState(42)

#rfc = RandomForestClassifier(n_jobs=-1, random_state=rs, verbose=2)
# cvs = cross_val_score(rfc, X, y, scoring='f1', verbose=2, 
#                       n_jobs=-1)

# Save Results
#with open ('./Models/RandomForestClassifer_cvs.pkl', 'wb') as f:
#    pickle.dump(cvs, f)

# Load in results
with open ('./Models/RandomForestClassifer_cvs.pkl', 'rb') as f:
    cvs = pickle.load(f)
    
print(cvs)
np.mean(cvs)

[0.50980392 0.3957323  0.41642789 0.4143167  0.64204046]


0.4756642539357637

Well that is worse than just guesssing. Yikes. Now we will try support vector machines. We will use a linear kernel so that it doesn't take forever.

In [12]:
from sklearn.svm import LinearSVC

# SVC = LinearSVC(verbose=10)
# cvs = cross_val_score(SVC, X, y, scoring='f1', verbose=10, n_jobs=-1, cv=5)

# # Save Results
# with open('./Models/LinearSVC_cvs.pkl', 'wb') as f:
#     pickle.dump(cvs, f)
    
# Load in results
with open ('./Models/LinearSVC_cvs.pkl', 'rb') as f:
    cvs = pickle.load(f)
    
print(cvs)
np.mean(cvs)

[0.58403756 0.51921435 0.58881579 0.58475336 0.69892473]


0.5951491578565674

Wow! This is the best result yet. I think there is great promise in using a Support Vector Machine. Let's go back and preprocess the data little better and then try modeling again. We will also tune our models' hyperparameters on the second try.

In [13]:
# First lets find the most common occuring words

# Count words
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(train_text)

# Get the total counts of each word
cnt_tot = X.toarray().sum(axis=0)

# Print the top 20 most occuring words
vocab = vectorizer.get_feature_names()
vocab = np.array(vocab)
vocab = vocab[np.argsort(cnt_tot)]
vocab_top20 = vocab[-20:]

vocab_top20

array(['crash', 'suicide', 'california', 'burning', 'storm', 'body',
       'police', 'disaster', 'emergency', 'video', 'don', 'news',
       'people', 'new', 'just', 'amp', 'û_', 'like', 'https', 'http'],
      dtype='<U32')

Looks like the most commonly occuring vocab is fairly relevant. However, there are some things like 'https' which are not so relvant. Hopefully inverse document frequency will take care of such words. We continue by lemmatizing the vocab.

In [14]:
import nltk
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
vocab_lemmatized = [lemmatizer.lemmatize(word) for word in vocab]
vocab_lemmatized = np.array(vocab_lemmatized)
vocab_lemmatized = np.unique(vocab_lemmatized)

print(f'The full vocab had {vocab.size} words when the lemmatized vocab has {vocab_lemmatized.size} words.')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mboli\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


The full vocab had 21363 words when the lemmatized vocab has 20173 words.
