# Disaster Tweets

This notebook builds a predictive model for predict which Tweets are about real disasters and which ones don't, using the [Kaggle's NLP with Disaster Tweets dataset](https://www.kaggle.com/c/nlp-getting-started).

In [20]:
# Load dependencies.
import pandas as pd
import numpy as np

from sklearn import linear_model, model_selection, metrics, naive_bayes, neighbors, svm
from sklearn.feature_extraction.text import CountVectorizer

## 1. Load and clear data

In [21]:
# Load train and test data.
train_data = pd.read_csv('./data/train.csv', sep=',')
test_data = pd.read_csv('./data/test.csv', sep=',')

# Take a quick look into the data.
train_data.head(8)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1


## 2. Explore data

In [22]:
# Tokenize tweet's texts.
vectorizer = CountVectorizer(stop_words='english')
train_vectors = vectorizer.fit_transform(train_data["text"])

print('The length of vocabulary', len(vectorizer.get_feature_names()))
print('The shape is', train_vectors.shape)

The length of vocabulary 21363
The shape is (7613, 21363)


## 3. Build models

### 3.1 Fast models comparison

In [23]:
clf = linear_model.RidgeClassifier()
scores = model_selection.cross_val_score(clf, train_vectors, train_data["target"], cv=3, scoring="f1")
print("Ridge Classifier:", scores)

clf = linear_model.SGDClassifier()
scores = model_selection.cross_val_score(clf, train_vectors, train_data["target"], cv=3, scoring="f1")
print("SGD Classifier:", scores)

clf = naive_bayes.MultinomialNB(alpha=1.0)
scores = model_selection.cross_val_score(clf, train_vectors, train_data["target"], cv=3, scoring="f1")
print("Multinomial Naive Bayes:", scores)

clf = neighbors.KNeighborsClassifier()
scores = model_selection.cross_val_score(clf, train_vectors, train_data["target"], cv=3, scoring="f1")
print("K Neighbors Classifier:", scores)

clf = svm.LinearSVC(penalty='l1', dual=False, loss='squared_hinge')
scores = model_selection.cross_val_score(clf, train_vectors, train_data["target"], cv=3, scoring="f1")
print("Linear SVC:", scores)

Ridge Classifier: [0.57814208 0.53562405 0.62392731]
SGD Classifier: [0.58177827 0.54351145 0.61137693]
Multinomial Naive Bayes: [0.67225326 0.64570416 0.71125883]
K Neighbors Classifier: [0.07329843 0.03536693 0.07965368]
Linear SVC: [0.57775318 0.5076242  0.55662773]


### 3.2 Multinomial Naive Bayes

In [24]:
X = train_data.drop('target', axis=1)
y = train_data['target'].copy()
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33, random_state=25)

# Get the training vectors
X_train_vectors = vectorizer.fit_transform(X_train['text'])

# Build the classifier
clf = naive_bayes.MultinomialNB(alpha=.01)

#  Train the classifier
clf.fit(X_train_vectors, y_train)

# Get the test vectors
vectors_test = vectorizer.transform(X_test['text'])

# Predict and score the vectors
pred = clf.predict(vectors_test)
acc_score = metrics.accuracy_score(y_test, pred)
f1_score = metrics.f1_score(y_test, pred, average='macro')

print('Total accuracy classification score: {}'.format(acc_score))
print('Total F1 classification score: {}'.format(f1_score))

Total accuracy classification score: 0.7715877437325905
Total F1 classification score: 0.7645322908599883


## 4. Make predictions

In [25]:
train_vectors = vectorizer.fit_transform(train_data["text"])

# Build and train classifier.
clf = naive_bayes.MultinomialNB(alpha=.01)
clf.fit(train_vectors, train_data['target'])

# Make the predictions.
vectors_test = vectorizer.transform(test_data['text'])
test_predictions = clf.predict(vectors_test)

# Generate the submission file (to be uploaded to Kaggle).
output = pd.DataFrame({'id': test_data.id, 'target': test_predictions})
output.to_csv('my_submission.csv', index=False)
print("The submission was successfully saved!")

The submission was successfully saved!
