#  Overview

The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. This dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website here

# Prepare Notebook

In [17]:
import numpy as np
import pandas as pd

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [18]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Load  Data

In [3]:
path = 'data/train.csv'
tweets = pd.read_csv(path)
tweets.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
test_path = 'data/test.csv'
test = pd.read_csv(test_path)
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [10]:
print(tweets.shape, "", test.shape) 

(7613, 5)  (3263, 4)


In [13]:
tweets['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

# Data Preprocessing

In [5]:
tweets['text'] = tweets['text'].apply(lambda x: "".join([i for i in x if not i.isdigit()]))

In [7]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [8]:
tweets['text'] = tweets['text'].apply(lambda x: "".join([i for i in x if i not in string.punctuation]))

In [9]:
tweets['text'] = tweets['text'].str.lower()

In [10]:
tweets.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this earthquake ma...,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,all residents asked to shelter in place are be...,1
3,6,,,people receive wildfires evacuation orders in...,1
4,7,,,just got sent this photo from ruby alaska as s...,1


# Data Preparation

Separate target from the dataframe

In [14]:
Y = tweets['target']
tweets.drop(columns=['target'], inplace=True)

## Tokenization 

In [15]:
tweets['text'] = tweets['text'].str.split()

In [19]:
stopwords = nltk.corpus.stopwords.words('english')

In [20]:
tweets['text'] = tweets['text'].apply(lambda x: [i for i in x if i not in stopwords])

In [21]:
tweets.head()

Unnamed: 0,id,keyword,location,text
0,1,,,"[deeds, reason, earthquake, may, allah, forgiv..."
1,4,,,"[forest, fire, near, la, ronge, sask, canada]"
2,5,,,"[residents, asked, shelter, place, notified, o..."
3,6,,,"[people, receive, wildfires, evacuation, order..."
4,7,,,"[got, sent, photo, ruby, alaska, smoke, wildfi..."


## Vectorization/ Count Vectorizer

In [22]:
tweets['text'] = tweets['text'].apply(lambda x: " ".join([str(i) for i in x]))

In [23]:
train = tweets['text']

In [24]:
vectorize = CountVectorizer()
X_train = vectorize.fit_transform(train)

In [26]:
X_train

<7613x20459 sparse matrix of type '<class 'numpy.int64'>'
	with 69237 stored elements in Compressed Sparse Row format>

Convert sparse matrix into dense array

In [29]:
X_train = X_train.toarray()

In [30]:
# convert into numpy array
X_train = np.array(X_train)

In [31]:
X_train.shape,

(7613, 20459)

# Train Model

In [36]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

In [50]:
def metrics_score(real, pred):
    accuracy = np.round(accuracy_score(real, pred), 2)
    Fscore = np.round(f1_score(real, pred), 2)
    cf_matrix = confusion_matrix(real, pred)
    print(accuracy, "", Fscore, "", cf_matrix)

## Multinomial Naive Bayes

In [34]:
nb = MultinomialNB()
nb.fit(X_train, Y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [35]:
y_pred = nb.predict(X_train)

In [40]:
metrics_score(Y, y_pred)

0.91  0.89  [[4185  157]
 [ 550 2721]]


## Logistic Regression

In [45]:
lr = LogisticRegression()
lr.fit(X_train,Y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [46]:
lr_pred = lr.predict(X_train)

In [51]:
metrics_score(Y,lr_pred)

0.95  0.94  [[4301   41]
 [ 324 2947]]


# Make Prediction on Test set

## `Preprocess test data

In [53]:
test['text'] = test['text'].apply(lambda x: "".join([i for i in x if not i.isdigit()]))

test['text'] = test['text'].apply(lambda x: "".join([i for i in x if i not in string.punctuation]))

test['text'] = test['text'].str.lower()

## Prepare Test set

In [54]:
test['text'] = test['text'].str.split()
test['text'] = test['text'].apply(lambda x: [i for i in x if i not in stopwords])

In [58]:
x_test = test['text']

In [56]:
submission = pd.read_csv("data/sample_submission.csv")

In [57]:
submission.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [59]:
submission['target'] = nb.predict(x_test)

ValueError: setting an array element with a sequence.