We will import the libraries needed throughout this notebook.

In [149]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

We opened the pickled customized stopwords created from the previous notebook. 

In [150]:
with open('./assets/stopwords.pkl','rb') as f:
    stopwords = pickle.load(f)

This is combined dataframe containing both the pre-disaster and post-disaster tweets.

In [151]:
combined_df = pd.read_csv('../project_4/assets/combined_df.csv')

In [152]:
combined_df.shape

(98550, 3)

Once again we will implement the TFIDF tool to have the words recored as integer counts in order to be used on a the Logistic Regression model. We will also split our data into 'X' and 'y'. The X is an object that consists of the predictor column, in this case the text (tweets). The y will contain the binary classifier column, whether or not the tweets belongs to a disaster class or non-disaster class.
We will randomly split the data into training and test sets. This is done so we can train our model on the training set and then evaluate the performance of the model on unseen new data (the validation set).

In [153]:
y = combined_df['disaster']

# Set X as text column.
X = combined_df['text']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y, 
                                                    test_size=.30,
                                                   random_state=42,
                                                   stratify=y)

tfidf = TfidfVectorizer(stop_words = stopwords, 
                        max_df=0.95, 
                        min_df=5, max_features=10_000)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [154]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
model = lr.fit(X_train_tfidf, y_train)



In [155]:
print(f'LogReg Training score: {model.score(X_train_tfidf, y_train)}')
print(f'LogReg Testing score: {model.score(X_test_tfidf, y_test)}')

LogReg Training score: 0.9591940276871783
LogReg Testing score: 0.9512599357348216


-------

## LDA:

Latent Dirichlet Allocation (LDA) is a form of feature extraction. We will apply LDA on a corpos of documents (tweets) and extract from it additive models of the topic structure of the corpus (collection of documents). The outcome will be a different topics and each of those topics are represented as a list of terms.

In [130]:
X_train_df_tfidf = pd.SparseDataFrame(X_train_tfidf,
                             columns=tfidf.get_feature_names())

In [131]:
X_train_df_tfidf.shape

(68915, 10000)

In [132]:
X_train_df_tfidf.head()

Unnamed: 0,__,____,_hustle_junky,aa,aaa,aan,aaron,aaroncarter,ab,abandoned,...,zswaggers,zswagtour,zt,zu,zumba,zv,zw,zx,zy,zz
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [133]:
X_train_df_tfidf.fillna(0, inplace=True)

In [134]:
X_test_df_tfidf = pd.SparseDataFrame(X_test_tfidf,
                                    columns = tfidf.get_feature_names())

In [135]:
X_test_df_tfidf.fillna(0, inplace=True)
print(X_test_df_tfidf.shape)

(29535, 10000)


LDA is a method of discovering topics from sentences. This is applicable to our project so we can better understand and identify the different topics of the tweets in our dataset. Due to our constraints of time and lack of in depth knowledge on LDA, we did not experiment with all the parameters of LDA. 


In [143]:

n_features = 1000
n_components = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()


# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=5,
                                max_features=n_features,
                                stop_words=stopwords)

X_train_tf = tf_vectorizer.fit_transform(X_train)
X_test_tf = tf_vectorizer.transform(X_test)


lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                random_state=42)

lda_train = lda.fit_transform(X_train_tf)
lda_test = lda.transform(X_test_tf)


print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Extracting tf features for LDA...

Topics in LDA model:
Topic #0: first tonight time south help look made way never big abc near pm guys pic would making high miles tomorrow
Topic #1: homes residents home destroyed pic friends fema says return un power cnn hurricane photo show little thousand destruction next evacuate
Topic #2: tx back going prayers still go update devastation traffic ever heart center may stop every water force families hate thoughts
Topic #3: thousand mass people killed night one last dead least day borderline victims year pic say grill week sunday everything another
Topic #4: new today watch latest ready state lol work pic begins check usa years beach taking read hurricane awesome crisis houstonstrong
Topic #5: like know life got us god think real thanks fire long park already ahead pic things evacuations summer im looks
Topic #6: get right repost safe want stay need see open let take days please someone even hope wait better team girl
Topic #7: video great happy mu

According to this output, most of the topics are related to the hurricanes. Almost all of the 10 topics seem to be somehow related to the hurricane natural disaster, except topic 3. Topic 3 seems to relate more with the mass shooting that took place in the city of Thousand Oaks. 

Although LDA seems to be a popular method when it comes to textual analysis/language processing. LDA is primarily utilized to find latent (hidden) topics in the documents. Unfortunately it seems that LDA, where each word has a topic label, may not work well with Twitter as Twitter messages (tweets) are short and a single tweet is more likely to talk about one topic.

In [138]:
feature_loadings = pd.DataFrame(lda.components_, 
                                columns = tf_vectorizer.get_feature_names(),
                                index = [f'topic_{x}' for x in range(lda.components_.shape[0])]).T

In [139]:
feature_loadings.shape

(1000, 10)

We created a dataframe to map the terms of topic_0. 

In [148]:
feature_loadings.sort_values('topic_0', ascending=False).head(10)

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
tonight,385.136008,0.100005,0.100007,0.100272,0.100018,0.100008,0.100013,0.10001,0.100017,0.100005
first,354.815759,0.10001,0.100012,0.100019,0.100011,0.100015,0.100017,0.10002,85.362585,0.100032
south,353.936864,0.100011,0.100009,0.100008,0.100009,0.100006,0.10001,0.100008,0.100009,0.100026
help,335.152664,0.100006,0.100013,0.10001,0.100008,0.100012,0.10002,0.100009,0.100012,0.100005
one,315.891284,14.35654,58.066294,126.538391,0.100017,122.14978,0.100028,0.100032,0.10003,0.10001
way,264.552605,0.100005,0.10001,0.100012,0.100009,0.100011,0.10001,0.100039,0.100012,0.100011
us,244.552141,0.100036,102.533737,0.10005,0.100022,69.504985,0.100024,58.825385,0.100027,56.552466
never,238.234392,0.100004,0.100011,0.100013,0.10001,0.100017,0.100013,0.100015,0.10001,0.100003
pm,234.46691,0.100012,0.100266,0.100018,0.100012,0.100011,0.100012,0.100015,0.100006,0.100033
devastation,227.402077,0.100029,40.700268,0.100005,0.100011,0.100007,0.100006,0.102899,0.100004,24.754948


In [141]:
# Instantiate linear regression model.
logreg = LogisticRegression()



# Fit on Z_train.
logreg.fit(lda_train, y_train)

# Score on training and testing sets.
print(f'Training Score: {round(logreg.score(lda_train, y_train),4)}.')
print(f'Testing Score: {round(logreg.score(lda_test, y_test),4)}.')



Training Score: 0.8006.
Testing Score: 0.8008.
