# Extra Cleaning

To visualize and try to understand some of the trends in our data, we cleaned a little more and filled in some nulls.

In [1]:
import pandas as pd
df = pd.read_csv('./datasets/finaldata_label.csv')

FileNotFoundError: [Errno 2] No such file or directory: './datasets/finaldata_label.csv'

Dropped the two extra index columns:

In [None]:
df.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1'], inplace=True)

After looking over the dataset, a large amount of the tweets came from California. We decided to fill the null 'user_location' values with 'California, USA'. All of the 'CA' and 'California' values were renamed to 'California, USA' because all three locations mean the same thing and there is no reason for them to be counted as separate values.

In [None]:
df[['user_location']] = df[['user_location']].fillna(value='CA')

In [None]:
df['user_location'] = df['user_location'].replace('California', 'CA')

In [None]:
df['user_location'] = df['user_location'].replace('CA', 'California, USA')

All of the 'San Francisco' values were lumped into the 'San Francisco, CA' values.

In [None]:
df['user_location'] = df['user_location'].replace('San Francisco', 'San Francisco, CA')

In [None]:
df[['verified']] = df[['verified']].fillna(value=0)

Checking dimensions and first five rows:

In [None]:
df.shape

In [None]:
df.head()

# EDA and Visualizations

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize = (12, 8))
sns.set_palette('crest', 9)
sns.barplot(x = df['keyword'], y = df['relevant'])

plt.title('Relevance of Tweets by Keyword', size = 14)
plt.xlabel('Keyword', size = 11)
plt.ylabel('Relevant Tweets out of Total', size = 11);

This barchart (above) shows the ratio of tweets that were deemed 'relevant' from the total tweets associated with a given keyword.

In [None]:
top_loc = df[['user_location']].value_counts().index.tolist()[:10]
top_loc_counts = df[['user_location']].value_counts().tolist()[:10]
top_loc, top_loc_counts

Creating a dataframe with the top 10 locations and the number of tweets from each location:

In [None]:
list1 = ['California, USA', 17629]
list2 = ['San Francisco, CA', 13775]
list3 = ['Sacramento, CA', 4746]
list4 = ['Oakland, CA', 4176]
list5 = ['San Jose, CA', 2585]
list6 = ['United States', 1102]
list7 = ['Los Angeles, CA', 1030]
list8 = ['Bay Area', 940]
list9 = ['Northern California', 839]
list10 = ['Stockton, CA', 823]

df2 = pd.DataFrame(data = [list1, list2, list3, list4, list5, list6, list7, list8, list9, list10], columns = ["location", "tweets"])

In [None]:
plt.figure(figsize = (12, 8))
plt.bar(x = df2['location'], height = df2['tweets'], color = 'lightgray')
plt.xticks(rotation = -60)

plt.title('Locations with the Most Tweets', size = 14)
plt.xlabel('Location', size = 11)
plt.ylabel('Tweet Count', size = 11);

(Above) Chart shows the ten locations that produced the most tweets and how many tweets came from each area. Because we are interested in focusing our search efforts within a certain area around the fires, this is a way to check that our tweets are coming from where we want them to.

In [None]:
plt.figure(figsize = (12, 8))
sns.countplot(x="duplicate", hue="relevant", data=df, palette='crest');

(Above) Of the tweets that appear more than once in the dataset (indicating that they are associaed with more than one keyword), just about half of them are labeled 'relevant'.

In [None]:
df_top = df[(df['user_location'] == 'California, USA') + (df['user_location'] == 'San Francisco, CA') + (df['user_location'] == 'Sacramento, CA') + (df['user_location'] == 'Oakland, CA') + (df['user_location'] == 'San Jose, CA') + (df['user_location'] == 'United States') + (df['user_location'] == 'Los Angeles, CA') + (df['user_location'] == 'Bay Area') + (df['user_location'] == 'Northern California') + (df['user_location'] == 'Stockton, CA')]
df_top.head()

In [None]:
h = sns.catplot(x="user_location", hue="relevant", col="duplicate",
                data=df_top, kind="count",
                height=8, aspect=1, palette = 'crest')

h.set_xticklabels(rotation = 90)
h.set_xlabels('Location');

(Above) Basically the same thing as the previous chart, just broken down by location.

In [None]:
plt.figure(figsize = (15, 8))
sns.countplot(x="user_location", hue="relevant", data=df_top, palette='crest')
plt.xticks(rotation = -45)

plt.title('Tweet Relevancy by Location', size = 14)
plt.xlabel('Location', size = 11)
plt.ylabel('Tweet Count', size = 11);

(Above) Does one location produce more relevant tweets than the others? Judging by proportions, no.

# Preprocessing

Before training models to classify tweets as relevent or irrelevant, the training input needs to be transformed into something that can interpreted mathematically. Unstructured, unlabelled text does not mean much on its own.

In [None]:
# imports
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, plot_roc_curve
import numpy as np
from sklearn.preprocessing import StandardScaler

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from scipy.stats import uniform, loguniform
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

The feature used for predictions is the "content" column. The target is the "relevant" column.

In [None]:
X = df['content']
y = df['relevant']

In [None]:
y.value_counts(normalize = True)

Split with stratify because the classes are heavily imbalanced:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

## Vectorizing

We wanted to compare model performances with both CountVectorizer and TF-IDF Vectorizer, so we processed our data both ways.

#### CVEC Transformation

In [None]:
cvec = CountVectorizer()
Xc_train = cvec.fit_transform(X_train)
Xc_test = cvec.transform(X_test)

#### TF-IDF Transformation

In [None]:
tvec = TfidfVectorizer()
Xt_train = tvec.fit_transform(X_train)
Xt_test = tvec.transform(X_test)

# Modeling

Our model needed to be a binary classifier, so we ran a number of classification models.

### Logistic Regression, CountVectorizer

In [None]:
logit = Pipeline([
    ('cvec', CountVectorizer() ),
    ('logit', LogisticRegression(penalty='none',
               C = 1.0,
               solver='lbfgs',
               max_iter=1000))
])

In [None]:
logit.fit(X_train, y_train)

In [None]:
train_score = logit.score(X_train, y_train)
test_score = logit.score(X_test, y_test)

print(f'Logistic Regression model with CountVectorizer training score: {train_score}')
print(f'Logistic Regression model with CountVectorizer testing score: {test_score}')

In [None]:
logit_preds = logit.predict(X_test)
Accuracy = accuracy_score(y_test, logit_preds)
Recall = recall_score(y_test, logit_preds)
Precision = precision_score(y_test, logit_preds)
F1_score = f1_score(y_test, logit_preds)
ROC_AUC_score = roc_auc_score(y_test, logit.predict_proba(X_test)[:, 1])

print(f'Logistic Regression model with CountVectorizer accuracy: {Accuracy}')
print(f'Logistic Regression model with CountVectorizer recall: {Recall}')
print(f'Logistic Regression model with CountVectorizer precision: {Precision}')
print(f'Logistic Regression model with CountVectorizer F1 score: {F1_score}')
print(f'Logistic Regression model with CountVectorizer AUC Score: {ROC_AUC_score}')

### KNN with CountVectorizer

In [None]:
pipe = Pipeline([('cvec', CountVectorizer(
    max_df=.325,
    max_features=2000,
    min_df=5,
    ngram_range=(1, 2),
   )),
    ('knn', KNeighborsClassifier())])

knn_params = {
    'knn__n_neighbors': [3, 5, 7],
    'knn__weights': ['uniform', 'distance'],
    'knn__p': [1, 2]
}

In [None]:
knn =  RandomizedSearchCV(
    pipe,
    knn_params,
    n_iter=12,
    n_jobs=-1,
    cv=5,
    random_state=42,
)

knn.fit(X_train, y_train)

In [None]:
ktrain_score = knn.score(X_train, y_train)
ktest_score = knn.score(X_test, y_test)

print(f'KNN model with CountVectorizer training score: {ktrain_score}')
print(f'KNN model with CountVectorizer testing score: {ktest_score}')

In [None]:
knn_preds = knn.predict(X_test)
kAccuracy = accuracy_score(y_test, knn_preds)
kRecall = recall_score(y_test, knn_preds)
kPrecision = precision_score(y_test, knn_preds)
kF1_score = f1_score(y_test, knn_preds)
kROC_AUC_score = roc_auc_score(y_test, knn.predict_proba(X_test)[:, 1])

print(f'KNN model with CountVectorizer accuracy: {kAccuracy}')
print(f'KNN model with CountVectorizer recall: {kRecall}')
print(f'KNN model with CountVectorizer precision: {kPrecision}')
print(f'KNN model with CountVectorizer F1 score: {kF1_score}')
print(f'KNN model with CountVectorizer AUC Score: {kROC_AUC_score}')

### KNN with TF-IDF

In [None]:
tknn = KNeighborsClassifier()

tknn.fit(Xt_train, y_train)

In [None]:
kttrain_score = tknn.score(Xt_train, y_train)
kttest_score = tknn.score(Xt_test, y_test)

print(f'KNN model with TF-IDF Vectorizer training score: {kttrain_score}')
print(f'KNN model with TF-IDF Vectorizer testing score: {kttest_score}')

In [None]:
tkknn_preds = tknn.predict(X_test)
tkAccuracy = accuracy_score(y_test, tknn_preds)
tkRecall = recall_score(y_test, tknn_preds)
tkPrecision = precision_score(y_test, tknn_preds)
tkF1_score = f1_score(y_test, tknn_preds)
tkROC_AUC_score = roc_auc_score(y_test, tknn.predict_proba(X_test)[:, 1])

print(f'KNN model with TF-IDF Vectorizer accuracy: {tkAccuracy}')
print(f'KNN model with TF-IDF Vectorizer recall: {tkRecall}')
print(f'KNN model with TF-IDF Vectorizer precision: {tkPrecision}')
print(f'KNN model with TF-IDF Vectorizer F1 score: {tkF1_score}')
print(f'KNN model with TF-IDF Vectorizer AUC Score: {tkROC_AUC_score}')

### Random Forest with CountVectorizer

In [None]:
pipe = Pipeline([('cvec', CountVectorizer(
    max_df=.325,
    max_features=2000,
    min_df=5,
    ngram_range=(1, 2),
   )),
     ('rf', RandomForestClassifier(random_state = 42))])

rf_params = {
             'rf__n_estimators': [200,300,500],
          'rf__max_depth': [20,30,50],
          'rf__min_samples_split': [20,30,40,60],
          'rf__min_samples_leaf': [2,4,10,20],
          'rf__max_features': ['auto', 'sqrt']
              }

rf = RandomizedSearchCV(estimator=pipe, 
                        param_distributions = rf_params,
                        random_state=42,
                        cv=5)

rf.fit(X_train, y_train)

In [None]:
rftrain_score = rf.score(X_train, y_train)
rftest_score = rf.score(X_test, y_test)

print(f'Random Forest model with CountVectorizer training score: {rftrain_score}')
print(f'Random Forest model with CountVectorizer testing score: {rftest_score}')

In [None]:
rf_preds = rf.predict(X_test)
rfAccuracy = accuracy_score(y_test, rf_preds)
rfRecall = recall_score(y_test, rf_preds)
rfPrecision = precision_score(y_test, rf_preds)
rfF1_score = f1_score(y_test, rf_preds)
rfROC_AUC_score = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])

print(f'Random Forest model with CountVectorizer accuracy: {rfAccuracy}')
print(f'Random Forest model with CountVectorizer recall: {rfRecall}')
print(f'Random Forest model with CountVectorizer precision: {rfPrecision}')
print(f'Random Forest model with CountVectorizer F1 score: {rfF1_score}')
print(f'Random Forest model with CountVectorizer AUC Score: {rfROC_AUC_score}')

### LinearSVM with CountVectorizer

In [None]:
svc = LinearSVC(max_iter = 2000)
pgrid = {"C": np.linspace(0.0001, 1, 20)}
gcv = GridSearchCV(svc,
                  pgrid,
                  cv=5,
                  n_jobs =-1)

gcv.fit(Xc_train, y_train)

In [None]:
gcvtrain_score = gcv.score(Xc_train, y_train)
gcvtest_score = gcv.score(Xc_test, y_test)

print(f'SVM model with CountVectorizer training score: {gcvtrain_score}')
print(f'SVM model with CountVectorizer testing score: {gcvtest_score}')

In [None]:
gcv_preds = gcv.predict(X_test)
gcvAccuracy = accuracy_score(y_test, gcv_preds)
gcvRecall = recall_score(y_test, gcv_preds)
gcvPrecision = precision_score(y_test, gcv_preds)
gcvF1_score = f1_score(y_test, gcv_preds)
gcvROC_AUC_score = roc_auc_score(y_test, gcv.predict_proba(X_test)[:, 1])

print(f'SVM model with CountVectorizer accuracy: {gcvAccuracy}')
print(f'SVM model with CountVectorizer recall: {gcvRecall}')
print(f'SVM model with CountVectorizer precision: {gcvPrecision}')
print(f'SVM model with CountVectorizer F1 score: {gcvF1_score}')
print(f'SVM model with CountVectorizer AUC Score: {gcvROC_AUC_score}')

We attempted to run a RandomizedSearch over our SVM model, but it errored out due to lack of computer memory.

### LinearSVM with TF-IDF Vectorizer

In [None]:
lsvc = LinearSVC(random_state=42)
lsvc.fit(Xt_train, y_train)

In [None]:
lsvctrain_score = lsvc.score(Xt_train, y_train)
lsvctest_score = lsvc.score(Xt_test, y_test)

print(f'LinearSVC model with TF-IDF Vectorizer training score: {lsvctrain_score}')
print(f'LinearSVC model with TF-IDF Vectorizer testing score: {lsvctest_score}')

In [None]:
lsvc_preds = lsvc.predict(Xt_test)
lsvcAccuracy = accuracy_score(y_test, lsvc_preds)
lsvcRecall = recall_score(y_test, lsvc_preds)
lsvcPrecision = precision_score(y_test, lsvc_preds)
lsvcF1_score = f1_score(y_test, lsvc_preds)
lsvcROC_AUC_score = roc_auc_score(y_test, lsvc.predict_proba(Xt_test)[:, 1])

print(f'LinearSVC model with TF-IDF Vectorizer accuracy: {lsvcAccuracy}')
print(f'LinearSVC model with TF-IDF Vectorizer recall: {lsvcRecall}')
print(f'LinearSVC model with TF-IDF Vectorizer precision: {lsvcPrecision}')
print(f'LinearSVC model with TF-IDF Vectorizer F1 score: {lsvcF1_score}')
print(f'LinearSVC model with TF-IDF Vectorizer AUC Score: {lsvcROC_AUC_score}')

### SVC with TF-IDF Vectorizer

In [None]:
tsvc = SVC()
tsvc.fit(Xt_train, y_train)

In [None]:
tstrain_score = gcv.score(Xt_train, y_train)
tstest_score = gcv.score(Xt_test, y_test)

print(f'SVM model with TF-IDF Vectorizer training score: {tstrain_score}')
print(f'SVM model with TF-IDF Vectorizer testing score: {tstest_score}')

In [None]:
tsvc_preds = tsvc.predict(Xt_test)
tsAccuracy = accuracy_score(y_test, tsvc_preds)
tsRecall = recall_score(y_test, tsvc_preds)
tsPrecision = precision_score(y_test, tsvc_preds)
tsF1_score = f1_score(y_test, tsvc_preds)
tsROC_AUC_score = roc_auc_score(y_test, tsvc.predict_proba(X_test)[:, 1])

print(f'SVM model with TF-IDF Vectorizer accuracy: {tsAccuracy}')
print(f'SVM model with TF-IDF Vectorizer recall: {tsRecall}')
print(f'SVM model with TF-IDF Vectorizer precision: {tsPrecision}')
print(f'SVM model with TF-IDF Vectorizer F1 score: {tsF1_score}')
print(f'SVM model with TF-IDF Vectorizer AUC Score: {tsROC_AUC_score}')

### Decision Tree with CountVectorizer

In [None]:
dt = DecisionTreeClassifier(max_depth=10,
                            min_samples_split =7,
                            min_samples_leaf = 3,
                            ccp_alpha=0.01,
                            random_state = 42)

dt.fit(Xc_train,y_train)

In [None]:
dttrain_score = dt.score(Xc_train, y_train)
dttest_score = dt.score(Xc_test, y_test)

print(f'Decision Tree model with CountVectorizer training score: {dttrain_score}')
print(f'Decision Tree model with CountVectorizer testing score: {dttest_score}')

In [None]:
dt_preds = dt.predict(X_test)
dtAccuracy = accuracy_score(y_test, dt_preds)
dtRecall = recall_score(y_test, dt_preds)
dtPrecision = precision_score(y_test, dt_preds)
dtF1_score = f1_score(y_test, dt_preds)
dtROC_AUC_score = roc_auc_score(y_test, dt.predict_proba(X_test)[:, 1])

print(f'Decision Tree model with CountVectorizer accuracy: {dtAccuracy}')
print(f'Decision Tree model with CountVectorizer recall: {dtRecall}')
print(f'Decision Tree model with CountVectorizer precision: {dtPrecision}')
print(f'Decision Tree model with CountVectorizer F1 score: {dtF1_score}')
print(f'Decision Tree model with CountVectorizer AUC Score: {dtROC_AUC_score}')

This model was one of our top performers; it had the highest precision score. Because of this, we plotted the ROC/AUC.

In [None]:
plt.style.use('fivethirtyeight')
plot_roc_curve(dt, Xc_test, y_test)
plt.plot([0,1], [0,1],label = 'baseline', linestyle = '--')
plt.title('Decision Tree ROC')
plt.legend();

### TVEC with DT

In [None]:
tdt = DecisionTreeClassifier(random_state = 42)
tdt.fit(Xt_train, y_train)

In [None]:
tdttrain_score = dt.score(Xt_train, y_train)
tdttest_score = dt.score(Xt_test, y_test)

print(f'Decision Tree model with TF-IDF Vectorizer training score: {tdttrain_score}')
print(f'Decision Tree model with TF-IDF Vectorizer testing score: {tdttest_score}')

In [None]:
tdt_preds = tdt.predict(Xt_test)
tdtAccuracy = accuracy_score(y_test, tdt_preds)
tdtRecall = recall_score(y_test, tdt_preds)
tdtPrecision = precision_score(y_test, tdt_preds)
tdtF1_score = f1_score(y_test, tdt_preds)
tdtROC_AUC_score = roc_auc_score(y_test, tdt.predict_proba(Xt_test)[:, 1])

print(f'Decision Tree model with TF-IDF Vectorizer accuracy: {tdtAccuracy}')
print(f'Decision Tree model with TF-IDF Vectorizer recall: {tdtRecall}')
print(f'Decision Tree model with TF-IDF Vectorizer precision: {tdtPrecision}')
print(f'Decision Tree model with TF-IDF Vectorizer F1 score: {tdtF1_score}')
print(f'Decision Tree model with TF-IDF Vectorizer AUC Score: {tdtROC_AUC_score}')

### Multinomial Naive-Bayes with CountVectorizer

Multinomial Naive-Bayes is one of the few models we were able to run a GridSearch on.

In [None]:
nb = Pipeline([
    ('cvec', CountVectorizer() ),
    ('nb', MultinomialNB()),
])

pipe_params = {
    'cvec__max_features': [1000, 2000, 3000, 4000, 5000],
    'cvec__min_df': [1, 2, 3, 4,5],
    'cvec__max_df': [0.95, 0.9, 1],
    'cvec__ngram_range': [(1,1), (1,2)],
                }

grid = GridSearchCV(nb,
                   pipe_params,
                   cv = 5)

mnb = RandomizedSearchCV(nb,
                   pipe_params,
                    n_iter=200,
                              cv=5,
                              n_jobs=-1,
                              random_state=42)

mnb.fit(X_train, y_train)

cmnb_train = mnb.score(X_train, y_train)
cmnb_test = mnb.score(X_test, y_test)

print(f'Multinomial Naive-Bayes model with CountVectorizer training score: {cmnb_train}')
print(f'Multinomial Naive-Bayes model with CountVectorizer testing score: {cmnb_test}')

These are the parameters that produced the best scores from the Multinomial Naive-Bayes (CV) model:

In [None]:
mnb.best_params_

In [None]:
mtrain_score = mnb.score(X_train, y_train)
mtest_score = mnb.score(X_test, y_test)

print(f'Decision Tree model with CountVectorizer training score: {dttrain_score}')
print(f'Decision Tree model with CountVectorizer testing score: {dttest_score}')

In [None]:
mnb_preds = mnb.predict(X_test)
mAccuracy = accuracy_score(y_test, mnb_preds)
mRecall = recall_score(y_test, mnb_preds)
mPrecision = precision_score(y_test, mnb_preds)
mF1_score = f1_score(y_test, mnb_preds)
mROC_AUC_score = roc_auc_score(y_test, mnb.predict_proba(X_test)[:, 1])

print(f'Multinomial Naive-Bayes model with CountVectorizer accuracy: {mAccuracy}')
print(f'Multinomial Naive-Bayes model with CountVectorizer recall: {mRecall}')
print(f'Multinomial Naive-Bayes model with CountVectorizer precision: {mPrecision}')
print(f'Multinomial Naive-Bayes model with CountVectorizer F1 score: {mF1_score}')
print(f'Multinomial Naive-Bayes model with CountVectorizer AUC Score: {mROC_AUC_score}')

Multinomial Naive-Bayes was another top-performing model; it had the least amount of overfitting. We created a AUC/ROC plot for this as well.

In [None]:
plot_roc_curve(mnb, X_test, y_test)
plt.plot([0,1], [0,1],label = 'baseline', linestyle = '--')
plt.title('MNB ROC')
plt.legend();

# Analysis and Conclusions

Of the models we tested, our Multinomial Naive-Bayes and Decision Tree models had the best outputs. MNB with both types of vectorizers had very low amounts of overfitting and DT scored very high in the metrics we chose to focus on (accuracy, precision, and recall), meaning that is the best model to answer our problem statement. Each metric can be optimized depending on the situation. Accuracy is a great metric for non-technical audiences and to gain a big picture understanding of how well the model is working. Precision can be optimized in situations where it's okay if people get a few tweets that are irrelevant, like the 2020 CA wildfires. It was not really a life or death situation, people had time to prepare and decide what their next move would be, so we believe precision is appropriate here. Recall should optimized when a situation is dire and the only tweets coming through are the relevant ones.