# Natural Language Processing with Disaster Tweets (Kaggle Competition)

## **Outline**

**Feature extraction techniques**
1. TF/IDF
2. spaCy word embeddings
3. Gensim word vectors
4. fastText (not currently working -- saved in separate notebook)


**Data processing techniques**
- Class balancing (nondisaster tweets: 57%)
    - Class_weight parameter
    - Undersampling
    - Oversampling
    - SMOTE
    - Combination
- Preprocessing
    - Stopwords
    - Stemming and/or lemmatization
    - Punctuation


**Models**
- K-Nearest Neighbors (KNN)
- Multinomial Naive Bayes (MNB)
- Random Forest (RF)
- Gradient Boosting Classifier (GBC)
- Extreme Gradient Boosting (XGB)
- Light Gradient Boosting Model (LGBM)
- Neural networks
    
    
**Other ideas?**
- Categorical Naive Bayes?
- Add in the keyword and/or location to the text of the tweet?

## Preliminary Steps

In [1]:
# imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [2]:
# helper function to save scores
from sklearn.metrics import f1_score

scores_df = pd.DataFrame()

def save_scores(model_pipe, X_train, X_test, y_train, y_test, name):
  
    # calculate predictions
    train_pred = model_pipe.predict(X_train)
    test_pred = model_pipe.predict(X_test)
    
    # save f1 scores for each class
    f1_train_scores = f1_score(y_train, train_pred, average = None)
    for i, f1 in enumerate(f1_train_scores):
        if i == 0:
            f1_0_train = f1
        elif i == 1:
            f1_1_train = f1
            
    f1_test_scores = f1_score(y_test, test_pred, average = None)
    for i, f1 in enumerate(f1_test_scores):
        if i == 0:
            f1_0_test = f1
        elif i == 1:
            f1_1_test = f1
        
    # store scores
    scores_df.at[name, 'F1_0_Train'] = f1_0_train
    scores_df.at[name, 'F1_1_Train'] = f1_1_train
    scores_df.at[name, 'F1_Avg_Train'] = f1_score(y_train, train_pred, average = 'macro')
    scores_df.at[name, 'F1_0_Test'] = f1_0_test
    scores_df.at[name, 'F1_1_Test'] = f1_1_test
    scores_df.at[name, 'F1_Avg_Test'] = f1_score(y_test, test_pred, average = 'macro')
    
    # show scores for this model only (can call scores_df to see all scores)
    print(scores_df.loc[name, :])

In [3]:
# read in data
train_df = pd.read_csv("Data/train.csv")
test_df = pd.read_csv("Data/test.csv")

# check
train_df.info()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


In [4]:
# check for duplicates
train_df.duplicated().sum()

0

In [5]:
# check class balance
train_df['target'].value_counts()

# the classes are a bit unbalanced (57% to 43%),
# so we may want to try some class balancing techniques

0    4342
1    3271
Name: target, dtype: int64

In [6]:
# check uniqueness of id column
train_df['id'].nunique()

7613

In [7]:
# check examples of non-disaster tweets (target = 0)
nondisaster_df = train_df[train_df['target'] == 0]

for tweet in nondisaster_df['text'].values[1:10]:
    print(tweet)

I love fruits
Summer is lovely
My car is so fast
What a goooooooaaaaaal!!!!!!
this is ridiculous....
London is cool ;)
Love skiing
What a wonderful day!
LOOOOOOL


In [8]:
# check examples of disaster tweets (target = 1)
disaster_df = train_df[train_df['target'] == 1]

for tweet in disaster_df['text'].values[1:10]:
    print(tweet)

Forest fire near La Ronge Sask. Canada
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
13,000 people receive #wildfires evacuation orders in California 
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas
I'm on top of the hill and I can see a fire in the woods...
There's an emergency evacuation happening now in the building across the street
I'm afraid that the tornado is coming to our area...


## Class-balanced data

In [None]:
# helper function to run model with balancing techniques
def run_with_balanced_data(model, ):

In [33]:
X_train, X_test, y_train, y_test = train_test_split(
    train_df.preprocessed_txt,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [34]:
# undersampled knn on preprocessed data
clf = resample_pipeline(TfidfVectorizer(),
                              RandomUnderSampler(),
                              KNeighborsClassifier())

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-under-prep")

F1_0_Train      0.850883
F1_1_Train      0.793340
F1_Avg_Train    0.822112
F1_0_Test       0.786238
F1_1_Test       0.702278
F1_Avg_Test     0.744258
Name: knn-tfidf-under-prep, dtype: float64


# 1. TF-IDF

## No class balancing, no preprocessing (KNN, MNB, RF)

In [9]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [10]:
# tts on unprocessed data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.text,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [11]:
clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('KNN', KNeighborsClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.86      0.81       869
           1       0.77      0.65      0.71       654

    accuracy                           0.77      1523
   macro avg       0.77      0.75      0.76      1523
weighted avg       0.77      0.77      0.76      1523



In [12]:
save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-unb")

F1_0_Train      0.865827
F1_1_Train      0.799016
F1_Avg_Train    0.832421
F1_0_Test       0.808234
F1_1_Test       0.705000
F1_Avg_Test     0.756617
Name: knn-tfidf-unb, dtype: float64


In [13]:
scores_df

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617


In [14]:
# multinomial naive bayes

from sklearn.naive_bayes import MultinomialNB

clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('Multi NB', MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.93      0.84       869
           1       0.87      0.61      0.72       654

    accuracy                           0.79      1523
   macro avg       0.81      0.77      0.78      1523
weighted avg       0.81      0.79      0.78      1523



In [15]:
save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-unb")

F1_0_Train      0.910580
F1_1_Train      0.859256
F1_Avg_Train    0.884918
F1_0_Test       0.835836
F1_1_Test       0.715695
F1_Avg_Test     0.775766
Name: mnb-tfidf-unb, dtype: float64


In [16]:
# random forest

from sklearn.ensemble import RandomForestClassifier

clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('Random Forest', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.73      0.93      0.82       869
           1       0.86      0.55      0.67       654

    accuracy                           0.77      1523
   macro avg       0.80      0.74      0.75      1523
weighted avg       0.79      0.77      0.76      1523



In [17]:
save_scores(clf, X_train, X_test, y_train, y_test, "rf-tfidf-unb")

F1_0_Train      0.997411
F1_1_Train      0.996557
F1_Avg_Train    0.996984
F1_0_Test       0.820487
F1_1_Test       0.670391
F1_Avg_Test     0.745439
Name: rf-tfidf-unb, dtype: float64


## No class balancing, minimal preprocessing (KNN, MNB, RF)

In [18]:
# preprocessing: removing spacy stopwords and punctuation, lemmatizing

import spacy

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    doc = nlp(text)
    
    filtered_tokens = []
    
    # take out stopwords and punctuation
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        
        # convert to lemmas
        filtered_tokens.append(token.lemma_)
            
    return " ".join(filtered_tokens)

In [19]:
train_df['preprocessed_txt'] = train_df['text'].apply(preprocess)

In [20]:
# check
train_df.head()

Unnamed: 0,id,keyword,location,text,target,preprocessed_txt
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deed Reason earthquake ALLAH forgive
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near La Ronge Sask Canada
2,5,,,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive wildfire evacuation orde..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got send photo Ruby Alaska smoke wildfire pour...


In [21]:
# tts on processed data

X_train, X_test, y_train, y_test = train_test_split(
    train_df.preprocessed_txt,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [22]:
# knn on preprocessed data
clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('KNN', KNeighborsClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.86      0.81       869
           1       0.77      0.63      0.70       654

    accuracy                           0.76      1523
   macro avg       0.77      0.75      0.75      1523
weighted avg       0.76      0.76      0.76      1523



In [23]:
save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-unb-prep")

F1_0_Train      0.865442
F1_1_Train      0.791545
F1_Avg_Train    0.828493
F1_0_Test       0.805391
F1_1_Test       0.696893
F1_Avg_Test     0.751142
Name: knn-tfidf-unb-prep, dtype: float64


In [24]:
# multinomial naive bayes on preprocessed text

clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('Multi NB', MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.91      0.84       869
           1       0.84      0.65      0.73       654

    accuracy                           0.80      1523
   macro avg       0.81      0.78      0.78      1523
weighted avg       0.80      0.80      0.79      1523



In [25]:
save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-unb-prep")

F1_0_Train      0.926736
F1_1_Train      0.889803
F1_Avg_Train    0.908269
F1_0_Test       0.836074
F1_1_Test       0.733850
F1_Avg_Test     0.784962
Name: mnb-tfidf-unb-prep, dtype: float64


In [26]:
# random forest on preprocessed text

clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('Random Forest', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.91      0.82       869
           1       0.84      0.58      0.69       654

    accuracy                           0.77      1523
   macro avg       0.79      0.75      0.75      1523
weighted avg       0.78      0.77      0.76      1523



In [27]:
save_scores(clf, X_train, X_test, y_train, y_test, "rf-tfidf-unb-prep")

F1_0_Train      0.997554
F1_1_Train      0.996749
F1_Avg_Train    0.997152
F1_0_Test       0.819824
F1_1_Test       0.685302
F1_Avg_Test     0.752563
Name: rf-tfidf-unb-prep, dtype: float64


In [28]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617
rf-tfidf-unb-prep,0.997554,0.996749,0.997152,0.819824,0.685302,0.752563
knn-tfidf-unb-prep,0.865442,0.791545,0.828493,0.805391,0.696893,0.751142
rf-tfidf-unb,0.997411,0.996557,0.996984,0.820487,0.670391,0.745439


## Class balancing, minimal preprocessing (KNN, MNB, RF)

In [30]:
# class_weight param
# knn: none
# mnb: fit_prior = False
# rf: class_weight = 'balanced'

In [31]:
# use class_weight param (if avail) WITH other
# sampling techniques

### Undersampling & class_weight

In [32]:
from imblearn.pipeline import make_pipeline as resample_pipeline
from imblearn.under_sampling import RandomUnderSampler

In [33]:
X_train, X_test, y_train, y_test = train_test_split(
    train_df.preprocessed_txt,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [34]:
# undersampled knn on preprocessed data
clf = resample_pipeline(TfidfVectorizer(),
                              RandomUnderSampler(),
                              KNeighborsClassifier())

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-under-prep")

F1_0_Train      0.850883
F1_1_Train      0.793340
F1_Avg_Train    0.822112
F1_0_Test       0.786238
F1_1_Test       0.702278
F1_Avg_Test     0.744258
Name: knn-tfidf-under-prep, dtype: float64


In [35]:
# undersampled, balanced mnb on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        RandomUnderSampler(),
                        MultinomialNB(fit_prior = False))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-under-prep")

F1_0_Train      0.919994
F1_1_Train      0.895900
F1_Avg_Train    0.907947
F1_0_Test       0.793184
F1_1_Test       0.738095
F1_Avg_Test     0.765640
Name: mnb-tfidf-under-prep, dtype: float64


In [36]:
# random forest on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        RandomUnderSampler(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "rf-tfidf-under-prep")

F1_0_Train      0.979029
F1_1_Train      0.973326
F1_Avg_Train    0.976178
F1_0_Test       0.815176
F1_1_Test       0.716070
F1_Avg_Test     0.765623
Name: rf-tfidf-under-prep, dtype: float64


In [37]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564
rf-tfidf-under-prep,0.979029,0.973326,0.976178,0.815176,0.71607,0.765623
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617
rf-tfidf-unb-prep,0.997554,0.996749,0.997152,0.819824,0.685302,0.752563
knn-tfidf-unb-prep,0.865442,0.791545,0.828493,0.805391,0.696893,0.751142
rf-tfidf-unb,0.997411,0.996557,0.996984,0.820487,0.670391,0.745439
knn-tfidf-under-prep,0.850883,0.79334,0.822112,0.786238,0.702278,0.744258


### Oversampling and class_weight

In [38]:
from imblearn.over_sampling import RandomOverSampler

In [39]:
X_train, X_test, y_train, y_test = train_test_split(
    train_df.preprocessed_txt,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [40]:
# oversampled knn on preprocessed data
clf = resample_pipeline(TfidfVectorizer(),
                        RandomOverSampler(),
                        KNeighborsClassifier())

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-over-prep")

F1_0_Train      0.857910
F1_1_Train      0.803211
F1_Avg_Train    0.830561
F1_0_Test       0.783982
F1_1_Test       0.699136
F1_Avg_Test     0.741559
Name: knn-tfidf-over-prep, dtype: float64


In [41]:
# oversampled, balanced mnb on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        RandomOverSampler(),
                        MultinomialNB(fit_prior = False))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-over-prep")

F1_0_Train      0.935604
F1_1_Train      0.913328
F1_Avg_Train    0.924466
F1_0_Test       0.792803
F1_1_Test       0.730159
F1_Avg_Test     0.761481
Name: mnb-tfidf-over-prep, dtype: float64


In [42]:
# oversampled random forest on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        RandomOverSampler(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "rf-tfidf-over-prep")

F1_0_Train      0.997553
F1_1_Train      0.996751
F1_Avg_Train    0.997152
F1_0_Test       0.818182
F1_1_Test       0.692580
F1_Avg_Test     0.755381
Name: rf-tfidf-over-prep, dtype: float64


In [43]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564
rf-tfidf-under-prep,0.979029,0.973326,0.976178,0.815176,0.71607,0.765623
mnb-tfidf-over-prep,0.935604,0.913328,0.924466,0.792803,0.730159,0.761481
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617
rf-tfidf-over-prep,0.997553,0.996751,0.997152,0.818182,0.69258,0.755381
rf-tfidf-unb-prep,0.997554,0.996749,0.997152,0.819824,0.685302,0.752563
knn-tfidf-unb-prep,0.865442,0.791545,0.828493,0.805391,0.696893,0.751142
rf-tfidf-unb,0.997411,0.996557,0.996984,0.820487,0.670391,0.745439


### SMOTE and class_weight

In [44]:
from imblearn.over_sampling import SMOTE

In [45]:
X_train, X_test, y_train, y_test = train_test_split(
    train_df.preprocessed_txt,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [46]:
# smote knn on preprocessed data
clf = resample_pipeline(TfidfVectorizer(),
                        SMOTE(),
                        KNeighborsClassifier())

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-smote-prep")

F1_0_Train      0.448329
F1_1_Train      0.667894
F1_Avg_Train    0.558111
F1_0_Test       0.413913
F1_1_Test       0.644515
F1_Avg_Test     0.529214
Name: knn-tfidf-smote-prep, dtype: float64


In [47]:
# smote, balanced mnb on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        SMOTE(),
                        MultinomialNB(fit_prior = False))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-smote-prep")

F1_0_Train      0.932492
F1_1_Train      0.909475
F1_Avg_Train    0.920984
F1_0_Test       0.799768
F1_1_Test       0.739229
F1_Avg_Test     0.769498
Name: mnb-tfidf-smote-prep, dtype: float64


In [48]:
# smote, default mnb on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        SMOTE(),
                        MultinomialNB())

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-smote-def-prep")

F1_0_Train      0.934593
F1_1_Train      0.911997
F1_Avg_Train    0.923295
F1_0_Test       0.803458
F1_1_Test       0.739893
F1_Avg_Test     0.771676
Name: mnb-tfidf-smote-def-prep, dtype: float64


In [49]:
# smote random forest on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        SMOTE(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "rf-tfidf-smote-prep")

F1_0_Train      0.997554
F1_1_Train      0.996749
F1_Avg_Train    0.997152
F1_0_Test       0.822917
F1_1_Test       0.698046
F1_Avg_Test     0.760481
Name: rf-tfidf-smote-prep, dtype: float64


In [50]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564


# 2. spaCy Word Embeddings

## MNB, no class balancing

In [51]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target,preprocessed_txt
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deed Reason earthquake ALLAH forgive
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near La Ronge Sask Canada
2,5,,,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive wildfire evacuation orde..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got send photo Ruby Alaska smoke wildfire pour...


In [52]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [53]:
# make spacy vectors (takes awhile!)
train_df['spacy_vector'] = train_df['text'].apply(lambda x: nlp(x).vector)

In [54]:
# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.spacy_vector.values,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [55]:
# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [56]:
# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

In [57]:
# mnb, spacy word vectors
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(scaled_train_embed, y_train)

save_scores(clf, 
            scaled_train_embed, 
            scaled_test_embed, 
            y_train, 
            y_test, 
            "mnb-spacyvec-unb")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

F1_0_Train      0.707195
F1_1_Train      0.628678
F1_Avg_Train    0.667937
F1_0_Test       0.704639
F1_1_Test       0.625465
F1_Avg_Test     0.665052
Name: mnb-spacyvec-unb, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564
rf-tfidf-under-prep,0.979029,0.973326,0.976178,0.815176,0.71607,0.765623
mnb-tfidf-over-prep,0.935604,0.913328,0.924466,0.792803,0.730159,0.761481
rf-tfidf-smote-prep,0.997554,0.996749,0.997152,0.822917,0.698046,0.760481
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617
rf-tfidf-over-prep,0.997553,0.996751,0.997152,0.818182,0.69258,0.755381


## KNN, no class balancing

In [58]:
# knn

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean')

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knn-spacyvec-unb")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

F1_0_Train      0.842459
F1_1_Train      0.776741
F1_Avg_Train    0.809600
F1_0_Test       0.754886
F1_1_Test       0.650199
F1_Avg_Test     0.702542
Name: knn-spacyvec-unb, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564
rf-tfidf-under-prep,0.979029,0.973326,0.976178,0.815176,0.71607,0.765623
mnb-tfidf-over-prep,0.935604,0.913328,0.924466,0.792803,0.730159,0.761481
rf-tfidf-smote-prep,0.997554,0.996749,0.997152,0.822917,0.698046,0.760481
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617
rf-tfidf-over-prep,0.997553,0.996751,0.997152,0.818182,0.69258,0.755381


## RF, no class balancing

In [59]:
# rf

clf = RandomForestClassifier()

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rf-spacyvec-unb")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

F1_0_Train      0.990834
F1_1_Train      0.987688
F1_Avg_Train    0.989261
F1_0_Test       0.800000
F1_1_Test       0.679761
F1_Avg_Test     0.739880
Name: rf-spacyvec-unb, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564


## MNB, class balancing

In [60]:
# undersampled, balanced mnb on preprocessed text

# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.spacy_vector.values,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

clf = resample_pipeline(RandomUnderSampler(),
                        MultinomialNB(fit_prior = False))
clf.fit(scaled_train_embed, y_train)

save_scores(clf, 
            scaled_train_embed, 
            scaled_test_embed, 
            y_train, 
            y_test, 
            "mnb-spacyvec-under")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

F1_0_Train      0.675980
F1_1_Train      0.652950
F1_Avg_Train    0.664465
F1_0_Test       0.666240
F1_1_Test       0.647773
F1_Avg_Test     0.657007
Name: mnb-spacyvec-under, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564


In [61]:
# oversampled, balanced mnb on preprocessed text

# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.spacy_vector.values,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

clf = resample_pipeline(RandomOverSampler(),
                        MultinomialNB(fit_prior = False))
clf.fit(scaled_train_embed, y_train)

save_scores(clf, 
            scaled_train_embed, 
            scaled_test_embed, 
            y_train, 
            y_test, 
            "mnb-spacyvec-over")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

F1_0_Train      0.676186
F1_1_Train      0.652714
F1_Avg_Train    0.664450
F1_0_Test       0.665387
F1_1_Test       0.647336
F1_Avg_Test     0.656362
Name: mnb-spacyvec-over, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564


In [62]:
# smoted, balanced mnb on preprocessed text

# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.spacy_vector.values,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

clf = resample_pipeline(SMOTE(),
                        MultinomialNB(fit_prior = False))
clf.fit(scaled_train_embed, y_train)

save_scores(clf, 
            scaled_train_embed, 
            scaled_test_embed, 
            y_train, 
            y_test, 
            "mnb-spacyvec-smote")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

F1_0_Train      0.676517
F1_1_Train      0.654094
F1_Avg_Train    0.665306
F1_0_Test       0.666240
F1_1_Test       0.647773
F1_Avg_Test     0.657007
Name: mnb-spacyvec-smote, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564


In [63]:
# smoted, unbalanced mnb on preprocessed text

# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.spacy_vector.values,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

clf = resample_pipeline(SMOTE(),
                        MultinomialNB())
clf.fit(scaled_train_embed, y_train)

save_scores(clf, 
            scaled_train_embed, 
            scaled_test_embed, 
            y_train, 
            y_test, 
            "mnb-spacyvec-smote")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

F1_0_Train      0.675564
F1_1_Train      0.653075
F1_Avg_Train    0.664320
F1_0_Test       0.666240
F1_1_Test       0.647773
F1_Avg_Test     0.657007
Name: mnb-spacyvec-smote, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564


## KNN, class balancing

In [64]:
# undersampled knn

clf = resample_pipeline(RandomUnderSampler(),
                        KNeighborsClassifier(n_neighbors = 5,
                                            metric = 'euclidean'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knn-spacyvec-under")

F1_0_Train      0.809453
F1_1_Train      0.764857
F1_Avg_Train    0.787155
F1_0_Test       0.714370
F1_1_Test       0.643542
F1_Avg_Test     0.678956
Name: knn-spacyvec-under, dtype: float64


In [65]:
# oversampled knn

clf = resample_pipeline(RandomOverSampler(),
                        KNeighborsClassifier(n_neighbors = 5,
                                            metric = 'euclidean'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knn-spacyvec-over")

F1_0_Train      0.827688
F1_1_Train      0.785016
F1_Avg_Train    0.806352
F1_0_Test       0.720614
F1_1_Test       0.650407
F1_Avg_Test     0.685510
Name: knn-spacyvec-over, dtype: float64


In [66]:
# smoted knn

clf = resample_pipeline(SMOTE(),
                        KNeighborsClassifier(n_neighbors = 5,
                                            metric = 'euclidean'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knn-spacyvec-smote")

F1_0_Train      0.780480
F1_1_Train      0.768634
F1_Avg_Train    0.774557
F1_0_Test       0.652687
F1_1_Test       0.651316
F1_Avg_Test     0.652001
Name: knn-spacyvec-smote, dtype: float64


In [67]:
# undersampled default knn

clf = resample_pipeline(RandomUnderSampler(),
                        KNeighborsClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knndef-spacyvec-under")

F1_0_Train      0.817340
F1_1_Train      0.768744
F1_Avg_Train    0.793042
F1_0_Test       0.725444
F1_1_Test       0.657817
F1_Avg_Test     0.691630
Name: knndef-spacyvec-under, dtype: float64


In [68]:
# oversampled default knn

clf = resample_pipeline(RandomOverSampler(),
                        KNeighborsClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knndef-spacyvec-over")

F1_0_Train      0.829743
F1_1_Train      0.786375
F1_Avg_Train    0.808059
F1_0_Test       0.720475
F1_1_Test       0.653931
F1_Avg_Test     0.687203
Name: knndef-spacyvec-over, dtype: float64


In [69]:
# smoted default knn

clf = resample_pipeline(SMOTE(),
                        KNeighborsClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knndef-spacyvec-smote")

F1_0_Train      0.784936
F1_1_Train      0.774074
F1_Avg_Train    0.779505
F1_0_Test       0.652259
F1_1_Test       0.650428
F1_Avg_Test     0.651344
Name: knndef-spacyvec-smote, dtype: float64


## RF, class balancing

In [70]:
# rf under

clf = resample_pipeline(RandomUnderSampler(),
                        RandomForestClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rf-spacyvec-under")

F1_0_Train      0.964317
F1_1_Train      0.955169
F1_Avg_Train    0.959743
F1_0_Test       0.777149
F1_1_Test       0.691706
F1_Avg_Test     0.734428
Name: rf-spacyvec-under, dtype: float64


In [71]:
# rf under bal

clf = resample_pipeline(RandomUnderSampler(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rfbal-spacyvec-under")

F1_0_Train      0.966441
F1_1_Train      0.957668
F1_Avg_Train    0.962055
F1_0_Test       0.774527
F1_1_Test       0.698388
F1_Avg_Test     0.736458
Name: rfbal-spacyvec-under, dtype: float64


In [72]:
# rf over

clf = resample_pipeline(RandomOverSampler(),
                        RandomForestClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rf-spacyvec-over")

F1_0_Train      0.990378
F1_1_Train      0.987157
F1_Avg_Train    0.988768
F1_0_Test       0.798683
F1_1_Test       0.699918
F1_Avg_Test     0.749301
Name: rf-spacyvec-over, dtype: float64


In [73]:
# rf over bal

clf = resample_pipeline(RandomOverSampler(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rfbal-spacyvec-over")

F1_0_Train      0.990668
F1_1_Train      0.987536
F1_Avg_Train    0.989102
F1_0_Test       0.794150
F1_1_Test       0.683333
F1_Avg_Test     0.738741
Name: rfbal-spacyvec-over, dtype: float64


In [74]:
# rf smote

clf = resample_pipeline(SMOTE(),
                        RandomForestClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rf-spacyvec-smote")

F1_0_Train      0.990820
F1_1_Train      0.987711
F1_Avg_Train    0.989266
F1_0_Test       0.794085
F1_1_Test       0.691803
F1_Avg_Test     0.742944
Name: rf-spacyvec-smote, dtype: float64


In [75]:
# rf smote with balancing

clf = resample_pipeline(SMOTE(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rfbal-spacyvec-smote")

F1_0_Train      0.990528
F1_1_Train      0.987337
F1_Avg_Train    0.988933
F1_0_Test       0.789732
F1_1_Test       0.683128
F1_Avg_Test     0.736430
Name: rfbal-spacyvec-smote, dtype: float64


In [76]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564


# 3. Gensim word vectors

In [77]:
import gensim.downloader as api

wv = api.load("word2vec-google-news-300")

In [78]:
import pandas as pd
df = pd.read_csv('Data/train.csv')

In [79]:
print(df.shape)
df.head()

(7613, 5)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [80]:
# balance classes?

In [81]:
# preprocess and get gensim doc vector
import spacy

nlp = spacy.load("en_core_web_lg")

def preprocess_and_vectorize(text):
    doc = nlp(text)
    
    filtered_tokens = []
    
    for token in doc:
        if token.is_punct or token.is_stop:
            continue
        filtered_tokens.append(token.lemma_)
    
    return wv.get_mean_vector(filtered_tokens)

In [82]:
# convert text into gensim word embeddings

df['gensim_vector'] = df['text'].apply(lambda text: preprocess_and_vectorize(text))

In [83]:
df.head()

Unnamed: 0,id,keyword,location,text,target,gensim_vector
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[0.05016107, 0.00387215, 0.047061782, 0.028958..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[0.03064329, 0.0030595234, 0.0369662, 0.020602..."
2,5,,,All residents asked to 'shelter in place' are ...,1,"[-0.0048536863, 0.011481234, 0.016771162, -0.0..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[0.060398173, -0.012511074, -0.0018801317, 0.0..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[0.021673834, 0.0012636562, -0.031610973, 0.03..."


In [84]:
# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.gensim_vector.values,
    df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = df.target)

In [85]:
# create 2d np arrays for X train and test sets
import numpy as np
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [86]:
# gradient boosting classifier
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier()

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "gbc-gensim")

F1_0_Train      0.891286
F1_1_Train      0.841212
F1_Avg_Train    0.866249
F1_0_Test       0.816304
F1_1_Test       0.737849
F1_Avg_Test     0.777076
Name: gbc-gensim, dtype: float64


In [87]:
scores_df.sort_values(by = "F1_Avg_Test", ascending = False).head()

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
gbc-gensim,0.891286,0.841212,0.866249,0.816304,0.737849,0.777076
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498


more gensim:
other models available:
- twitter, wiki
- glove, fasttext

consider gensim models:
- glove-twitter-200
- word2vec-google-news-300
- glove-wiki-gigaword-300

# Scores

In [89]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
gbc-gensim,0.891286,0.841212,0.866249,0.816304,0.737849,0.777076
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-def-prep,0.934593,0.911997,0.923295,0.803458,0.739893,0.771676
mnb-tfidf-smote-prep,0.932492,0.909475,0.920984,0.799768,0.739229,0.769498
mnb-tfidf-under-prep,0.919994,0.8959,0.907947,0.793184,0.738095,0.76564
rf-tfidf-under-prep,0.979029,0.973326,0.976178,0.815176,0.71607,0.765623
mnb-tfidf-over-prep,0.935604,0.913328,0.924466,0.792803,0.730159,0.761481
rf-tfidf-smote-prep,0.997554,0.996749,0.997152,0.822917,0.698046,0.760481
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617
