# Natural Language Processing with Disaster Tweets (Kaggle Competition)

## **Outline**

**Feature extraction techniques**
1. TF/IDF
2. spaCy word embeddings
3. Gensim word vectors
4. fastText (not currently working -- saved in separate notebook)


**Data processing techniques**
- Class balancing (nondisaster tweets: 57%)
    - Class_weight parameter
    - Undersampling
    - Oversampling
    - SMOTE
    - Combination
- Preprocessing
    - Stopwords
    - Stemming and/or lemmatization
    - Punctuation


**Models**
- K-Nearest Neighbors (KNN)
- Multinomial Naive Bayes (MNB)
- Random Forest (RF)
- Gradient Boosting Classifier (GBC)
- Extreme Gradient Boosting (XGB)
- Light Gradient Boosting Model (LGBM)
- Neural networks
    
    
**Other ideas?**
- Categorical Naive Bayes?

In [1]:
# imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [2]:
# helper function to save scores
from sklearn.metrics import f1_score

scores_df = pd.DataFrame()

def save_scores(model_pipe, X_train, X_test, y_train, y_test, name):
  
    # calculate predictions
    train_pred = model_pipe.predict(X_train)
    test_pred = model_pipe.predict(X_test)
    
    # save f1 scores for each class
    f1_train_scores = f1_score(y_train, train_pred, average = None)
    for i, f1 in enumerate(f1_train_scores):
        if i == 0:
            f1_0_train = f1
        elif i == 1:
            f1_1_train = f1
            
    f1_test_scores = f1_score(y_test, test_pred, average = None)
    for i, f1 in enumerate(f1_test_scores):
        if i == 0:
            f1_0_test = f1
        elif i == 1:
            f1_1_test = f1
        
    # store scores
    scores_df.at[name, 'F1_0_Train'] = f1_0_train
    scores_df.at[name, 'F1_1_Train'] = f1_1_train
    scores_df.at[name, 'F1_Avg_Train'] = f1_score(y_train, train_pred, average = 'macro')
    scores_df.at[name, 'F1_0_Test'] = f1_0_test
    scores_df.at[name, 'F1_1_Test'] = f1_1_test
    scores_df.at[name, 'F1_Avg_Test'] = f1_score(y_test, test_pred, average = 'macro')
    
    # show scores for this model only (can call scores_df to see all scores)
    print(scores_df.loc[name, :])

In [None]:
# read in data
train_df = pd.read_csv("Data/train.csv")
test_df = pd.read_csv("Data/test.csv")

# check
train_df.info()
test_df.info()

In [None]:
# check for duplicates
train_df.duplicated().sum()

In [None]:
# check class balance
train_df['target'].value_counts()

# the classes are a bit unbalanced (57% to 43%),
# so we may want to try some class balancing techniques

In [None]:
# check uniqueness of id column
train_df['id'].nunique()

In [None]:
# check examples of non-disaster tweets (target = 0)
nondisaster_df = train_df[train_df['target'] == 0]

for tweet in nondisaster_df['text'].values[1:10]:
    print(tweet)

In [None]:
# check examples of disaster tweets (target = 1)
disaster_df = train_df[train_df['target'] == 1]

for tweet in disaster_df['text'].values[1:10]:
    print(tweet)

# 1. TF-IDF

## No class balancing, no preprocessing (KNN, MNB, RF)

In [15]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [16]:
# tts on unprocessed data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.text,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [17]:
clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('KNN', KNeighborsClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.86      0.81       869
           1       0.77      0.65      0.71       654

    accuracy                           0.77      1523
   macro avg       0.77      0.75      0.76      1523
weighted avg       0.77      0.77      0.76      1523



In [18]:
save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-unb")

F1_0_Train      0.865827
F1_1_Train      0.799016
F1_Avg_Train    0.832421
F1_0_Test       0.808234
F1_1_Test       0.705000
F1_Avg_Test     0.756617
Name: knn-tfidf-unb, dtype: float64


In [19]:
scores_df

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617


In [20]:
# multinomial naive bayes

from sklearn.naive_bayes import MultinomialNB

clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('Multi NB', MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.93      0.84       869
           1       0.87      0.61      0.72       654

    accuracy                           0.79      1523
   macro avg       0.81      0.77      0.78      1523
weighted avg       0.81      0.79      0.78      1523



In [21]:
save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-unb")

F1_0_Train      0.910580
F1_1_Train      0.859256
F1_Avg_Train    0.884918
F1_0_Test       0.835836
F1_1_Test       0.715695
F1_Avg_Test     0.775766
Name: mnb-tfidf-unb, dtype: float64


In [22]:
# random forest

from sklearn.ensemble import RandomForestClassifier

clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('Random Forest', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.92      0.82       869
           1       0.84      0.58      0.68       654

    accuracy                           0.77      1523
   macro avg       0.79      0.75      0.75      1523
weighted avg       0.78      0.77      0.76      1523



In [23]:
save_scores(clf, X_train, X_test, y_train, y_test, "rf-tfidf-unb")

F1_0_Train      0.997412
F1_1_Train      0.996556
F1_Avg_Train    0.996984
F1_0_Test       0.820803
F1_1_Test       0.684783
F1_Avg_Test     0.752793
Name: rf-tfidf-unb, dtype: float64


## No class balancing, minimal preprocessing (KNN, MNB, RF)

In [24]:
# preprocessing: removing spacy stopwords and punctuation, lemmatizing

import spacy

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    doc = nlp(text)
    
    filtered_tokens = []
    
    # take out stopwords and punctuation
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        
        # convert to lemmas
        filtered_tokens.append(token.lemma_)
            
    return " ".join(filtered_tokens)

In [25]:
train_df['preprocessed_txt'] = train_df['text'].apply(preprocess)

In [26]:
# check
train_df.head()

Unnamed: 0,id,keyword,location,text,target,preprocessed_txt
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deed Reason earthquake ALLAH forgive
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near La Ronge Sask Canada
2,5,,,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive wildfire evacuation orde..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got send photo Ruby Alaska smoke wildfire pour...


In [27]:
# tts on processed data

X_train, X_test, y_train, y_test = train_test_split(
    train_df.preprocessed_txt,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [28]:
# knn on preprocessed data
clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('KNN', KNeighborsClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.86      0.81       869
           1       0.77      0.63      0.70       654

    accuracy                           0.76      1523
   macro avg       0.77      0.75      0.75      1523
weighted avg       0.76      0.76      0.76      1523



In [29]:
save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-unb-prep")

F1_0_Train      0.865442
F1_1_Train      0.791545
F1_Avg_Train    0.828493
F1_0_Test       0.805391
F1_1_Test       0.696893
F1_Avg_Test     0.751142
Name: knn-tfidf-unb-prep, dtype: float64


In [30]:
# multinomial naive bayes on preprocessed text

clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('Multi NB', MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.91      0.84       869
           1       0.84      0.65      0.73       654

    accuracy                           0.80      1523
   macro avg       0.81      0.78      0.78      1523
weighted avg       0.80      0.80      0.79      1523



In [31]:
save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-unb-prep")

F1_0_Train      0.926736
F1_1_Train      0.889803
F1_Avg_Train    0.908269
F1_0_Test       0.836074
F1_1_Test       0.733850
F1_Avg_Test     0.784962
Name: mnb-tfidf-unb-prep, dtype: float64


In [32]:
# random forest on preprocessed text

clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('Random Forest', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.92      0.82       869
           1       0.84      0.59      0.69       654

    accuracy                           0.77      1523
   macro avg       0.79      0.75      0.76      1523
weighted avg       0.79      0.77      0.77      1523



In [33]:
save_scores(clf, X_train, X_test, y_train, y_test, "rf-tfidf-unb-prep")

F1_0_Train      0.997556
F1_1_Train      0.996746
F1_Avg_Train    0.997151
F1_0_Test       0.822922
F1_1_Test       0.690712
F1_Avg_Test     0.756817
Name: rf-tfidf-unb-prep, dtype: float64


In [34]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
rf-tfidf-unb-prep,0.997556,0.996746,0.997151,0.822922,0.690712,0.756817
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617
rf-tfidf-unb,0.997412,0.996556,0.996984,0.820803,0.684783,0.752793
knn-tfidf-unb-prep,0.865442,0.791545,0.828493,0.805391,0.696893,0.751142


## Class balancing, minimal preprocessing (KNN, MNB, RF)

In [35]:
# class_weight, undersampling, oversampling, smote

In [36]:
# class_weight param
# knn: none
# mnb: fit_prior = False
# rf: class_weight = 'balanced'

In [37]:
# use class_weight param (if avail) WITH other
# sampling techniques

### Undersampling & class_weight

In [38]:
from imblearn.pipeline import make_pipeline as resample_pipeline
from imblearn.under_sampling import RandomUnderSampler

In [39]:
X_train, X_test, y_train, y_test = train_test_split(
    train_df.preprocessed_txt,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [40]:
# undersampled knn on preprocessed data
clf = resample_pipeline(TfidfVectorizer(),
                              RandomUnderSampler(),
                              KNeighborsClassifier())

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-under-prep")

F1_0_Train      0.851106
F1_1_Train      0.791970
F1_Avg_Train    0.821538
F1_0_Test       0.785553
F1_1_Test       0.701727
F1_Avg_Test     0.743640
Name: knn-tfidf-under-prep, dtype: float64


In [41]:
# undersampled, balanced mnb on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        RandomUnderSampler(),
                        MultinomialNB(fit_prior = False))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-under-prep")

F1_0_Train      0.916084
F1_1_Train      0.891648
F1_Avg_Train    0.903866
F1_0_Test       0.794376
F1_1_Test       0.737864
F1_Avg_Test     0.766120
Name: mnb-tfidf-under-prep, dtype: float64


In [42]:
# random forest on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        RandomUnderSampler(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "rf-tfidf-under-prep")

F1_0_Train      0.980363
F1_1_Train      0.974981
F1_Avg_Train    0.977672
F1_0_Test       0.804090
F1_1_Test       0.693603
F1_Avg_Test     0.748847
Name: rf-tfidf-under-prep, dtype: float64


In [43]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-under-prep,0.916084,0.891648,0.903866,0.794376,0.737864,0.76612
rf-tfidf-unb-prep,0.997556,0.996746,0.997151,0.822922,0.690712,0.756817
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617
rf-tfidf-unb,0.997412,0.996556,0.996984,0.820803,0.684783,0.752793
knn-tfidf-unb-prep,0.865442,0.791545,0.828493,0.805391,0.696893,0.751142
rf-tfidf-under-prep,0.980363,0.974981,0.977672,0.80409,0.693603,0.748847
knn-tfidf-under-prep,0.851106,0.79197,0.821538,0.785553,0.701727,0.74364


### Oversampling and class_weight

In [44]:
from imblearn.over_sampling import RandomOverSampler

In [45]:
X_train, X_test, y_train, y_test = train_test_split(
    train_df.preprocessed_txt,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [46]:
# oversampled knn on preprocessed data
clf = resample_pipeline(TfidfVectorizer(),
                        RandomOverSampler(),
                        KNeighborsClassifier())

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-over-prep")

F1_0_Train      0.861095
F1_1_Train      0.806907
F1_Avg_Train    0.834001
F1_0_Test       0.784225
F1_1_Test       0.698662
F1_Avg_Test     0.741444
Name: knn-tfidf-over-prep, dtype: float64


In [47]:
# oversampled, balanced mnb on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        RandomOverSampler(),
                        MultinomialNB(fit_prior = False))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-over-prep")

F1_0_Train      0.932305
F1_1_Train      0.908459
F1_Avg_Train    0.920382
F1_0_Test       0.797211
F1_1_Test       0.736604
F1_Avg_Test     0.766907
Name: mnb-tfidf-over-prep, dtype: float64


In [48]:
# oversampled random forest on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        RandomOverSampler(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "rf-tfidf-over-prep")

F1_0_Train      0.997409
F1_1_Train      0.996561
F1_Avg_Train    0.996985
F1_0_Test       0.816649
F1_1_Test       0.696864
F1_Avg_Test     0.756757
Name: rf-tfidf-over-prep, dtype: float64


In [49]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-over-prep,0.932305,0.908459,0.920382,0.797211,0.736604,0.766907
mnb-tfidf-under-prep,0.916084,0.891648,0.903866,0.794376,0.737864,0.76612
rf-tfidf-unb-prep,0.997556,0.996746,0.997151,0.822922,0.690712,0.756817
rf-tfidf-over-prep,0.997409,0.996561,0.996985,0.816649,0.696864,0.756757
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617
rf-tfidf-unb,0.997412,0.996556,0.996984,0.820803,0.684783,0.752793
knn-tfidf-unb-prep,0.865442,0.791545,0.828493,0.805391,0.696893,0.751142
rf-tfidf-under-prep,0.980363,0.974981,0.977672,0.80409,0.693603,0.748847


### SMOTE and class_weight

In [50]:
from imblearn.over_sampling import SMOTE

In [51]:
X_train, X_test, y_train, y_test = train_test_split(
    train_df.preprocessed_txt,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [52]:
# smote knn on preprocessed data
clf = resample_pipeline(TfidfVectorizer(),
                        SMOTE(),
                        KNeighborsClassifier())

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "knn-tfidf-smote-prep")

F1_0_Train      0.447028
F1_1_Train      0.669204
F1_Avg_Train    0.558116
F1_0_Test       0.421144
F1_1_Test       0.646934
F1_Avg_Test     0.534039
Name: knn-tfidf-smote-prep, dtype: float64


In [53]:
# smote, balanced mnb on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        SMOTE(),
                        MultinomialNB(fit_prior = False))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-smote-prep")

F1_0_Train      0.933582
F1_1_Train      0.911115
F1_Avg_Train    0.922349
F1_0_Test       0.802768
F1_1_Test       0.739329
F1_Avg_Test     0.771049
Name: mnb-tfidf-smote-prep, dtype: float64


In [54]:
# smote, default mnb on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        SMOTE(),
                        MultinomialNB())

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "mnb-tfidf-smote-def-prep")

F1_0_Train      0.934534
F1_1_Train      0.911651
F1_Avg_Train    0.923093
F1_0_Test       0.799306
F1_1_Test       0.736522
F1_Avg_Test     0.767914
Name: mnb-tfidf-smote-def-prep, dtype: float64


In [55]:
# smote random forest on preprocessed text

clf = resample_pipeline(TfidfVectorizer(),
                        SMOTE(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train, y_train)

save_scores(clf, X_train, X_test, y_train, y_test, "rf-tfidf-smote-prep")

F1_0_Train      0.997554
F1_1_Train      0.996749
F1_Avg_Train    0.997152
F1_0_Test       0.824753
F1_1_Test       0.699911
F1_Avg_Test     0.762332
Name: rf-tfidf-smote-prep, dtype: float64


In [56]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-prep,0.933582,0.911115,0.922349,0.802768,0.739329,0.771049
mnb-tfidf-smote-def-prep,0.934534,0.911651,0.923093,0.799306,0.736522,0.767914
mnb-tfidf-over-prep,0.932305,0.908459,0.920382,0.797211,0.736604,0.766907


# 2. spaCy Word Embeddings

## MNB, no class balancing

In [57]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target,preprocessed_txt
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deed Reason earthquake ALLAH forgive
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near La Ronge Sask Canada
2,5,,,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive wildfire evacuation orde..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got send photo Ruby Alaska smoke wildfire pour...


In [58]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [59]:
# make spacy vectors (takes awhile!)
train_df['spacy_vector'] = train_df['text'].apply(lambda x: nlp(x).vector)

In [60]:
# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.spacy_vector.values,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

In [61]:
# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [62]:
# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

In [63]:
# mnb, spacy word vectors
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(scaled_train_embed, y_train)

save_scores(clf, 
            scaled_train_embed, 
            scaled_test_embed, 
            y_train, 
            y_test, 
            "mnb-spacyvec-unb")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

F1_0_Train      0.707195
F1_1_Train      0.628678
F1_Avg_Train    0.667937
F1_0_Test       0.704639
F1_1_Test       0.625465
F1_Avg_Test     0.665052
Name: mnb-spacyvec-unb, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-prep,0.933582,0.911115,0.922349,0.802768,0.739329,0.771049
mnb-tfidf-smote-def-prep,0.934534,0.911651,0.923093,0.799306,0.736522,0.767914
mnb-tfidf-over-prep,0.932305,0.908459,0.920382,0.797211,0.736604,0.766907
mnb-tfidf-under-prep,0.916084,0.891648,0.903866,0.794376,0.737864,0.76612
rf-tfidf-smote-prep,0.997554,0.996749,0.997152,0.824753,0.699911,0.762332
rf-tfidf-unb-prep,0.997556,0.996746,0.997151,0.822922,0.690712,0.756817
rf-tfidf-over-prep,0.997409,0.996561,0.996985,0.816649,0.696864,0.756757
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617


## KNN, no class balancing

In [64]:
# knn

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean')

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knn-spacyvec-unb")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False)

F1_0_Train      0.842459
F1_1_Train      0.776741
F1_Avg_Train    0.809600
F1_0_Test       0.754886
F1_1_Test       0.650199
F1_Avg_Test     0.702542
Name: knn-spacyvec-unb, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-prep,0.933582,0.911115,0.922349,0.802768,0.739329,0.771049
mnb-tfidf-smote-def-prep,0.934534,0.911651,0.923093,0.799306,0.736522,0.767914
mnb-tfidf-over-prep,0.932305,0.908459,0.920382,0.797211,0.736604,0.766907
mnb-tfidf-under-prep,0.916084,0.891648,0.903866,0.794376,0.737864,0.76612
rf-tfidf-smote-prep,0.997554,0.996749,0.997152,0.824753,0.699911,0.762332
rf-tfidf-unb-prep,0.997556,0.996746,0.997151,0.822922,0.690712,0.756817
rf-tfidf-over-prep,0.997409,0.996561,0.996985,0.816649,0.696864,0.756757
knn-tfidf-unb,0.865827,0.799016,0.832421,0.808234,0.705,0.756617


## RF, no class balancing

In [65]:
# rf

clf = RandomForestClassifier()

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rf-spacyvec-unb")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

F1_0_Train      0.990826
F1_1_Train      0.987702
F1_Avg_Train    0.989264
F1_0_Test       0.796992
F1_1_Test       0.680743
F1_Avg_Test     0.738868
Name: rf-spacyvec-unb, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-prep,0.933582,0.911115,0.922349,0.802768,0.739329,0.771049
mnb-tfidf-smote-def-prep,0.934534,0.911651,0.923093,0.799306,0.736522,0.767914
mnb-tfidf-over-prep,0.932305,0.908459,0.920382,0.797211,0.736604,0.766907


## MNB, class balancing

In [66]:
# undersampled, balanced mnb on preprocessed text

# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.spacy_vector.values,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

clf = resample_pipeline(RandomUnderSampler(),
                        MultinomialNB(fit_prior = False))
clf.fit(scaled_train_embed, y_train)

save_scores(clf, 
            scaled_train_embed, 
            scaled_test_embed, 
            y_train, 
            y_test, 
            "mnb-spacyvec-under")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

F1_0_Train      0.678492
F1_1_Train      0.653938
F1_Avg_Train    0.666215
F1_0_Test       0.667093
F1_1_Test       0.648211
F1_Avg_Test     0.657652
Name: mnb-spacyvec-under, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-prep,0.933582,0.911115,0.922349,0.802768,0.739329,0.771049
mnb-tfidf-smote-def-prep,0.934534,0.911651,0.923093,0.799306,0.736522,0.767914
mnb-tfidf-over-prep,0.932305,0.908459,0.920382,0.797211,0.736604,0.766907


In [67]:
# oversampled, balanced mnb on preprocessed text

# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.spacy_vector.values,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

clf = resample_pipeline(RandomOverSampler(),
                        MultinomialNB(fit_prior = False))
clf.fit(scaled_train_embed, y_train)

save_scores(clf, 
            scaled_train_embed, 
            scaled_test_embed, 
            y_train, 
            y_test, 
            "mnb-spacyvec-over")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

F1_0_Train      0.676186
F1_1_Train      0.652714
F1_Avg_Train    0.664450
F1_0_Test       0.667093
F1_1_Test       0.648211
F1_Avg_Test     0.657652
Name: mnb-spacyvec-over, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-prep,0.933582,0.911115,0.922349,0.802768,0.739329,0.771049
mnb-tfidf-smote-def-prep,0.934534,0.911651,0.923093,0.799306,0.736522,0.767914
mnb-tfidf-over-prep,0.932305,0.908459,0.920382,0.797211,0.736604,0.766907


In [68]:
# smoted, balanced mnb on preprocessed text

# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.spacy_vector.values,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

clf = resample_pipeline(SMOTE(),
                        MultinomialNB(fit_prior = False))
clf.fit(scaled_train_embed, y_train)

save_scores(clf, 
            scaled_train_embed, 
            scaled_test_embed, 
            y_train, 
            y_test, 
            "mnb-spacyvec-smote")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

F1_0_Train      0.675560
F1_1_Train      0.652728
F1_Avg_Train    0.664144
F1_0_Test       0.666240
F1_1_Test       0.647773
F1_Avg_Test     0.657007
Name: mnb-spacyvec-smote, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-prep,0.933582,0.911115,0.922349,0.802768,0.739329,0.771049
mnb-tfidf-smote-def-prep,0.934534,0.911651,0.923093,0.799306,0.736522,0.767914
mnb-tfidf-over-prep,0.932305,0.908459,0.920382,0.797211,0.736604,0.766907


In [69]:
# smoted, unbalanced mnb on preprocessed text

# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df.spacy_vector.values,
    train_df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = train_df.target)

# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

clf = resample_pipeline(SMOTE(),
                        MultinomialNB())
clf.fit(scaled_train_embed, y_train)

save_scores(clf, 
            scaled_train_embed, 
            scaled_test_embed, 
            y_train, 
            y_test, 
            "mnb-spacyvec-smote")

scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

F1_0_Train      0.676933
F1_1_Train      0.653970
F1_Avg_Train    0.665452
F1_0_Test       0.666240
F1_1_Test       0.647773
F1_Avg_Test     0.657007
Name: mnb-spacyvec-smote, dtype: float64


Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-prep,0.933582,0.911115,0.922349,0.802768,0.739329,0.771049
mnb-tfidf-smote-def-prep,0.934534,0.911651,0.923093,0.799306,0.736522,0.767914
mnb-tfidf-over-prep,0.932305,0.908459,0.920382,0.797211,0.736604,0.766907


## KNN, class balancing

In [70]:
# undersampled knn

clf = resample_pipeline(RandomUnderSampler(),
                        KNeighborsClassifier(n_neighbors = 5,
                                            metric = 'euclidean'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knn-spacyvec-under")

F1_0_Train      0.814946
F1_1_Train      0.768349
F1_Avg_Train    0.791648
F1_0_Test       0.730905
F1_1_Test       0.659226
F1_Avg_Test     0.695066
Name: knn-spacyvec-under, dtype: float64


In [71]:
# oversampled knn

clf = resample_pipeline(RandomOverSampler(),
                        KNeighborsClassifier(n_neighbors = 5,
                                            metric = 'euclidean'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knn-spacyvec-over")

F1_0_Train      0.828449
F1_1_Train      0.783819
F1_Avg_Train    0.806134
F1_0_Test       0.717404
F1_1_Test       0.645448
F1_Avg_Test     0.681426
Name: knn-spacyvec-over, dtype: float64


In [72]:
# smoted knn

clf = resample_pipeline(SMOTE(),
                        KNeighborsClassifier(n_neighbors = 5,
                                            metric = 'euclidean'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knn-spacyvec-smote")

F1_0_Train      0.783545
F1_1_Train      0.773879
F1_Avg_Train    0.778712
F1_0_Test       0.664047
F1_1_Test       0.662278
F1_Avg_Test     0.663162
Name: knn-spacyvec-smote, dtype: float64


In [73]:
# undersampled default knn

clf = resample_pipeline(RandomUnderSampler(),
                        KNeighborsClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knndef-spacyvec-under")

F1_0_Train      0.809354
F1_1_Train      0.765825
F1_Avg_Train    0.787589
F1_0_Test       0.717094
F1_1_Test       0.652524
F1_Avg_Test     0.684809
Name: knndef-spacyvec-under, dtype: float64


In [74]:
# oversampled default knn

clf = resample_pipeline(RandomOverSampler(),
                        KNeighborsClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knndef-spacyvec-over")

F1_0_Train      0.826299
F1_1_Train      0.783858
F1_Avg_Train    0.805079
F1_0_Test       0.712426
F1_1_Test       0.641593
F1_Avg_Test     0.677009
Name: knndef-spacyvec-over, dtype: float64


In [75]:
# smoted default knn

clf = resample_pipeline(SMOTE(),
                        KNeighborsClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "knndef-spacyvec-smote")

F1_0_Train      0.784615
F1_1_Train      0.773737
F1_Avg_Train    0.779176
F1_0_Test       0.660040
F1_1_Test       0.666233
F1_Avg_Test     0.663136
Name: knndef-spacyvec-smote, dtype: float64


## RF, class balancing

In [76]:
# rf under

clf = resample_pipeline(RandomUnderSampler(),
                        RandomForestClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rf-spacyvec-under")

F1_0_Train      0.964186
F1_1_Train      0.954958
F1_Avg_Train    0.959572
F1_0_Test       0.775557
F1_1_Test       0.696525
F1_Avg_Test     0.736041
Name: rf-spacyvec-under, dtype: float64


In [77]:
# rf under bal

clf = resample_pipeline(RandomUnderSampler(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rfbal-spacyvec-under")

F1_0_Train      0.962908
F1_1_Train      0.953630
F1_Avg_Train    0.958269
F1_0_Test       0.772232
F1_1_Test       0.695318
F1_Avg_Test     0.733775
Name: rfbal-spacyvec-under, dtype: float64


In [78]:
# rf over

clf = resample_pipeline(RandomOverSampler(),
                        RandomForestClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rf-spacyvec-over")

F1_0_Train      0.989625
F1_1_Train      0.986260
F1_Avg_Train    0.987942
F1_0_Test       0.794760
F1_1_Test       0.690280
F1_Avg_Test     0.742520
Name: rf-spacyvec-over, dtype: float64


In [79]:
# rf over bal

clf = resample_pipeline(RandomOverSampler(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rfbal-spacyvec-over")

F1_0_Train      0.990375
F1_1_Train      0.987162
F1_Avg_Train    0.988769
F1_0_Test       0.795417
F1_1_Test       0.690849
F1_Avg_Test     0.743133
Name: rfbal-spacyvec-over, dtype: float64


In [80]:
# rf smote

clf = resample_pipeline(SMOTE(),
                        RandomForestClassifier())

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rf-spacyvec-smote")

F1_0_Train      0.990820
F1_1_Train      0.987711
F1_Avg_Train    0.989266
F1_0_Test       0.796943
F1_1_Test       0.693575
F1_Avg_Test     0.745259
Name: rf-spacyvec-smote, dtype: float64


In [81]:
# rf smote with balancing

clf = resample_pipeline(SMOTE(),
                        RandomForestClassifier(class_weight = 'balanced'))

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "rfbal-spacyvec-smote")

F1_0_Train      0.990818
F1_1_Train      0.987716
F1_Avg_Train    0.989267
F1_0_Test       0.791621
F1_1_Test       0.693182
F1_Avg_Test     0.742401
Name: rfbal-spacyvec-smote, dtype: float64


In [82]:
scores_df.sort_values(by = 'F1_Avg_Test', ascending = False).head()

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-prep,0.933582,0.911115,0.922349,0.802768,0.739329,0.771049
mnb-tfidf-smote-def-prep,0.934534,0.911651,0.923093,0.799306,0.736522,0.767914
mnb-tfidf-over-prep,0.932305,0.908459,0.920382,0.797211,0.736604,0.766907


# 3. Gensim word vectors

In [83]:
import gensim.downloader as api

wv = api.load("word2vec-google-news-300")

In [84]:
import pandas as pd
df = pd.read_csv('Data/train.csv')

In [85]:
print(df.shape)
df.head()

(7613, 5)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [86]:
# balance classes?

In [87]:
# preprocess and get gensim doc vector
import spacy

nlp = spacy.load("en_core_web_lg")

def preprocess_and_vectorize(text):
    doc = nlp(text)
    
    filtered_tokens = []
    
    for token in doc:
        if token.is_punct or token.is_stop:
            continue
        filtered_tokens.append(token.lemma_)
    
    return wv.get_mean_vector(filtered_tokens)

In [88]:
# convert text into gensim word embeddings

df['gensim_vector'] = df['text'].apply(lambda text: preprocess_and_vectorize(text))

In [89]:
df.head()

Unnamed: 0,id,keyword,location,text,target,gensim_vector
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[0.05016107, 0.00387215, 0.047061782, 0.028958..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[0.03064329, 0.0030595234, 0.0369662, 0.020602..."
2,5,,,All residents asked to 'shelter in place' are ...,1,"[-0.0048536863, 0.011481234, 0.016771162, -0.0..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[0.060398173, -0.012511074, -0.0018801317, 0.0..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[0.021673834, 0.0012636562, -0.031610973, 0.03..."


In [90]:
# tts
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.gensim_vector.values,
    df.target,
    test_size = 0.2,
    random_state = 2022,
    stratify = df.target)

In [91]:
# create 2d np arrays for X train and test sets
import numpy as np
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [92]:
# gradient boosting classifier
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier()

clf.fit(X_train_2d, y_train)

save_scores(clf, 
            X_train_2d, 
            X_test_2d, 
            y_train, 
            y_test, 
            "gbc-gensim")

F1_0_Train      0.891286
F1_1_Train      0.841212
F1_Avg_Train    0.866249
F1_0_Test       0.816304
F1_1_Test       0.737849
F1_Avg_Test     0.777076
Name: gbc-gensim, dtype: float64


In [93]:
scores_df.sort_values(by = "F1_Avg_Test", ascending = False).head()

Unnamed: 0,F1_0_Train,F1_1_Train,F1_Avg_Train,F1_0_Test,F1_1_Test,F1_Avg_Test
mnb-tfidf-unb-prep,0.926736,0.889803,0.908269,0.836074,0.73385,0.784962
gbc-gensim,0.891286,0.841212,0.866249,0.816304,0.737849,0.777076
mnb-tfidf-unb,0.91058,0.859256,0.884918,0.835836,0.715695,0.775766
mnb-tfidf-smote-prep,0.933582,0.911115,0.922349,0.802768,0.739329,0.771049
mnb-tfidf-smote-def-prep,0.934534,0.911651,0.923093,0.799306,0.736522,0.767914


more gensim:
other models available:
- twitter, wiki
- glove, fasttext

consider gensim models:
- glove-twitter-200
- word2vec-google-news-300
- glove-wiki-gigaword-300

# Scores