# Tweet Binary Classification
## Does a tweet indicate a real disaster?

#### Methodology
* create a vector representation of each word (using pre-trained [gensim word2vec](https://radimrehurek.com/gensim/models/woc2vec.html)
    *  We will use the embeddings of a pre-trained model because our corpus is small.
* average the word vectors for each tweet into a single "sentence vector"
* train a binary classificatin model on the sentence vectors


#### Reading
* [This blogpost](https://medium.com/@dilip.voleti/classification-using-word2vec-b1d79d375381) discusses how to train a binary classification model using pretrained word vectors (from word2vec)
* [This blogpost](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)provides a gentle introduction to the doc2vec model

In [1]:
import pandas as pd
import numpy as np
from gensim import downloader
from gensim.utils import simple_preprocess

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

Helper functions

In [2]:
def make_tweet_vector(tweet_tokens, model):
    
    # create a placeholder vector for aggregating our sentence
    vec_len = len(model.get_vector("hello"))
    tweet_vec = np.zeros((1, vec_len))
    
    n = 0 # number of words used
    for word in tweet_tokens:        
        try:
            word_vec = model.get_vector(word)
            n += 1
        except KeyError:
            # if the word isn't in our vocab, represent it as zeros
            word_vec = np.zeros((1, vec_len))
            
        tweet_vec += word_vec
    
    # make our vector sentence length invariant 
    # by converting our sum into an average
    tweet_vec /= n
    
    return tweet_vec

Read Data and pre-process our data

In [3]:
# Pre-trained word2vec model
model_gt25 = downloader.load("glove-twitter-25")

# data
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")


# ---- Pre-process our documents ---- 
# tokenize tweets
train_df["tweet_tokens"] = train_df["text"].apply(simple_preprocess)
test_df["tweet_tokens"] = train_df["text"].apply(simple_preprocess)


# create document vectors
train_df["tweet_vector"] = train_df["tweet_tokens"].apply(lambda tt: make_tweet_vector(tt, model=model_gt25))
test_df["tweet_vector"] = test_df["tweet_tokens"].apply(lambda tt: make_tweet_vector(tt, model=model_gt25))


# create numpy arrays for model training and testing
# For now, lets just use text. We can add keywords latter
X_train = np.concatenate(train_df["tweet_vector"].values)
X_test = np.concatenate(test_df["tweet_vector"].values)

y_train = train_df["target"].to_numpy()

# make sure that our data dimensions make sense
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

X_train shape: (7613, 25)
y_train shape: (7613,)


#### Use `GridSearchCV` to find the best hyperparameters

In [4]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [5, 10, 15],
    'max_features': [5, 10],
    'min_samples_leaf': [2, 5, 15, 20],
    'n_estimators': [500]}

n_jobs = -2

# Create a based model
rf = RandomForestClassifier() # Instantiate the grid search model
grid_search = GridSearchCV(estimator=rf,param_grid=param_grid, 
                           scoring="f1", cv=3, n_jobs=n_jobs, verbose=100)

grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


GridSearchCV(cv=3, estimator=RandomForestClassifier(), n_jobs=-2,
             param_grid={'bootstrap': [True], 'max_depth': [5, 10, 15],
                         'max_features': [5, 10],
                         'min_samples_leaf': [2, 5, 15, 20],
                         'n_estimators': [500]},
             scoring='f1', verbose=100)

#### Generate Test Set Predictions

Retrain `RandomForestClassifier` using the best hyperparameters but with the _full_ training data

In [5]:
rf_full_input = RandomForestClassifier(**grid_search.best_params_)
rf_full_input.fit(X_train, y_train)
training_score = rf_full_input.score(X_train, y_train)
print("rf_full_input training score: {}".format(training_score))

# get predictions
y_pred = rf_full_input.predict(X_test)

rf_full_input training score: 0.9235518192565348


#### Save output

In [6]:
# Format data to required output
preds_df = pd.DataFrame()
preds_df["id"] = test_df["id"]
preds_df["target"] = y_pred

# Save as csv
submission_path = "/kaggle/working/submission.csv"
preds_df.to_csv(submission_path,
                header=True, 
                index=False)