<font size="+3">Inference Notebook</font>

Explore simple ML models, different text embeddings, hyper parameter tuning and ensemble models.

The BERT model got an accuracy of about 0.83 without any feature engineering. It will be interesting to see if traditional ML models can be tuned to perform as well/better than BERT.

# Imports

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import f1_score, log_loss

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression


# Load Data

In [2]:
train_df = pd.read_csv("./data/train.csv")
test_df = pd.read_csv("./data/test.csv")

In [3]:
features = [col for col in train_df.columns if col not in ['target','id']]

In [4]:
xtrain, xvalid, ytrain, yvalid = train_test_split(train_df[features], train_df['target'], test_size=0.1,
                                                 shuffle=True, random_state = 42)

In [5]:
xtrain.sample(10)

Unnamed: 0,keyword,location,text
2992,dust%20storm,"Beirut, Lebanon",Some poor sods arriving in Amman during yester...
4463,hostages,,Natalie Stavola our co-star explains her role ...
4352,hijack,NIGERIA,Bayelsa poll: Tension in Bayelsa as Patience J...
675,blaze,Gotham City,?? Yes I do have 2 guns ?? ??
1486,catastrophe,Lytham St Anne's,Oh my god thatÛªs the biggest #gbbo catastrop...
1674,collide,,i just remember us driving and singing collide...
4933,mayhem,"PG County, MD",Tonight It's Going To Be Mayhem @ #4PlayThursd...
1470,catastrophe,,@peterjukes But there are good grounds to beli...
2238,deluge,,#MeditationByMSG 45600 ppl got method of medit...
6172,sirens,they/them,my dad said I look thinner than usual but real...


# Metric

As per the competition rules. We will optimise for the F1_Score and also use the log_loss to evaluate each algorithms performance.

# Text Preprocessing

As we will evalute different levels of text preprocessing, the processing steps will be kept in a seperate file so they can easily be added to as we evaluate each model. This will enable us to add different level of preprocessing to the pipelines.

In [6]:
%load_ext autoreload
%autoreload 2

from text_preprocessing import *

# Model Evaluation

To begin with we will just evaluate the `text` column and create features from that.

**Create TF-IDF Embeddings**

Normally a good place to start with NLP problems. We will first try without any custom preprocessing of the data.

## Create features from 'text'

There are a number of different methods for generating numerical features from text. We can try the following:
- TF-IDF
- CountVectorizer

We will create features from using each method and evaluate which works better on this dataset.

In [7]:
# TF-IDF embeddings
tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')

# Fitting TF-IDF to both training and test sets (semi-supervised learning)
tfv.fit(list(xtrain['text']) + list(xvalid['text']))
xtrain_tfv =  tfv.transform(xtrain['text']) 
xvalid_tfv = tfv.transform(xvalid['text'])

In [8]:
# Count Vectorizer
ctv = CountVectorizer(stop_words = 'english')

# Fitting Count Vectorizer to both training and test sets (semi-supervised learning)
ctv.fit(list(xtrain['text']) + list(xvalid['text']))
xtrain_ctv =  ctv.transform(xtrain['text']) 
xvalid_ctv = ctv.transform(xvalid['text'])

## Logistic Regression

The first model to try is a simple logistic regression. We will use sklearn pipelines and gridsearch to find the best possible logistic regression model.

In [9]:
#create dictionary to store results of each model
model_results = {}

In [10]:
#standard logistic regresssion
lr_pipeline = Pipeline([('lr',LogisticRegression())])

#gridsearch parameters
lr_param_grid = {'lr__C': [0.1, 1.0, 10],
                'lr__penalty': ['l1', 'l2']}


#pipeline with svd and scaling
lr_pipeline_svd = Pipeline([('svd',TruncatedSVD()),
                          ('scl',StandardScaler()),
                          ('lr',LogisticRegression())])

lr_svd_param_grid = {'svd__n_components' : [120, 180],
                 'lr__C': [0.1, 1.0, 10], 
                 'lr__penalty': ['l1', 'l2']}

In [11]:
lr_gs_tfv = GridSearchCV(lr_pipeline, param_grid=lr_param_grid, scoring='f1',
                     verbose=5, n_jobs=-1, refit=True, cv=4)

lr_gs_ctv = GridSearchCV(lr_pipeline, param_grid=lr_param_grid, scoring='f1',
                     verbose=5, n_jobs=-1, refit=True, cv=4)

lr_svd_gs_tfv = GridSearchCV(lr_pipeline_svd, param_grid=lr_svd_param_grid, scoring='f1',
                     verbose=5, n_jobs=-1, refit=True, cv=4)

lr_svd_gs_ctv = GridSearchCV(lr_pipeline_svd, param_grid=lr_svd_param_grid, scoring='f1',
                     verbose=5, n_jobs=-1, refit=True, cv=4)


In [20]:
lr_gs_tfv.fit(xtrain_tfv, ytrain)
lr_gs_ctv.fit(xtrain_ctv, ytrain)
lr_svd_gs_tfv.fit(xtrain_tfv, ytrain)
lr_svd_gs_ctv.fit(xtrain_ctv, ytrain)

print("Hyper-parameter tuning done")

Fitting 4 folds for each of 6 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  24 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  22 out of  24 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    0.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 4 folds for each of 6 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Done  22 out of  24 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    1.0s finished


Fitting 4 folds for each of 12 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:   19.2s finished


Fitting 4 folds for each of 12 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    7.7s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:   33.1s finished


Hyper-parameter tuning done


In [33]:
models = [lr_gs_tfv, lr_gs_ctv, lr_svd_gs_tfv, lr_svd_gs_ctv]
names = ['lr-tfv','lr-ctv','lr-svd-tfv','lr-svd-ctv']

val_scores = {}
for model, name in zip(models, names):
    print(f"{name}: \n\tBest GridSearch score: {model.best_score_: 0.4f}")
    if 'tfv' in name:
        predictions = model.predict(xvalid_tfv)
        val_score = f1_score(yvalid,predictions)
        print(f"\tScore on validation set: {val_score: 0.4f}")
        val_scores[name] = val_score
    else:
        predictions = model.predict(xvalid_ctv)
        val_score = f1_score(yvalid,predictions)
        print(f"\tScore on validation set: {val_score: 0.4f}")
        val_scores[name] = val_score

lr-tfv: 
	Best GridSearch score:  0.7357
	Score on validation set:  0.7231
lr-ctv: 
	Best GridSearch score:  0.7423
	Score on validation set:  0.7346
lr-svd-tfv: 
	Best GridSearch score:  0.7016
	Score on validation set:  0.6831
lr-svd-ctv: 
	Best GridSearch score:  0.7102
	Score on validation set:  0.6975


Looks like the model with best performance on the unseen validation set is the best Logistic Regression model with CountVectorizer features.

In [34]:
#save best model to dictionary for comparison with other models later
model_results['lr-ctv'] = {'model': lr_gs_ctv, 'score': val_scores['lr-ctv']}

In [35]:
model_results

{'lr-ctv': {'model': GridSearchCV(cv=4, error_score=nan,
               estimator=Pipeline(memory=None,
                                  steps=[('lr',
                                          LogisticRegression(C=1.0,
                                                             class_weight=None,
                                                             dual=False,
                                                             fit_intercept=True,
                                                             intercept_scaling=1,
                                                             l1_ratio=None,
                                                             max_iter=100,
                                                             multi_class='auto',
                                                             n_jobs=None,
                                                             penalty='l2',
                                                             random_state=None,
   

In [None]:
log_loss(yvalid,lr_gs.predict_proba(xvalid_tfv))

In [None]:
# Fitting a simple Logistic Regression on TFIDF
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_tfv, ytrain)
predictions_prob = clf.predict_proba(xvalid_tfv)
predictions_classes = clf.predict(xvalid_tfv)

print ("Log loss: %0.3f " % log_loss(yvalid, predictions_prob))
print("F1 score: %0.3f " % f1_score(yvalid, predictions_classes))
print(classification_report(yvalid, predictions_classes))

sns.heatmap(confusion_matrix(yvalid,predictions_classes),annot=True,fmt='.0f',cmap='Blues', cbar=False)
plt.title("Confusion Matrix")
plt.show()

## Naive Bayes

# Resources

Many ideas for this notebook can from this [notebook](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) from Kaggle Grandmaster Abhishek Thakur. I would recommend checking out more of his [content](https://www.youtube.com/user/abhisheksvnit), he has great tutorials particularly on using BERT with PyTorch.