This notebook look at whether ensemble of traditional NLP and neural NLP will produce good result. This could be very much an overkill, but it would be interesting to explore nontheless. </n>

In summary, traditional NLP with hand crafted features and random forest produced a f1 score of 0.88, whereas the fine-tuned pre-trained BERT model produced an accuracy score of around 94%. The intuition of combining these two models is that hand crafted features very much captures frequencies of certain words or types of words. Pre-train BERT model has a better understanding of the language as a whole, and this could help identify different writing styles.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pickle
from prep_bert import *
from ensemble import *
from transformers import BertTokenizer, BertForSequenceClassification, BertConfig

In [2]:
with open('tokenised-data.pickle', 'rb') as handle:
    df = pickle.load(handle)

In [3]:
df.head()

Unnamed: 0,url,class_label,class_id,text,ents_rep,vocab,ppo_rep,no_ents_text,verb_present,verb_past,...,ad,prob,lem_text,len,org_count,place_count,time_count,person_count,num_count,ne_count
0,https://www.nytimes.com/2020/08/17/world/afric...,Incident Report,0,advertisement supported by the extremist group...,1.52,0.570312,1.71875,advertisement support extremist group ORG esca...,0.015625,0.0703125,...,0.119792,0.0,"[advertisement, support, extremist, group, ORG...",384,0.059896,0.057292,0.039062,0.026042,0.015625,0.143229
1,https://www.bbc.com/news/world-europe-54500555,Incident Report,0,belarusian riot police have used water cannon ...,1.41333,0.57716,1.54902,NORP riot police water cannon stun grenade bre...,0.037037,0.0432099,...,0.125,0.00154321,"[NORP, riot, police, water, cannon, stun, gren...",648,0.030864,0.026235,0.024691,0.064815,0.013889,0.121914
2,https://www.aljazeera.com/news/2020/10/12/nago...,Incident Report,0,russian foreign minister calls on armenia and ...,2.2518,0.372881,2.62887,NORP foreign minister call GPE GPE adhere agre...,0.0209847,0.0524617,...,0.0992736,0.00161421,"[NORP, foreign, minister, call, GPE, GPE, adhe...",1239,0.066182,0.09201,0.033091,0.047619,0.012914,0.205811
3,https://counteriedreport.com/roadside-bomb-bla...,Incident Report,0,alshabaab ended 2019 with a truckborne improvi...,1.25,0.75,1.5,GPE end DATE truckborne improvise explosive de...,0.0277778,0.037037,...,0.212963,0.0,"[GPE, end, DATE, truckborne, improvise, explos...",108,0.027778,0.027778,0.018519,0.0,0.018519,0.055556
4,https://counteriedreport.com/al-shabaabs-impro...,Incident Report,0,alshabaab ended 2019 with a truckborne improvi...,1.25,0.75,1.5,GPE end DATE truckborne improvise explosive de...,0.0277778,0.037037,...,0.212963,0.0,"[GPE, end, DATE, truckborne, improvise, explos...",108,0.027778,0.027778,0.018519,0.0,0.018519,0.055556


## I. Individual model performances

In [4]:
X = df[['text','len', 'org_count', 'place_count', 'time_count', 'person_count', 'num_count', 'ents_rep', 'vocab', 'ppo_rep', 'verb_present', 'verb_past', 'verb', 'modal', 'ad', 'prob']]
y = df[['class_label', 'class_id']]
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=2020)

In [5]:
X_rf = X_test[['len', 'org_count', 'place_count', 'time_count', 'person_count', 'num_count', 'ents_rep', 'vocab', 'ppo_rep', 'verb_present', 'verb_past', 'verb', 'modal', 'ad', 'prob']]
y_rf = y_test['class_label']


texts = X_test['text'].to_list()
labels = y_test['class_id'].apply(lambda x:int(x))

dataset = BertEncoder(
    tokenizer=BertTokenizer.from_pretrained(
        'bert-base-cased',
        do_lower_case=False),
    input_data=texts
)

tokenized_data = dataset.tokenize(max_len=510)

input_ids, attention_masks = tokenized_data
labels = torch.Tensor(labels.to_list()).long()

dataloader = build_test_dataloaders(
    input_ids=input_ids,
    attention_masks=attention_masks,
    labels=labels
)

In [6]:
rf = pickle.load(open('./model/rf_model.sav', 'rb'))

In [7]:
# Please note the scores here are not the same as those in the traditional NLP notebook.
# The reason is I accidentally ran tranditional NLP notebook random search cv again, 
# so the model there is not the same as the one save here. 
y_pred = rf.predict(X_rf)
print(classification_report(y_rf, y_pred))
print(confusion_matrix(y_rf, y_pred))

                   precision    recall  f1-score   support

Analytical Report       0.88      0.93      0.90        15
  Incident Report       0.94      0.83      0.88        18
   Profile Report       0.78      0.90      0.84        20
 Situation Report       0.94      0.85      0.89        20

         accuracy                           0.88        73
        macro avg       0.88      0.88      0.88        73
     weighted avg       0.88      0.88      0.88        73

[[14  0  0  1]
 [ 0 15  3  0]
 [ 1  1 18  0]
 [ 1  0  2 17]]


In [15]:
bert_state_dict = torch.load('./model/bfs_trained_model.pt')
bert = BertForSequenceClassification.from_pretrained(
    'bert-base-cased',
    state_dict=bert_state_dict,
    num_labels = 4,
    output_attentions=False,
    output_hidden_states=False
)

In [16]:
y_b_pred, _ = bert_predict(bert, dataloader)

In [17]:
cls_names = ['Incident Report', 'Situation Report', 'Profile report', 'Analytical report']
print(classification_report(labels, y_b_pred, target_names=cls_names))
print(confusion_matrix(labels, y_b_pred))

                   precision    recall  f1-score   support

  Incident Report       0.77      0.94      0.85        18
 Situation Report       1.00      0.95      0.97        20
   Profile report       1.00      0.85      0.92        20
Analytical report       0.93      0.93      0.93        15

         accuracy                           0.92        73
        macro avg       0.93      0.92      0.92        73
     weighted avg       0.93      0.92      0.92        73

[[17  0  0  1]
 [ 1 19  0  0]
 [ 3  0 17  0]
 [ 1  0  0 14]]


## II. Ensemble
Considering that random forest took little time to train, the result is really good. However, the purpose of this notebook is to look at if ensemble both models will produce better result. </n>

First, let's look at if both models predict the same labels.

In [18]:
y_b_pred = [int(x) for x in y_b_pred]

In [19]:
print('Random forest misclassified groud truth:', y_rf[y_rf!=y_pred])
print('Random forest misclassified predicted:', y_pred[y_rf!=y_pred])
print()
y_nn = labels.numpy()
print('BERT model misclassified ground truth', y_test['class_label'][y_nn!=y_b_pred])
print('BERT model misclassified predicted', np.array(y_b_pred)[y_nn!=y_b_pred])

Random forest misclassified groud truth: 96        Profile Report
81      Situation Report
86       Incident Report
48     Analytical Report
148       Profile Report
83       Incident Report
323     Situation Report
281     Situation Report
259      Incident Report
Name: class_label, dtype: object
Random forest misclassified predicted: ['Analytical Report' 'Analytical Report' 'Profile Report'
 'Situation Report' 'Incident Report' 'Profile Report' 'Profile Report'
 'Profile Report' 'Profile Report']

BERT model misclassified ground truth 140       Profile Report
96        Profile Report
156       Profile Report
48     Analytical Report
273     Situation Report
259      Incident Report
Name: class_label, dtype: object
BERT model misclassified predicted [0 0 0 0 0 3]


There are some overlap (96, 48, 259) but also misclassified cases of different samples. Even the overlapped misclassification, the predicitons are different between these two models. There is now an intuition for ensemble.

In [22]:
ensemble = NLPEnsemble(
    traditional_nlp=rf,
    nn_nlp=bert
)

pred = ensemble.predict(X_rf, dataloader)

[[0.05262963 0.32701272 0.59212214 0.02823551]
 [0.02722499 0.44456294 0.43557154 0.09264052]
 [0.04503624 0.16159037 0.73389024 0.05948315]
 [0.031464   0.01341887 0.91054833 0.0445688 ]
 [0.00994902 0.08118055 0.87134197 0.03752845]
 [0.03349206 0.68068452 0.21540675 0.07041667]
 [0.04683761 0.03008547 0.04299145 0.88008547]
 [0.08333333 0.67933333 0.11066667 0.12666667]
 [0.34037334 0.04787904 0.44442827 0.16731934]
 [0.05168039 0.22945182 0.63154596 0.08732183]
 [0.01111112 0.05434883 0.92635551 0.00818454]
 [0.47369915 0.0765931  0.26912649 0.18058126]
 [0.83333333 0.03666667 0.06       0.07      ]
 [0.5        0.21380952 0.115      0.17119048]
 [0.00666667 0.85238889 0.10594444 0.035     ]
 [0.04058761 0.01384567 0.14968851 0.79587821]
 [0.55899034 0.15534515 0.17420045 0.11146406]
 [0.01094623 0.13766677 0.80936188 0.04202512]
 [0.40438645 0.08882239 0.26493469 0.24185647]
 [0.01344444 0.56214291 0.36917808 0.05523457]
 [0.01666667 0.09577872 0.02020801 0.8673466 ]
 [0.5154058  

In [23]:
print(classification_report(labels, pred, target_names=cls_names))
print(confusion_matrix(labels, pred))

                   precision    recall  f1-score   support

  Incident Report       0.71      0.67      0.69        18
 Situation Report       0.65      0.55      0.59        20
   Profile report       1.00      0.90      0.95        20
Analytical report       0.57      0.80      0.67        15

         accuracy                           0.73        73
        macro avg       0.73      0.73      0.72        73
     weighted avg       0.74      0.73      0.73        73

[[12  5  0  1]
 [ 1 11  0  8]
 [ 2  0 18  0]
 [ 2  1  0 12]]


Oh dear! That's not what we had hoped. There is probably some mismatch. But I am not sure where/how!