This notebook look at whether ensemble of traditional NLP and neural NLP will produce good result. This could be very much an overkill, but it would be interesting to explore nontheless. </n>

In summary, traditional NLP with hand crafted features and random forest produced a f1 score of 0.88, whereas the fine-tuned pre-trained BERT model produced an accuracy score of around 94%. The intuition of combining these two models is that hand crafted features very much captures frequencies of certain words or types of words. Pre-train BERT model has a better understanding of the language as a whole, and this could help identify different writing styles.

In [20]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pickle
from prep_bert import *
from ensemble import NLPEnsemble
from transformers import BertTokenizer, BertForSequenceClassification

In [2]:
with open('tokenised-data.pickle', 'rb') as handle:
    df = pickle.load(handle)

In [3]:
df.head()

Unnamed: 0,url,class_label,class_id,text,ents_rep,vocab,ppo_rep,no_ents_text,verb_present,verb_past,...,ad,prob,lem_text,len,org_count,place_count,time_count,person_count,num_count,ne_count
0,https://www.nytimes.com/2020/08/17/world/afric...,Incident Report,0,advertisement supported by the extremist group...,1.52,0.570312,1.71875,advertisement support extremist group ORG esca...,0.015625,0.0703125,...,0.119792,0.0,"[advertisement, support, extremist, group, ORG...",384,0.059896,0.057292,0.039062,0.026042,0.015625,0.143229
1,https://www.bbc.com/news/world-europe-54500555,Incident Report,0,belarusian riot police have used water cannon ...,1.41333,0.57716,1.54902,NORP riot police water cannon stun grenade bre...,0.037037,0.0432099,...,0.125,0.00154321,"[NORP, riot, police, water, cannon, stun, gren...",648,0.030864,0.026235,0.024691,0.064815,0.013889,0.121914
2,https://www.aljazeera.com/news/2020/10/12/nago...,Incident Report,0,russian foreign minister calls on armenia and ...,2.2518,0.372881,2.62887,NORP foreign minister call GPE GPE adhere agre...,0.0209847,0.0524617,...,0.0992736,0.00161421,"[NORP, foreign, minister, call, GPE, GPE, adhe...",1239,0.066182,0.09201,0.033091,0.047619,0.012914,0.205811
3,https://counteriedreport.com/roadside-bomb-bla...,Incident Report,0,alshabaab ended 2019 with a truckborne improvi...,1.25,0.75,1.5,GPE end DATE truckborne improvise explosive de...,0.0277778,0.037037,...,0.212963,0.0,"[GPE, end, DATE, truckborne, improvise, explos...",108,0.027778,0.027778,0.018519,0.0,0.018519,0.055556
4,https://counteriedreport.com/al-shabaabs-impro...,Incident Report,0,alshabaab ended 2019 with a truckborne improvi...,1.25,0.75,1.5,GPE end DATE truckborne improvise explosive de...,0.0277778,0.037037,...,0.212963,0.0,"[GPE, end, DATE, truckborne, improvise, explos...",108,0.027778,0.027778,0.018519,0.0,0.018519,0.055556


## I. Individual model performance

In [None]:
X = df[['text', 'len', 'org_count', 'place_count', 'time_count', 'person_count', 'num_count', 'ents_rep', 'vocab', 'ppo_rep', 'verb_present', 'verb_past', 'verb', 'modal', 'ad', 'prob']]
y = df[['class_id', 'class_label']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=2020)

In [6]:
X = X_test[['len', 'org_count', 'place_count', 'time_count', 'person_count', 'num_count', 'ents_rep', 'vocab', 'ppo_rep', 'verb_present', 'verb_past', 'verb', 'modal', 'ad', 'prob']]
y = df['class_label']


texts = df['text'].to_list()
labels = df['class_id'].apply(lambda x:int(x))

dataset = BertEncoder(
    tokenizer=BertTokenizer.from_pretrained(
        'bert-base-cased',
        do_lower_case=False),
    input_data=texts
)

tokenized_data = dataset.tokenize(max_len=510)

input_ids, attention_masks = tokenized_data
labels = torch.Tensor(labels.to_list()).long()

dataloader = build_test_dataloaders(
    input_ids=input_ids,
    attention_masks=attention_masks,
    labels=labels
)

In [18]:
rf = pickle.load(open('./model/rf_model.sav', 'rb'))
bert = torch.load('./model/trained_model.pt')

In [21]:
y_pred = rf.predict(X)
print(classification_report(y, y_pred))
print(confusion_matrix(y, y_pred))

                   precision    recall  f1-score   support

Analytical Report       0.97      0.99      0.98        74
  Incident Report       0.99      0.97      0.98        88
   Profile Report       0.95      0.98      0.97       100
 Situation Report       0.99      0.97      0.98        99

         accuracy                           0.98       361
        macro avg       0.98      0.98      0.98       361
     weighted avg       0.98      0.98      0.98       361

[[73  0  0  1]
 [ 0 85  3  0]
 [ 1  1 98  0]
 [ 1  0  2 96]]
