## Classification with BERT
Before diving into the details: the training and validation were not performed from this notebook but a AWS EC2 instance. If you want to replicate the result, please run BERT.py on a machine with GPU. </n>

After playing around with traditional NLP methods, I think  them not being able to capturing the meaning of words and styles of writing may have a big impact on the accuracy. Transformers have a deeper understanding of words and sequences beyond frequencies. In attempt to improve accuracy, I propose the following strategies to try. Each may have variances within and trying to solve different challenges of BERT. </n>

1. **BERT for sequence classification model with pretrained weights and embeddings:** </n>
- Simple truncate: truncate the document head only, tail only, and head and tail; Hierachical method: divide the document into L/510 fragments and then use mean pooling, max pooling and self attention to combine hidden states of [CLS] for each fragment. There is [suggestion](https://arxiv.org/pdf/1905.05583.pdf) that head and tail produces better results on certain corpus than hierachical methods. </n>
- [Longformer](https://arxiv.org/pdf/2004.05150.pdf): A transformer that handles long documents. [Code](https://github.com/allenai/longformer)

2. **Fine tune BERT MLM on a domain specific corpus and use the updated weights (and embeddings) for BERT sequence classification**

3. **Ensemble of traditional NLP with BERT.**

4. **Knowledge distillation** 
5. **Make more data dividing documents into segments, each labelled accordingly.

Option 1 and 4 is trying to address the problem of long sequence, 2 lack of domain specific training, and 3 tackling the problem from different angles (tranditional NLP: frequencies, BERT: word meaning and style). </n>

Because of time and resource contraint we are going to try combining 1 with head truncation and 3. It's possible that traditional NLP and neural NLP could provide insights from different aspects and it would be interesting to explore here. Given time and resouce, options 2 and 4 would be very interesting to explore too!


In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from prep_bert import BertEncoder, build_dataloaders
from fine_tune_bert import fine_tune_bert
import pickle5 as pickle

if __name__ == '__main__':
    with open('data.pickle', 'rb') as handle:
        df = pickle.load(handle)
    texts = df['text'].tolist()
    label = df['class_id'].apply(lambda x: int(x))
    
    texts, _, label, _ = train_test_split(texts, label, test_size=0.2, stratify=label, random_state=2020)
    dataset = BertEncoder(
        tokenizer=BertTokenizer.from_pretrained(
            'bert-base-cased', 
            do_lower_case=False), 
        input_data=texts
    )

    data = dataset.tokenize(max_len=510)
    input_ids, attention_masks = data
    label = torch.Tensor(label.to_list()).long()
    train_dataloader, val_dataloader = build_dataloaders(
        input_ids=input_ids,
        attention_masks=attention_masks,
        labels=label,
        batch_size=(16, 16),
        train_ratio=0.8
    )

    bert_model = BertForSequenceClassification.from_pretrained(
        'bert-base-cased',
        output_attentions=False,
        output_hidden_states=False,
        num_labels = 4
    )

    optimizer = AdamW(bert_model.parameters(), lr=2e-5, eps=1e-8)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    trained_model, stats = fine_tune_bert(
        train_dataloader=train_dataloader, 
        valid_dataloader=val_dataloader, 
        model=bert_model,
        optimizer=optimizer,
        save_model_path='model/trained_model.pt',
        save_stats_dict_path='logs/statistics.json',
        device = device,
        epochs = 30
    )

The result from 21 epochs of training, before it was stopped because validation accuracy had not changed for a few epochs: </n>
![Screenshot](training_result.png) </n>


This model is saved and ready for evaluation.

In [5]:
import pickle5 as pickle
from sklearn.model_selection import train_test_split
import torch
from transformers import BertForSequenceClassification, BertTokenizer
from ensemble import bert_predict
from prep_bert import *
from sklearn.metrics import classification_report, confusion_matrix

with open('tokenised-data.pickle', 'rb') as handle:
    df = pickle.load(handle)
    
X = df['text']
y = df['class_id']
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=2020)
texts = X_test.to_list()
labels = y_test.apply(lambda x:int(x))

dataset = BertEncoder(
    tokenizer=BertTokenizer.from_pretrained(
        'bert-base-cased',
        do_lower_case=False),
    input_data=texts
)

tokenized_data = dataset.tokenize(max_len=510)

input_ids, attention_masks = tokenized_data
labels = torch.Tensor(labels.to_list()).long()

dataloader = build_test_dataloaders(
    input_ids=input_ids,
    attention_masks=attention_masks,
    labels=labels
)

bert_state_dict = torch.load('./model/bfs_trained_model.pt')
bert = BertForSequenceClassification.from_pretrained(
    'bert-base-cased',
    state_dict=bert_state_dict,
    num_labels = 4,
    output_attentions=False,
    output_hidden_states=False
)
y_b_pred, _ = bert_predict(bert, dataloader)
cls_names = ['Incident Report', 'Situation Report', 'Profile report', 'Analytical report']
print(classification_report(labels, y_b_pred, target_names=cls_names))
print(confusion_matrix(labels, y_b_pred))

                   precision    recall  f1-score   support

  Incident Report       0.77      0.94      0.85        18
 Situation Report       0.94      0.85      0.89        20
   Profile report       1.00      0.65      0.79        20
Analytical report       0.75      1.00      0.86        15

         accuracy                           0.85        73
        macro avg       0.87      0.86      0.85        73
     weighted avg       0.88      0.85      0.85        73

[[17  1  0  0]
 [ 1 17  0  2]
 [ 4  0 13  3]
 [ 0  0  0 15]]


The is one drawback of vanilla BERT model: the tokeniser and pre-trained BERT model were trained on Wikipedia and BookCorpus. Although it might have a good 'understanding' of Wiki and book English, it has not seen news and analytical reports that are closer to our training data. </n>
Here enters [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf). The advantage of RoBERTa here are two. First, it was pre-trained on Wikipedia, BookCorpus, plus CC-NEWS, OpenWebText and STORIES, the later three are all texts from the internet, much closer to the kind of texts we see in our corpus. Two, RoBERTa also employs different training strategies, arguably more effective for downstream tasks. </n>
However, we managed to achieve no loss in the first epoch (not sure it is a good thing or not?). Either out training data is too small, or RoBERTa is too great?</n>
I evaluate both BERT model and RoBERTa model using the same testset:

In [12]:
with open('tokenised-data.pickle', 'rb') as handle:
    df = pickle.load(handle)
    
X = df['text']
y = df['class_id']
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=2020)

df_train = pd.concat([X_train, y_train], axis=1)
df_train.to_csv('data_train.csv')
df_test = pd.concat([X_test, y_test], axis=1)
df_test.to_csv('data_test.csv')