## Classification with BERT
Before diving into the details: the training and validation were not performed from this notebook but a AWS EC2 instance. If you want to replicate the result, please run BERT.py on a machine with GPU. </n>

After playing around with traditional NLP methods, I think  them not being able to capturing the meaning of words and styles of writing may have a big impact on the accuracy. Transformers have a deeper understanding of words and sequences beyond frequencies. In attempt to improve accuracy, I propose the following strategies to try. Each may have variances within and trying to solve different challenges of BERT. </n>

1. **BERT for sequence classification model with pretrained weights and embeddings:** </n>
- Simple truncate: truncate the document head only, tail only, and head and tail; Hierachical method: divide the document into L/510 fragments and then use mean pooling, max pooling and self attention to combine hidden states of [CLS] for each fragment. There is [suggestion](https://arxiv.org/pdf/1905.05583.pdf) that head and tail produces better results on certain corpus than hierachical methods. </n>
- [Longformer](https://arxiv.org/pdf/2004.05150.pdf): A transformer that handles long documents. [Code](https://github.com/allenai/longformer)

2. **Fine tune BERT MLM on a domain specific corpus and use the updated weights (and embeddings) for BERT sequence classification**

3. **Ensemble of traditional NLP with BERT.**

4. **Knowledge distillation** 
5. **Make more data dividing documents into segments, each labelled accordingly.**
6. **Back translation to create more data**

Options 1 and 4, 5 are trying to address the problem of long sequence, 2 lack of domain specific training, and 3 tackling the problem from different angles (tranditional NLP: frequencies, BERT: word meaning and style). Option 6 is trying to tackle the problem of small training set. </n>

Because of time and resource contraint I have tried combining 1 with head truncation, 2, 5 and 6 with varied sucess. It's possible that traditional NLP and neural NLP could provide insights from different aspects and it would be interesting to explore here. Given time and resouce, other ideas would be good to explore too!


## I.a Vanilla BERT

In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from prep_bert import BertEncoder, build_dataloaders
from fine_tune_bert import fine_tune_bert
import pickle5 as pickle

if __name__ == '__main__':
    with open('data.pickle', 'rb') as handle:
        df = pickle.load(handle)
    texts = df['text'].tolist()
    label = df['class_id'].apply(lambda x: int(x))
    
    texts, _, label, _ = train_test_split(texts, label, test_size=0.2, stratify=label, random_state=2020)
    dataset = BertEncoder(
        tokenizer=BertTokenizer.from_pretrained(
            'bert-base-cased', 
            do_lower_case=False), 
        input_data=texts
    )

    data = dataset.tokenize(max_len=510)
    input_ids, attention_masks = data
    label = torch.Tensor(label.to_list()).long()
    train_dataloader, val_dataloader = build_dataloaders(
        input_ids=input_ids,
        attention_masks=attention_masks,
        labels=label,
        batch_size=(16, 16),
        train_ratio=0.8
    )

    bert_model = BertForSequenceClassification.from_pretrained(
        'bert-base-cased',
        output_attentions=False,
        output_hidden_states=False,
        num_labels = 4
    )

    optimizer = AdamW(bert_model.parameters(), lr=2e-5, eps=1e-8)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    trained_model, stats = fine_tune_bert(
        train_dataloader=train_dataloader, 
        valid_dataloader=val_dataloader, 
        model=bert_model,
        optimizer=optimizer,
        save_model_path='model/trained_model.pt',
        save_stats_dict_path='logs/statistics.json',
        device = device,
        epochs = 30
    )

The result from 21 epochs of training, before it was stopped because validation accuracy had not changed for a few epochs: </n>
![Screenshot](training_result.png) </n>


This model is saved and ready for evaluation.

In [5]:
import pickle5 as pickle
from sklearn.model_selection import train_test_split
import torch
from transformers import BertForSequenceClassification, BertTokenizer
from ensemble import bert_predict
from prep_bert import *
from sklearn.metrics import classification_report, confusion_matrix

with open('tokenised-data.pickle', 'rb') as handle:
    df = pickle.load(handle)
    
X = df['text']
y = df['class_id']
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=2020)
texts = X_test.to_list()
labels = y_test.apply(lambda x:int(x))

dataset = BertEncoder(
    tokenizer=BertTokenizer.from_pretrained(
        'bert-base-cased',
        do_lower_case=False),
    input_data=texts
)

tokenized_data = dataset.tokenize(max_len=510)

input_ids, attention_masks = tokenized_data
labels = torch.Tensor(labels.to_list()).long()

dataloader = build_test_dataloaders(
    input_ids=input_ids,
    attention_masks=attention_masks,
    labels=labels
)

bert_state_dict = torch.load('./model/bfs_trained_model.pt')
bert = BertForSequenceClassification.from_pretrained(
    'bert-base-cased',
    state_dict=bert_state_dict,
    num_labels = 4,
    output_attentions=False,
    output_hidden_states=False
)
y_b_pred, _ = bert_predict(bert, dataloader)
cls_names = ['Incident Report', 'Situation Report', 'Profile report', 'Analytical report']
print(classification_report(labels, y_b_pred, target_names=cls_names))
print(confusion_matrix(labels, y_b_pred))

                   precision    recall  f1-score   support

  Incident Report       0.77      0.94      0.85        18
 Situation Report       0.94      0.85      0.89        20
   Profile report       1.00      0.65      0.79        20
Analytical report       0.75      1.00      0.86        15

         accuracy                           0.85        73
        macro avg       0.87      0.86      0.85        73
     weighted avg       0.88      0.85      0.85        73

[[17  1  0  0]
 [ 1 17  0  2]
 [ 4  0 13  3]
 [ 0  0  0 15]]


## I.b RoBERTa

There is a drawback of vanilla BERT model: the tokeniser and pre-trained BERT model were trained on Wikipedia and BookCorpus. Although it might have a good 'understanding' of Wiki and book English, it has not seen news and analytical reports that are closer to our training data. </n>
Here enters [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf). The advantage of RoBERTa here are two. First, it was pre-trained on Wikipedia, BookCorpus, plus CC-NEWS, OpenWebText and STORIES, the later three are all texts from the internet, much closer to the kind of texts we see in our corpus. Two, RoBERTa also employs different training strategies, arguably more effective for downstream tasks. </n>
However, we managed to achieve no loss in the first epoch (not sure it is a good thing or not?). Either out training data is too small, or RoBERTa is too great?</n>
I evaluate both BERT model and RoBERTa model using the same testset:

In [None]:
# Warning: do not run on CPU.

import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from prep_bert import *
from fine_tune_bert_sc import *
import pickle5 as pickle
import pandas as pd
from torch.utils.data import TensorDataset

if __name__ == '__main__':
    with open('data.pickle', 'rb') as handle:
        df = pickle.load(handle)
    texts = df['text'].tolist()
    labels = df['class_id'].apply(lambda x: int(x))
    train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=2020)

    train_dataset = BertEncoder(
        tokenizer=RobertaTokenizer.from_pretrained('roberta-base'), 
        input_data=train_texts
    )
    test_dataset = BertEncoder(
        tokenizer=RobertaTokenizer.from_pretrained('roberta-base'), 
        input_data=test_texts
    )

    train_data = train_dataset.tokenize(max_len=510)
    test_data = test_dataset.tokenize(max_len=510)

    train_input_ids, train_attention_masks = train_data
    test_input_ids, test_attention_masks = test_data

    train_labels = torch.Tensor(train_labels.to_list()).long()
    test_labels = torch.Tensor(test_labels.to_list()).long()

    train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)
    test_dataset = TensorDataset(test_input_ids, test_attention_masks, test_labels)

    model = RobertaForSequenceClassification.from_pretrained(
        'roberta-base',
        output_attentions=False,
        output_hidden_states=False,
        num_labels = 4
    )

    training_args = TrainingArguments(
        output_dir='.results/',
        num_train_epochs=8,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        warmup_steps=10,
        weight_decay=0.01,
        logging_dir='./logs',
        load_best_model_at_end=True
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics,
        data_collator=dummy_data_collector
    )

    trainer.train()
    trainer.save_model('.model/')
    trainer.evaluate()
    

In [16]:
import pickle5 as pickle
from sklearn.model_selection import train_test_split
import torch
from transformers import RobertaForSequenceClassification, RobertaTokenizer
from ensemble import bert_predict
from prep_bert import *
from sklearn.metrics import classification_report, confusion_matrix

with open('data.pickle', 'rb') as handle:
    df = pickle.load(handle)
    
X = df['text']
y = df['class_id']
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=2020)
texts = X_test.to_list()
labels = y_test.apply(lambda x:int(x))

dataset = BertEncoder(
    tokenizer=RobertaTokenizer.from_pretrained('roberta-base'),
    input_data=texts
)

tokenized_data = dataset.tokenize(max_len=510)

input_ids, attention_masks = tokenized_data
labels = torch.Tensor(labels.to_list()).long()

dataloader = build_test_dataloaders(
    input_ids=input_ids,
    attention_masks=attention_masks,
    labels=labels
)

roberta = RobertaForSequenceClassification.from_pretrained(
    'model/roberta',
    local_files_only=True
)
y_b_pred, _ = bert_predict(roberta, dataloader)
cls_names = ['Incident Report', 'Situation Report', 'Profile report', 'Analytical report']
print(classification_report(labels, y_b_pred, target_names=cls_names))
print(confusion_matrix(labels, y_b_pred))

                   precision    recall  f1-score   support

  Incident Report       0.95      1.00      0.97        18
 Situation Report       1.00      0.85      0.92        20
   Profile report       1.00      1.00      1.00        20
Analytical report       0.88      1.00      0.94        15

         accuracy                           0.96        73
        macro avg       0.96      0.96      0.96        73
     weighted avg       0.96      0.96      0.96        73

[[18  0  0  0]
 [ 1 17  0  2]
 [ 0  0 20  0]
 [ 0  0  0 15]]


It look like RoBERTa works a lot better than vanilla BERT model. Notice the model only misclassified 3 situation reports in test set. Let's run the model on the full data (train and test) to see which ones are misclassified.

In [27]:
dataset = BertEncoder(
    tokenizer=RobertaTokenizer.from_pretrained('roberta-base'),
    input_data=X.to_list()
)

tokenized_data = dataset.tokenize(max_len=510)

input_ids, attention_masks = tokenized_data
labels = torch.Tensor(y.apply(lambda x:int(x)).to_list()).long()

dataloader = build_test_dataloaders(
    input_ids=input_ids,
    attention_masks=attention_masks,
    labels=labels
)

y_pred, _ = bert_predict(roberta, dataloader)
cls_names = ['Incident Report', 'Situation Report', 'Profile report', 'Analytical report']
print(classification_report(labels, y_pred, target_names=cls_names))
print(confusion_matrix(labels, y_pred))

                   precision    recall  f1-score   support

  Incident Report       0.97      1.00      0.98        88
 Situation Report       1.00      0.95      0.97        99
   Profile report       1.00      1.00      1.00       100
Analytical report       0.96      0.99      0.97        74

         accuracy                           0.98       361
        macro avg       0.98      0.98      0.98       361
     weighted avg       0.98      0.98      0.98       361

[[ 88   0   0   0]
 [  2  94   0   3]
 [  0   0 100   0]
 [  1   0   0  73]]


In [25]:
import numpy as np
y_pred = np.array(y_pred).flatten()

In [26]:
print('BERT model misclassified ground truth', y[y!=y_pred])
print('BERT model misclassified predicted', np.array(y_pred)[y!=y_pred])

BERT model misclassified ground truth 26     1
68     1
75     1
77     1
336    1
347    3
Name: class_id, dtype: int64
BERT model misclassified predicted [0 0 3 3 3 0]


Misclassified cases are: </n>

26 and 86 ground truth: situation report, prediction: incident report </n>

74, 77 and 336 ground truth: situation report, prediction: profile report </n>

347 ground truth: profile report, prediction: incident report </n>




## II Finetune the BERT model with a MLM task and finetune again with classification task

The intuition behind this is the general idea pretrained models improve performance on downstream task. Because BERT might not have seen the kind of training data here, fine tuning BERT weights with the texts similar to the training data would be benefitial. </n>

Because labelled data is hard to come by, BERT Masked Language Model would be very useful. </n>

The original proposal is to obtain more documents of the similar kind, however, we weren't able to because of the time constraint. I could fine tune BERT with our training data on hand, however, I have little confidence it would work well because: first, less than 400 data points is too small to fine tune BERT weights effectively; second, if the model has seen the training data, would it constitute data leakage? </n>

Maybe this is a good avenue to explore given additional data (how much?) </n>

Given how well RoBERTa performs, I think it's fair to say any improvement on it with further MLM fine tuning would be quite amazing and proves how adaptable BERT is. The idea is not only for this classification task, but all downstream tasks such as entity recognition. It is probable that semi-supervised learning techniques would bring overall improvements on all tasks using similar kind of data. </n>

This has a very exciting practical implication. Instead of updating the model with labelled data (expensive and not easy to get), update BERT weights with all new data as it comes in. Like making a master sour dough, the BERT weights will get better and better and evolve with change (new vocabulary, changes in document types) </n>

Traditional NLP although also very effective and much quicker, will have to rely on retraining on labelled data to maintain. In addition, it will not be able to help down stream tasks such as entity recognition, sentiment analysis etc. Finally, even though training and inference takes very little time, particularly for randam forest, the time to tokenize is not particularly short if you have a large amount of data. Although, NN NLP has the same problem as well. </n>

In summary, as the original [RoBERTa paper](https://arxiv.org/pdf/1907.11692.pdf) pointed out, it achieved state-of-the-art performance from bigger batches, longer sequences, longer training time and more data! I think it points to the fact that ceiling probably has not been achieved and further training and fine tuning will give better results.

## III Backtranslate

For this part, I utilise [nlpaug](https://github.com/makcedward/nlpaug) -- a NLP augmentation API with a backgtranslation class that uses [pointers to the models of Facebook-FAIR's WMT'19 news translation task submission](https://github.com/pytorch/fairseq/tree/master/examples/wmt19) <br>

The fair wmt19 model contrains pretrained model for two languages, German and Russian.

In [25]:
en2de_ensemble = torch.hub.load(
    'pytorch/fairseq', 'transformer.wmt18.en-de',
    checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
    tokenizer='moses', bpe='subword_nmt')
de2en_ensemble = torch.hub.load(
    'pytorch/fairseq', 'transformer.wmt18.de-en',
    checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
    tokenizer='moses', bpe='subword_nmt')

en2de_ensemble.translate('Hello world!')

Using cache found in /Users/luluo/.cache/torch/hub/pytorch_fairseq_master
100%|██████████| 13605842351/13605842351 [17:31<00:00, 12939043.29B/s]
Error when composing. Overrides: ['common.no_progress_bar=True', 'common.log_interval=100', "common.log_format='simple'", 'common.tensorboard_logdir=null', 'common.wandb_project=null', 'common.azureml_logging=False', 'common.seed=1', 'common.cpu=False', 'common.tpu=False', 'common.bf16=False', 'common.memory_efficient_bf16=False', 'common.fp16=True', 'common.memory_efficient_fp16=False', 'common.fp16_no_flatten_grads=False', 'common.fp16_init_scale=128', 'common.fp16_scale_window=null', 'common.fp16_scale_tolerance=0.0', 'common.min_loss_scale=0.0001', 'common.threshold_loss_scale=null', 'common.user_dir=null', 'common.empty_cache_freq=0', 'common.all_gather_list_size=16384', 'common.model_parallel_size=1', 'common.quantization_config_path=null', 'common.profile=False', 'common.reset_logging=True', 'common_eval.path=null', 'common_eval.post_pr

ConfigCompositionException: Error merging override checkpoint.save_interval_updates=null

In [34]:
text = df['text'][0]

In [42]:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
encoded_dict = tokenizer(text=text, add_special_tokens=True, padding='max_length', max_length=512, return_attention_mask=True, return_tensors='pt')
input_ids = encoded_dict['input_ids']

In [65]:
import torch
# def get_masked_input_and_labels(input_ids):
mask_token_id = torch.Tensor(50264)
size = input_ids.size()
# 15% BERT masking
inp_mask = torch.randn(size) < 0.15
# Do not mask special tokens
inp_mask[input_ids<=2] = False
# # Set targets to -1 by default, it means ignore
# labels = -1 * np.ones(size, dtype=int)
# # Set labels for masked tokens
# labels[inp_mask] = input_ids[inp_mask]

# Prepare input
input_ids_masked = torch.clone(input_ids)
# Set input to [MASK] which is the last token for the 90% of tokens
# This means leaving 10% unchanged
inp_mask_2mask = inp_mask & (torch.randn(size) < 0.90)
input_ids_masked[inp_mask_2mask] = mask_token_id  # mask token is the last in the dict

# Set 10% to a random token
inp_mask_2random = inp_mask_2mask & (torch.randn(size) < 1 / 9)
input_ids_masked[inp_mask_2random] = torch.randint(
    low=3, high=50262, size=torch.sum(inp_mask_2random)
)

# # Prepare sample_weights to pass to .fit() method
# sample_weights = np.ones(labels.shape)
# sample_weights[labels == -1] = 0

# y_labels would be same as input_ids i.e input tokens
y_labels = torch.clone(input_ids)

#     return input_ids_masked, y_labels

RuntimeError: shape mismatch: value tensor of shape [50264] cannot be broadcast to indexing result of shape [412]

In [75]:
input_ids.shape

torch.Size([1, 912])

In [54]:
input_id_masked, y_labels = get_masked_input_and_labels(input_ids)

RuntimeError: "normal_kernel_cpu" not implemented for 'Long'

In [31]:
text = 'The quick brown fox jumped over the lazy dog'
back_translation_aug = naw.BackTranslationAug(
    from_model_name='transformer.wmt19.en-de',
    to_model_name='transformer.wmt19.de-en')
back_translation_aug.augment(text)

Using cache found in /Users/luluo/.cache/torch/hub/pytorch_fairseq_master
 69%|██████▉   | 8223920128/11958904958 [3:48:23<1:43:43, 600135.83B/s]
Using cache found in /Users/luluo/.cache/torch/hub/pytorch_fairseq_master
100%|██████████| 11958904958/11958904958 [15:37<00:00, 12750993.89B/s]


'The speedy brown fox jumped over the lazy dog'

In [23]:
len(df['text'][0])

4480

In [19]:
import nlpaug.augmenter.word as naw
import pandas as pd
import pickle



def back_translation(df: pd.DataFrame, col_name: str) -> pd.DataFrame:
    """
    This function takes in a dataframe, backtranslate the text column and append to the original dataframe with labels 

    Args:
        df (pd.DataFrame): Dataframe containing text and labels
        col_name (str): Name of the text column

    Returns:
        pd.DataFrame: Dataframe containing original text and backtranslated text and labels
    """

    MODELS = {
        'GERMAN': ('transformer.wmt19.en-de', 'transformer.wmt19.de-en'),
        'RUSSIAN': ('transformer.wmt19.en-ru', 'transformer.wmt19.ru-en')
    }

    for _, model in MODELS.items():
        back_translation_aug = naw.BackTranslationAug(
            from_model_name=model[0],
            to_model_name=model[1]
        )
        translation = df[col_name].apply(back_translation_aug.augment)
        print(translation)
        labels = df['class_id']
        trans_df = pd.DataFrame({
            'text': translation,
            'class_id': labels
        })
        df = pd.concat([df, trans_df], axis=0, ignore_index=True)
    
    return df

In [20]:
with open('data.pickle', 'rb') as handle:
    df = pickle.load(handle)
df = back_translation(df, 'text')

Using cache found in /Users/luluo/.cache/torch/hub/pytorch_fairseq_master
Using cache found in /Users/luluo/.cache/torch/hub/pytorch_fairseq_master


Exception: Size of sample #0 is invalid (=(1616, 0)) since max_positions=(1024, 1024), skip this example with --skip-invalid-size-inputs-valid-test

## IV Split texts into segments to build more training data
Because the overall lengths of the training texts are much longer than 510, here I experiment on creating more training data by dividing the documents into chunks of lengths between 300 and 510. I don't want anything too short that is hard to classify.

In [12]:
import pickle
with open('data.pickle', 'rb') as handle:
    df = pickle.load(handle)

train_texts, test_texts, train_labels, test_labels = train_test_split(df['text'], df['class_id'], test_size=0.2, random_state=2020)
train_dataset = pd.DataFrame(list(zip(train_texts, train_labels)), columns=['text', 'class_id'])    
test_dataset = pd.DataFrame(list(zip(test_texts, test_labels)), columns=['text', 'class_id'])    

In [15]:
from prep_bert import *
split_train_df = split_training_data(train_dataset, 510)

In [16]:
with open('split_train_data.pickle', 'wb') as handle:
    pickle.dump(split_train_df, handle)
with open('test_data.pickle', 'wb') as handle:
    pickle.dump(test_dataset, handle)

In [18]:
split_train_df

Unnamed: 0,text,label
0,The Friday Cover is POLITICO Magazine's ...,3
1,cracking down on religion. Poland’s fierce...,3
2,Polish leaders sometimes talk of military ...,3
3,Al-Qa‘ida in the Arabian Peninsula is a ...,2
4,"Published Nov. 16, 2020 4:54AM ET Photo ...",3
...,...,...
604,until “the Mujahideen has irreversibly com...,2
605,Stories being covered today by BBC Monit...,1
606,last month after negotiations with the fe...,1
607,With communication lines cut and roads b...,3
