# Final Project

## TRAC2- Transformer Models (BERT) - base uncased - Task A binary classification Task

The purpose of this notebook is to create a classification model for Task A with only two classes: Aggressive (AG) and Non-aggresive. The classes OAG and CAG are combined into class AG.

The notebook `TRAC2-Data_2_classes_Task_A.ipynb` combines the classes and creates the dataset.

In [None]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Package imports

In [None]:
!pip install transformers



In [None]:
!pip install datasets



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

import tensorflow as tf
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification
from datasets import load_dataset

from sklearn.preprocessing import label_binarize
from sklearn import metrics

import statistics

## Helper functions

In [None]:
def from_logits_to_labels(pred, task):
    '''
    Returns labels based on predicted logits on labels [CAG,NAG,OAG] for task A. Task B is binary, and 'GEN' represents 
    the positive class.
    Parameters:
    pred: array with model prediction
    task: either 'A' or 'B'
    '''
    index_a = {0:'AG', 1:'NAG'}
    index_b = {0:'GEN', 1:'NGEN'}
    
    if task == 'A':
        highest_prob_class = np.argmax(pred, axis=1)
        labels = np.vectorize(index_a.get)(highest_prob_class.astype(int))
        
    elif task == 'B':
        highest_prob_class = np.argmax(pred, axis=1)
        labels = np.vectorize(index_b.get)(highest_prob_class.astype(int))
    else:
        labels = []
        
    return labels  

In [None]:
def to_one_hot_labels(string_labels):
    '''
    Returns one-hot encoded labels from a multi-class label vector e.g. ['cat', 'dog', 'dog', 'lion', 'cat', ...] 
    Parameters:
    string_labels: 
    '''
    labels = pd.get_dummies(string_labels)
    labels = labels.to_numpy()
    
    return labels

## Load data
Load training, development and test datasets.

In [None]:
# Load labels using pandas dataframes

train_labels = pd.read_csv('drive/MyDrive/w266/pet_files/all_data_task_A_two_classes/train.csv')['label_a']
dev_labels = pd.read_csv('drive/MyDrive/w266/pet_files/all_data_task_A_two_classes/dev.csv')['label_a']
test_labels = pd.read_csv('drive/MyDrive/w266/pet_files/all_data_task_A_two_classes/test.csv')['label_a']

In [None]:
# Load text data using Hugging Face datasets
# need to use the split argument even though we are not splitting. If not, data is loaded as DatasetDict
# to load as dataset need to include the split parameter
train_dataset = load_dataset('csv', data_files='drive/MyDrive/w266/pet_files/all_data_task_A_two_classes/train.csv', split = 'train[:4263]')
dev_dataset = load_dataset('csv', data_files='drive/MyDrive/w266/pet_files/all_data_task_A_two_classes/dev.csv', split = 'train[:1066]')
test_dataset = load_dataset('csv', data_files='drive/MyDrive/w266/pet_files/all_data_task_A_two_classes/test.csv', split = 'train[:1200]')

Using custom data configuration default-5f77e630ad205ebf
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-5f77e630ad205ebf/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)
Using custom data configuration default-9f5c4d2d22d4cdf0
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-9f5c4d2d22d4cdf0/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)
Using custom data configuration default-6807051e4db05c0e
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-6807051e4db05c0e/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


## Encode labels

In [None]:
# encode labels Task A- [AG,NAG]
train_labels_enc = to_one_hot_labels(train_labels)
dev_labels_enc = to_one_hot_labels(dev_labels)
test_labels_enc = to_one_hot_labels(test_labels)


## Prepare TensorFlow datasets for BERT

In [None]:
# remove columns to leave only the column with the posts. Column 'Text'
train_dataset = train_dataset.remove_columns(['ID', 'Sub-task A', 'Sub-task B', 'label_a'])
dev_dataset = dev_dataset.remove_columns(['ID', 'Sub-task A', 'Sub-task B', 'label_a'])
test_dataset = test_dataset.remove_columns(['ID', 'Sub-task A', 'Sub-task B', 'label_a'])

In [None]:
# define a BERT tokenizer
# use the bert-based-uncased tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [None]:
# tokenize the train, development and test data
# Tried to increase the sequence max lenght but I get an error. It should be enough 150 given the results of the EDA.

train_dataset_tok = train_dataset.map(lambda x: tokenizer(x['Text'], truncation=True, padding=True, max_length=150), batched=True)
dev_dataset_tok = dev_dataset.map(lambda x: tokenizer(x['Text'], truncation=True, padding=True, max_length=150), batched=True)
test_dataset_tok = test_dataset.map(lambda x: tokenizer(x['Text'], truncation=True, padding=True, max_length=150), batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-5f77e630ad205ebf/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-d8af7848e47fc82e.arrow


  0%|          | 0/2 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-6807051e4db05c0e/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-1b31fc717748ad01.arrow


In [None]:
# now we can remove the column with the original post from the dataset. We are going to use the result of tokenization for modeling
train_dataset_tok = train_dataset_tok.remove_columns(['Text']).with_format('tensorflow')
dev_dataset_tok = dev_dataset_tok.remove_columns(['Text']).with_format('tensorflow')
test_dataset_tok = test_dataset_tok.remove_columns(['Text']).with_format('tensorflow')

In [None]:
# extract features from tokenizer output: 'input_ids', 'token_type_ids', 'attention_mask'
train_features = {x: train_dataset_tok[x] for x in tokenizer.model_input_names}
dev_features = {x: dev_dataset_tok[x] for x in tokenizer.model_input_names}
test_features = {x: test_dataset_tok[x] for x in tokenizer.model_input_names}

In [None]:
# batch data

batch_size = 16
buffer = len(train_dataset_tok)

# Task A
train_tf_dataset_a = tf.data.Dataset.from_tensor_slices((train_features, train_labels_enc)).shuffle(buffer).batch(batch_size)
dev_tf_dataset_a = tf.data.Dataset.from_tensor_slices((dev_features, dev_labels_enc)).batch(batch_size)
test_tf_dataset_a = tf.data.Dataset.from_tensor_slices((test_features, test_labels_enc)).batch(batch_size)


## Model Task A

In [None]:
# initialize lists to keep statistics of all runs
f1_NAG = []
f1_AG = []
f1_macro = []
f1_weighted = []
accuracy =[]

# run 15 times the model to get an idea of variability
for i in range(15):

  # delete model if exists
  try:
    del BERT_model_A
  except:
    pass
  
  # define the model. Task A is a classification task with 3 labels
  BERT_model_A = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

  # compile model
  BERT_model_A.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
                       loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                       metrics=tf.metrics.CategoricalAccuracy()
                       )
  # fit model
  training_history = BERT_model_A.fit(train_tf_dataset_a, validation_data=dev_tf_dataset_a, epochs=2)

  print(f'---------------------------Iteration {i} ---------------------------\n')
  # Evaluate model on TEST data
  # predict using model. Returns logits
  pred_labels_test = BERT_model_A.predict(test_features)[0]
  # convert logits lo labels
  pred_labels_test = from_logits_to_labels(pred_labels_test, 'A')

  # get f1-score for all classes, macro and weighted
  x = metrics.classification_report(test_labels, pred_labels_test, digits=3, output_dict=True)
  # append values to keep scores
  f1_NAG.append(x['NAG']['f1-score'])
  f1_AG.append(x['AG']['f1-score'])
  f1_macro.append(x['macro avg']['f1-score'])
  f1_weighted.append(x['weighted avg']['f1-score'])
  accuracy.append(x['accuracy'])

# calculate mean
f1_NAG_mean = round(statistics.mean(f1_NAG), 3)
f1_AG_mean = round(statistics.mean(f1_AG), 3)
f1_macro_mean = round(statistics.mean(f1_macro), 3)
f1_weighted_mean = round(statistics.mean(f1_weighted), 3)
accuracy_mean = round(statistics.mean(accuracy), 3)

# calculate standard deviation
f1_NAG_std = round(statistics.stdev(f1_NAG), 3)
f1_AG_std = round(statistics.stdev(f1_AG), 3)
f1_macro_std = round(statistics.stdev(f1_macro), 3)
f1_weighted_std = round(statistics.stdev(f1_weighted), 3)
accuracy_std = round(statistics.stdev(accuracy), 3)

print('Class NAG')
print(f'Mean f1-score = {f1_NAG_mean}')
print(f'Standard deviation f1-score = {f1_NAG_std}\n')

print('Class AG')
print(f'Mean f1-score = {f1_AG_mean}')
print(f'Standard deviation f1-score = {f1_AG_std}\n')

print('Class Macro')
print(f'Mean f1-score = {f1_macro_mean}')
print(f'Standard deviation f1-score = {f1_macro_std}\n')

print('Class Weighted')
print(f'Mean f1-score = {f1_weighted_mean}')
print(f'Standard deviation f1-score = {f1_weighted_std}\n')

print('Accuracy')
print(f'Mean = {accuracy_mean}')
print(f'Standard deviation = {accuracy_std}\n')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 0 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 1 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 2 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 3 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 4 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 5 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 6 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 7 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 8 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 9 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 10 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 11 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 12 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 13 ---------------------------



All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2
---------------------------Iteration 14 ---------------------------

Class NAG
Mean f1-score = 0.867
Standard deviation f1-score = 0.035

Class AG
Mean f1-score = 0.787
Standard deviation f1-score = 0.114

Class Macro
Mean f1-score = 0.827
Standard deviation f1-score = 0.074

Class Weighted
Mean f1-score = 0.833
Standard deviation f1-score = 0.068

Accuracy
Mean = 0.838
Standard deviation = 0.057



## References

- Pre-processing data: https://huggingface.co/transformers/preprocessing.html

- Fine-tunning a pre-trained model: https://huggingface.co/transformers/training.html

- BERT: https://huggingface.co/transformers/model_doc/bert.html
