<a href="https://colab.research.google.com/github/mvdheram/Stereotypical-Social-bias-detection-/blob/Pre-trained-LM-selection-and-training/Language_model_and_visualization_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language models

## BERT 

Pre-training of Deep Bidirectional Transformers for Language Understanding

Link : https://arxiv.org/pdf/1810.04805.pdf

Bert - **B**idirecitonal **E**ncoder **R**epresentations from **t**ransformers

Architecture :
  * Multi-layer, **bidirectional** , encoder - based transformer 
  * BERT-base - 12 Encoder stacks, 768 hidden size, 12 - Self attention head

Framework :

  * BERT-tokenizer :
    * **WordPiece** embeddings 
    * Vocab size : 30,000
    * First token `[CLS]` - The final hidden state of the 12th encoder stack
    * Two sequences seperated by `[SEP]` 
    * End of sequence indicated by `[EOS]`

    * Input : Input sentence
    * Output : Token embedding + segment embedding + Position embedding 
    * "BERT uniformly selects 15% of the input tokens for possible replacement. Of the selected tokens, 80% are replaced with [MASK], 10% are left unchanged,and 10% are replaced by a randomly selected vocabulary token." [RoBERTa paper]
  
  * Pre-training
    * Task1 : Masked language model (MLM)
      * For deep bidirectional representation, mask 15 % of all Wordpiece tokens in each sequence at random and predict the masked tokens.
    * Task2 : Next sentence prediction (NSP)
      * Trained to understand relationships between two sentences (Q&A, NLI)
    * Data
      * BookCorpus (800M words)
      * English wikipedia (2500M words)
  * Fine-tuning
    * Plug-in task specific output layer and fine-tune all the parameters end to end.
    * At output, the first token i.e `[CLS]` representation is fed into output layer for classification.

## GPT-2

Links :

1. http://www.persagen.com/files/misc/radford2019language.pdf
2. https://youtu.be/Ck9-0YkJD_Q?t=936
3. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf 

Model specification :
  * 12- layer decoder- only transformer 
  * Masked self-attention heads (768 dimentional states and 12 attention heads)

Basic Blocks of GPT architecture [2,3] :
  1. GPT2 Tokenizer 
    * **Input** : Sentence 
    * **Output** : input_ids, attention_mask, labels 
    * **Vocab size** : 50k 
    * **Encoding** : Byte pair encoding (BPE) - "Middle gorund between character level encoding and word level encoding" [1] 
  2. GPT-2 embedding block [2] :
      * Consists of 
        * Word embedding layer 
        * Position embedding layer
      * **Input** : (input_ids, attention_mask, labels)- size (1,8) [*sentence, tokens*]
      * **Output** : size (1,8,768) - [*sentence, tokens, embeddings per token*]
  3. GPT-2 Decoder Block (x12):
    * Consists of 
      * Attention block [2]:
        * Consists of 
          * Self attention mechanism  ( generating Query, key, value pairs etc. of transformer) 
      * Multi layer perceptron block/ feedforward layer (MLP - Block)
          * Consists of 
            * layer norm, convolution , activaiton function, dropout 
      * Layer Normalization 
  4. LM head layer 
    * Consists of Linear layer projected to vocab  

Training data :
  * WebText ( Web pages curated and filtered by humans - 45 million)
    * Starting point, scraped outbound links from reddit (>3 karma)

GPT2 for text classification :
  * Remove the LM head and attach a classification layer with the output dimention equal to size of labels. 
  * Grab the output of last word embedding in the seqence because it has context information (L-R LM) until that word int he input sequence.
    * transformer_output[0] (https://huggingface.co/transformers/_modules/transformers/models/gpt2/modeling_gpt2.html#GPT2ForSequenceClassification) 
    * Sequence classification architecture:
      * Pooled output (last output as mentioned above)
      * Dropout(0.1)
      * Linear classifier layer with size (input_dim, num_class lables)
      * Sigmoid Layer + BCE loss function 


   



## XL-Net (TBD)

Link : https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf 



## RoBERTa 

Link : https://arxiv.org/abs/1907.11692








Optimized version of BERT with the following modificatoins:
  * Removing next sentence prediction (NSP) 
  * Training on more data and bigger batches
  * Training on longer sequences 
  * Dynamically changing the masking pattern applied to training data.

Data:
  * Increase data size to imrpove end task performance 
  * BookCorpus + Wikipedia : Original data used by BERT 
  * CC- News : English portion of CommonCrawl News dataset ( "63 million articles crawled between September 2016 and feb 2019 - 76 GB after filtering")
  * OPENWEBTEXT - "The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB)"
  * "STORIES, a dataset introduced in Trinh and Le
(2018) containing a subset of CommonCrawl
data filtered to match the story-like style of
Winograd schemas. (31GB)".

Training procedure :
  * Statis (BERT) vs Dynamic (RoBERTa) masking
    * Dynamic masking is a strategy where masking patterns are generated for every sequence being fed into the model rather than relying on randomly masking and predicting.
  * Model input format and Next sentence prediction 
    * Doument-sentences : Inputs to the model are packed with full sentences which do not cross document boundaries. Remove NSP loss 
  * Training with larger batch sizes 
    * 8k batch size increased when compared to BERT.
  * Text encoding :
    * Byte Pair encoding [BPE] which relies on subword units extracted from the training corpus compared to  BERT character level 

## Visualization (TBD)

Link: https://towardsdatascience.com/visualize-bert-sequence-embeddings-an-unseen-way-1d6a351e4568


1. https://jalammar.github.io/explaining-transformers/
2. https://jalammar.github.io/hidden-states/
3. BerViz : https://github.com/jessevig/bertviz

Bert-base:
  * 12 encoder-layer stack for building contextualized embeddings.
  * 100 million tuneable parameters.
  * As bert model offers its embeddings to input, its useful to viusalize layers to analyze the patterns learned on unseen data.


Why?
  * After training viusalize, how well each layer seperates over epochs.

How?
  * BertForSequenceClassification consists of :
    * 1 - BertEmbedding layer -> 12 - Bertlayer -> 1 - Bertpooler -> Tanh - activation -> Dropout layer

# Compilation of results 

Reference : https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score 

### Hyper-parameter search 

In [None]:
h_params_compiled = pd.read_csv('/content/h_params_compiled.csv', index_col= 0)

In [None]:
h_params_compiled

Unnamed: 0,model_name,learning_rate,num_train_epochs,seed,per_device_train_batch_size
0,roberta-base,3.4e-05,5,22,8
1,xlnet-base-cased,2.5e-05,2,15,32
2,bert-base-uncased,2.5e-05,2,15,32
3,gpt2,2.5e-05,2,15,32


### Loading the metrics

In [82]:
bert = open('/content/eval_results_BERT_0.5_.json','r')
roberta = open('/content/eval_results_RoBERTa_0.5_.json','r')
gpt2 = open('/content/eval_results_gpt-2_0.5_.json','r')
xlnet = open('/content/eval_results_xlnet_0.5_.json','r')

In [83]:
import json 
import pandas as pd

bert_metrics = json.load(bert)
roberta_metrics = json.load(roberta)
gpt2_metrics = json.load(gpt2)
xlnet_metrics = json.load(xlnet)

### Sample_average precision, recall, f1-score (TBD)

#### Definition

'Sample_average':

* Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score).

#### Results

In [None]:
bert_metrics['sample_average_f1']

In [6]:
metrics = ['sample_average_precision','sample_average_recall','sample_average_f1']
results = [bert_metrics,roberta_metrics,gpt2_metrics,xlnet_metrics]
model_name = ['bert','roberta','gpt2','xlnet']

sample_avg_scores = {}

for index,result in enumerate(results):
  for metric in metrics:
    sample_avg_scores[model_name[index]+ "_"+ metric] = result[metric]

In [37]:
import pandas as pd

df = pd.DataFrame.from_records(sample_avg_scores, index = [0]).T
df.reset_index(drop = False, inplace = True )
df.columns = ['Model_name','sample_average_score']

In [76]:
df_bert  = df.iloc[0:3]
df_bert_metrics = df_bert.copy()
df_bert_metrics['Model_name'] = 'bert_base_uncased'
df_bert_metrics.index = ['precision','recall','fmeasure']
df_bert_metrics = df_bert_metrics.set_index(['Model_name',df_bert_metrics.index])

In [77]:
df_roberta = df.iloc[6:9]
df_roberta_metrics = df_roberta.copy()
df_roberta_metrics['Model_name'] = 'roberta-base'
df_roberta_metrics.index = ['precision','recall','fmeasure']
df_roberta_metrics = df_roberta_metrics.set_index(['Model_name',df_roberta_metrics.index])

In [78]:
df_GPT2 = df.iloc[3:6]
df_GPT2_metrics = df_GPT2.copy()
df_GPT2_metrics['Model_name'] = 'gpt2'
df_GPT2_metrics.index = ['precision','recall','fmeasure']
df_GPT2_metrics = df_GPT2_metrics.set_index(['Model_name',df_GPT2_metrics.index])

In [79]:
df_XLNet = df.iloc[9:]
df_XLNet_metrics = df_XLNet.copy()
df_XLNet_metrics['Model_name'] = 'xlnet-base-cased'
df_XLNet_metrics.index = ['precision','recall','fmeasure']
df_XLNet_metrics = df_XLNet_metrics.set_index(['Model_name',df_XLNet_metrics.index])

In [80]:
df_per_label = pd.concat([df_bert_metrics,df_roberta_metrics,df_XLNet_metrics,df_GPT2_metrics])

In [81]:
df_per_label

Unnamed: 0_level_0,Unnamed: 1_level_0,sample_average_score
Model_name,Unnamed: 1_level_1,Unnamed: 2_level_1
bert_base_uncased,precision,0.819245
bert_base_uncased,recall,0.843137
bert_base_uncased,fmeasure,0.813054
roberta-base,precision,0.87643
roberta-base,recall,0.880338
roberta-base,fmeasure,0.875302
xlnet-base-cased,precision,0.784985
xlnet-base-cased,recall,0.869057
xlnet-base-cased,fmeasure,0.742949
gpt2,precision,0.582487


#### Analysis

* Sample average scores of Roberta are slightly better when compared to bert

### Micro_avg_scores

#### Definitions

Quality of overall classification is defined by 
  * Micro-average :  
    * Favours bigger classes; not preferable if the data is skewed. ??
    * Average per-text decision. (Based on total number of text examples)
    * Micro-averaged scores are calculated per sample instance.

#### Results

In [None]:
micro_avg_lms = pd.DataFrame([bert_metrics['Classification_report']['micro avg'],roberta_metrics['Classification_report']['micro avg'],gpt2_metrics['Classification_report']['micro avg'],xlnet_metrics['Classification_report']['micro avg']],index=['bert-base-uncased','roberta-base','gpt2','xlnet-base-cased'])

In [None]:
micro_avg_lms.columns = pd.MultiIndex.from_product([micro_avg_lms.columns, ['micro avg']])

In [None]:
micro_avg_lms

Unnamed: 0_level_0,precision,recall,f1-score,support
Unnamed: 0_level_1,micro avg,micro avg,micro avg,micro avg
bert-base-uncased,0.824537,0.822633,0.823584,4330
roberta-base,0.864859,0.869053,0.866951,4330
gpt2,0.849421,0.457275,0.594505,4330
xlnet-base-cased,0.880033,0.736952,0.802162,4330


#### Analysis

* **roberta-base** has higher f1-score, recall, **XLNet** precision higher when compared to others

### Macro_avg_score

#### Definition

Quality of overall classification is defined by 
  * Macro-average :
    * A measure is the average of the same measures calculated for C classes.
    * Macro-averaging treats all classes equally.
    * Average per-class measure (considers the classifier predictions rather than the datasets counts??)

#### Results

In [None]:
macro_avg_lms = pd.DataFrame([bert_metrics['Classification_report']['macro avg'],roberta_metrics['Classification_report']['macro avg'],gpt2_metrics['Classification_report']['macro avg'],xlnet_metrics['Classification_report']['macro avg']],index=['bert-base-uncased','roberta-base','gpt2','xlnet-base-cased'])

In [None]:
macro_avg_lms.columns = pd.MultiIndex.from_product([macro_avg_lms.columns, ['macro avg']])

In [None]:
macro_avg_lms

Unnamed: 0_level_0,precision,recall,f1-score,support
Unnamed: 0_level_1,macro avg,macro avg,macro avg,macro avg
bert-base-uncased,0.854382,0.842521,0.84751,4330
roberta-base,0.880783,0.884021,0.882198,4330
gpt2,0.792463,0.49603,0.563569,4330
xlnet-base-cased,0.885105,0.77277,0.815378,4330


#### Analysis

* XLNet has lower recall but precision is better than other LMS
* Overall roberta-base has better score when considering f1-score
* bert comes second with a good balance of precision and recall 


### Hamming_loss, subset_accuracy

References : 

  1. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html
  2. https://mmuratarat.github.io/2020-01-25/multilabel_classification_metrics
  3. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
  4. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc


#### Definition

* Hamming loss : 
  * Fraction of labels incorrectly predicted
* Exact match ratio (Aka Accuracy):
    * **The average per-text exact classification**
    * It is harsh measure as exact match(prediction and true values) are taken into consideration. 
      * Why?
        * In multi-label/class classification, a subset of correct prediction are avoided. 
    * Its based on some threshold as output probabilities are rounded to 0 if less than threshold and 1 if greater than threshold.
* Hamming_score / Accuracy :
  * "Accuracy for each instance is defined as the proportion of the predicted correct labels to the total number (predicted and actual) of labels for that instance. Overall accuracy is the average across all instances. It is less ambiguously referred to as the Hamming score." [2]
* AUC_ROC curve and score:
  * Why?
    * Compare ROC (analyze probabilities) AUC score (model performance) of two or three models for binary classification or classes when multi-class.
    * AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. 
    * ROC is a probability curve 
      * TP vs FP plotted at different thresholds
    * AUC represents the degree or measure of separability.
      * AUC value ranges from 0 to 1 (perfect)
      * AUC is scale invarient 
      * AUC is classification-threshold-invariant.
      * For multi-label/class, extend using one vs All (class 0 vs rest..), one ROC curve per label considering it as one vs rest.
  * Sensitivity / True positive rate:
    * Proportion of correctly classified positive samples (TP) out of total number (TP+FN) of positive classes.
    * **Higher the better**
  * Specificity / True Negative rate:
    * Proportion of correctly classified negative sample (TN) out of total number of negative classes (TN+FP)
    * **Higher the better**
  * False Negative rate:
    * Proportion of incorrectly classified positive class (FN) out of total number correctly predicted classes (TP + TN) by classifier.
    * **Lower the better**
  * False positive rate (1- Specificity) :
    * Proportion of incorrectly classified negaitve class out of total number of correctly predicted negative class and incorrectly predicted positive class (TN + FP)
    * **Lower the better**.
  
  * Probability of predicitons
    * `Predict_proba` gives probability distribution of the prediction across different classes. 
    * The prediction is thus converted into class label by using decision threshold or threshold.
    * Different threshold gives different results, confusion matrices which effect sensitivity and specificity.
    * Default value of threshold is 0.5 for prediction scores ranging from 0 to 1.
      * Prediction < 0.5 - class 0
      * Prediction >= 0.5 - class 1
    * Default might not always work.
    * AUC-ROC curve gives a cumulative view of results for different threshold, thus provide a visualization to choose best threshold.
  * Receiver Operator Characteristic (ROC) curve :
    * Curve that plots TPR (sensitivity) and FPR (1-Specificity) probability for different thresholds.
      * X-axis - False positive rate
      * Y-axis - True positive rate
      * Ideal : 
        * Higher value on y-axis and lower value on x-axis.
          * Higher true positive rate than False positive rate.
            
    * Area under the curve (AUC) is the area under the ROC.
      * "The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes."
      1. AUC = 1 - Perfect classifier
      2. `0.5<AUC<1` - Better classifier
      3. AUC = 0.5 - Random guess
      4. AUC = 0 - Worst classifier (Neg as pos and pos as neg)
      * `sklearn.metrics import roc_auc_score` 
      * `roc_auc_score(true,pred)`




#### Results

In [85]:
bert = bert_metrics['hamming_loss'],bert_metrics['subset_accuracy'],bert_metrics['hamming_score'],bert_metrics['AUC_ROC_score']
roberta = roberta_metrics['hamming_loss'],roberta_metrics['subset_accuracy'],roberta_metrics['hamming_score'],roberta_metrics['AUC_ROC_score']
gpt2 = gpt2_metrics['hamming_loss'],gpt2_metrics['subset_accuracy'],gpt2_metrics['hamming_score'],gpt2_metrics['AUC_ROC_score']
xlnet = xlnet_metrics['hamming_loss'],xlnet_metrics['subset_accuracy'],xlnet_metrics['hamming_score'],xlnet_metrics['AUC_ROC_score']

In [86]:
hs_metrics = pd.DataFrame([bert,roberta,gpt2,xlnet],index=['bert-base-uncased','roberta-base','gpt2','xlnet-base-cased'],columns=['hamming_loss','subset_accuracy','hamming_score/Accuracy','AUC_ROC_score'])

In [87]:
hs_metrics

Unnamed: 0,hamming_loss,subset_accuracy,hamming_score/Accuracy,AUC_ROC_score
bert-base-uncased,0.09048,0.649879,0.772663,0.950945
roberta-base,0.064694,0.789283,0.848811,0.968983
gpt2,0.152124,0.33199,0.518466,0.892467
xlnet-base-cased,0.091516,0.576954,0.729654,0.956161


#### Analysis

* Hamming loss (main_metric for multi-label) is lowest for roberta-base .
* AUC_ROC score of roberta-base has a slight margin over Xlnet and bert 

### per_class_precision_recall_fmeasure

#### Definition

References : http://rali.iro.umontreal.ca/rali/sites/default/files/publis/SokolovaLapalme-JIPM09.pdf 

Precision : 
  * The number of correctly classified positive examples divided by the number of examples labeled by the model as positive.
  * Percentage of "correctly **predicted** (positive)" class labels by model out of total correct predictions (positive).
  * Agreement of the data class labels with those of a classifiers if calculated from sums of per-text decisions. 


Recall : 
  * The number of correctly classified positive examples divided by the number of positive examples in the data.
  * Percentage of **identified positive sample** out of total positive labels in the data.
  * Effectiveness of a classifier to identify class labels if calculated from sums of per-text decisions.

F1-Measure:
  * Harmonic mean of Precision and recall 


#### Results

In [None]:
model = {}
Labels = ['Ethnicity','gender','profession','religion','Anti-stereotype','stereotype','unrelated']
# model = ['bert_base_uncased','roberta-base','gpt2','xlnet-base-cased']
metrics = [bert_metrics,roberta_metrics,gpt2_metrics,xlnet_metrics]
model_name = ['bert','roberta','gpt2','xlnet']

In [None]:
model.clear()
for index,metric in enumerate(metrics):
  for label in Labels:
    model[model_name[index] + "_"+ label] = metric['Classification_report'][label]

In [None]:
df = pd.DataFrame(model)

In [None]:
df.iloc[:,14:21]

Unnamed: 0,gpt2_Ethnicity,gpt2_gender,gpt2_profession,gpt2_religion,gpt2_Anti-stereotype,gpt2_stereotype,gpt2_unrelated
precision,0.87464,0.852459,0.732468,0.924915,0.404762,0.89011,0.867886
recall,0.774235,0.171053,0.603854,0.924915,0.021851,0.302804,0.673502
f1-score,0.82138,0.284932,0.661972,0.924915,0.041463,0.451883,0.758437
support,784.0,304.0,467.0,293.0,778.0,1070.0,634.0


In [None]:
df_bert  = df.iloc[:,:7]
df_bert_metrics = df_bert.copy()
df_bert_metrics['Model_name'] = 'bert_base_uncased'
df_bert_metrics =df_bert_metrics.set_index(['Model_name',df_bert.index])
df_bert_metrics.columns = Labels

In [None]:
df_roberta = df.iloc[:,7:14]
df_roberta_metrics = df_roberta.copy()
df_roberta_metrics['Model_name'] = 'roberta-base'
df_roberta_metrics = df_roberta_metrics.set_index(['Model_name',df_bert.index])
df_roberta_metrics.columns = Labels

In [None]:
df_GPT2 = df.iloc[:,14:21]
df_GPT2_metrics = df_GPT2.copy()
df_GPT2_metrics['Model_name'] = 'gpt2'
df_GPT2_metrics = df_GPT2_metrics.set_index(['Model_name',df_bert.index])
df_GPT2_metrics.columns = Labels

In [None]:
df_XLNet = df.iloc[:,21:]
df_XLNet_metrics = df_XLNet.copy()
df_XLNet_metrics['Model_name'] = 'xlnet-base-cased'
df_XLNet_metrics = df_XLNet_metrics.set_index(['Model_name',df_bert.index])
df_XLNet_metrics.columns = Labels

In [None]:
df_per_label = pd.concat([df_bert_metrics,df_roberta_metrics,df_XLNet_metrics,df_GPT2_metrics])

In [None]:
df_per_label

Unnamed: 0_level_0,Unnamed: 1_level_0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated
Model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
bert_base_uncased,precision,0.92725,0.870504,0.864583,0.97619,0.613208,0.745079,0.957555
bert_base_uncased,recall,0.959184,0.796053,0.888651,0.979522,0.66838,0.707477,0.88959
bert_base_uncased,f1-score,0.942947,0.831615,0.876452,0.977853,0.639606,0.725791,0.922322
bert_base_uncased,support,784.0,304.0,467.0,293.0,778.0,1070.0,634.0
roberta-base,precision,0.94403,0.833333,0.896186,0.983051,0.726753,0.773276,0.965
roberta-base,recall,0.968112,0.888158,0.905782,0.989761,0.652956,0.838318,0.913249
roberta-base,f1-score,0.955919,0.859873,0.900958,0.986395,0.687881,0.804484,0.938412
roberta-base,support,784.0,304.0,467.0,293.0,778.0,1070.0,634.0
xlnet-base-cased,precision,0.910843,0.850909,0.870536,0.986207,0.789916,0.804054,0.983271
xlnet-base-cased,recall,0.964286,0.769737,0.835118,0.976109,0.362468,0.66729,0.834385
