<a href="https://colab.research.google.com/github/mvdheram/Stereotypical-Social-bias-detection-/blob/Pre-trained-LM-selection-and-training/Language_model_and_visualization_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Link : https://arxiv.org/pdf/1810.04805.pdf

Bert - **B**idirecitonal **E**ncoder **R**epresentations from **t**ransformers

Architecture :
  * Multi-layer, **bidirectional** , encoder - based transformer 
  * BERT-base - 12 Encoder stacks, 768 hidden size, 12 - Self attention head

Framework :

  * BERT-tokenizer :
    * **WordPiece** embeddings 
    * Vocab size : 30,000
    * First token `[CLS]` - The final hidden state of the 12th encoder stack
    * Two sequences seperated by `[SEP]` 
    * End of sequence indicated by `[EOS]`

    * Input : Input sentence
    * Output : Token embedding + segment embedding + Position embedding 
    * "BERT uniformly selects 15% of the input tokens for possible replacement. Of the selected tokens, 80% are replaced with [MASK], 10% are left unchanged,and 10% are replaced by a randomly selected vocabulary token." [RoBERTa paper]
  
  * Pre-training
    * Task1 : Masked language model (MLM)
      * For deep bidirectional representation, mask 15 % of all Wordpiece tokens in each sequence at random and predict the masked tokens.
    * Task2 : Next sentence prediction (NSP)
      * Trained to understand relationships between two sentences (Q&A, NLI)
    * Data
      * BookCorpus (800M words)
      * English wikipedia (2500M words)
  * Fine-tuning
    * Plug-in task specific output layer and fine-tune all the parameters end to end.
    * At output, the first token i.e `[CLS]` representation is fed into output layer for classification.

# GPT-2

Links :

1. http://www.persagen.com/files/misc/radford2019language.pdf
2. https://youtu.be/Ck9-0YkJD_Q?t=936
3. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf 

Model specification :
  * 12- layer decoder- only transformer 
  * Masked self-attention heads (768 dimentional states and 12 attention heads)

Basic Blocks of GPT architecture [2,3] :
  1. GPT2 Tokenizer 
    * **Input** : Sentence 
    * **Output** : input_ids, attention_mask, labels 
    * **Vocab size** : 50k 
    * **Encoding** : Byte pair encoding (BPE) - "Middle gorund between character level encoding and word level encoding" [1] 
  2. GPT-2 embedding block [2] :
      * Consists of 
        * Word embedding layer 
        * Position embedding layer
      * **Input** : (input_ids, attention_mask, labels)- size (1,8) [*sentence, tokens*]
      * **Output** : size (1,8,768) - [*sentence, tokens, embeddings per token*]
  3. GPT-2 Decoder Block (x12):
    * Consists of 
      * Attention block [2]:
        * Consists of 
          * Self attention mechanism  ( generating Query, key, value pairs etc. of transformer) 
      * Multi layer perceptron block/ feedforward layer (MLP - Block)
          * Consists of 
            * layer norm, convolution , activaiton function, dropout 
      * Layer Normalization 
  4. LM head layer 
    * Consists of Linear layer projected to vocab  

Training data :
  * WebText ( Web pages curated and filtered by humans - 45 million)
    * Starting point, scraped outbound links from reddit (>3 karma)

GPT2 for text classification :
  * Remove the LM head and attach a classification layer with the output dimention equal to size of labels. 
  * Grab the output of last word embedding in the seqence because it has context information (L-R LM) until that word int he input sequence.
    * transformer_output[0] (https://huggingface.co/transformers/_modules/transformers/models/gpt2/modeling_gpt2.html#GPT2ForSequenceClassification) 
    * Sequence classification architecture:
      * Pooled output (last output as mentioned above)
      * Dropout(0.1)
      * Linear classifier layer with size (input_dim, num_class lables)
      * Sigmoid Layer + BCE loss function 


   



# XL-Net

Link : https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf 



# RoBERTa 

Link : https://arxiv.org/abs/1907.11692








Optimized version of BERT with the following modificatoins:
  * Removing next sentence prediction (NSP) 
  * Training on more data and bigger batches
  * Training on longer sequences 
  * Dynamically changing the masking pattern applied to training data.

Data:
  * Increase data size to imrpove end task performance 
  * BookCorpus + Wikipedia : Original data used by BERT 
  * CC- News : English portion of CommonCrawl News dataset ( "63 million articles crawled between September 2016 and feb 2019 - 76 GB after filtering")
  * OPENWEBTEXT - "The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB)"
  * "STORIES, a dataset introduced in Trinh and Le
(2018) containing a subset of CommonCrawl
data filtered to match the story-like style of
Winograd schemas. (31GB)".

Training procedure :
  * Statis (BERT) vs Dynamic (RoBERTa) masking
    * Dynamic masking is a strategy where masking patterns are generated for every sequence being fed into the model rather than relying on randomly masking and predicting.
  * Model input format and Next sentence prediction 
    * Doument-sentences : Inputs to the model are packed with full sentences which do not cross document boundaries. Remove NSP loss 
  * Training with larger batch sizes 
    * 8k batch size increased when compared to BERT.
  * Text encoding :
    * Byte Pair encoding [BPE] which relies on subword units extracted from the training corpus compared to  BERT character level 

# Visualization 

Link: https://towardsdatascience.com/visualize-bert-sequence-embeddings-an-unseen-way-1d6a351e4568


1. https://jalammar.github.io/explaining-transformers/
2. https://jalammar.github.io/hidden-states/
3. BerViz : https://github.com/jessevig/bertviz

Bert-base:
  * 12 encoder-layer stack for building contextualized embeddings.
  * 100 million tuneable parameters.
  * As bert model offers its embeddings to input, its useful to viusalize layers to analyze the patterns learned on unseen data.


Why?
  * After training viusalize, how well each layer seperates over epochs.

How?
  * BertForSequenceClassification consists of :
    * 1 - BertEmbedding layer -> 12 - Bertlayer -> 1 - Bertpooler -> Tanh - activation -> Dropout layer

## Compilation of results 

In [36]:
bert = open('/content/eval_results_BERT_0.5_.json','r')
roberta = open('/content/eval_results_RoBERTa_0.5_.json','r')
gpt2 = open('/content/eval_results_gpt-2_0.5_.json','r')
xlnet = open('/content/eval_results_xlnet_0.5_.json','r')

In [37]:
import json 
import pandas as pd

bert_metrics = json.load(bert)
roberta_metrics = json.load(roberta)
gpt2_metrics = json.load(gpt2)
xlnet_metrics = json.load(xlnet)

### Micro_avg_scores

In [None]:
micro_avg_lms = pd.DataFrame([bert_metrics['Classification_report']['micro avg'],roberta_metrics['Classification_report']['micro avg'],gpt2_metrics['Classification_report']['micro avg'],xlnet_metrics['Classification_report']['micro avg']],index=['bert-base-uncased','roberta-base','gpt2','xlnet-base-cased'])

In [None]:
micro_avg_lms.columns = pd.MultiIndex.from_product([micro_avg_lms.columns, ['micro avg']])

In [None]:
micro_avg_lms

Unnamed: 0_level_0,precision,recall,f1-score,support
Unnamed: 0_level_1,micro avg,micro avg,micro avg,micro avg
bert-base-uncased,0.824537,0.822633,0.823584,4330
roberta-base,0.864859,0.869053,0.866951,4330
gpt2,0.849421,0.457275,0.594505,4330
xlnet-base-cased,0.880033,0.736952,0.802162,4330


### Macro_avg_score

In [None]:
macro_avg_lms = pd.DataFrame([bert_metrics['Classification_report']['macro avg'],roberta_metrics['Classification_report']['macro avg'],gpt2_metrics['Classification_report']['macro avg'],xlnet_metrics['Classification_report']['macro avg']],index=['bert-base-uncased','roberta-base','gpt2','xlnet-base-cased'])

In [None]:
macro_avg_lms.columns = pd.MultiIndex.from_product([macro_avg_lms.columns, ['macro avg']])

In [None]:
macro_avg_lms

Unnamed: 0_level_0,precision,recall,f1-score,support
Unnamed: 0_level_1,macro avg,macro avg,macro avg,macro avg
bert-base-uncased,0.854382,0.842521,0.84751,4330
roberta-base,0.880783,0.884021,0.882198,4330
gpt2,0.792463,0.49603,0.563569,4330
xlnet-base-cased,0.885105,0.77277,0.815378,4330


### Hamming_loss, subset_accuracy

In [None]:
bert = bert_metrics['hamming_loss'],bert_metrics['subset_accuracy'],bert_metrics['AUC_ROC_score']
roberta = roberta_metrics['hamming_loss'],roberta_metrics['subset_accuracy'],roberta_metrics['AUC_ROC_score']
gpt2 = gpt2_metrics['hamming_loss'],gpt2_metrics['subset_accuracy'],gpt2_metrics['AUC_ROC_score']
xlnet = xlnet_metrics['hamming_loss'],xlnet_metrics['subset_accuracy'],xlnet_metrics['AUC_ROC_score']

In [None]:
hs_metrics = pd.DataFrame([bert,roberta,gpt2,xlnet],index=['bert-base-uncased','roberta-base','gpt2','xlnet-base-cased'],columns=['hamming_loss','subset_accuracy','AUC_ROC_score'])

In [None]:
hs_metrics

Unnamed: 0,hamming_loss,subset_accuracy,AUC_ROC_score
bert-base-uncased,0.087832,0.666398,0.952434
roberta-base,0.070911,0.771555,0.966717
gpt2,0.155462,0.293312,0.891429
xlnet-base-cased,0.090595,0.588235,0.954108


### per_class_precision_recall_fmeasure

In [247]:
Labels = ['Ethnicity','gender','profession','religion','Anti-stereotype','stereotype','unrelated']
# model = ['bert_base_uncased','roberta-base','gpt2','xlnet-base-cased']
metrics = [bert_metrics,roberta_metrics,gpt2_metrics,xlnet_metrics]
model_name = ['bert','roberta','gpt2','xlnet']
model = {}

In [248]:
model.clear()

In [272]:
for index,metric in enumerate(metrics):
  print(metric)

{'AUC_ROC_score': 0.9515636613746575, 'subset_accuracy': 0.6700241740531829, 'hamming_loss': 0.09048002762748936, 'hammingsScore/Accuracy': 0.7811912436207331, 'Classification_report': {'Ethnicity': {'precision': 0.9272503082614056, 'recall': 0.9591836734693877, 'f1-score': 0.9429467084639498, 'support': 784}, 'gender': {'precision': 0.8705035971223022, 'recall': 0.7960526315789473, 'f1-score': 0.8316151202749142, 'support': 304}, 'profession': {'precision': 0.8645833333333334, 'recall': 0.8886509635974305, 'f1-score': 0.8764519535374868, 'support': 467}, 'religion': {'precision': 0.9761904761904762, 'recall': 0.9795221843003413, 'f1-score': 0.9778534923339012, 'support': 293}, 'Anti-stereotype': {'precision': 0.6132075471698113, 'recall': 0.6683804627249358, 'f1-score': 0.6396063960639606, 'support': 778}, 'stereotype': {'precision': 0.7450787401574803, 'recall': 0.7074766355140187, 'f1-score': 0.725790987535954, 'support': 1070}, 'unrelated': {'precision': 0.9575551782682513, 'recall

In [250]:
for index,metric in enumerate(metrics):
  for label in Labels:
    model[model_name[index] + "_"+ label] = metric['Classification_report'][label]

In [269]:
df = pd.DataFrame(model)

In [274]:
df.iloc[:,14:21]

Unnamed: 0,gpt2_Ethnicity,gpt2_gender,gpt2_profession,gpt2_religion,gpt2_Anti-stereotype,gpt2_stereotype,gpt2_unrelated
precision,0.87464,0.852459,0.732468,0.924915,0.404762,0.89011,0.867886
recall,0.774235,0.171053,0.603854,0.924915,0.021851,0.302804,0.673502
f1-score,0.82138,0.284932,0.661972,0.924915,0.041463,0.451883,0.758437
support,784.0,304.0,467.0,293.0,778.0,1070.0,634.0


In [294]:
df_bert  = df.iloc[:,:7]
df_bert_metrics = df_bert.copy()
df_bert_metrics['Model_name'] = 'bert_base_uncased'
df_bert_metrics =df_bert_metrics.set_index(['Model_name',df_bert.index])
df_bert_metrics.columns = Labels

In [295]:
df_roberta = df.iloc[:,7:14]
df_roberta_metrics = df_roberta.copy()
df_roberta_metrics['Model_name'] = 'roberta-base'
df_roberta_metrics = df_roberta_metrics.set_index(['Model_name',df_bert.index])
df_roberta_metrics.columns = Labels

In [296]:
df_GPT2 = df.iloc[:,14:21]
df_GPT2_metrics = df_GPT2.copy()
df_GPT2_metrics['Model_name'] = 'gpt2'
df_GPT2_metrics = df_GPT2_metrics.set_index(['Model_name',df_bert.index])
df_GPT2_metrics.columns = Labels

In [297]:
df_XLNet = df.iloc[:,21:]
df_XLNet_metrics = df_XLNet.copy()
df_XLNet_metrics['Model_name'] = 'xlnet-base-cased'
df_XLNet_metrics = df_XLNet_metrics.set_index(['Model_name',df_bert.index])
df_XLNet_metrics.columns = Labels

In [301]:
df_per_label = pd.concat([df_bert_metrics,df_roberta_metrics,df_XLNet_metrics,df_GPT2_metrics])

In [302]:
df_per_label

Unnamed: 0_level_0,Unnamed: 1_level_0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated
Model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
bert_base_uncased,precision,0.92725,0.870504,0.864583,0.97619,0.613208,0.745079,0.957555
bert_base_uncased,recall,0.959184,0.796053,0.888651,0.979522,0.66838,0.707477,0.88959
bert_base_uncased,f1-score,0.942947,0.831615,0.876452,0.977853,0.639606,0.725791,0.922322
bert_base_uncased,support,784.0,304.0,467.0,293.0,778.0,1070.0,634.0
roberta-base,precision,0.94403,0.833333,0.896186,0.983051,0.726753,0.773276,0.965
roberta-base,recall,0.968112,0.888158,0.905782,0.989761,0.652956,0.838318,0.913249
roberta-base,f1-score,0.955919,0.859873,0.900958,0.986395,0.687881,0.804484,0.938412
roberta-base,support,784.0,304.0,467.0,293.0,778.0,1070.0,634.0
xlnet-base-cased,precision,0.910843,0.850909,0.870536,0.986207,0.789916,0.804054,0.983271
xlnet-base-cased,recall,0.964286,0.769737,0.835118,0.976109,0.362468,0.66729,0.834385
