<a href="https://colab.research.google.com/github/mvdheram/Stereotypical-Social-bias-detection-/blob/Pre-trained-LM-selection-and-training/Language_model_and_visualization_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Link : https://arxiv.org/pdf/1810.04805.pdf

Bert - **B**idirecitonal **E**ncoder **R**epresentations from **t**ransformers

Architecture :
  * Multi-layer, **bidirectional** , encoder - based transformer 
  * BERT-base - 12 Encoder stacks, 768 hidden size, 12 - Self attention head

Framework :

  * BERT-tokenizer :
    * **WordPiece** embeddings 
    * Vocab size : 30,000
    * First token `[CLS]` - The final hidden state of the 12th encoder stack
    * Two sequences seperated by `[SEP]` 
    * End of sequence indicated by `[EOS]`

    * Input : Input sentence
    * Output : Token embedding + segment embedding + Position embedding 
    * "BERT uniformly selects 15% of the input tokens for possible replacement. Of the selected tokens, 80% are replaced with [MASK], 10% are left unchanged,and 10% are replaced by a randomly selected vocabulary token." [RoBERTa paper]
  
  * Pre-training
    * Task1 : Masked language model (MLM)
      * For deep bidirectional representation, mask 15 % of all Wordpiece tokens in each sequence at random and predict the masked tokens.
    * Task2 : Next sentence prediction (NSP)
      * Trained to understand relationships between two sentences (Q&A, NLI)
    * Data
      * BookCorpus (800M words)
      * English wikipedia (2500M words)
  * Fine-tuning
    * Plug-in task specific output layer and fine-tune all the parameters end to end.
    * At output, the first token i.e `[CLS]` representation is fed into output layer for classification.

# GPT-2

Links :

1. http://www.persagen.com/files/misc/radford2019language.pdf
2. https://youtu.be/Ck9-0YkJD_Q?t=936
3. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf 

Model specification :
  * 12- layer decoder- only transformer 
  * Masked self-attention heads (768 dimentional states and 12 attention heads)

Basic Blocks of GPT architecture [2,3] :
  1. GPT2 Tokenizer 
    * **Input** : Sentence 
    * **Output** : input_ids, attention_mask, labels 
    * **Vocab size** : 50k 
    * **Encoding** : Byte pair encoding (BPE) - "Middle gorund between character level encoding and word level encoding" [1] 
  2. GPT-2 embedding block [2] :
      * Consists of 
        * Word embedding layer 
        * Position embedding layer
      * **Input** : (input_ids, attention_mask, labels)- size (1,8) [*sentence, tokens*]
      * **Output** : size (1,8,768) - [*sentence, tokens, embeddings per token*]
  3. GPT-2 Decoder Block (x12):
    * Consists of 
      * Attention block [2]:
        * Consists of 
          * Self attention mechanism  ( generating Query, key, value pairs etc. of transformer) 
      * Multi layer perceptron block/ feedforward layer (MLP - Block)
          * Consists of 
            * layer norm, convolution , activaiton function, dropout 
      * Layer Normalization 
  4. LM head layer 
    * Consists of Linear layer projected to vocab  

Training data :
  * WebText ( Web pages curated and filtered by humans - 45 million)
    * Starting point, scraped outbound links from reddit (>3 karma)

GPT2 for text classification :
  * Remove the LM head and attach a classification layer with the output dimention equal to size of labels. 
  * Grab the output of last word embedding in the seqence because it has context information (L-R LM) until that word int he input sequence.
    * transformer_output[0] (https://huggingface.co/transformers/_modules/transformers/models/gpt2/modeling_gpt2.html#GPT2ForSequenceClassification) 
    * Sequence classification architecture:
      * Pooled output (last output as mentioned above)
      * Dropout(0.1)
      * Linear classifier layer with size (input_dim, num_class lables)
      * Sigmoid Layer + BCE loss function 


   



# XL-Net

Link : https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf 



# RoBERTa 

Link : https://arxiv.org/abs/1907.11692








Optimized version of BERT with the following modificatoins:
  * Removing next sentence prediction (NSP) 
  * Training on more data and bigger batches
  * Training on longer sequences 
  * Dynamically changing the masking pattern applied to training data.

Data:
  * Increase data size to imrpove end task performance 
  * BookCorpus + Wikipedia : Original data used by BERT 
  * CC- News : English portion of CommonCrawl News dataset ( "63 million articles crawled between September 2016 and feb 2019 - 76 GB after filtering")
  * OPENWEBTEXT - "The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB)"
  * "STORIES, a dataset introduced in Trinh and Le
(2018) containing a subset of CommonCrawl
data filtered to match the story-like style of
Winograd schemas. (31GB)".

Training procedure :
  * Statis (BERT) vs Dynamic (RoBERTa) masking
    * Dynamic masking is a strategy where masking patterns are generated for every sequence being fed into the model rather than relying on randomly masking and predicting.
  * Model input format and Next sentence prediction 
    * Doument-sentences : Inputs to the model are packed with full sentences which do not cross document boundaries. Remove NSP loss 
  * Training with larger batch sizes 
    * 8k batch size increased when compared to BERT.
  * Text encoding :
    * Byte Pair encoding [BPE] which relies on subword units extracted from the training corpus compared to  BERT character level 

# Visualization 

Link: https://towardsdatascience.com/visualize-bert-sequence-embeddings-an-unseen-way-1d6a351e4568


1. https://jalammar.github.io/explaining-transformers/
2. https://jalammar.github.io/hidden-states/
3. BerViz : https://github.com/jessevig/bertviz

Bert-base:
  * 12 encoder-layer stack for building contextualized embeddings.
  * 100 million tuneable parameters.
  * As bert model offers its embeddings to input, its useful to viusalize layers to analyze the patterns learned on unseen data.


Why?
  * After training viusalize, how well each layer seperates over epochs.

How?
  * BertForSequenceClassification consists of :
    * 1 - BertEmbedding layer -> 12 - Bertlayer -> 1 - Bertpooler -> Tanh - activation -> Dropout layer

## Compilation of results 

In [107]:
bert = open('/content/eval_results_BERT_0.5_.json','r')
roberta = open('/content/eval_results_RoBERTa_0.5_.json','r')
gpt2 = open('/content/eval_results_gpt-2_0.5_.json','r')
xlnet = open('/content/eval_results_xlnet_0.5_.json','r')

In [108]:
bert_metrics = json.load(bert)
roberta_metrics = json.load(roberta)
gpt2_metrics = json.load(gpt2)
xlnet_metrics = json.load(xlnet)

### Micro_avg_scores

In [84]:
micro_avg_lms = pd.DataFrame([bert_metrics['Classification_report']['micro avg'],roberta_metrics['Classification_report']['micro avg'],gpt2_metrics['Classification_report']['micro avg'],xlnet_metrics['Classification_report']['micro avg']],index=['bert-base-uncased','roberta-base','gpt2','xlnet-base-cased'])

In [74]:
micro_avg_lms.columns = pd.MultiIndex.from_product([micro_avg_lms.columns, ['micro avg']])

In [75]:
micro_avg_lms

Unnamed: 0_level_0,precision,recall,f1-score,support
Unnamed: 0_level_1,micro avg,micro avg,micro avg,micro avg
bert-base-uncased,0.824537,0.822633,0.823584,4330
roberta-base,0.864859,0.869053,0.866951,4330
gpt2,0.849421,0.457275,0.594505,4330
xlnet-base-cased,0.880033,0.736952,0.802162,4330


### Macro_avg_score

In [85]:
macro_avg_lms = pd.DataFrame([bert_metrics['Classification_report']['macro avg'],roberta_metrics['Classification_report']['macro avg'],gpt2_metrics['Classification_report']['macro avg'],xlnet_metrics['Classification_report']['macro avg']],index=['bert-base-uncased','roberta-base','gpt2','xlnet-base-cased'])

In [79]:
macro_avg_lms.columns = pd.MultiIndex.from_product([macro_avg_lms.columns, ['macro avg']])

In [80]:
macro_avg_lms

Unnamed: 0_level_0,precision,recall,f1-score,support
Unnamed: 0_level_1,macro avg,macro avg,macro avg,macro avg
bert-base-uncased,0.854382,0.842521,0.84751,4330
roberta-base,0.880783,0.884021,0.882198,4330
gpt2,0.792463,0.49603,0.563569,4330
xlnet-base-cased,0.885105,0.77277,0.815378,4330


### Hamming_loss, subset_accuracy

In [109]:
bert = bert_metrics['hamming_loss'],bert_metrics['subset_accuracy'],bert_metrics['AUC_ROC_score']
roberta = roberta_metrics['hamming_loss'],roberta_metrics['subset_accuracy'],roberta_metrics['AUC_ROC_score']
gpt2 = gpt2_metrics['hamming_loss'],gpt2_metrics['subset_accuracy'],gpt2_metrics['AUC_ROC_score']
xlnet = xlnet_metrics['hamming_loss'],xlnet_metrics['subset_accuracy'],xlnet_metrics['AUC_ROC_score']

In [111]:
hs_metrics = pd.DataFrame([bert,roberta,gpt2,xlnet],index=['bert-base-uncased','roberta-base','gpt2','xlnet-base-cased'],columns=['hamming_loss','subset_accuracy','AUC_ROC_score'])

In [112]:
hs_metrics

Unnamed: 0,hamming_loss,subset_accuracy,AUC_ROC_score
bert-base-uncased,0.087832,0.666398,0.952434
roberta-base,0.070911,0.771555,0.966717
gpt2,0.155462,0.293312,0.891429
xlnet-base-cased,0.090595,0.588235,0.954108


### per_class_precision_recall_fmeasure

In [114]:
Labels = ['Ethnicity','gender','profession','religion','Anti-stereotype','stereotype']
model = ['bert-base-uncased','roberta-base','gpt2','xlnet-base-cased']
metrics = ['bert_metrics','roberta_metrics','gpt2_metrics','xlnet_metrics']

In [106]:
for 
bert = bert_metrics['Classification_report']['Ethnicity'],bert_metrics['Classification_report']['gender'], bert_metrics['Classification_report']['profession'],bert_metrics['Classification_report']['religion'], bert_metrics['Classification_report']['Anti-stereotype'],bert_metrics['Classification_report']['stereotype'],bert_metrics['Classification_report']['unrelated']

({'precision': 0.9317617866004962, 'recall': 0.9579081632653061, 'f1-score': 0.9446540880503144, 'support': 784}, {'precision': 0.8661971830985915, 'recall': 0.8092105263157895, 'f1-score': 0.8367346938775511, 'support': 304}, {'precision': 0.8682008368200836, 'recall': 0.8886509635974305, 'f1-score': 0.8783068783068781, 'support': 467}, {'precision': 0.9795918367346939, 'recall': 0.9829351535836177, 'f1-score': 0.9812606473594548, 'support': 293}, {'precision': 0.6590584878744651, 'recall': 0.5938303341902313, 'f1-score': 0.6247464503042597, 'support': 778}, {'precision': 0.7173174872665535, 'recall': 0.7897196261682243, 'f1-score': 0.7517793594306049, 'support': 1070}, {'precision': 0.9585492227979274, 'recall': 0.8753943217665615, 'f1-score': 0.9150865622423742, 'support': 634})

### Macro_average_per_class

In [63]:
import pandas as pd

bert = pd.read_json('/content/eval_results_BERT_0.5_.json')
roberta = pd.read_json('/content/eval_results_RoBERTa_0.5_.json')
gpt2 = pd.read_json('/content/eval_results_gpt-2_0.5_.json')
xlnet = pd.read_json('/content/eval_results_xlnet_0.5_.json')

Unnamed: 0,AUC_ROC_score,subset_accuracy,hamming_loss,Classification_report,confusion_matrix_Ethnicity,confusion_matrix_gender,confusion_matrix_profession,confusion_matrix_religion,confusion_matrix_Anti-stereotype,confusion_matrix_stereotype,confusion_matrix_unrelated,y_pred,y_labels,threshold
Anti-stereotype,0.952434,0.666398,0.087832,"{'precision': 0.6590584878744651, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
Ethnicity,0.952434,0.666398,0.087832,"{'precision': 0.931761786600496, 'recall': 0.9...",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
gender,0.952434,0.666398,0.087832,"{'precision': 0.8661971830985911, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
macro avg,0.952434,0.666398,0.087832,"{'precision': 0.8543824058846871, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
micro avg,0.952434,0.666398,0.087832,"{'precision': 0.8245370370370371, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
profession,0.952434,0.666398,0.087832,"{'precision': 0.8682008368200831, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
religion,0.952434,0.666398,0.087832,"{'precision': 0.9795918367346931, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
samples avg,0.952434,0.666398,0.087832,"{'precision': 0.8385710448562981, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
stereotype,0.952434,0.666398,0.087832,"{'precision': 0.7173174872665531, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
unrelated,0.952434,0.666398,0.087832,"{'precision': 0.9585492227979271, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
