<a href="https://colab.research.google.com/github/mvdheram/Stereotypical-Social-bias-detection-/blob/Pre-trained-LM-selection-and-training/Language_model_and_visualization_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Link : https://arxiv.org/pdf/1810.04805.pdf

Bert - **B**idirecitonal **E**ncoder **R**epresentations from **t**ransformers

Architecture :
  * Multi-layer, **bidirectional** , encoder - based transformer 
  * BERT-base - 12 Encoder stacks, 768 hidden size, 12 - Self attention head

Framework :

  * BERT-tokenizer :
    * **WordPiece** embeddings 
    * Vocab size : 30,000
    * First token `[CLS]` - The final hidden state of the 12th encoder stack
    * Two sequences seperated by `[SEP]` 

    * Input : Input sentence
    * Output : Token embedding + segment embedding + Position embedding 
  
  * Pre-training
    * Task1 : Masked language model (MLM)
      * For deep bidirectional representation, mask 15 % of all Wordpiece tokens in each sequence at random and predict the masked tokens.
    * Task2 : Next sentence prediction (NSP)
      * Trained to understand relationships between two sentences (Q&A, NLI)
    * Data
      * BookCorpus (800M words)
      * English wikipedia (2500M words)
  * Fine-tuning
    * Plug-in task specific output layer and fine-tune all the parameters end to end.
    * At output, the first token i.e `[CLS]` representation is fed into output layer for classification.

# GPT-2

Links :

1. http://www.persagen.com/files/misc/radford2019language.pdf
2. https://youtu.be/Ck9-0YkJD_Q?t=936
3. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf 

Model specification :
  * 12- layer decoder- only transformer 
  * Masked self-attention heads (768 dimentional states and 12 attention heads)

Basic Blocks of GPT architecture [2,3] :
  1. GPT2 Tokenizer 
    * **Input** : Sentence 
    * **Output** : input_ids, attention_mask, labels 
    * **Vocab size** : 50k 
    * **Encoding** : Byte pair encoding (BPE) - "Middle gorund between character level encoding and word level encoding" [1] 
  2. GPT-2 embedding block [2] :
      * Consists of 
        * Word embedding layer 
        * Position embedding layer
      * **Input** : (input_ids, attention_mask, labels)- size (1,8) [*sentence, tokens*]
      * **Output** : size (1,8,768) - [*sentence, tokens, embeddings per token*]
  3. GPT-2 Decoder Block (x12):
    * Consists of 
      * Attention block [2]:
        * Consists of 
          * Self attention mechanism  ( generating Query, key, value pairs etc. of transformer) 
      * Multi layer perceptron block/ feedforward layer (MLP - Block)
          * Consists of 
            * layer norm, convolution , activaiton function, dropout 
      * Layer Normalization 
  4. LM head layer 
    * Consists of Linear layer projected to vocab  

Training data :
  * WebText ( Web pages curated and filtered by humans - 45 million)
    * Starting point, scraped outbound links from reddit (>3 karma)

GPT2 for text classification :
  * Remove the LM head and attach a classification layer with the output dimention equal to size of labels. 
  * Grab the output of last word embedding in the seqence because it has context information (L-R LM) until that word int he input sequence.
    * transformer_output[0] (https://huggingface.co/transformers/_modules/transformers/models/gpt2/modeling_gpt2.html#GPT2ForSequenceClassification) 
    * Sequence classification architecture:
      * Pooled output (last output as mentioned above)
      * Dropout(0.1)
      * Linear classifier layer with size (input_dim, num_class lables)
      * Sigmoid Layer + BCE loss function 


   



# XL-Net

Link : https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf 



# RoBERTa 

Link : https://arxiv.org/abs/1907.11692




# Visualization 

Link: https://towardsdatascience.com/visualize-bert-sequence-embeddings-an-unseen-way-1d6a351e4568


1. https://jalammar.github.io/explaining-transformers/
2. https://jalammar.github.io/hidden-states/
3. BerViz : https://github.com/jessevig/bertviz

Bert-base:
  * 12 encoder-layer stack for building contextualized embeddings.
  * 100 million tuneable parameters.
  * As bert model offers its embeddings to input, its useful to viusalize layers to analyze the patterns learned on unseen data.


Why?
  * After training viusalize, how well each layer seperates over epochs.

How?
  * BertForSequenceClassification consists of :
    * 1 - BertEmbedding layer -> 12 - Bertlayer -> 1 - Bertpooler -> Tanh - activation -> Dropout layer

## Compilation of results 

In [2]:
import pandas as pd

bert = pd.read_json('/content/eval_results_BERT_0.5_.json')
roberta = pd.read_json('/content/eval_results_RoBERTa_0.5_.json')
gpt2 = pd.read_json('/content/eval_results_gpt-2_0.5_.json')
xlnet = pd.read_json('/content/eval_results_xlnet_0.5_.json')

In [5]:
bert

Unnamed: 0,AUC_ROC_score,subset_accuracy,hamming_loss,Classification_report,confusion_matrix_Ethnicity,confusion_matrix_gender,confusion_matrix_profession,confusion_matrix_religion,confusion_matrix_Anti-stereotype,confusion_matrix_stereotype,confusion_matrix_unrelated,y_pred,y_labels,threshold
Anti-stereotype,0.952434,0.666398,0.087832,"{'precision': 0.6590584878744651, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
Ethnicity,0.952434,0.666398,0.087832,"{'precision': 0.931761786600496, 'recall': 0.9...",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
gender,0.952434,0.666398,0.087832,"{'precision': 0.8661971830985911, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
macro avg,0.952434,0.666398,0.087832,"{'precision': 0.8543824058846871, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
micro avg,0.952434,0.666398,0.087832,"{'precision': 0.8245370370370371, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
profession,0.952434,0.666398,0.087832,"{'precision': 0.8682008368200831, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
religion,0.952434,0.666398,0.087832,"{'precision': 0.9795918367346931, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
samples avg,0.952434,0.666398,0.087832,"{'precision': 0.8385710448562981, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
stereotype,0.952434,0.666398,0.087832,"{'precision': 0.7173174872665531, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
unrelated,0.952434,0.666398,0.087832,"{'precision': 0.9585492227979271, 'recall': 0....",[1643 55 33 751],[2140 38 58 246],[1952 63 52 415],[2183 6 5 288],[1465 239 316 462],[1079 333 225 845],[1824 24 79 555],[[0 0 0 ... 0 0 1]\n [1 0 0 ... 1 0 0]\n [0 0 ...,[[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0....,0.5
