<a href="https://colab.research.google.com/github/mvdheram/Social-bias-Detection/blob/main/Transformer_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN seq2seq model with Attention

Resources : 


*  Seq2seq with attention : https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/



## RNN sequence-to-sequence models  :

Why and What RNN?

* Usefull When Dealing with Sequential data [current data depends on previous].
* Regular NN with a loop (recurrent unit; hidden state passed every time step). Unrolling of RNN leads a very deep Feed forward neural network.

RNN-Seq2Seq :

  *   Takes sequence of items ( words, letters, features of an image ..etc) as input and ouputs a sequence of items w.r.t tasks ( machine translation, text summarization and image captionning etc..)

* Elements :

    Inputs:
      1. Hidden state - Previous information or context from previous inputs 
      2. Input Vector - Element in the sequence

    Output: 
      3. Output Vector - Final/intermediate?? output of the sequence 

RNN :

      for each time step (word_embedding in sequence):

        Hidden state #0 + Input vector #1 => hidden state #1 + Input Vector#2.. - hidden unit (context) passed along with input to RNN => output vector #1 


Problem with vanilla RNN:

* Short term memory ( ability to retain information from previous wrong steps)

  Why ??
 
* Vanishing gradients due to backpropagation.   

  How??

* Weights adjusted ( reduce the loss ) of a layer in deep feed forward layer depends on the previous layers gradients. If the gradients of the previous layer is less, it effects the current layer nodes ( first layers gradients becomes very low or no learning).

  Partial Solution??

* Varients of RNN architecture 
  1. GRU ( Gated Reccurent Unit )
  2. LSTM ( Long Short Term Memeory )


Eg. 

Input Vector : Word in a sequence


## Illustrated example: Nueral Machine traslation 

Seq2seq Machine traslation:



Encoder-Decoder architecture (RNN) :

*   Encoder transforms input sequence into vector ( Context )
*   Context (last hidden state of the sequence) sent to decoder which decodes vector sequence element by element.

        Input_sequence -> Encoder -> Context -> Decoder -> Translation 

* Size of context vector set by the number of hidden units in the encoder RNN.
* The last hidden state of the Encoder is the context passed along to the decoder to decode the sequence of words.



### Attention

Why??

* Dealing with long term dependencies with long sentences.
* Dealing with different words (context) which contribute to single word in translation .

Auto - regressive CNN's (WaveNet, ByteNet) used as a solution, but the convolution layers could capture history by positions rather than content.  

What??

* Allows model to focus on relavant parts (words) of input sequence with weights representing the relative importance of different words (keys) for the particular word (query) being translated. 

Contribution ??

* Encoder :
  
  All the hidden states of encoder passed to decoder rather than the last hidden state of the sequence. 

* Decoder :

  Attention scores and context vector :

  for each time step:

    1. Score each hidden state, associated with words in the sequence.
    2. Multiply each hidden state (vector) by softmaxed (0-1) attention score. thus giving attention to words with high score.
    3. Sum-up the weighted vectors which is context vector.



#### Scoring with attention  

Decoder RNN :

Time step #1 :

1. Takes the `<END>` embedding as input vector and initial decoder hidden state to feed into the RNN unit.   
2. Decoder produces output and hidden state vector (h4); output vector discarded.
3. Scoring: 
  1. Use encoder hidden states (h1,h2,h3) and  h4 hidden state vector of decoder to calcualte attention scores and thus form context vector(c4). 
4. Concatenate h4 and c4 into single vector.
5. Pass through feedforward neural network 
6. Output vector (O1) of feedforward neural network is the translated word for time step#1.
7. Hidden unit (h4) and output vector (O1) of previous step passed as input to the next time step #2 and repeated. 


Advantages:

1. With attention and scoring the text alignment is learned while training. 

Eg. French to English 

French : je suis etudiant

1. French Encoder hidden state vector (h) [h1,h2,h3]
  2. Decoder hidden state vector step #1 [1,0,0] - I 
  3. Decoder hidden state vector step #2 [0,1,0] - am 
  4. Decoder hidden state vector step #3 [0,0.5,0.5] - a 
  5. Decoder hidden state vector step #4 [0,0,1] - Student



# Transformers 

Transformers Paper : https://arxiv.org/abs/1706.03762

Lecture video : https://www.youtube.com/watch?v=OyFJWRnt_AY&ab_channel=PascalPoupart

Annotated Transformer : http://nlp.seas.harvard.edu/2018/04/03/attention.html

Transformers : A model that uses attention to boost speed of training. 

Contributions as compared to RNNs :

* Facilitate long range dependencies with attention mechanism
* Avoid gradient vanishing and explosion as transformers does computation for **entire sequence simultaneously** rather than linearly done in the RNN.
* Fewer training steps due to whole sepquence being processed rather than processing each word linearly in each time step.
* No recurrence ( sequential computation like RNN) which facilitates parallel computation (GPU)




## Attention Mechanism in Nueral architecture 

### Transformer Nueral Architecture

Building blocks of Transformer - Illustrated with machine translation :

    Input : Word embeddings; vector_size - 512; (bottom-most encoder)

    output : vector (floats) for the words in the sequence -> Fully connected layer -> logits vector ( based on vocab size learned during training) -> softmax layer (all positive, all add up to 1.0) -> argMax(softmax layer)  -> Word

**Iteration w.r.t a single word** : 

1.  Encoders - stack

        Input : Input word - embedding (Eng)
        Output : Encoded input word - embedding with context ( information of other relavant words attended throughout the stack )

  1. Attention ( self / multi-head ) 
  2. Add & Layer - normalization 
  3. Feed forward neural network 

2.   Decoders - stack

          Input : Encoded input Word - embedding (Eng) and masked output word embedding(Ger; for having information of previous words until the predicting word and mask future words)
          Output : Probabilitiy distribution over the word in the dictionary 

  1. Attention ( self / multi-head ) 
  2. Add & Layer - normalization 
  3. Encoder - Decoder attention 
  4. Feed forward neural network 


#### Encoder - stack

High level description of steps in encoder stack:

1.  Inputs : Entire sequence of words as input.
2.  Input  Embedding : Word embedding of input.
3.  Positional encoding : Add positional encoding (vector) for words in the sequence which provide meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.
4.  Multi-head attention : Multi-head attention used to compute attention of each word in the sequence vector againt other words with multiple heads (multiple combination of contributing words). 
  * Idea: Treat each word as query, find keys (other words) in the sentence based on similarity and take the dot product of query vector and key vector to compute attention score of each word against other words. 

  * Having several stacks of encoders (attention) leads to finding attention of not only single words but of pairs; pair of pairs .. 

  * Multi-head attention leads to finding multiple attention ( multiple weighted combination of other words which contribute)  w.r.t the query word.
5. Add & Norm : Add the original input to the output of multi-head attention and normalizing the layer ( mean - 0 and variance - 1).
        hidden-unit(h) = g (h - mean (h) ) / standard-deviation(h)
          g - variable 
          
        Why ??
          covariate- shift ( gradient dependencies between layers)  

6. Feed Forward layer : A layer with no cycles/ recurrence and used to process or fit output from one layer to next.
7. Add & Norm 

Key points:

*  All Encoders receive vectors of size 512. 
  * Bottom_most : word bedding vector
  * Others : output of the previous layer encoder.
*  Words in each encoder flow through its own path (Key factor for parellel execution with GPU) and hence entire sequence of words is fed as input.
  * Dependencies exists between words in the "attention layer".
* Pipeline for a word in sequence :  

      Input ( single Word ) -> embedding -> ( Query, key, value ) -> attention score against other words in sequence (q*k) -> normalize (dimention_size) -> softmax (0-1) -> softmax * value  -> summation -> encoding of word with context.









#### Decoder-stack 

Input: Multiple stacks of key,value vector ( multi-head attention) from encoder and output embeddings (translation) from decoder after first iteration.

Output : Probability distribution over the vocab for each position of word. (translation)


High level description of steps in decoder stack:

1. Output Embedding : word embedding of outputs
2. Positional encoding : Adding positional encoding to decoders input to indicate position of each words.
3. Masked Multi-head attention :  Mask (-infinity) the positions of the future words, for the self attention to focus on earlier positions (words) in the output rather than the future positions while training. 
4. Add and norm : Add the residuals and normlize the the output of the multi-head attention.
5. Multi-head attention : Takes the input from the encoder (multiple key, value)  and output (masked attention scores) from the masked multi-head attention to calculate the inter-attention between the input and output ( generated untill that point).
6. Feed forward layer ; add and norm : Adjust the dimentionality and do a layer normalization ( mean - 0 and variance - 1). 
7. Linear layer : Project the attention vector into the logits vector with dimentionality size of the vocab (unique words) learned during the training. 
8. Softmax layer : Turns the scores into probabilities ( all positive; all add upto 1.0); the cell with highest probability choosen and word associated (vocab) with it is produced as output for the time step which again goes into decoder stack as input. 


Note :

* The output of decoder being fed in again might seems as recurrence, but with teacher forcing technique this can be avoided.

* Teacher forcing is a method used while training wherein the output of the decoder  ( previous prediction ) is assumed to be correct and the correct translation is directly fed into the decoder input rather than waiting for the output in that timestep. 


##### Loss Function 

Subtraction of the model output vector ( probability distribution over vocab ) and the desired output vector ( probability distribution over vocab ) for every position in the sequence of words until the `<'end of sentence'>` is reached. 

### Attention Mechanism

Attention Mechanism 

* **Mimics the retrieval of a value `v` for a query `q` based on a key `k` in the database**.

Elements:

* Query vector - q
* Key vectors -  k
* Value vectors - v

      attention(q,k,v) = Sum( (similarity(q,k)) * v) 

Steps :

1. Compute similarity Measure f(Query, keys):

        Input : Vectors ( Query, keys)
        Output : similarity measure (Scalar) 
  options :
  1. Dot product of (q,k)   
  2. Scaled dot product (q,k)/ sqrt(dimentionality of each key) 
  3. General dot product(q * w  k); w(weights) - Query vector projected to embedding space for similarity measure (multi-head).
  

2. Compute weights (a):

        Input : similarity measure (Scalar)
        Output : Weights (Vector) 

        Formula : Fully connected Softmax layer 
        a(i) = similarity_measure (i) / sum of all similarity_measure 

3. Weighted combination (weight, value):

        Input : Weights (Vector) 
        Output : Attention value (vector)  

        Formula : sum(weights * value of keys) 
        
        Paper (scaled (by length) dot product attention): softmax ( weight * values) / sqrt(d) ) 
        d - dimentionality 

* Query, key, values are matrices ( formed by combining multiple (q,k,v) vectors; described in paper to speedup training for multiple queries) formed by multiplying word embedding (x) and projection matrices (WQ,WK,WV) learned during training. 
  * Why projection matrices?
    * Reduce embedding (x)  dimention via weight transformation / projection.
* keys == values in self-attention / intra seqeunce - attention. Keys and values are different when attention used to relate two different sequences.  
* Key, value - Vocab learned during training.
* Query - word for which attention score is being calculated.

### Multi-head attention in Transformer

Multi-head attention : Compute multiple attentions per query ( word ) with different weights (values).

* Multiple sets of (Query,key,value) 

  * Why ??
    * The weighted combination in the attention mechanism combines the attention of every word into a matrix wherein the other words might dominate the actual word.

  * What ??
    * Multiple sets are like multiple feature maps / filters in CNN which gives multiple contexts or "representation subspaces" for each word.

  * How ??
      * Extract different features w.r.t word by focusing on different positions and combination of different positions in sentence.

 Eg.  heads = 2 (q,k,v)

          Sentence :  " The animal didn't cross the street because it was too tired".

          Coreference resolution of "it" :  animal, tire 


Pipeline :

    Multiple-sets( # of heads) of ( q,k,v ) via W projection matrices -> Scaled dot product attention( # of heads) -> concat (head1,head2,...headh) -> linear -> multi-head attention


Masked Multi - head attention: 

* Multi-head where some values are masked ( i.e probabilities of masked values ( -inf) are nullified ( with softmax operation ) to prevent them from being selected ).
* In decoder, the output values should depend on previous output rather than future outputs which is done through masked multi-head attention.

        MaskedAttention (q,k,v) = softmax( q + M / sqrt (d) ) v

        where M is a mask matrix of 0's and -inf's


# BERT : Bi-directional encoder representation from Transformers

Source : http://jalammar.github.io/illustrated-word2vec/

### Word Embeddings

Source : https://www.tensorflow.org/tutorials/text/word_embeddings

What ??

* Representing text as numbers 

Why??

* Machine learning model takes vectors ( arrays of numbers ) as input. 
* Using a better way to represent syntax and semantics of the underlying text being represented.

How??

Three strategies :

1. one-hot encoding : Representing vocab (unique words) as columns and each word sequence as rows. The presence of words represented with 1 and absence as 0.

  Eg. sentence :  "The cat sat on the mat"
                        vocab      
                  cat mat on sat the
            the => 0   0   0  0   1
            cat => 1   0   0  0   0
            sat => 0   0   0  1   0
            ...         ...

      Combine the vectors of each word to form the encoding for the sentence 

      Disadvantage: 
      
        Sparse (mostly 0) vector representation of words


2.  Encode each word with a unique number :  Encode each word with unique number. Dence representation when compared to one-hot encoding. 
  
        Eg. 
            sentence :  "The cat sat on the mat"

              vector :   [5,  1,  4, 3,  5,  2]

          "the" => 5
        
          "cat" => 1

          ...

  Disadvantage: 
        
        1. Integer-encoding is arbitrary (Relationship between words not captured)
        2. Not trainable 

3. Word embeddings: 

  * A word embedding is a trainable dense vector representation of floating point values. 
  * Values for word embedding are learned during training.
  * Dimentionality or length of the vector is a hyper-parameter.
  * Similar words have a similar encoding in word embedding wherein the semantic and syntactic relations are attended. 

          Eg. Analogies :  vec(“king”) - vec(“man”) + vec(“woman”) = vec(queen)
Eg. sentence :  "The cat sat on the mat"
                              vocab      
                           cat    mat     on      sat      the
                  the =>   0.5    0.56  -0.23    0.122    -0.111
                  ...         ...




#### Non-contextual word embeddings

Word2Vec : http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/



Pre- trained non - contextual word representation :  

"Word embeddings (word2vec, GloVe) are often
pre-trained on text corpus from co-occurrence
statistics".

* Word2Vec ( Window - based model ) :

  Technique of training : 
    1.   Continuous Bag-of-words (CBOW) : Predict center word in the window based on surrounding words ( context before and after).
    2.   Skip-gram (SG) : Predict surrounding words based on the central word in the window.

  Disadvantages :
    1. Window-based model, doesnot benefit from the information in the whole document.
    2. Cannot handle OutOfVocabulary words (OOV) 

* Glove (Global Vectors for Word Representation) :
  
  * Construct matrix of (word x context) in a large corpus.
  * Takes ratios of co-occurence probabilities.
  * The embeddings are optimized , so that the dot product of 2 vectors equals the log of number of times the 2 words will occur near each other.

  Disadvantages :
  
  1. Cannot handle OutOfVocabulary words (OOV) 

* Fasttext ( Fast and efficient representation ):

  * Relies on skip-gram (SG) model.
  * Benefits from sub-word information. 
  * Can handle OutOfVocabulary words (OOV).

**Dis-advantage**  :

* Static word embedding fail to capture polysemy (Same embedding for different contexts).

eg. "bank"

*  “I deposited 100 EUR in the **bank**.”
* “She was enjoying the sunset o the left **bank** of the river.”

### Language model

LM : http://web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture06-rnnlm.pdf

What ??

* A language model takes list of words and attempt to predict the probability of next word over the vocab ( unique words learned ) and *outputs the word with highest probability*.

  * Computes probability distribution of next word (xt+1) over vocab V based on sequence x1,x2,....,xt.
        p(xt,xt+1) = p(xt+1|xt,.....,x1)

  * "You can also think of a Language Model as a system that
assigns probability to a piece of text."
      
        p(x1....xt) = product [p(xt | xt-1.....x1)]

Why ??

* Helps understanding language; with NLP application such as text generation (speech assistance), recognition, machine traslation. 

How ??

* Basic architecture :

  1. Look up embeddings 
  2. Calculate prediction with model
  3. Project to output vocabulary and display words with highest probability.








**Language models Training** :

**Idea** : 

1. Get text data  
2. Select window size (history,word to be predicted) to slide against the text to generate dataset. 
3. Train model to generate next word Using history sequence as input. 

**Models for training LM** :

1. Traditional n - gram language models : 

    * Predict n words (n-grams) based on the previous history of words( n-1 words ).
    * "The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few."

      Markov Assumption  : Probability of word depends on the previous word rather than the entire sequence of history.

            Probability of next word given history :
            bi-gram history = 2-1 = 1 ( only previous word)
            P(next_word | history ) =  count( history, next_word) / count (history)
  Problem :
  * Sparsity Problems with n-gram Language Models ( counts could be 0)
  * Storage of all n-grams
  * Increasing n worsens sparsity problem, and increases model size

2. Nueral language models : 
    * Deep neural networks models trained on corpus  which are better at  handling long - diatance relationships. 

  Training :

    1. Get big corpus of text.
    2. Feed into RNN-LM
    3. Compute distribution of word  y^(t) over vocab for every step t.
    4. Loss function(t) = cross-entropy  *summation of loss between the true label and log of predicted prob ( softmax prob)*  ( predicted prob, true prob )
    
  Types:

    1. Fixed Window-based neural model
        
            Pipeline : Words/one-hot vectors (x) -> concatenated word embeddings (e) -> hidden layer  (h= f(We+b1)) -> output distribution over unique words (y = softmax( Uh +b2)

          Dis-advantages :

          1. Fixed window **too small**
          2. Input words are multiplied by completely different weights in W. **No symmetry**  

    2. RNN Language models
    
            Pipeline : Words/one-hot vectors (x) -> concatenated word embeddings (e) -> hidden layer  (h= sigmoid(Wh(t-1)+we(t)+b1)) -> output distribution over unique words (y = softmax( Uh +b2)

        Dis-advantages: 

          1. Recurrent computation is slow and expencive. 
          2. Difficult to access information from many steps back. 

#### Contextual word embeddings

Source: https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018


Pre- trained contextual word representation : Capturing word senses of language. 

*   ELMO : Embeddings from Language Models
    * Learn word token vectors using long contexts ( whole sentence or longer) not context windows. 
    * Train **Seperate** Left-to-Right and Right-to-Left language models ( task agnostic and unsupervised ) using bi-LSTM .
    * Use the task agnostic hidden states generated by the intermediate layers as "Pre-trained contextualised embeddings" . (every time step produces context specific word representation)
      
          word_embedding (hidden state) = previous_hidden_state which carry information + input_word_embedding*
    
    Steps:

      * Concatenate hidden layers. 
      * Multiply each vector by a weight based on a task ( for scaling over the donwstream task).
      * Sum the vectors.

*   Flair


#### Transfer learning 

Language Models pre-training : https://arxiv.org/pdf/1906.08237.pdf

* Generate language model using semi-supervised sequence learning trained over large text (unsupervised) rather than task specific to capture the patterns in the language.

* **Using the pre-trained language model; finetune it for the particular downstream tasks** ( classification ).

Idea (fast.ai) : 

1. In a pre-trained Nueral network (resnet34 cnn) first layers are more generic and last layers are more task specific. 
2. The final layer of pretrained model (which projects the output from the previous layers to the categories) is to be deleted and a new layer is added with some random weights for the downstream task.      
3. **The previous layers are freezed from weight update during the SGD except the final added layer**.
4. For NLP :
  1. Freeze all layers except the last classifier layer ( randomly initialized weights) and train for few epochs.
  2. Unfreeze all the layers and again train for few epochs with learning rate gradually increasing.
          learn.fit_one_cycle(2, max_le = slice(1e-6,1e-4))


Pre-training LM:

1. Autoregressive LM : Factorize ( product of several factors ) the likelihood of word **after** a sequence (forward product of probabilities ) or word **before** a sequence ( backward product of probabilities )

2. Autoencoding LM : Reconstruct the original sequence of text from the corrupted input (masked).

### OpenAI Transformer GPT : Pre - training a transformer Decoder for language modeling

"Improving Language Understanding by Generative
Pre-Training, OpenAI, 2018"
*   Using Transformer decoder ( masked self attention ; no peeking at future tokens ) for the task of language modeling  as transformers are better choice when training on unlabeled large text data.
*   12 - decoders are stacked from the vanilla transformer without the encoder-decoder attentention sublayer ( as encoder is absent).  
* Fine-tune on downstream task (classification task).
* Transformer decoders can be trained to generate text ( output for each timestep ).

Scaled Version of GPT :

1. GPT - 2 ( trained on 40B tokens of text )
2. GPT - 3 ( trained on 300B tokens of text )

### Problem with autoregressive language models

Source : https://web.stanford.edu/class/cs224n/slides/Jacob_Devlin_BERT.pdf

* Language models only use left context or right context, but language understanding is bi-directional. 

* OpenAI tranformer LM is a left context / forward language model.

Why??

* Directionality is needed to generate a well-formed probability distribution.
* Words can “see themselves” in a bidirectional encoder.


### BERT : From Decoders to Encoders 

Source : https://web.stanford.edu/class/cs224n/slides/Jacob_Devlin_BERT.pdf

* Encoders self attention spans the left and right tokens in the sequence. 
* Language models need to learn to predict future tokens which would not be possible using the encoders self attention which peeks into future tokens.

Solution : Masked Language models ( autoencoding LM) 

* Mask out k% of the input words, and then predict the masked words. (k = 15%)



BERT (Bidirectional Encoder Representations from Transformers):

Pre-training of Deep Bidirectional Transformers for Language
Understanding, which is then fine-tuned for a task.

* Uses transformers **encoders** and **masked language modeling** to capture left and right context and handle long distance context. 
* Pretrained on Wikipedia + bookcorpus
* Two varients based on pre-trained model sizes:
  1. BERT-Base : 12 layers, 768-hidden, 12-heads
  2. BERT-Large : 24 layers, 1024-hidden, 16-heads

BERT Input Representation :

* 30,000 wordpiece vocabulary on input.
* Each token is sum of three embeddings.

    Input -> Token embeddings ( represent word pieces ) + Segment embeddings ( represent segments of sentence pair ) + Positional embedding ( represents positions of word piece ) 



BERT model fine tuning :

* Use a classifier built on top layer for each task.
* BERT trained for:
  1. Sentence pair classification task. ( eg. Multi-NLI premise,hypothesis, label )
  2. Single sentence classification tasks. (eg. CoLa - corpus of linguistic acceptability)
  3. Question Answering task.
  4. Single sentence tagging task.





### Post-BERT Pre-training Advancements

Source : https://web.stanford.edu/class/cs224n/slides/Jacob_Devlin_BERT.pdf

1. RoBERTA : A Robustly Optimized BERT Pretraining
Approach  (Liu et al, University of Washington and
Facebook, 2019)
  * Trained BERT for more epochs and on more data.
2. XLNet : Generalized Autoregressive Pretraining for
Language Understanding g (Yang et al, CMU and
Google, 2019)
  * Combination of auto-regressive (left-context) and auto-encoding (both context)
  * Innovation #1 : Permutation LM 
      * Randomly permute the order of text rather than (left-to-right or       right-to-left) while training.

  * Innovation #2 : Masked two stream attention (Relative position embeddings) 
    * Architecture change to handle permutation LM where two hidden representation for each token (h - Can attend every token including itself, g - Can peek into left context)
3. ALBERT : A Lite BERT for Self-supervised Learning
of Language Representations (Lan et al, Google
and TTI Chicago, 2019)
    * Innovation #1 : Factorized embedding
parameterization
      * Use small embedding size (e.g., 128) and then project it to
Transformer hidden size (e.g., 1024) with parameter matrix

  * Innovation #2: Cross-layer parameter sharing
    *  Share all parameters between Transformer layers
4. T5 
  * Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer (Raffel et al,
Google, 2019)
5. ELECTRA : Pre-training Text Encoders as
Discriminators Rather Than Generators (Clark et al,
2020)
  * Train model to discriminate locally plausible text
from real text

Distillation :

* BERT and other pre-trained language models are
extremely large and expensive.
* Train a distilled version of BERT (distillBERT) for smaller models.
  * Technique : 
    1. Train "Teacher": Use SOTA( state of the art) + fine-tuning technique to train model with large data and maximum accuracy.
    2. Train "Student" : Much smaller that mimics Teacher output.
    3. Student objective is typically Mean Square Error or Cross Entropy
    4. **Distillation works much better than pre-training +
fine-tuning with smaller model**
      why??
        * Language modeling is the “ultimate” NLP task in many ways
          * I.e., a perfect language model is also a perfect question
            answering/entailment/sentiment analysis model
        * Finetuning mostly just picks up and tweaks these existing latent
features.
        * Distillation allows the model to only focus on those features.