# Transfer learning models



* The features in the deep neural networks tend to change from general to task specific from start to the end. So the idea is to use the first layers and finetune or change the last layers for the different task. [source](https://arxiv.org/pdf/1801.06146.pdf)



## ELMo(Embeddings of Language Model)
<b>***Better for the disambiguation of different word senses***</b>

* Deep contextualized words 
* Trained on huge corpus of text data
* Elmo representations(aka embeddings) are the function of all the internal layers of biLM. Also represented that elmo representations are the function of the input sentence.
* <b>Architecture: </b> [Stacked LSTM layer](https://machinelearningmastery.com/stacked-long-short-term-memory-networks/)
* <b>Previous works:</b> 
  - Assigning separate vectors for words in different meanings.
  - Context2vec : Encoding context around a pivot word by using biLSTM
  - BiRNNs: These also encode different information at different layers.
* Trained through semi-supervised learning approach.

<b>Mathematics for bi language model:</b>
  
  * The joint probability distribution of a forward language model is
  
  $\boxed {P(t_1,t_2,.......t_N) = \Pi_{k=1}^N P(t_k|t_{k-1},t_{k-2},........t_1)} $
  
  * Similarly the joint distribution fro the backward language model is 
  
  $\boxed {P(t_1,t_2,.......t_N) = \Pi_{k=1}^N P(t_k|t_{k+1},t_{k+2},........t_N)} $
  
  - This states that the backward language model takes the context of the word from the backwards.
  - A biLM combines the both forward and backward model and takes the maximum [log probability](https://en.wikipedia.org/wiki/Log_probability)
  
  $\sum_{k=1}^N(log P(t_k|t_{k-1},t_{k-2},........t_1; \theta_x, \vec{\theta_{LSTM}}, \theta_s) + log P(t_k|t_{k+1},t_{k+2},........t_N; \theta_x, \overleftarrow{\theta_{LSTM}}, \theta_s))$

<b>Mathematics for ELMo:</b>
  * Each token will have 2L+1 representations, where L is the number of layers of LSTM.
  
  $R_k =\{\vec{h_{k,j}^{LM}}, \overleftarrow{h_{k,j}^{LM}}, x_k^{LM}\}$
  
  * A token layer $h_{k,j}^{LM} = [\vec{h_{k,j}^{LM}}, \overleftarrow{h_{k,j}^{LM}}]$
  
  * ELMo converts all the layer representations of R into a single vector. In the simple case, it only considers the top most layer's representation  ""$h_{k,L}^{LM}$""
  
  * Then compute a task specific weighting of biLM layers.
  
  $ELMO_k^{task} = E(R_k, \theta^{task}) = \gamma^{task}\sum_{j=0}^L s_j^{task} h_{k,j}^{LM}$
  
    - where $s_j^{task}$ = normalized softmax weights
    - $\gamma^{task}$ = scalar parameter to scale the representation vector
    
<b>Pretrained biLM architecture:</b>
* This architecture is to similar to the previous works done by [Jozefowicz et al,](https://arxiv.org/pdf/1602.02410.pdf) and [Kim et al.](https://arxiv.org/pdf/1508.06615.pdf). 
* Additionally added residual connections between LSTM layers and modified for training on both sides.
* They have 2 layers of LSTM(L=2) with a residual connection from first to second
    
<b>Adding ELMo to supervised tasks:</b>

* The biLM model can be improved better by adding ELMo to the supervised architecture for the desired NLP task.
* All the supervised NLP models have the similar architecture at the lowest layers.
* To add the ELMo to supervised models:
  - Freeze the biLM weights and add the ELMo vector $ELMo_K^{task}$ with $x_k$
  -  And pass this [$x_k$; $ELMo_K^{task}$] representation into the task RNN.
  
<b>What biLM representations are capturing?</b>
* As they are resulting better than word vectors, they are capturing more than what word vectors  are capturing.
* The nearby vectors of biLM represents the contextual words. An example is shown in the paper with word "play"
  
<b>Evaluated tasks:</b> 
* Question-Answering
* Natural language inference(Textual entailment)
* Semantic role labelling
* Coreference resolution
* Named entity extraction
* Sentiment analysis

<b>Keynotes:</b>
* Higher level LSTM represents the context of the word
* Lower level LSTM deals with the syntax of the word aspect.
* Not worked on semantic text similarity
* ELMo improved performance because of deep contextualized representations instead using only the top layers.
* Small regularization parameter(see "Terms known") is better for ELMo according to the authors.
* Including ELMo for the task specific biRNN architecture, at input and output results in the better results.

<b>Terminology:</b>
* Regularization parameter ($\lambda$): decreases the overfitting. See [here](https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a) and [here](https://stackoverflow.com/questions/12182063/how-to-calculate-the-regularization-parameter-in-linear-regression)
* Downstream models : ?


<b>  Questions: </b>
* What is coupled language model?
* What is function of language model?
* Why does in a stacked layers, the lower levels represent syntactic and upper levels the semantic features?
* What are character convolutions?
* What is task specific weighting?
* What are these supervised NLP models?? Are these Word2Vec and GloVe??
* The authors are talking about including ELMo for task related biRNNs. Do we use this BiRNNs architecture for every task?

<b>Implementation:</b>

* See under "Using pretrained models" in [For training the models as ELMo](https://github.com/allenai/bilm-tf)

## BERT(Bidirectional Encoder Representations for Transformer)

<b></b>
* Improving the finetuning approaches by bi directional encoder representations from Transformers.
* Introduced a new objective called as masked language model("MLM") to test the effectiveness during training.
* Unlike ELMo approach of adding forward and backward representations, the authors used MLM to enable the pretrained deep bi directional embeddings.
* <b>Architecture :</b> [Multi layer Bidirectional Transformer Encoder](https://arxiv.org/pdf/1706.03762.pdf)
*<b> Previous approaches:</b>
  - Uni-directional pretraining
  - Finetuning pretraining models(BERT)
  - Feature based pretraining models(ELMo)


<b>Model architecture:</b>
* They stated that normally it is possible to train language models either left-to-right or right-to-left. We cannot do both at a time.
* Hence they have not gone to traditional language model training.

 * <b> Masked LM: </b>
   - They masked some tokens and predicted the tokens during training the deep bidirectional representation.
   - Masked around 15% of random words in every sequence of WordPiece tokens and predicted only the masked words.
   - The disadvantage is that the words differ during finetuning and pretraining. So, 80% of the time they put the [mask] token, 10% of the time random word and 10% of the time correct word.
   - Eg: 80% of the time "The dog has [mask]", 10% of the time "The dog has apple", 10% of the time "The dog has hair". This induces bias to the correct word.
   - The training takes long time to converge as they were trained with only 15% of masked words at a time. However the results surpass the previous state of the arts.
   
  * <b> Next sentence prediction: </b>
   - A binarized next sentence prediction task was pretrained.
   - In pretraining 50% of the time correct sentences were shown and remaining 50% of the time some random sequence is given stating 'NotNext'

<b>Finetuning procedures: </b>
  * For sentence classification task, the BERT approach is simple. 
  * While finetuning the most of the hyperparameters are kept same. 
  * Some hyperparameters such as 
    - batchsize : 24-32
    - Learning rate : 5e-5, 3e-5, 2e-5
    - Epochs: 3-4 are said to be optimal for several tasks.
  * One of the key observation by the authors is that, the hyperparameters are very sensitive to larger datasets.
   
<b>Hyperparameters:</b>

| Hyper parameter | Value/name |
| --- | --- |
| Batch size | 256 sequences/batch  (1,28,000 tokens per batch) |
| Epochs: 40 | 1,000,000 steps |
|  Optimizer | Adam |
| Learning rate | 1e-4 |
| Weight decay | 0.01|
| Momentum | $β_1 = 0.9, β_2 = 0.999.$ |
| Dropout probability | 0.1 |
| Activation function | gelu |
| Training loss | sum (mean masked LM likelihood, mean next sentence prediction likelihood) |

<b>Evaluated tasks:</b> 


| Tasks | Tested datasets | 
| --- | --- |
| Natural Language Inference | MNLI <br> WNLI |
| Semantic Text Similarity | QQP <br> STS-B <br> MRPC |
| Question Answering | QNLI |
| Sentiment Analysis | SST-2 |
| Sentence Classification | CoLA |
| Binary Entailment | RTE |

* All the tested datasets are GLUE benchmark datasets and results in GLUE score for each task.
* GLUE(General Language Understanding Evaluation) has the test data in the online corpus to test the model, for which labels are not provided.
* This evaluates the model and provide the GLUE score.

<b>Keynotes:</b>
* Input embeddings of BERT are the sum of positional, segment and token embeddings.
* Good with sentence prediction, entailment and similairity tasks.
* The left-to-right language model performs worse in all the tasks compared to masked language modelling.

<b>Terminology:</b>
* Feature based approach :  Uses the task specific architectures that include the pretraining representation as additional features.(Eg; ELMo)
* Fine tuning approach: Trained on downstream models by simple fine tuning(Eg: Open AI GPT, ULMFit)


<b>  Questions: </b>
* How does the masking helps?
* What is the difference between the masking language model and normal language model
* What is the difference between the steps and epochs?

<b>Implementation:</b>

* [Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)


## ULMFiT(Universal Language Model Fine Tuning)

"Pretraining the model on huge corpus and finetuning it on AWD LSTM architecture for target task"

"On a note this model is primarily focused on classification tasks"

* The aim is to finetune the existing model and use it for various tasks in NLP.
* ULMFit had surpassed results with only 100 labelled examples.
* In addition to the model, they have also proposed 'discriminative fine-tuning', 'slanted triangular learning rates' and 'gradual unfreezing.'
* <b> Architecture:</b> [3- layered LSTM](https://arxiv.org/pdf/1708.02182.pdf)

<b>Approach:</b>
* Assume two tasks source task ($\tau_S$) and target task ($\tau_T$).
* Language model is the ideal source task, which is used widely. The  authors had taken a state of the art language model trained on [AWD LSTM](https://arxiv.org/pdf/1708.02182.pdf)
* The model consists of following steps:
  - General model pretraining: Pretraining on Wikitext-103 dataset.Pretraining improves performance and converges the downstream tasks.
  - Target task LM finetuning: Finetuning the data on target task data. In finetuning here, they proposed "descriminative finetuning and slanted triangular learning rates."  They are explained below.
* <b>Discriminative Finetuning:</b> This is the process of assigning different learning rates for different layers. This is because each layer learns a different level of feature(for eg: initial layers have high-level features and the final with low level features), so they are tuning different layer with different learning rate. The parameter ($\theta$) at time 't' is given as follows:
\begin{equation}\theta_t = \theta_{t-1} - \eta \nabla_{\theta} J(\theta)\end{equation} \begin{equation} \implies \theta_t^l = \theta_{t-1}^l - \eta^l \nabla_{\theta^l} J(\theta)\end{equation}
  - where $(\theta_1, \theta_2,....\theta_l)$ and $(\eta_1, \eta_2,...... \eta_l)$  are the parameters and learning rates of the layers 1,2,....l respectively.
  - $\nabla_{\theta}J(\theta)$ represents the gradient of the model's objective function.
* <b>Slanted triangular learning rate (SLTR):</b> It is a proposed way of increasing the learning rates initially and later gradually decrease. This method is used for better convergence of parameters according to the task-specific feature.
* For finetuning the classifier, the authors attach the pretraining model to two layered linear blocks. Just as in computer vision, the two layers uses batch normalization, ReLU activation in between and softmax for the final layer.
* The parameters in these last two layers have been learnt from scratch.(Here we can sense the transfer learning)
* <b>Gradual unfreezing:</b> As the model forgets if we finetune all the layers at once, gradual unfreezing is introduced. In this process, intially the last layer is unfreezed and finetuned for one epoch, then the last two layers are taken and fineuned for another epoch and so on. Similarly for the total model is done.
* The preprocessing is carried out as same as [Mc.Cann et al](https://arxiv.org/pdf/1708.00107.pdf). In addition to that the authors added tokens for uppercase words, elongations and repetitions.
* Similar to ELMO, the authors pretrain both the forward and backward LM. Then they finetune the classifier for each LM using BPT3C(Backpropagation through time for text classification) and average them.
* Various analysis on pretraining, finetuning and bi-directionality is explained in the paper.
* The ULMFiT can be primarily used for
  - NLP for non-English languages, where training data is scarce.
  - NLP for new tasks, which donot have state of the art model.
  - Tasks with limited amount of labelled data.

<b>Evaluated tasks:</b> 

* <b>Classification tasks:</b>

| Tasks | Tested datasets | 
| --- | --- |
| Sentiment analysis | IMDB dataset<br> Yelp review dataset |
| Question classification | small TREC dataset |
| Topic classification | AG news dataset <br> DBpedia ontology dataset |

<b> Hyperparameters; </b>

| Hyper parameter | Value/name |
| --- | --- |
|embedding size | 400 | 
| Batch size | 64 for LM |
| Dropout | 0.4 to layers<br> 0.3 to RNN layers<br> 0.4 to input embedding layers<br> 0.05 to embedding layers<br> weight dropout of 0.5 to the RNN hidden-to-hidden matrix|
| Activation function | Adam |
| Momentum | $β_1$ = 0.7, $β_2$ = 0.99 |
| Finetuning learning rate | 0.004  for LM, 0.01for classifier|
| Training loss | sum (mean masked LM likelihood, mean next sentence prediction likelihood) |

<b>keynotes:</b>
* Language model is the ideal source task used widely. This is because it can capture the 
  - Context
  - Long-term dependencies
  - Heirarcheal relations and
  - Sentiment
* Language model provides a hypothesis space with all these features, which is useful for all the NLP tasks.
* Finetuning is said to be the most critical path  in transfer learning. "Overly aggressive fine-tuning will cause catastrophic forgetting, eliminating the benefit of the information captured through language modeling; too cautious fine-tuning will lead to slow convergence (and resultant overfitting)."

<b>Terminology:</b>
* **Transductive inference:** Reasoning from observed specific cases while training to speicific test cases. 
 - In [transductive transfer](https://www.andrewoarnold.com/arnolda-transfer-icdm-short.pdf) learning while training the test data is put into the training in which the both data are from same domain.
* **Inductive inference:** Reasoning from observed specific cases while training to general test cases. 
* **Hypercolumns:** We get embeddings as features at different levels(Do you remember ELMo), then we use these in different ways. The concatenation of these embeddings at different layers is called as hypercolumn.
* **Multi-task learning:** Multiple tasks are solved at a time. For more see [here](https://en.wikipedia.org/wiki/Multi-task_learning)
* **Pooling:** As we obtain several features after convoluting/ finding embeddings with thousands of dimensions, we decrease the dimensions by aggregating the features. This process is called as pooling. This is done to decrease the computational cost. [Here](https://computersciencewiki.org/index.php/Max-pooling_/_Pooling) you can check the convolutional pooling and [here](http://jessicastringham.net/2018/12/30/conv-max-pool.html) the pooling in nlp.
* **Batch normalization:** It normalizes the output of previous layer by subtracting all the values from that layer's mean and dividing with layer's std deviation.
* **chain-thaw:** Sequentially unfreezes and finetunes single layer at a time. See [here](https://www.aclweb.org/anthology/D17-1169)
* **Supervised learning:** Labelled examples are used to finetune the LM.
* **Semi-supervised learning:** All task data is available and can be used to fine-tune the LM.

<b>Questions</b>
* **What is convergence in machine learning/deep learning? Why is it required?**
  - Converging means as the iteration of algorithm goes on, the output value gets closer and closer.
* **What is multi-task learning? What is the difference between multi task learning and transfer learning?**

 

## GPT (Generative Pre-Training)

"Generative pre-training of language model on unlabelled text data and supervised discriminative fine-tuning on each specific task"

* They use task-aware input transformation such that effective transfer during finetuning can be achieved, with minimal changes to model architecture.
* To attain more than word-level information is difficult for unlabelled text data for two main reasons:(acc to paper)
  - To know the better optimization to transfer best text representations.
  - The lack of single/general procedure to transfer representations.
* Considering these drawbacks, paper aims at creating an effective transfer learning through semi-supervised approach.

<b>Approach:</b>
* The approach is the combination of unsupervised pre-training and supervised-finetuning.
   - First on an unlabelled data, language model objective is applied to learn initial parameters of the neural network
   - Then we finetune for the required task using corressponding supervised approach for the task derived from [traversal-style approaches](https://arxiv.org/pdf/1509.06664.pdf). This approach creates a single contiguous sequence of tokens for a structured text.
* *Transformer* architecture is used for the model, which has provided more structured memory for the long-term dependencies compared to recurrent neural networks.

* <b>Unsupervised pre-training:</b>

 * Given a corpus data with tokens $U = {u_1, u_2,.......u_n}$, we maximize the likelihood function:
 
 \begin{equation} L_1(U) = \sum_i log P(u_i|u_{i-1}, u_{i-2}.....u_{i-k}; \theta) ------(1)\end{equation}
 
   -  where $k$ is the size of the context window
   
 * They used multi-layer transformer(a variant of the transformer) decoder for the language model.

  \begin{equation}h_0 = UW_e + W_p \end{equation}
  \begin{equation}h_l = transformer\_block(h_{l-1}) \forall i \in[1,n] -----(2)\end{equation}
  \begin{equation} P(u) = softmax(h_nW_e^T) \end{equation}

  - where $U = {u_1, u_2,.......u_n}$
  - $W_e$ = token embedding matrix
  - $W_p$ = Position embedding matrix
  - $h_0$ might be representing the input head of self-attention
  
* <b>Supervised fine-tuning :</b>
  - After training from first equation, we have to adapt the parameters to the target task.
  - For a labeled dataset $C$ assume sequence of tokens $x^1, x^2,..........,x^n$. The inputs are sent through the pretrained model to obtain the final transformer block's activation $h_l^m$, which is sent to the linear output layer with parameters $W_y$ to predict $y$:

  - \begin{equation}P(y|x^1,x^2,.....x^m) = softmax(h_l^m W_y)-----(3) \end{equation}

   -  Now we have to maximize the objective function

\begin{equation} L_2(C) = \sum_{(x,y)} logP(y|x^1,.....x^m)------(4)\end{equation}

  -  One of the observation by the authors is that using language model as the auxiliary objective to fine-tune,
    - improved generalization of the supervised model
    - accelerating convergence
  -  Now we optimize the following objective, where $\lambda$ is a weight :

\begin{equation} L_3(C) = L_2(C) + \lambda L_1(C) \end{equation}

* <b>Task-specific input transformations:</b>
 -  Although it is easier to classify by this architecture, for the tasks like question answering, textual entailment need structured inputs like triplets of documents, questions and answers.
 - Hence we have to modify the model to apply to these tasks. For that the authors used a traversal style approach, in which the structured inputs are converted into an ordered sequence such that the pretrained model can process.
 - For similarity task, they have concatenated two texts by a delimiter in between in both the orders (text1 followed by text2 and viceversa) separately. Each order resulted in a representation $h_l^m$ and we add the both representations element-wise before feeding into the linear layer. See [figure](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf).

* They have trained the model with *BookCorpus* dataset.


<b>Evaluated tasks:</b>

| Tasks | Tested datasets | 
| --- | --- |
| Natural Language Inference | SNLI<br><br>MultiNLI<br><br>Question NLI<br><br> RTE<br><br>SciTail |
| Question Answering | RACE<br><br>Story Cloze |
| Semantic Similarity | MSR Paraphrase Corpus<br><br>Quora Question Pairs<br><br>STS Benchmark |
| Text classification | SST-2 <br><br> CoLA |

<b>Hyperparameters:</b>
* **Pre-training:**

| Hyperparameter | Vaue/Name |
| --- | --- |
| Attention heads | 12 |
| Dimensions | 768 |
| Optimizer | Adam |
| Max learning rate | 2.5e-4 |
| Epochs | 100 |
| Batchsize | 64 |
| Dropouts | 0.1 |
| Regularization | L2 |
| Activation function | Gaussian Error Linear Unit (GELU) |

* **Finetuning:**

| Hyperparameter | Vaue/Name |
| --- | --- |
| Dropout | 0.1 |
| Learning rate | 6.25e-5 | 
|Batch size | 32 |
| Epochs | 3 |
 

<b>Terminology:</b>
* **Discriminatively trained models:** The models which are trained for a specific task with corresponding datasets.
* **Sequence labelling:** A pattern recognition task which assigns each word with a categorical label. (Eg: Parts of speech tagging, Named Entity Recognition)

<b>Preview:</b>
* The gpt tokenizer changes the word into root + suffix
* This tokenization is different from normal nltk tokenization. For eg:
  - In nltk tokenizer 'Abstraction' ---> \[ 'Abstraction' ]
  - In gpt tokenizer 'Abstraction' ----> \[ 'Abstract', 'tion' ] 

##GPT-2
 "Our suspicion is that the prevalence of single task training
on single domain datasets is a major contributor to the lack
of generalization observed in current systems."

  - The aim is to show that the language models can learn multi-tasks through unsupervised learning. The transfer learning can be done without any modifications of parameters and architectures. This also can attain the state of the art results.

  - The largest GPT-2 model contains of 1.5B parameters, which achieves state of the art results in 7 out of 8 tasks.

  - Although the various deep learning models provide greater accuracy, they are sensitive to the little changes in data distribution. These can provide good accuracy with only specific tasks and cannot function in generalized situations.

  - Even the works like BERT and GPT projected that the task specific architectures are not required, instead the transfer of self-attention blocks does the work.

  - So, the idea is to create more generalized and robust language model unsupervised, such that it can be used for various tasks without pretraining from scratch with corresponding datasets.
   

<b>Approach:</b>
  - The previous approaches used combination of pretraining and supervised finetuning for the unsupervised language models which are more convenient for transfer learning.
  
  - Language modeling is the core part of the approach. Considering a general language model, we know that it is the unsupervised distribution of various examples $(x_1, x_2, ..., x_n)$ consisting of variable sequence length of symbols $(s_1,s_2,...,s_n).$ 

  \begin{equation}\implies P(x) = \prod_{i=1}^n p(s_n|s_1,...........s_{n-1}) \end{equation}

  - Learning a single task can be considered as **P(output|input)**. Extending it too the generalized version, we can express that as **P(output|input, task)**.

  - The byte-level language models are not effective on larger datasets such as One Billion Benchmark. 

  - Hence the authors used Byte Pair Encoding(BPE) which is effective for both the word-level and character-level language modeling.

  -However the byte pair encoding often performs on Unicode code points resulting in the vocabulary of 1,30,000 symbols.

  -Since this approach can effectively accredit the probability for the unicode strings, it can evaluate LM with any dataset irrespective of it's size, pre-processing etc.,

  -<b>Architecture:</b>
    - They used the Attention mechanism and used the similar architecture of GPT with small changes.
    
    -The normalization layers are added to the input of each sub-block and also at the end of the final attention-block.

  - They have trained four language models with log-uniformly spaced sizes.

  - The smallest model is equivalent to the GPT model. The second smallest is equivalent to $BERT_{LARGE}$ model and the largest model is called as GPT-2.

  - Initially, they have evaluated the zero-shot task transfer on language modeling.


<b>Dataset</b>
  - The dataset is chosen such that is more generalized without restricted to news articles, wikipedia or fiction books. 

  - They have chosen the web scrape (Common crawl) considering it to be more generalized data.

  - However, this dataset has notable quality issues resulting in the authors creating a new qualitative webscrape from reddit, a social media platform.

  -This new dataset created by the authors is called as WebText. 

  -They removed all the wikipedia data from the dataset for more precise comparative analysis with other approaches, which used that dataset.

  - Although the LMs have not been pretrained or finetuned, they have resulted in state of the art results in various language models evaluating datasets because of the parameters and byte wise vocabulary.


<b>Hyperparameters</b>

| Hyperparameter | Vaue/Name |
| --- | --- |
| Attention heads | 12 |
| Dimensions | 768 |
| Optimizer | Adam |
| Max learning rate | 2.5e-4 |
| Epochs | 100 |
| Batchsize | <b>512</b> |
| Dropouts | 0.1 |
| Regularization | L2 |
| Activation function | Gaussian Error Linear Unit (GELU) |
| Vocabulary size | 50,257 |
| Context size | 512-1024 |

- The hyperparameters used for the four LMs are as follows:

| Parameters |Layers | $d_{model}$ |
| --- | --- | --- |
| 117M | 12 | 768 | 
| 345M | 24 | 1024 | 
| 762M | 36 | 1280 |
| 1542M | 48 | 1600 |  

<b>Evaluated tasks</b>

| Tasks | Tested datasets |
| --- | --- |
| Language modeling | Children's Book Test (CBT) <br> LAMBADA <br> |
| Common sense reasoning | Winograd Schema Challenge |
| Reading comprehension | Conversation Question Answering Dataset (CQAD) |
| Summarization | CNN and Daily Mail dataset |
| Translation | WMT-14 English-French |
| Question Answering | Stanford Question Answering Dataset (SQUAD) |  

<b>Key notes</b>
  - According to [McCann et al](https://arxiv.org/pdf/1806.08730.pdf), it is stated that the task, input and output can be represented as symbols for language to specify tasks. For eg: (translate from German to English, German text, English text).
  - Some of their experiments showed that the unsupervised learning is able to perform multi-tasks, but very slow to learn compared to exclusive supervised leaning.
  -Character level language modeling are very useful for infrequent words and word level for frequent words.
  - Most of the time the better-than-expected results occure due to the overlap of the train and test data. As the dataset size increases, this phenomena increases. This is one of the thing that can be happening with WebText acc., to the authors.
  

<b>Terminology</b>
  * **Web scraping :** It is a process of collecting data from the web to a local database for retrieval or analysis.
  * **Byte Pair Encoding:**






## Overview

| Model | Architecture | Dataset name | Dataset size | Pretrained model
| --- | --- | --- | --- | --- |
| BERT | Transformer |BookCorpus <br> English Wikipedia | 800M words <br> 2500M words | [Installing bert embeddings](https://pypi.org/project/bert-embedding/) |
| ELMo | biLSTM | [One billion word benchmark](https://arxiv.org/pdf/1312.3005.pdf) | 1B words(30M sentences) | [Elmo embeddings](https://allennlp.org/elmo) |
| GPT | Transformer | BookCorpus | 800M words | [Installing GPT embeddings](https://github.com/huggingface/pytorch-transformers) | 
| ULMFit | AWD LSTM | Wikitext-103 | 103M words | [ULMFit embeddings](http://files.fast.ai/models/wt103/) |


# Finetuning from scratch using pretrained parameters


## Creating vocab.txt file

* As we donot need all the words of generalized area of domain, when working with the particular domain area of pretraining, we donot use the vocab.txt provided by BERT research.
* We create a new vocab.txt file from our corpus using this [code](https://github.com/kwonmha/bert-vocab-builder).

## BERT
###Source:
 [Pretraining BERT-Google Research](https://github.com/google-research/bert#pre-training-with-bert)




* Bert pretrained models consists of three files mainly:
  - **.ckpt:** Checkpoint file contains the pretrained weights.
  - **vocab.txt**: Vocabulary file mapping from wordpiece to word id.
  - **bert_config.json:** Hyperparameters of pretraining
* For pretraining the input file is a txt file.
* The documents are delimited by empty lines.
* For pretraining we have to run three files provided from google research.
  - create_pretraining.py
  - run_pretraining.py
  - **extract_features.py** (We have to do only this)


* We can finetune the data on the BERT by using extract_features.py and model.ckpt.
* Here the training_data is our text books dataset.


## GPT


## ELMO
* The pretraining of elmo from scratch can be seen [here.](https://github.com/allenai/bilm-tf/blob/master/README.md#training-a-bilm-on-a-new-corpus)

* The ELMo training works with GPU, but not with TPU.

# Sentence embeddings

* Can be seen [here](https://towardsdatascience.com/fse-2b1ffa791cf9).

# Semantic text similarity
   


* Finding similarity between words is the fundamental process, which can be extended and implemented to setences, paragraphs and documents.
* Words can be similar 
  - <b>Lexically:</b> If the words follow the similar character sequence. Lexical similarity can be checked by string based algorithms.
  - <b>Semantically: </b> If the words are meaningfully similar or used in the same context. Semantic similarity can be checked by corpus based and knowledge based algorithms.

  
## Types
### String based:
* Measures the similarity from character compositions and string sequences. The string based algorithms are further divided into two types:

#### Character based: 
  * <b>Longest common substring:</b> It considers the contagious chain of characters in both the strings to measure the similarity.
  * <b>Damerau Levenshtein:</b>  The similarity is measured by counting the minimum number of operations required to transform one string to the other. Each operation can be defined as insertion, deletion, substitution or transposition of adjacent characters.
  * <b>Jaro:</b> It calculates similarity by considering the number and order of the common characters between strings.
  * <b>Jaro Winkler:</b> It is an extension of Jaro similarity, in which a prefix scale is given such that the favourable ratings are given for the characters matching in both the strings for that prefix length.
  * <b>Needleman Wunsch:</b> This algorithm performs a global alignment to find the best alignment for the entire of two sequences. This is a dynamic programming algorithm. This algorithm is efficient for the large sequence of similar texts.
  * <b>Smith-Waterman:</b> It is also a dynamic programming algorithm, in which local alignment is performed to find the best alignment of the entire of two sequences. This algorithm is efficient for the large sequence of dissimilar texts.
  * <b>N-grams:</b> In this the similarity is calculated by taking the n-grams of word/character sequences. Distance is calculated by taking the ratio of similar n-grams by maximal number of n-grams.
  
#### Term based:
  * <b>Block distance:</b> It is also known as Manhattan distance. It is calculated by the distance between two data points, when a grid-like path is followed. Mathematically, it is the sum of difference of corresponding components.
  * <b>Cosine similarity:</b> It is the cosine angle between two vectors representing the corresponding data points.
  * <b>Dice's coefficient:</b> It is the ratio of twice the common terms in two strings to the total number of terms in both the strings.
  * <b> Eucledian distance:</b> It is also called as L2 distance. It is the root of sum of squares of differences of corresponding components between the two vectors.
  * <b>Jaccard's distance:</b> It is the ratio of shared terms to the unique terms in both sentences.
  * <b>Matching coefficient:</b> It simply counts the number of similar items in both the vectors which are non-zero.
  * <b>Overlap coefficient:</b> It is similar to the Dice's coefficient, but considers the full match if one sentence is subset of the other. 

### Corpus based:
* The similarity between the words are measured from the information available from large corpora.
* <b>Hyperspace analogue to language (HAL):</b> 
  - A matrix is created representing the semantic strength between the words using word co-occurences. The user can eliminate the low-entropy columns from the matrix.
  - The semantic strength is calculated by placing a center word in a window and accumulating the weights of cooccurences inversely proportional to the distance from the center word. It also considers whether the word has occured before or after the center word.
* <b>Latent semantic analysis (LSA):</b>
  - It assumes that the words occuring in similar context have similar semantic meaning.
  - A matrix is created with words as rows and paragraphs as columns. This matrix is reduced to lower number of columns by using the SVD technique. 
  - Then cosine similarity is used to define the distance between two vectors formed by any two rows.
* <b>General latent semantic analysis (GLSA):</b>
  - It is the extension of the latent semantic analysis(LSA). GLSA focuses on the term vectors instead of the dual document representation.
  - Mostly it is similar to the latent semantic analysis, except changes in consideration of columns and cells.(NOT very sure)
* <b>Explicit semantic analysis (ESA):</b>
  - It can be used to measure the distance between two arbitrary texts. Each term in the text is assigned with tf-idf vector. 
  - Then cosine similarity is used to find the distance between these two vectors.
* <b>Cross-language explicit semantic analysis (CESA):</b>
  - It is a multilanguage generalization of ECA. It considers the wikipedia to represent as the language independent vector.
  - The cosine distance of two corresponding vector representaitons are taken of two documents  in different languages to measure the relatedness.
  
  

### Knowledge based:
* The similarity between words are measured from the information available from semantic networks.

<b>Notes:</b>
* As our aim is to find the semantic similarity, string based is not much of our interest.

<b>Questions:</b>
* What is contagious chain?


## Challenges
 * According to [GPT paper](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf), the challenges of semantic text similarity task are:
   - Recognizing rephrasing of concepts
   - Understanding negation
   - Handling syntactic ambiguity

## Applications

* Information retrieval
* Text classification
* Document clustering
* Topic detection
* Topic tracking
* Questions generation
* Question answering
* Essay scoring
* Short answer scoring
* Machine translation
* Text summarization

## Metrics

* Root mean square error(RMSE): 
  - RMSE is the mostly used metric to evaluate the model's performance of predicting the values.
  - It is the root of the quadratic mean of the difference of the values
  - Given $y_1, y_2, y_3,.....y_n$ as the actual values and $\hat{y_1},\hat{y_2},\hat{y_3},.....\hat{y_n}$ as the predicted values, the root mean square error can be defined as the 
  \begin{equation} RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y_i})^2}\end{equation}
  - Properties:    
    - The negative elements can lead the positive numbers to be zero.
    - To eliminate that, we square the differences and root it to scale it down.
    - The RMSE is proportional to the square of the error. So, a large error has larger effect on the metric.
    - Hence, the lower the metric, the better fit the model.
    - For these reasons, it is used as the standard metric in many tasks and also can be used to calculate accuracy.
    - However, this is not the best metric to evaluate the semantic text similarity model.
    
    
* Pearson correlation($\rho$):
  - Although, RMSE is the standard metric for most of the tasks, it cannot be used as the appropriate metric for this specific task.
  - This is because, for a similarity task, when a similarity score is given it is subjective and changes from subject to subject.
  - So, it is important to follow the trend instead of the exact measure given. For this relation we have to consider the correlation between the scores instead of the differences.
  - Pearson introduced this metric to measure the direction of the scores. It is denoted by $\rho$ and given by:

  \begin{equation} \rho = \frac{\operatorname{cov}(y,\hat{y})}{\sigma_y \sigma_{\hat{y}}}\end{equation}
  - where y and $\hat{y}$ are the actual and predicted scores respectively.
  - Given the expectation of y as $\operatorname{E}[y]$, the **cov** represents the covariance of the two variables and it is given by
  \begin{equation} cov(y,\hat{y}) = \operatorname{E}{\big[(y - \operatorname{E}[y])(\hat{y} - \operatorname{E}[\hat{y}])\big]}\end{equation}
  - and $\sigma$ represents the standard deviation of the values of that particular variable.
   - The standard deviation is given by:
   \begin{equation} \sigma_y = \sqrt{\frac{1}{N-1}\sum_{i=1}^N (x_i - \bar{x})^2 }\end{equation}
   - The range of pearson correlation($\rho$) is from -1 to 1. The 1 states that it is linearly correlated. 0 states that it is linearly not correlated. and -1 states that it is negative linearly correlated.
   

* Spearman correlation($r_s$):
  - however the pearson correlation defines linear correlation only.
  - To define the non-linear correlation, spearman came up with the extended version of pearson correlation, i.e., to calculate pearson correlation of the ranks of the variables.
  - This defines the non-linear correlation between the actual and predicted variables.
  - Hence, it is defined as the pearson correlation of the rank of the variables
  - The relation is given by:
    \begin{equation} r_s =   = \frac {\operatorname{cov}(\operatorname{rg}_y,\operatorname{rg}_\hat{y})} { \sigma_{\operatorname{rg}_y} \sigma_{\operatorname{rg}_\hat{y}} } \end{equation}

    - where $rg_y$ is the rank variable of the variable y and $rg_\hat{y}$ is the rank variable of variable $\hat{y}$.

  - For more detailed explanation of the metrics look in appendix
  -Create some plots for each metric


* Appendix:
    - For example, given a set of values 
      - y = [0,1,2,3,4] and $\hat{y}$ = [5,6,7,8,9]
    - \begin{equation} RMSE(y, \hat{y}) = \sqrt{\frac{1}{5}((0-5)^2 + (1-6)^2 + (2-6)^2 + (3-8)^2 + (4-9)^2)} \end{equation}
   \begin{equation}\implies RMSE(y, \hat{y}) = 5 \end{equation}
  - whereas the pearson correlation $\rho = 1$




# Various methods to calulate similarity



## Cosine similarity
 * Create the embeddings of the words of the two sentences.
 * Create the mean vectors of the sentences.
 * Calculate the distance between the two mean vectors

 <b>Notes:</b>
 * This method creates high weights to the irrelevant words.
 * The change in number of words in two sentences effect the accuracy of finding similarity. 
  - For example: "My health has been improved compared to the last year" and "I am healthier now" are the two sentences with similar meaning, but the number of words are different, which can move the mean away.

## Word mover's distance
* Remove the stop words in both the sentences.
* Each word in one sentence checks the nearer word in the second sentence.
* Then sum up the similarity and check which sentence is nearer.

<b>Notes:</b>
* This is used to say which sentence is similar (out if many given sentences) to the given sentence.
* This cannot be used to say whether two sentences mean the same or not.

# Datasets

* Microsoft paraphrase corpus(MRPC)
* Quora question pairs(QQP)
* Semantic text similarity benchmark(STS-B)
* Mohler dataset


# Results

## Ridge regression 



### RMSE

| Model | Linear regression| Ridge regression | Ordinal ridge regression | 
| --- | --- | --- | --- | 
| ELMo | 0.965 | 0.965 | 1.010 |
| GPT | 1.060 | 1.060 | 1.103 | 
| BERT | 1.051 | 1.052 | 1.102 | 
| GPT-2 | 1.037 | 1.038 | 1.108 | 

### Pearson correlation

| Model | Linear regression| Ridge regression | Ordinal ridge regression | 
| --- | --- | --- | --- | 
| ELMo | 0.465 | 0.440 | 0.379 |
| GPT | 0.247 | 0.175 | 0.008 | 
| BERT | 0.287 | 0.170 | 0.105 | 
| GPT-2 | 0.313 | 0.206 | 0.118 | 

### Spearman correlation ($r_s$)


| Model | Linear regression| Ridge regression | Ordinal ridge regression | 
| --- | --- | --- | --- | 
| ELMo | 0.466 | 0.453 | 0.396 |
| GPT | 0.275 | 0.235 | -0.012 | 
| BERT | 0.209 | 0.181 | 0.078 | 
| GPT-2 | 0.180 | 0.168 | 0.065 | 

## Semantic text similarity- Benchmarks

| Model | STS-B | Mohler dataset |
| --- | --- | --- |
| ELMo | pearson = 0.65 [cite](https://arxiv.org/pdf/1806.06259.pdf) | --- |
| GPT | pearson = 0.823 | --- |
| BERT | corr = 0.8938784082825981<br>pearson = 0.8963020207577451<br> spearmanr = 0.8914547958074512 | --- |
| GPT-2 | --- | --- |


# Planned Implementation approach

## Training
* Train the dataset of computer science text books with transfer learning model architectures.
* Put in the expected answer and calculate the vector of the corresponding sentence
* Calculate the vector of the desired sentence
* Find the distance for every desired sentence and the input sentence.
* Train the model with sentence's distance and corresponding grade/target value with all the transfer learning model architectures.

## Prediction
* Input the predicting sentence --> calculate the distance of this sentence vector with the expected sentence vector --> Predict the corresponding value

## Things to do in R&D
* Plot the graphs with binary/ trinary and 5 grades clusters
* Create correlation between the graders.
* Train BERT&ELMo 

# Questions and Answers

<b>Date:</b> 19.06.2019

* <b>Q. Explain sentence embeddings.</b>
* A. Sentence embeddings are analogus to word embeddings. In sentence embeddings sentences are mapped to vectors of real numbers. [source](https://en.wikipedia.org/wiki/Sentence_embedding)

* <b>Q. Explain word embedding technique</b>
* A. Have to read the paper [("How to generate a good word embedding?")](https://arxiv.org/pdf/1507.05523.pdf)

* <b>Q. Explain Language representation model.</b>
* A. Language model is the probability distribution of words given a sequence. 

* <b>Q. What are word representations?</b>
* A. Word representations are same as word embeddings.

* <b> Q. Difference between multi-purpose NLP models and embedding models.</b>
* A. ????


# Resources

In [0]:
import tensorflow as tf

for example in tf.python_io.tf_record_iterator("/content/sample_data/tf_examples.tfrecord"):
    print(tf.train.Example.FromString(example))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Presentation details

* Less text and more images
* Introduce topic with image ------> what are we doing with image!!
* Talk more about results
* (Abstract + Results)---> important
* The abstract and results lead to abstract and introduction
