
<p style="text-align:right; font-size:14px;"> University of Studies of Florence
<p style="text-align:right; font-size:14px;"> Department of Engineering Information </p>
<p style="text-align:right; font-size:14px;"> Firenze, December 2021 </p>



<h2 align=center>Machine Learning Exam</h2>
<h3 align=center>RoBERTa by Facebook AI: some results reproduction.</h3>

<br>


In [None]:

__AUTHOR__ = {'lp': ("Lorenzo Pisaneschi",
                     "lorenzo.pisaneschi1@stud.unifi.it",
                     "https://github.com/pisalore/roberta_results")}

__TOPICS__ = ['Natural Language Processing', 'GLUE Benchmark', 'RACE Dataset']

__KEYWORDS__ = ['Machine Learning', 'AI', 'NLP', 'Transformers', 'BERT', 'fairseq', 'Facebook AI']

## Introduction
<br>
Natural Language Processing (NLP) is a research field which interest linguistics, computer science and AI. It concerns
how computers process, analyze and understand large amount of natural language data.

NLP can be used for :
<ul>
    <li>Machine Translation</li>
    <li>Sentiment Analysis</li>
    <li>Linguistic acceptability</li>
    <li>Sentences' similarity and entailment</li>
    <li>Automatic QA / Virtual assistants</li>
</ul>

<h2>Background</h2>

NLP has its roots in 1950, when Alan Turing published **"Computing Machinery and Intelligence"** in *Mind*,
an academic journal by Oxford Press. In this paper, Turing proposed the famous **Imitation game**, considering the question
*"Can machines think?"*, opening technical and philosophical debates.

From a technical point of view we can divide NLP in three major historical periods and methods:


### Symbolic NLP (1950s – early 1990s)

Given a **collection of rules**, the computer emulates language understanding,
without applying any kind of learning. This is a deterministic approach. A related interesting work is philosopher John
Searle's *"Chinese room experiment"*, where the **Weak AI** and **Strong AI** theories are proposed.

The focus was on encoding natural languages (with semantics, morphology and syntax) in data for machines.

### Statistical NLP (1990s–2010s)

The gradual lessening of Chomsky's linguistics theories and the steady increase
of computational power allowed using machine learning algorithms also in NLP. **IBM Research** implemented NLP
statistical method focusing on machine translation, exploiting textual corpora.

Then, in early 2000s, thanks to web, tons of unannotated data were provided, inducing researchers to focus on
unsupervised learning algorithms

### Neural NLP (present)

The use of deep learning, representation learning and word embeddings techniques made possible to achieve the
state-of-the-art in many NLP tasks. In particular, neural machine translation attempts to use a single neural network,
to read an input and translate the output.

## Neural NLP: Methods & techniques

As for each machine learning problems, three elements are the most important:
1. **Data**: how they must be mapped to be conveniently processed by a neural network.

2. **Architectures**: how they must be designed in such a way to let the learning process as generalizing as possible, fast and
reliable.
3.
4. **Tasks**: the problems to be addressed

### Neural Machine Translation

A first important example to be considered is machine translation, event because it has been one of the first problem
considered in NLP:

<center> $ argmax_y p(y | x)$ </center>

Find a target sentence **y** that maximizes the conditional probability of y given a source sentence **x**.

- **Neural machine translation** aims at building a single neural network that can be jointly tuned to maximize the
translation performance.

- **Encoder-decoders** architectures are often used for the goal

- However, problems raise when trying to increase the difficulty of the problem, for example, **fixed-length vectors** usage
for long source sentences encoding of the most important information

### Bidirectional Recurrent Neural Network

In *"Neural machine translation by jointly learning to align and translate"* (2016), Bahdanau et al. propose e method
for which a model tries to translate and align a word using the contextual words vectors.

- Encoding the input sentence into a **sequence of vectors** and choose a subset of these vectors
adaptively while decoding the translation.

- Using a **Bidirectional RNN** for encoding

- Using a **decoder** that emulates searching through a source sentence during translation.

### Bidirectional Recurrent Neural Network: some formula

Given a sequence of vectors $ x = (x_1, ..., x_2)$, an encoder encodes it in a vector *c*:
<center> $h_t = f(x_t, h_{t-1})$ </center>
<center> $c = q({h_1, h_{Tx}})$ </center>
Where $h_t \in R^n$, $f$ and $q$ some non-linear functions, and $c$ is a vector generated from the hidden states.
<br>
The proposed idea uses a context vector $c_i$ for each target word $y_i$ s.t.:
<center> $ p(y) = \prod_{i=1}^{T} p(y_i | {y_1, ..., y_{i-1}, x})$ </center>

$c_i$ depends on a **sequence of annotations**, where each annotation contains information about the whole input sequence,
giving more importance to surroundings parts of *i-th* word.

$c_i$ is a **weighted sum of annotations**; weights depend on an *alignment model* which measures how well the translated
word and near input words match.

This implements an **attention mechanism**.

## Transformers: "Attention Is All You Need"
An **attention mechanism** enables to dynamically highlight relevant features of the input data, in NLP textual elements,
obviously.

In *"Attention Is All You Need* (2017), Vaswani et al. proposed the famous **Transformer** architecture, which relies
only on attention mechanism, specifically **self-attention**, without recurrence, in order to:

- Improve performance
- Let tasks parallelization possible
- Improve long sentence handling
- Improve generalization on other tasks, not only transduction

## Transformer: the model

<div>
    <div style="float: right">
        <img src="images/transformer_architecture.png" width="480">
    </div>
    <br>
     <div>
        <li>The <strong>transformer</strong> is a <strong>encoder-decoder</strong> architecture. Each step is autoregressive: outputs are used as additional input generating the next.</li>
        <br>
        <li>The <strong>encoder</strong> maps a sequence of symbols $x = (x_1, x_2..., x_n)$ to a sequence of continuous representation $z = (z_1, z_2, ..., z_n)$</li>
        <br>
        <li>The <strong>decoder</strong> then generates an output $y = (y_1, y_2, ..., y_n)$ one element at time.</li>
    </div>
</div>

## Transformer: the attention mechanism

 <img src="images/transformer_attention.png">

 <div>
    <li><strong>Scaled Dot-Product Attention</strong>: $Attention(Q, K, V) = softmax(\frac{QK^T} {\sqrt{d_k}})V$</li>
    <li><strong>Multi-Head Attention</strong>: $MultiHead(Q, K, V) = Concat(head_1,...head_h)W^O$
    <br>
    $head_i = Attention(QW_i^Q, QW_i^K, KW_i^K, VW_i^V)$</li>
</div>

## Transformer: summary

- A Transformer is a deep learning model that adopts **attention** mechanisms which weighs the importance
of different parts of input data

- Data are not processed in order, as in RNN or LSTM: the transformer tries to give context to words in sentences. This
key idea allows parallelization and improves performance, generalizing for a wide range of tasks


- Transformer is an **encoder-decoder** architecture. The encoder encodes information about importance of parts with
respect to others in sentence in word encodings. The first encoder  layer simply takes **words embeddings**.
On the other hand, the decoder generates an output sequence of probabilities over the vocabulary, starting from encodings
shifted by one position


- The model learns three matrices for each attention unit $W_Q, W_K, W_V$. The **training** is usually done in a
semi-supervised fashion.

## BERT: Bidirectional Encoder Representations from Transformers by Google AI

Transformers revolutionized NLP, and research worked in this direction. With *"BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding* (2018) Devlin et al. proposed a new model to pre-train deep
bidirectional representations from unlabelled text; as a result, the pre-trained BERT can be fine-tuned to obtain
new state-of-the-art results in many tasks.

BERT rapidly became an important baseline for NLP research and experiments.

## BERT architecture

<img src="images/bert.png">

- Exploit **bidirectional pre-training** for language representations
- Exploit **fine-tuning** and **transfer learning** in order to avoid the use of heavy engineered architectures
- Pre-training is done with **unlabelled data**, while fine-tuning works with labelled data for the specific task

## BERT input representation and pre-training
<img src="images/bert2.png">

- To work with different **downstream tasks** input are represented in order to disambiguate both single sentence and
sentences pairs. **WordPiece** embeddings is used.
- As training data, the **BooksCorpus** (800M words) is used, along with **English Wikipedia (2500 words)
- For bidirectional training, some percentage of input tokens are masked (MLM); however it creates a downside, since fine-tuning
procedure does not use [MASK] tokens. Next Sentence Prediction (NSP) is used, in order to train a model to understand
sentences relationships

## Fine-tuning and experiments

- Transformer **self-attention** allows to model many downstream tasks involving single text or text pairs.

For fine-tuning, we simply plug in the task-specific inputs and outputs into BERT. Experiments involved:
- Sentences equivalence
- Sentences entailment
- Question Answering
- Text classification (Linguistic Acceptability)

The metrics depend on the interested benchmarks and datasets.

Two BERT models have been used: $BERT_{base}$ (L=12, H=768, A=12 for 110M parameters) and $BERT_{large}$
(L=24, H=1024, A=16 for 340M parameters)

## RoBERTa by Facebook AI

As mentioned before, BERT became an essential baseline for NLP research.

Among other models, Facebook AI presented RoBERTa in **fairseq**, a sequence modeling toolkit written in PyTorch.

In *"RoBERTa: A Robustly Optimized BERT Pretraining Approach* (2019) Liu et al. proposed an **optimized version of BERT**,
considering the following key points:

- Training the model **longer**, with bigger batches using **more data**
- Removing the use of **NSP** (Next Sentence Prediction) loss
- Training on **longer sequences**
- Using **Dynamic Masking** on training data: masking is done during training, not once during data preprocessing

## RoBERTa: some results reproduction

I have experiment with two different benchmarks for investigate **RoBERTa** power:
- **GLUE benchmark** (General Language Understanding Evaluation)
- **RACE benchmark** (Large-scale ReAding Comprehension Dataset From Examination)

For this purpose, I have worked at different levels:

1. **Data analysis**: understanding data shape
2. **Data preprocessing**: using BPE (Byte Pair Encoding)
3. **Pretrained model fine-tuning**: using the indicated hyperparameters
4. **Test inference**: on development sets of each dataset using different metrics

It is important to stress that RoBERTa models have been pre-trained:

1. with 160GB of uncompressed text (**BookCorpus + English Wikipedia**, **CC-News**, **Stories**)
2. for longer, 500k epochs

**Experiments were conducted on a `NVIDIA GeForce RTX 3090` GPU, 24GB**

## GLUE benchmark

GLUE benchmark consists of 9 different datasets for as much NLP tasks.

Online at https://gluebenchmark.com/ a leaderboard is present.

**RoBERTa** model achieved the bests results on 4/9 of the GLUE tasks when was published: **MNLI, QNLI, RTE and STS-B**

In the next I list all the results' reproduction I made, starting from the fine-tuning of RoBERTa base and large models
(provided by Facebook AI) to results validation on development datasets of various tasks.

### MNLI (The Multi-Genre Natural Language Inference Corpus)

*[...] is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence
and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts
the hypothesis (contradiction), or neither (neutral).*

**Matched** sentences belong to same domain, the other way round for **mismatched** sentences

- **392703** annotated train sentences
- **9816** annotated matched development sentences
- **9833** annotated mismatched development sentences

**Hyperparameters**

| **num classes** | **learning rate** | **max sentences** | **total num updates** | **warmup updates** |
|-------------|---------------|---------------|-------------------|----------------|
| 3           | 1.0e-05       | 32            | 123873            | 7432           |


**Results**

| model                	| Accuracy (mismatched) 	| Accuracy (matched) 	|
|----------------------	|-----------------------	|--------------------	|
| roberta.base      	| 86,7%                 	| 87,2%              	|
| roberta.large.mnli 	| 90,1%                 	| 90,59%             	|

<style>
td {
  font-size: 18px
}
th {
  font-size: 14px
}
</style>

### QNLI (Stanford Question Answering Dataset)

*[...] The Stanford Question Answering Dataset (Rajpurkar et al. 2016) is a question-answering
dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn
from Wikipedia) contains the answer to the corresponding question (written by an annotator).*

- **104744** annotated train sentences
- **5464** annotated development sentences

Example: *17	What percentage of farmland grows wheat?	More than 50% of this area is sown for wheat, 33% for barley and 7% for oats.	entailment*

**Hyperparameters**

| **num classes** | **learning rate** | **max sentences** | **total num updates** | **warmup updates** |
|-------------|---------------|---------------|-------------------|----------------|
| 2           | 1.0e-05       | 32            | 123873            | 7432           |


**Results**

| model                	| Accuracy  	|
|----------------------	|-----------------------	|
| roberta.base      	| 93,2%                 	|
| roberta.large.mnli 	| 94,4%                 	|

### QQP (Quora Question Pairs)

*[...] The Quora Question Pairs2 dataset is a collection of question pairs from the community
question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.*

- **363847** annotated train sentences
- **40431** annotated development sentences

Example: *201359	303345	303346	Why are African-Americans so beautiful?	Why are hispanics so beautiful?	0*

**Hyperparameters**

| **num classes** | **learning rate** | **max sentences** | **total num updates** | **warmup updates** |
|-------------|---------------|---------------|-------------------|----------------|
| 2           | 1.0e-05       | 32            | 123873            | 7432           |


**Results**

| model                	| Accuracy  	|
|----------------------	|-----------------------	|
| roberta.base      	| 93,2%                 	|
| roberta.large.mnli 	| 94,4%                 	|

### RTE (Recognizing Textual Entailment)

*The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual
entailment challenges. All datasets are converted to a two-class split, where
for three-class datasets we collapse neutral and contradiction into not entailment, for consistency.*

As it is indicated in the official paper, the large model for finetuning on RTE task is the one obtained from a
pre-finetuning of `roberta.large` on MNLI dataset.

- **2491** annotated train sentences
- **278** annotated development sentences

Example: *0	No Weapons of Mass Destruction Found in Iraq Yet.	Weapons of Mass Destruction Found in Iraq.	not_entailment*

**Hyperparameters**

| **num classes** | **learning rate** | **max sentences** | **total num updates** | **warmup updates** |
|-------------|---------------|---------------|-------------------|----------------|
| 2           | 2.0e-05       | 16            | 2036            | 122           |


**Results**

| model                	| Accuracy  	|
|----------------------	|-----------------------	|
| roberta.base      	| 79,0%                 	|
| roberta.large.mnli 	| 90,1%                 	|

### SST-2 (Stanford Sentiment Treebank)

*The Stanford Sentiment Treebank consists of sentences from movie
reviews and human annotations of their sentiment. The task is to predict the sentiment of a given
sentence. Labels are from two-way (positive/negative) class split.*

As it is indicated in the official paper, the large model for finetuning on STS task is the one obtained from a
pre-finetuning of `roberta.large` on MNLI dataset.

- **67350** annotated train sentences
- **873** annotated development sentences

Example: *allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker. 1*

**Hyperparameters**

| **num classes** | **learning rate** | **max sentences** | **total num updates** | **warmup updates** |
|-------------|---------------|---------------|-------------------|----------------|
| 2           | 1.0e-05       | 32            | 20935            | 1256           |


**Results**

| model                	| Accuracy  	|
|----------------------	|-----------------------	|
| roberta.base      	| 95,0%                 	|
| roberta.large.mnli 	| 96,0%                 	|

### MRPC (Microsoft Research Paraphrase Corpus)

*The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically
extracted from online news sources, with human annotations for whether the sentences in the pair are semantically
equivalent.*

As it is indicated in the official paper, the large model for finetuning on STS task is the one obtained from a
pre-finetuning of `roberta.large` on MNLI dataset.

- **4077** annotated train sentences
- **1726** annotated development sentences

Example: *1	702876	702977	Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .	Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .*

**Hyperparameters**

| **num classes** | **learning rate** | **max sentences** | **total num updates** | **warmup updates** |
|-------------|---------------|---------------|-------------------|----------------|
| 2           | 1.0e-05       | 16            | 2296            | 137           |


**Results**

| model                	| F1 score  	|
|----------------------	|-----------------------	|
| roberta.base      	| 90,0%                 	|
| roberta.large.mnli 	| 92,3%                 	|

### CoLA  (Corpus of Linguistic Acceptability)

*The Corpus of Linguistic Acceptability (Warstadt et al., 2018) consists of English acceptability judgments drawn from
books and journal articles on linguistic theory*

- **8551** annotated train sentences
- **1043** annotated development sentences

Example: *cj99	1		If you had eaten more, you would want less.
cj99	0	*	As you eat the most, you want the least.*

**Hyperparameters**

| **num classes** | **learning rate** | **max sentences** | **total num updates** | **warmup updates** |
|-------------|---------------|---------------|-------------------|----------------|
| 2           | 1.0e-05       | 16            | 5336            | 320           |


**Results**

| model                	| Matthew's Corr  	|
|----------------------	|-----------------------	|
| roberta.base      	| 62,0%                 	|
| roberta.large.mnli 	| 68,7%                 	|

Considering the **Matthew's Correlation Coefficient**:

$MCC = \frac{TP * TN - FP * FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$

### STS-B  (Corpus of Linguistic Acceptability)

*The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence
pairs drawn from news headlines, video and image captions, and natural language inference data.
Each pair is human-annotated with a similarity score from 1 to 5; the task is to predict these scores.*

- **5750** annotated train sentences
- **1501** annotated matched development sentences

Example: *1	main-captions	MSRvid	2012test	0002	none	none	A young child is riding a horse.	A child is riding a horse.	4.750*

**Hyperparameters**

| **num classes** | **learning rate** | **max sentences** | **total num updates** | **warmup updates** |
|-------------|---------------|---------------|-------------------|----------------|
| 1           | 2.0e-05       | 16            | 3598            | 214           |


**Results**

| model                	| Pearson-Spearman Corr  	|
|----------------------	|-----------------------	|
| roberta.base      	| 91,0%                 	|
| roberta.large.mnli 	| 92,2%                 	|

Considering the **Pearson-Spearman Correlation Coefficient**:

$\rho_{XY} = \frac {\sigma_{XY}}{\sigma_X\sigma_Y}$
where X and Y are development set labels and predictions, respectively.

## RACE benchmark

RACE is a large-scale reading comprehension dataset. In consists of **passages**, **multiple choice questions**.
This dataset is collected from English exams in China for middle and high school students.

Online at http://www.qizhexie.com/data/RACE_leaderboard.html

```json
{
   "answers":[
      "D",
      "A",
      "B",
      "B"
   ],
   "options":[
      [
         "Help his mother.",
         "Watch TV.",
         "Wear his raincoat",
         "Go out."
      ],
      [
         "happy",
         "scary",
         "dangerous",
         "boring"
      ],
      [
         "The raincoat can stop the rain.",
         "The color of Robbie's raincoat is red.",
         "Robbie first watches with his Mum",
         "Robbie's mum doesn't wear a raincoat in the rain."
      ],
      [
         "It's raining",
         "Fun in the rain",
         "Robbie and His mother",
         "Robbie's raincoat"
      ]
   ],
   "questions":[
      "What does Robbie want to do on the rainy day?",
      "Robbie has a_day that day.",
      "Which of the following is TRUE according to the passage?",
      "Which is the best title for the passage?"
   ],
   "article":"Pit-a-pat. Pit-a-pat. It's raining. \"I want to go outside and play, Mum,\" Robbie says, \"When can the rain stop?\" His mum doesn't know what to say. She hopes the rain can stop, too. \"You can watch TV with me,\" she says. \"No, I just want to go outside.\" \"1Put on your raincoat.\" \"Does it stop raining?\" \"No, but you can go outside and play in the rain. Do you like that?\" \"Yes, mum.\" He runs to his bedroom and puts on his red raincoat. \"Here you go. Go outside and play.\" Mum opens the door and says. Robbie runs into the rain. Water goes 2here and there. Robbie's mum watches her son. He is having so much fun. \"Mum, come and play with me!\" Robbie calls. The door opens and his mum walks out. She is in her yellow raincoat. Mother and son are out in the rain for a long time. They play all kinds of games in the rain.",








  "id":"middle10.txt"
}
```

## RACE fine-tuning and evaluation

Given a question, it is concatenated with each of its possible answers; then, these sequences are passed through a
fully-connected layer to predict the correct answer.

**Data**
- 28,000 passages
- 100,000 questions

**Hyperparameters**

| **num classes** | **learning rate** | **max sentences** | **update frequency**   |
|-------------|---------------|---------------|-------------------|---------------- |
| 4           | 1.0e-05       | 16            | 5336            |

**Results**

| model                	| Accuracy (Middle school) 	| Accuracy (High school) 	|
|----------------------	|-----------------------	|--------------------	    |
| roberta.large      	| 81,8%                 	| 87,7%              	    |


## References

**Code**
- https://github.com/pisalore/roberta_results, Github repository for this work
- https://github.com/pytorch/fairseq/tree/main/examples/roberta, Fairseq RoBERTa

**Papers**
- https://arxiv.org/abs/1409.0473 Badhanau et al., Neural Machine Translation by Jointly Learning to Align and Translate
- https://arxiv.org/abs/1706.03762 Vaswani et al., "Attention Is All You Need"
- https://arxiv.org/abs/1810.04805, Devlit et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- https://arxiv.org/abs/1907.11692, Liu et al., "RoBERTa: A Robustly Optimized BERT Pretraining Approach"

**Dataset**
- https://www.cs.cmu.edu/~glai1/data/race/, RACE dataset
- https://gluebenchmark.com/, GLUE