# 1. Introduction

[BERT](https://github.com/google-research/bert) - Bidirectional Encoder Representations from Transformers. Pre-trained language models have achieved high efficiency on sentence level tasks and token level tasks.

Two existing techniques are (unidirectional):
- Feature-based
    - ELMo: Change of architecture for different downstream tasks
- Fine-tuning
    - OpenAI GPT: Fine-tuning all pre-trained parameters. This is left-to-right architecture i.e. every token can get information from previous seen tokens

Unidirectional training is not optimal because tasks like question-answering need the whole context before prediction.

BERT uses two different task as pretraining:
- MLM (Masked language model): Here some tokens from the input are randomly masked and the goal is the predict the masked words ids given the masked sentence
- Next Sentence Prediction: Text-pair representations

# 2. Related Work

### 2.1 Unsupervised Feature-based Approaches

Mainly discussing about Left-to-Right context for pretraining.

### 2.2 Unsupervised Fine-tuning Approaches

Supervised training after pre-training. This helps model to learn new parameters **from scratch** for downstream tasks.

### 2.3 Transfer Learning from Supervised Data

Same as previous approach but parameters are relearned instead of adding new parameters.

# 3. Bert

There are two steps involved, pretraining and fine-tuning. In pretraining, the model is trained on large amount of unlabeled data. Next while fine-tuning, same model is loaded with pretrained parameters. Then additional output layer is added based on requirement by downstream tasks. Now the model is retrained with downstream task specific labelled dataset without freezing any layers. Figure 1 shows pre-training (same for all tasks) and fine-tuning (task specific) architectures.
![Figure 1](resources/bert1.png "Bert1")
*Figure 1: BERT Architecture*

In the Paper, authors have used two model sizes: (L is number of Transformer blocks, H is hidden size, A is self attention heads)
- BERT<sub>BASE</sub> (L=12, H=768, A=12, Total Parameters=110M)
- BERT<sub>LARGE</sub> (L=24, H=1024, A=16, Total Parameters=340M)

BERT<sub>BASE</sub> was mainly created to compare the results with OpenAI GPT which has the same model size. The main difference here is BERT uses bidirectional context while GPT uses left context for self-attention.
![Figure 2](resources/bert-input.png)
*Figure 2: BERT input representation*

Figure 2 shows how input in constructed for BERT model. It comprises of three different embeddings - Token embeddings, Segment embeddings, Position embeddings.

Token embeddings are every word/sub-word mapped to a unique number. Every sequence also starts with special classification token [CLS] in token embeddings. The final hiddent state of this token is considered as whole representation of the sequence and is used for downstream classification tasks. A sequence can contain multiple sentences for <Question, Answer> type of dataset. Another special token [SEP] is used in token-embeddings to separate multiple sequences.

Segment embeddings marks which token belong to which sentence and position embeddings are used to learn about order of input. More details about this is given in this paper: [Attention is all you need](https://arxiv.org/abs/1706.03762)

### 3.1 Pre-training BERT



#### MaskedLM

In this step, randomly 15% of the token from each sequence are marked as [MASK] token and then T<sub>i</sub> (shown in Figure 1) corresponding to the mask token are used in output softmax layer for binary classification (predicted true or not). As for many fine-tuning tasks, [MASK] tokens are not used, therefore authors have only used [MASK] tokens 80% of the time. In rest 20% time, incorrect tokens are substituted.

### 3.2 Fine-tuning BERT