In [None]:
#@title Install Required Packages

# On your local machine, uncomment them
# !pip install -qU torch
# !pip install -qU numpy
# !pip install -qU pandas

!pip install -qU transformers

[K     |████████████████████████████████| 1.8MB 5.6MB/s 
[K     |████████████████████████████████| 3.2MB 19.3MB/s 
[K     |████████████████████████████████| 890kB 39.4MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [None]:
#@title Load Packages

from transformers import AutoTokenizer, AutoModelForMaskedLM

from pprint import pprint
from IPython import display

# BERT


<br/>
<p align="center">
    <img src="https://hooshvare.s3.ir-thr-at1.arvanstorage.com/bert.png" height="400" />
    <br/>
    <em>Figure 1: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</em><br/>
    <em><a href="https://github.com/google-research/bert">https://github.com/google-research/bert</a></em>
</p>
<br/>



## Introduction

One of the biggest challenges in NLP is the lack of enough training data. Researchers have developed various techniques for training general-purpose language representation models (known as "`Pre-Training`").

<br/>

These general-purpose pre-trained models can then be fine-tuned on smaller task-specific datasets. In this paper, Google researchers published a state-of-the-art language model that can be used for a wide range of natural language tasks without architecture modification. To understand BERT, It is essential to take a quick look at the models that inspired Google researchers. 

### ELMo

**ELMo (Embedding from Language Model)** is an unsupervised pretrained feature-based approach that learns word-level characteristics and linguistic context. ELMo is knowns as a shallow bidirectional language model because It extracts states from left-to-right and right-to-left of the same sequence and concatenates them into a weights matrix. If instead of this shallow bi-directional, we use a deep one, caused to create a cycle where words can indirectly see themselves and achieve a useless model.

<br/>
<p align="center">
    <img src="https://hooshvare.s3.ir-thr-at1.arvanstorage.com/elmo.png" />
    <br/>
    <em>Figure 2: ELMo</em><br/>
    <em><a href="https://arxiv.org/abs/1802.05365">https://arxiv.org/abs/1802.05365</a></em>
</p>
<br/>

Bear in mind that ELMo needs to look at the entire sentence before assigning each word to an embedding. As a result, every token has a different embedding for each of its occurrences.

### GPT


**OpenAI GPT (Generative Pre-trained Transformer)** is a multi-layer transformer that expands the unsupervised language model in a way that uses a left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers.


<br/>
<p align="center">
    <img src="https://hooshvare.s3.ir-thr-at1.arvanstorage.com/gpt.png" height="400" />
    <br/>
    <em>Figure 3: GPT</em><br/>
    <em><a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">GPT Paper</a></em>
</p>
<br/>

In act, GPT is a multi-layer transformer decoder, in contrast to ELMo that feeds embeddings into models customized for specific tasks as additional features, while GPT fine-tunes the same base model for all end tasks. GPT has a different perspective on tokenization, utilize Byte-Pair Encoding (BPE) to encoded the sequences.

## BERT

Existing language models before BERT only use left or right context, or even a concatenation of them (as a shallow bi-directional), but in reality, language models understanding need a bidirectional procedure. 



<br/>
<p align="center">
    <img src="https://hooshvare.s3.ir-thr-at1.arvanstorage.com/bert_arch.png" />
    <br/>
    <em>Figure 4: BERT Architecture</em><br/>
</p>
<br/>

Until Google researchers published BERT, a pre-trained bidirectional representation on unlabeled text corpora in order to learn `left-<>-right` contexts, and able to fine-tune on other NLP down-stream tasks without architecture modification. As we mentioned before, having a bidirectional model is not easy as possible, so the researchers need to tackle those problems with some tricks. In total, they introduced them as the following phases:


- Masked Language Model ($MLM$), finding relationships between words and a solution for using a deep bidirectional technique.
- Next Sentence Prediction ($NSP$), recognizing conditional sentences in specific tasks like QA/NLI.
- WordPiece, Custom Text Preprocessing, converting every input embedding into a combination of three particular embeddings 
 - *input_ids:* the id of tokens based on vocabulary for $MLM$
 - *segment_ids:* the id order of sentences for $NSP$, the sentence is either related to first $A$ ($E_{A}$) or second $B$ ($E_{B}$)
 - *position_ids:* the ids of *italicized text* masked tokens for $MLM$.

### MLM

Predicting a missing word from within the sequence itself instead of predicting the next word in a sequence is the main idea behind the $MLM$. To prevent the model from focusing much on a particular position or tokens that are masked, BERT defined a random masked around $15\%$ of the tokens. The mask words were not always replaced by the $[MASK]$ because the $[MASK]$ token would never appear during fine-tuning and caused mismatching between pretraining and fine-tuning. As a result, BERT suggests the following solutions to achieve a model that does not know which words will be asked or replaced by random:

- $80\%$ of the time, the words replaced with the mask token $[MASK]$
- $10\%$ of the time, the words replaced with random words. (It is not harmfull for the model because the replacement only consists $1.5\%$ [=$15\%$/$10\%$] of all tokens)
- $10\%$ of the time, the words left unchanged.


```text
my dog is hairy → my dog is [MASK]
my dog is hairy → my dog is apple
my dog is hairy → my dog is hairy
```


### NSP 

It is a binary classification task in which the data can quickly generate from any corpus by splitting into paired sentences $A$/$B$ with some craftiness used by MLM to consider the correlation $A$/$B$ paired in subjects like QA.

- $50\%$ of the time, $B$ is the actual next sentence that follows $A$ (labeled as IsNext)
- $50\%$ of the time, it is random sentences from the corpus (labeled as NotNext)

```text
INPUT = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
LABEL = IsNext

INPUT = [CLS] the man went to [MASK] store [SEP] penguin [MASK] are light ##less birds [SEP]
LABEL = NotNext
```

### WordPiece

<br/>
<p align="center">
    <img src="https://hooshvare.s3.ir-thr-at1.arvanstorage.com/wordpiece.png" />
    <br/>
    <em>Figure 5: WordPiece</em><br/>
    <em><a href="https://arxiv.org/pdf/1609.08144v2.pdf">https://arxiv.org/abs/1609.08144v2</a></em><br/>
</p>
<br/>


**Shape:**
- input_ids [Token Embeddings]: $batch\_size * 512 * 768$
- segment_ids [Segment Embeddings]: $batch\_size * 512 * 768$
- position_ids [Position Embeddings]: $batch\_size * 512 * 768$

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
text_a = "the man went to grocery store."
text_b = "he bought a gallon of milk."

tokenized = tokenizer.encode_plus(text_a, text_b)
pprint(tokenized)

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101,
               1996,
               2158,
               2253,
               2000,
               13025,
               3573,
               1012,
               102,
               2002,
               4149,
               1037,
               25234,
               1997,
               6501,
               1012,
               102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [None]:
model

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

In [None]:
print(f"Model has {count_parameters(model):,} parameters")

Model has 109,514,298 parameters


## BERT Architecture

- BERT-Base: `12-layer`, `768-hidden-nodes`, `12-attention-heads`, `110M parameters`
- BERT-Large: `24-layer`, `1024-hidden-nodes`, `16-attention-heads`, `340M parameters`

*BERT-Base was trained on 4 TPUs for 4 days, and BERT-Large was trained on 16 TPUs for 4 days!*

## BERT Configuaraion

**Pre-training**

- batch_size: (256 [sequences] x 512 [tokens]), 128K tokens/batch for 1M steps
- adam optimizer: 
 - $LR=1e-4$
 - $\beta_{1}=0.9 \beta_{2}=0.99, decay=0.01$
- dropout: $0.1$
- activation: $GELU$


**Fine-Tuning**
- batch_size: $16, 32$
- LR: $5e-5, 3e-5, 2e-5$
- epochs: $2, 3, 4$
