# 1. **ALBERT**
## A Lite version of BERT 
- In this section, we will learn about A Lite version of BERT, also known as ALBERT. One of the challenges with BERT is that it consists of millions of parameters. BERT-base consists of
110 million parameters, which makes it harder to train, and it also has a high inference
time. Increasing the model size gives us good results but it puts a limitation on the
computational resources. To combat this, ALBERT was introduced. 
- ALBERT is a lite
version of BERT with fewer parameters compared to BERT. It uses the following two
techniques to reduce the number of parameters: 
- Cross-layer parameter sharing
- Factorized embedding layer parameterization 
By using the preceding two techniques, we can reduce the training time and inference time
of the BERT model.

## **Extracting embeddings with ALBERT**

In [50]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [51]:
from transformers import AlbertTokenizer, AlbertModel

## Download and load the pre-trained ALBERT model and tokenizer. In this tutorial, we'll use the ALBERT-base model:

In [52]:
# pip install sentencepiece

In [53]:
model = AlbertModel.from_pretrained('albert-base-v2')

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertModel: ['predictions.decoder.weight', 'predictions.LayerNorm.bias', 'predictions.decoder.bias', 'predictions.bias', 'predictions.LayerNorm.weight', 'predictions.dense.weight', 'predictions.dense.bias']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [54]:
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

## Now, feed the sentence to the tokenizer and get the preprocessed input: 

In [55]:
sentence = "Paris is a beautiful city"
inputs = tokenizer(sentence, return_tensors="pt")
inputs

{'input_ids': tensor([[   2, 1162,   25,   21, 1632,  136,    3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [56]:
hidden_rep = model(**inputs)[0]
cls_head = model(**inputs)[1]

In [57]:
hidden_rep.shape

torch.Size([1, 7, 768])

In [58]:
cls_head.shape

torch.Size([1, 768])

# 2. **RoBERTa**
## RoBERTa is another interesting and popular variant of BERT. Researchers observed that BERT is severely undertrained and proposed several approaches to pre-train the BERT model. RoBERTa is essentially BERT with the following changes in pre-training: 
- Use dynamic masking instead of static masking in the MLM task.
- Remove the NSP task and train using only the MLM task.
- Train with a large batch size. 
- Use byte-level BPE (BBPE) as a tokenizer.

## Using dynamic masking instead of static masking 
- We learned that we pre-train BERT using the MLM and NSP tasks. In the MLM task, we
randomly mask 15% of the tokens and let the network predict the masked token.
- For instance, say we have the sentence **We arrived at the airport in time**. - Now, after tokenizing
and adding [CLS] and [SEP] tokens, we have the following: 
- tokens = [ [CLS], we, arrived, at, the, airport, in, time, [SEP] ]
Next, we randomly mask 15% of the tokens:
- tokens = [ [CLS], we, [MASK], at, the airport, in, [MASK], [SEP] ]
- Now, we feed the tokens to BERT and train it to predict the masked tokens.  - - Note that the masking is done only once during the preprocessing step and we train the model over
several epochs to predict the same masked token. This is known as **static masking**. 
- RoBERTa uses dynamic masking instead of static masking.
- As we can observe from the preceding results, BERT performs better in the **FULL-SENTENCES** and **DOC-SENTENCES** settings where we trained it without the NSP task.
- To conclude, in RoBERTa, we train the model only with the MLM task and not with the
NSP task and the input consists of a full sentence, which is sampled continuously from one
or more documents. The input consists of at most 512 tokens. If we reach the end of one
document, then we begin sampling from the next document. 

## Training with more data points
We learned that we pre-train BERT with the Toronto BookCorpus and English Wikipedia
datasets, which account for a total of 16 GB. Along with these two datasets, we pre-train
RoBERTa using the CC-News (Common Crawl-News), Open WebText, and Stories (subset
of Common Crawl) datasets. 
Thus, the RoBERTa model is pre-trained using five datasets and the sum of the total size of
these five datasets is 160 GB.

## Training with a large batch size 
We learned that BERT is pre-trained with a batch size of 256 sequences for 1 million steps.
We pre-train RoBERTa with a larger mini-batch size. We pre-train RoBERTa with a batch
size of 8,000 sequences for 300,000 steps. We can also pre-train longer with a batch size of
8,000 sequences for 500,000 steps. 

## Using BBPE as a tokenizer 
- We know that BERT uses the WordPiece tokenizer. We learned that the WordPiece
tokenizer works similar to BPE and it merges the symbol pair based on likelihood instead
of frequency. Unlike BERT, RoBERTa uses BBPE as a tokenizer.
- We learned that
BBPE works very similar to BPE but instead of using a character-level sequence, it uses a
byte-level sequence. We know that BERT uses a vocabulary size of 30,000 tokens but
RoBERTa uses a vocabulary size of 50,000 tokens.

## **RoBERTa (Robustly Optimized BERT pre-training Approach) tokenizer**

In [59]:
from transformers import RobertaConfig, RobertaModel, RobertaTokenizer

In [60]:
model = RobertaModel.from_pretrained('roberta-base')

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [61]:
model.config

RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.21.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

In [62]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

In [63]:
tokenizer.tokenize('It was a great day')

['It', 'Ġwas', 'Ġa', 'Ġgreat', 'Ġday']

In [64]:
tokenizer.tokenize('I had a sudden epipheny')

['I', 'Ġhad', 'Ġa', 'Ġsudden', 'Ġep', 'ip', 'heny']

## To summarize, RoBERTa is a variant of BERT and it uses only the MLM task for training. Unlike BERT, it uses dynamic masking instead of static masking and it is trained with a large batch size. It uses BBPE as a tokenizer and it has a vocabulary size of 50,000. 

# 3. **ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)**
- ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements
Accurately) is yet another interesting variant of BERT. We learned that we pre-train BERT using the MLM and NSP tasks. We know that in the MLM task, we randomly mask 15% of the tokens and train BERT to predict the masked token. Instead of using the MLM task as a pre-training objective, ELECTRA is pre-trained using a task called **replaced token detection**.
- In order to train the ELECTRA model efficiently, we can share the weights between the generator and the discriminator. That is, if both the generator and the discriminator are the
same size, then we can share the weights of the encoder.
- But the problem is, if the generator and discriminator are the same size, then it will increase the training time, so to avoid that, we can use a smaller generator. When the generator is
small, we can just share only the embedding layers (token and positional embeddings) between the generator and discriminator. This tied embedding between the generator and
discriminator minimizes the training time. 
- The pre-trained ELECTRA model is
available in three different configurations:
  - ELECTRA-small: With 12 encoder layers and 256 hidden size
  - ELECTRA-base: With 12 encoders and 768 hidden size
  - ELECTRA-large: With 24 encoders and 1,024 hidden size


In [65]:
from transformers import ElectraTokenizer, ElectraModel

In [67]:
model_dis = ElectraModel.from_pretrained('google/electra-small-discriminator')
model_gen = ElectraModel.from_pretrained('google/electra-small-generator')

Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraModel: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/51.7M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/electra-small-generator were not used when initializing ElectraModel: ['generator_lm_head.weight', 'generator_predictions.LayerNorm.weight', 'generator_predictions.dense.weight', 'generator_lm_head.bias', 'generator_predictions.LayerNorm.bias', 'generator_predictions.dense.bias']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## In this way, we can also load the other different configurations of ELECTRA. Now that we have learned how ELECTRA works, let's learn about a popular variant of BERT called SpanBERT in the next section.

# 4. **SpanBERT**
## Predicting span with SpanBERT
- SpanBERT is another interesting variant of BERT. As the name suggests, SpanBERT is
mostly used for tasks such as question answering where we predict the span of text. 

## **Performing Q&As with pre-trained SpanBERT**

In [68]:
from transformers import pipeline

In [69]:
qa_pipeline = pipeline(
 "question-answering",
 model="mrm8488/spanbert-large-finetuned-squadv2",
 tokenizer="SpanBERT/spanbert-large-cased"
)

Downloading config.json:   0%|          | 0.00/577 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/636M [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

In [70]:
results = qa_pipeline({
 'question': "What is machine learning?",
 'context': '''Machine learning is a subset of artificial intelligence. It
is widely for creating a variety of applications such as email filtering
and computer vision'''
})

In [73]:
print(results['answer'])

a subset of artificial intelligence
