***Flavors of BERT***

**BERT Architectures**

BERT inspired derivative architectures, each with their own specialties/drawbacks

- RoBERTa
- DistillBERT
- ALBERT

Each flavor attempts to enhance BERT by altering its architecture and/or how it was pretrained

**RoBERTa**

- Authors claim BERT was under trained
- 10x the training data, from 16gb to 160gb
- 15% more parameters
- Removed the next sentence prediction task, they claim that the taks is not useful
- Dynamic Masking Pattern -> 4x the masking tasks to learn from

**DistillBERT**

- Distillation is a technique to train a student model to replicate a master model (BERT would make a prediction and compute the losses and DistillBERT would use those loses in its back propagation)
- 40% fewer parameters and 60% faster in training and prediction while having 97% of BERT's performance

**ALBERT**

- Optimize model performance /number of parameters (90% fewer parameters)
- Factorizing embedding parameterization architecture, factorizing tokens to make them much smaller
- Cross-layer parameter sharing architecture, share parameters across layers
- Next sentence prediction task became the Sentence order prediction task (the theory is that NSP and MLM are too similar of a task while SOP is harder)
Take two consecutive parts from the same document and swap the order to use it as a negative example

e.g. "I completed highschool" "Then I joined undergrad" and ALBERT has to learn the correct order, since in next sentence prediction we take sentences from two different documents we can't for sure say they wouldn't come one after the other while swapping the sentence order gave a more robust sample base

In [1]:
# imports

from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Instantiating model for masking

nlp = pipeline('fill-mask', model='bert-base-cased')

preds = nlp(f"If you don't {nlp.tokenizer.mask_token} at the sign, you will get a ticket")

print("\n\nIf you don't *** at the sign, you will get a ticket")

for p in preds:
    print(f"Token: {p['token_str']}, Score: {p['score']:,.2f}%")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




If you don't *** at the sign, you will get a ticket
Token: look, Score: 0.47%
Token: stop, Score: 0.43%
Token: glance, Score: 0.01%
Token: wait, Score: 0.01%
Token: turn, Score: 0.01%


In [6]:
# Same task using RoBERTa

nlp = pipeline('fill-mask', model='roberta-base')

preds = nlp(f"If you don't {nlp.tokenizer.mask_token} at the sign, you will get a ticket")

print()
print(type(nlp.model))

print("\n\nIf you don't *** at the sign, you will get a ticket")

for p in preds:
    print(f"Token: {p['token_str']}, Score: {p['score']:,.2f}%")



Downloading (…)lve/main/config.json: 100%|██████████| 481/481 [00:00<00:00, 143kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading model.safetensors: 100%|██████████| 499M/499M [03:55<00:00, 2.12MB/s] 
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 1.09MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.40MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 1.90MB/s]




If you don't *** at the sign, you will get a ticket
Token:  look, Score: 0.44%
Token:  stop, Score: 0.41%
Token:  stay, Score: 0.03%
Token:  stand, Score: 0.02%
Token:  wave, Score: 0.01%


All these models are doing the same process, using attention and CLS and SEP tokens, the difference comes down to the architecture and how the models were pretrained.