***Pre-Training BERT***

BERT is pre-trained on two tasks:

- The Masked language model
- The next sentence prediction

**Masked language modelling**

Replace 15% of words in corpus with special "MASK" token and ask BERT to fill in the blank

BERT will consider each sequence, pass it through its encoder layers which will add context to the masked token and then it will be a classification task against all possible tokens BERT knows.

In [2]:
# imports

from transformers import BertForMaskedLM, pipeline

In [3]:
# Transformers library comes with several standard heards
# on top of the standard BERT model

bert_lm = BertForMaskedLM.from_pretrained('bert-base-cased')

Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 285kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading model.safetensors: 100%|██████████| 436M/436M [03:36<00:00, 2.01MB/s] 
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to 

In [4]:
# Looking at the model

bert_lm

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_a

In [5]:
# we create a pipeline

nlp = pipeline('fill-mask', model='bert-base-cased')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Downloading (…)okenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 30.0kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 1.33MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 436k/436k [00:00<00:00, 2.29MB/s]


In [7]:
preds = nlp(f"If you don't {nlp.tokenizer.mask_token} at the sign, you will get a ticket")

print("If you don't *** at the sign, you will get a ticket.")

for p in preds:
    print(f"Token: {p['token_str']}, Score: {100*p['score']:,.2f}%")

If you don't *** at the sign, you will get a ticket.
Token: look, Score: 47.00%
Token: stop, Score: 43.15%
Token: glance, Score: 0.83%
Token: wait, Score: 0.76%
Token: turn, Score: 0.65%


In [14]:
preds1 = nlp(f"Hi hippie, I {nlp.tokenizer.mask_token} you, I always did")

for p in preds1:
    print(f"Token: {p['token_str']}, Score: {100*p['score']:,.2f}%")

Token: love, Score: 41.48%
Token: told, Score: 22.10%
Token: tell, Score: 5.42%
Token: like, Score: 3.36%
Token: miss, Score: 3.25%


In [15]:
preds = nlp(f"Goodnight kiddo, I {nlp.tokenizer.mask_token} you.")

for p in preds:
    print(f"Token: {p['token_str']}, Score: {100*p['score']:,.2f}%")

Token: love, Score: 90.56%
Token: miss, Score: 5.54%
Token: missed, Score: 0.76%
Token: hear, Score: 0.41%
Token: need, Score: 0.38%


In [16]:
preds = nlp(f"Why does Padel not {nlp.tokenizer.mask_token}.")

for p in preds:
    print(f"Token: {p['token_str']}, Score: {100*p['score']:,.2f}%")

Token: know, Score: 46.69%
Token: understand, Score: 6.98%
Token: care, Score: 4.36%
Token: answer, Score: 2.61%
Token: exist, Score: 2.31%
