# 🎭 **Masked Language Modeling with BERT**

Explore masked language modeling (MLM) using the BERT model to understand context and predict missing words in sentences.

## 🛠️ Setup and Installation

Begin by installing the necessary libraries to manage data processing and modeling.

In [None]:
!pip install -U transformers

Collecting transformers
  Downloading transformers-4.39.3-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed transformers-4.39.3


## 📚 Importing Libraries

Import essential modules for our tasks.

In [2]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import pandas as pd
import numpy as np
from scipy.special import softmax

## 🤖 Model Setup

Load the pre-trained BERT model and tokenizer, specifically designed for masked language modeling.

In [3]:
model_name = "bert-base-cased"

# Loading the pre-trained model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architect

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

## 🎭 Defining the Mask Token

Identify the mask token used by BERT to signify where predictions are needed in the sentence.

## ✏️ Creating the Input Sentence

Craft a sentence with a missing word indicated by the mask token, to test the model's predictive power.

In [4]:
# Defining the mask token
mask = tokenizer.mask_token

# Defining the sentence
sentence = f"I want to {mask} pizza for tonight."

# Tokenizing the sentence
tokens = tokenizer.tokenize(sentence)

## 🔍 Tokenization and Encoding

Tokenize and encode the sentence to format it properly for the model.

## 📈 Model Prediction

Feed the encoded inputs to the model and extract logits for predictions.

In [5]:
# Encoding the input sentence and getting model predictions
encoded_inputs = tokenizer(sentence, return_tensors="pt")
output = model(**encoded_inputs)

# Detaching the logits from the model output and converting to numpy array
logits = output.logits.detach().numpy()[0]


## 🔎 Analyzing Predictions

Retrieve logits for the masked token and calculate confidence scores for possible replacements.

In [6]:
# Extracting the logits for the masked token and calculating the confidence scores
masked_logits = logits[tokens.index(mask) + 1]
confidence_scores = softmax(masked_logits)

## 📝 Displaying Top Predictions

Cycle through the top 5 predicted tokens, substituting the masked token in the original sentence to show the model's suggestions.


In [7]:
# Iterating over the top 5 predicted tokens and printing the sentences with the masked token replaced
for i in np.argsort(confidence_scores)[::-1][:5]:
    pred_token = tokenizer.decode(i)
    score = confidence_scores[i]

    # print(pred_token, score)
    print(sentence.replace(mask, pred_token), f"--> Score - {score}")

I want to have pizza for tonight. --> Score - 0.25729063153266907
I want to get pizza for tonight. --> Score - 0.17849592864513397
I want to eat pizza for tonight. --> Score - 0.15555556118488312
I want to make pizza for tonight. --> Score - 0.11422409117221832
I want to order pizza for tonight. --> Score - 0.09823039919137955
