# Bert Architecture and WorkFlow

BERT Architecture Overview
BERT (Bidirectional Encoder Representations from Transformers) is a Transformer-based model designed to pre-train deep bidirectional representations. Its architecture is based on the Transformer encoder, and the model is pre-trained on large amounts of text data using two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). This allows BERT to understand the context of words in sentences both from the left and right directions.

Input (Tokens) 
       |
     [Embedding Layer]
       |
     [Transformer Encoder Layers]
       |
     [Final Embeddings]
       |
     [Prediction Heads] --> Masked Language Model (MLM)
                         --> Next Sentence Prediction (NSP)


- Input Layer: Token embeddings, segment embeddings, and position embeddings.
- Transformer Encoder Layers: Multiple layers (typically 12 or 24) of self-attention and feed-forward layers.
- Prediction Heads: Tasks like MLM and NSP use the final embeddings for prediction.


##### BERT Workflow Explanation in Steps
Let's go through the BERT workflow and understand it step-by-step using Python and Hugging Face's transformers library.

1. Input Preprocessing: BERT requires input in the form of tokens, attention masks, and token type IDs. The BertTokenizer handles tokenization and preprocessing.
2. Model Initialization: BERT’s pre-trained model is loaded using Hugging Face.
3. Forward Pass: Input tokens are passed through the model to get final embeddings and outputs.
4. Prediction Tasks: Use the outputs for downstream tasks (e.g., masked token prediction or sentence classification).


Step-by-Step Breakdown:

1. Tokenization:
Text input is tokenized into input IDs, and an attention mask is created.
The tokenizer adds special tokens ([CLS] at the start and [SEP] at the end of input).
2. Embedding:
Tokens are converted into embeddings. The embedding layer sums token embeddings, positional embeddings, and segment embeddings.
3. Transformer Encoder:
The core of BERT is the multi-layer transformer encoder. Each encoder layer applies multi-head attention and feed-forward networks to capture contextual information.
4. Final Layer Output:
The final output contains token-level embeddings (last_hidden_state) and the pooled_output for the [CLS] token, which can be used for classification tasks.
5. Prediction:
For masked language modeling, the final embeddings are used to predict the masked tokens. For sentence classification, the pooled_output is typically passed through a classification layer.

    This flow provides a solid understanding of how BERT processes input text and generates contextualized embeddings for downstream tasks. You can further fine-tune the model for specific tasks like text classification, QA, etc.

##### Detailed:
1. Tokenization and Input Processing: How the tokenizer works, handling special tokens like [CLS] and [SEP], and token types (e.g., token type IDs for sentence pairs).
2. Transformer Encoder Layer: Detailed breakdown of the self-attention mechanism, multi-head attention, and the role of positional embeddings.
3. Masked Language Modeling (MLM): How BERT is pre-trained using MLM, where random tokens are masked and predicted.
4. Next Sentence Prediction (NSP): How BERT is pre-trained to understand sentence relationships and the workflow for this task.
5. Fine-Tuning for Downstream Tasks: How to fine-tune BERT for tasks like text classification, sentiment analysis, or question answering.

In [2]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


## 1. Tokenization and Input Processing

BERT’s tokenization process involves converting raw text into input tokens that the model can process. It uses a WordPiece Tokenizer, which splits words into subword units when necessary to handle rare or unknown words.

Special Tokens: BERT uses special tokens like [CLS] for classification and [SEP] to separate sentences (or signal the end of a sentence).
Token Type IDs: These indicate sentence A or sentence B for tasks involving sentence pairs (used in Next Sentence Prediction).
Attention Masks: These tell the model which tokens to focus on (non-padding tokens).

In [11]:
from transformers import BertTokenizer
import torch

In [13]:
## Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [15]:
tokenizer

BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [92]:
# Example text (with two sentences for NSP)
text = "BERT is a transformer model. It is used for various NLP tasks."

In [94]:
# Tokenize input (convert to input IDs, attention masks, etc.)
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

In [96]:
print("Tokenized Input IDs: ", inputs['input_ids'])
print("Attention Mask: ", inputs['attention_mask'])
print("Token Type IDs: ", inputs['token_type_ids'])  # Used when handling sentence pairs

Tokenized Input IDs:  tensor([[  101, 14324,  2003,  1037, 10938,  2121,  2944,  1012,  2009,  2003,
          2109,  2005,  2536, 17953,  2361,  8518,  1012,   102]])
Attention Mask:  tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Token Type IDs:  tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


Key Points:

[CLS]: Added at the beginning for classification. 

[SEP]: Added at the end to separate sentences or mark the end of a single sentence.

Padding: BERT uses fixed-length sequences, so shorter sequences are padded.


## 2. Transformer Encoder Layer

The transformer encoder layer is at the heart of BERT. It's a stack of layers that apply multi-head self-attention followed by position-wise feed-forward networks.

Key Concepts:
Self-Attention: BERT applies attention to each token by considering other tokens in the input sequence, allowing it to capture contextual relationships.
Multi-Head Attention: BERT uses multiple attention heads to capture different aspects of context.
Positional Embeddings: Since transformers don't have inherent knowledge of token order, BERT adds positional embeddings to the input.
Self-Attention Formula:

Query (Q), Key (K), and Value (V) vectors are created for each token.

The attention scores are calculated as:

Attention
(
𝑄
,
𝐾
,
𝑉
)
=
softmax
(
𝑄
𝐾
𝑇
𝑑
𝑘
)
𝑉
Attention(Q,K,V)=softmax(dkQKT)V

where 
dk is the dimension of the key/query.

In [1]:
from transformers import BertModel

In [5]:
# Load pre-trained BERT model
model = BertModel.from_pretrained("bert-base-uncased")

In [35]:
# Forward pass through BERT
outputs = model(**inputs)
outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.6356, -0.5310, -0.0545,  ..., -0.2000, -0.1567,  0.8965],
         [ 0.3849, -0.5095,  0.3477,  ..., -0.0883, -0.0330,  0.0359],
         [-0.6935, -0.3289,  0.0345,  ..., -0.1874, -0.3868,  0.9165],
         ...,
         [-0.1377, -0.1032, -0.0988,  ..., -0.4231, -1.1112, -0.4846],
         [ 0.4561, -0.1571, -0.4719,  ...,  0.2254, -0.4706, -0.2761],
         [-0.7604, -0.3017,  0.3763,  ...,  0.6815, -0.6139,  0.5033]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-8.7065e-01, -3.3561e-01, -8.6563e-01,  6.5553e-01,  7.4882e-01,
         -2.0365e-01,  5.5993e-01,  1.7072e-01, -6.4585e-01, -9.9999e-01,
         -3.5010e-01,  7.6290e-01,  9.6332e-01,  5.3351e-02,  7.4888e-01,
         -6.5730e-01, -3.4508e-02, -4.6789e-01,  3.2335e-01,  6.8526e-02,
          4.5708e-01,  9.9998e-01,  2.4238e-01,  2.6449e-01,  2.8081e-01,
          8.8371e-01, -5.7988e-01,  8.0844e-01,  9.1552e-01,  7.520

In [43]:
# Last hidden states (for each token) and pooled output (for classification)
last_hidden_state = outputs.last_hidden_state # Token-level embeddings
pooler_output = outputs.pooler_output ## Embbeding for [CLS} token

In [49]:
print("Last Hidden State Shape: ", last_hidden_state.shape)
print("Pooled Output Shape: ", pooler_output.shape)

Last Hidden State Shape:  torch.Size([1, 18, 768])
Pooled Output Shape:  torch.Size([1, 768])


Key Points:
###### Multi-Head Self-Attention allows BERT to look at each token from different perspectives.
###### Feed-forward layers follow attention, applying non-linearity and projection.

## 3. Masked Language Modeling (MLM)

During pre-training, BERT randomly masks some input tokens (typically 15%) and trains the model to predict them. This allows BERT to learn bidirectional context.

###### Masking Strategy: 
80% of the time, the token is replaced with [MASK], 
10% is replaced with a random token, and 10% is left unchanged.

In [58]:
from transformers import BertForMaskedLM
import torch

In [110]:
## Load pre-trained model for MLM
model_mlm = BertForMaskedLM.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [134]:
## Example sentences with [MASK] token
masked_text = "BERT is [MASK] a model for NLP tasks."
inputs = tokenizer(masked_text, return_tensors='pt')
inputs

{'input_ids': tensor([[  101, 14324,  2003,   103,  1037,  2944,  2005, 17953,  2361,  8518,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [122]:
## Forword pass for MLM
outputs = model_mlm(**inputs)
outputs

MaskedLMOutput(loss=None, logits=tensor([[[ -6.5931,  -6.5281,  -6.5194,  ...,  -5.8572,  -5.6498,  -4.1681],
         [ -5.9752,  -5.9742,  -5.9816,  ...,  -6.0757,  -5.9236,  -4.5834],
         [-10.8940, -10.3371, -10.8128,  ...,  -9.9103,  -8.2486,  -8.0948],
         ...,
         [ -0.7863,  -0.4980,  -0.6281,  ...,  -1.3897,  -1.0887,  -2.6577],
         [-11.2825, -10.9634, -11.0110,  ...,  -8.5080,  -8.9784,  -7.6255],
         [-15.6298, -15.5663, -15.5410,  ..., -14.5949, -12.5687,  -9.8353]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [126]:
logits = outputs.logits
logits

tensor([[[ -6.5931,  -6.5281,  -6.5194,  ...,  -5.8572,  -5.6498,  -4.1681],
         [ -5.9752,  -5.9742,  -5.9816,  ...,  -6.0757,  -5.9236,  -4.5834],
         [-10.8940, -10.3371, -10.8128,  ...,  -9.9103,  -8.2486,  -8.0948],
         ...,
         [ -0.7863,  -0.4980,  -0.6281,  ...,  -1.3897,  -1.0887,  -2.6577],
         [-11.2825, -10.9634, -11.0110,  ...,  -8.5080,  -8.9784,  -7.6255],
         [-15.6298, -15.5663, -15.5410,  ..., -14.5949, -12.5687,  -9.8353]]],
       grad_fn=<ViewBackward0>)

In [154]:
inputs.input_ids

tensor([[  101, 14324,  2003,   103,  1037,  2944,  2005, 17953,  2361,  8518,
          1012,   102]])

In [160]:
tokenizer.mask_token_id

103

In [170]:
import numpy as np

In [180]:
masked_index = torch.where(inputs.input_ids == tokenizer.mask_token_id)[1]
masked_index

tensor([3])

In [204]:
predicted_token_id = logits[0, masked_index].argmax(axis = 1)
predicted_token_id

tensor([2036])

In [208]:
predicted_token = tokenizer.decode(predicted_token_id)
print(f"Predicted token for [MASK]: {predicted_token}")


Predicted token for [MASK]: also


Key Points:
###### MLM Pre-training: 
This task teaches BERT to predict missing tokens, which enhances its understanding of the context.

## 4. Next Sentence Prediction(NSP)
BERT is also pre-trained on a binary classification task called Next Sentence Prediction (NSP). Given two sentences, it predicts whether the second sentence follows the first one in the original text.

50% of the time, the second sentence is the next sentence.
50% of the time, it’s a random sentence.

In [213]:
from transformers import BertForNextSentencePrediction

In [217]:
## Load pre-trained model for NSP
model_nsp = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
model_nsp

BertForNextSentencePrediction(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [219]:
# Two sentences for NSP
sentence_1 = "BERT is a transformer model."
sentence_2 = "It is used for various NLP tasks."

In [225]:
# Tokenize sentence pair for NSP task
inputs = tokenizer(sentence_1, sentence_2, return_tensors='pt')
inputs

{'input_ids': tensor([[  101, 14324,  2003,  1037, 10938,  2121,  2944,  1012,   102,  2009,
          2003,  2109,  2005,  2536, 17953,  2361,  8518,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [229]:
## Forword pass for NSP
outputs = model_nsp(**inputs)
outputs

NextSentencePredictorOutput(loss=None, logits=tensor([[ 5.8095, -5.0802]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [235]:
logits = outputs.logits
logits

tensor([[ 5.8095, -5.0802]], grad_fn=<AddmmBackward0>)

In [237]:
# Logits indicate the probability of next sentence
print("NSP logits: ", logits)

NSP logits:  tensor([[ 5.8095, -5.0802]], grad_fn=<AddmmBackward0>)


Key Points:
###### NSP Pre-training: 
This task helps BERT understand sentence relationships, which is useful in tasks like question answering and summarization.

## 5. Fine-Tuning for Downstream Tasks
BERT can be fine-tuned for a variety of NLP tasks, such as text classification, question answering, and named entity recognition. Fine-tuning is typically done by adding a task-specific head (e.g., classification layer) on top of BERT and training the model on a specific dataset.

Example: Fine-tuning for Text Classification

1. Add a classification head (e.g., a simple linear layer) on top of the pooled output from BERT.
2. Train on a labeled dataset (e.g., sentiment analysis) by minimizing a loss function (e.g., cross-entropy loss).

In [287]:
from transformers import BertForSequenceClassification, AdamW

In [289]:
# Load pre-trained BERT model for text classification (binary classification)
model_cls = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model_cls

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [290]:
# Example input
tokenizer("BERT is awesome!", return_tensors='pt')

{'input_ids': tensor([[  101, 14324,  2003, 12476,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [291]:
# Forward pass (get classification logits)
outputs = model_cls(**inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[0.2670, 0.0884]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [294]:
logits = outputs.logits
logits

tensor([[0.2670, 0.0884]], grad_fn=<AddmmBackward0>)

In [297]:
# Compute loss (if labels are provided)
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1, binary label
labels

tensor([[1]])

In [299]:
# Load pre-trained BERT model for text classification (binary classification)
model_cls = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Example input
inputs = tokenizer("BERT is awesome!", return_tensors='pt')


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [306]:

# Forward pass (get classification logits)
outputs = model_cls(**inputs)
logits = outputs.logits

# Compute loss (if labels are provided)
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1, binary label
loss = torch.nn.CrossEntropyLoss()(logits, labels)

RuntimeError: 0D or 1D target tensor expected, multi-target not supported

In [307]:
# Optimizer
optimizer = AdamW(model_cls.parameters(), lr=1e-5)
optimizer.zero_grad()
loss.backward()
optimizer.step()

print("Logits: ", logits)
print("Loss: ", loss.item())

NameError: name 'loss' is not defined

Key Points:
1. Fine-Tuning: BERT is adaptable for many tasks by simply training a small classification layer on top of its embeddings.
2. Transfer Learning: The fine-tuning process leverages BERT's pre-trained knowledge and adapts it to specific tasks with minimal training data.

##### Summary of BERT Workflow:
1. Tokenization: Convert input text into tokens, including special tokens [CLS] and [SEP].
2. Transformer Encoding: Apply multi-head attention and feed-forward networks to create contextual embeddings.
3. Pre-training Tasks: MLM and NSP help BERT learn language representation.
4. Fine-Tuning: Adapt BERT for specific NLP tasks with minimal modifications to the architecture.