<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Architectures/bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT: Bidirectional Encoder Representations from Transformers

BERT is a transformer-based language model developed by Google in 2018. It revolutionized NLP by introducing deep bidirectional representations, achieving state-of-the-art performance on a wide range of tasks with minimal task-specific architecture modifications.

## Key Innovations

BERT introduced several key innovations to language modeling:

1. **Bidirectional Training**: Unlike previous models that processed text either left-to-right or right-to-left, BERT processes words in relation to all other words in a sentence simultaneously.

2. **Pre-training Tasks**:
   - **Masked Language Modeling (MLM)**: Randomly mask 15% of tokens and train the model to predict the original vocabulary ID of the masked word based on its context.
   - **Next Sentence Prediction (NSP)**: Train the model to understand relationships between sentences by predicting whether sentence B follows sentence A in the original text.

3. **Transfer Learning for NLP**: BERT demonstrated that a model pre-trained on large text corpora could be fine-tuned for specific downstream tasks with minimal additional training.

## Architecture

BERT's architecture is based on the Transformer encoder from "Attention is All You Need":

- **BERT-Base**: 12 transformer layers, 12 attention heads, 768 hidden dimensions (110M parameters)
- **BERT-Large**: 24 transformer layers, 16 attention heads, 1024 hidden dimensions (340M parameters)

![BERT Architecture](https://miro.medium.com/max/700/1*0bxsEnVxnHIg2g5E5Y6q0g.png)

### Input Representation

BERT's input representation is the sum of three embeddings:

1. **Token Embeddings**: WordPiece embeddings with 30,000 token vocabulary
2. **Segment Embeddings**: Distinguish between pairs of sentences (0 for first sentence, 1 for second)
3. **Position Embeddings**: Indicate the position of each token in the sequence

![BERT Input Representation](https://miro.medium.com/max/700/1*vLF7q-ktD73lmv6oL5bW3A.png)

## Pre-training Process

BERT was pre-trained on:
- BooksCorpus (800M words)
- English Wikipedia (2,500M words)

### Masked Language Modeling (MLM)

For each training example:
1. Randomly select 15% of tokens
2. Replace 80% of selected tokens with [MASK]
3. Replace 10% with random words
4. Keep 10% unchanged

The model then predicts the original token at masked positions using context from both directions.

### Next Sentence Prediction (NSP)

For each training example:
1. Choose 50% of samples where sentence B actually follows sentence A (labeled "IsNext")
2. Choose 50% where sentence B is a random sentence from the corpus (labeled "NotNext")
3. Train the model to classify whether sentence B is the actual next sentence or not

## Fine-tuning for Downstream Tasks

BERT can be fine-tuned for various NLP tasks with minimal architectural changes:

- **Classification Tasks** (Sentiment analysis, NLI): Add a classification layer on top of the [CLS] token output
- **Question Answering**: Add start/end span prediction layers
- **Named Entity Recognition**: Use outputs from all tokens for token-level predictions
- **Paraphrasing**: Use the [CLS] token for sentence pair classification

## Using BERT with Hugging Face Transformers

The Transformers library provides a convenient way to use BERT models.

In [None]:
# Install required libraries
!pip install transformers torch

In [None]:
import torch
from transformers import BertTokenizer, BertModel

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
# Encode text
text = "Here's a sentence to encode with BERT."
inputs = tokenizer(text, return_tensors="pt")
print(inputs)

In [None]:
# Get BERT embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Last hidden states contains contextual embeddings for each token
last_hidden_states = outputs.last_hidden_state
print(f"Shape of output embeddings: {last_hidden_states.shape}")

# CLS token embedding (often used for classification tasks)
cls_embedding = last_hidden_states[:, 0, :]
print(f"Shape of CLS embedding: {cls_embedding.shape}")

## Fine-tuning BERT for Sentiment Analysis

In [None]:
from transformers import BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from torch.utils.data import DataLoader, Dataset

# Sample dataset class (replace with real dataset)
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_length)
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Example fine-tuning setup
def fine_tune_bert(train_texts, train_labels, val_texts, val_labels):
    # Load pre-trained BERT model for sequence classification
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    
    # Prepare datasets
    train_dataset = SentimentDataset(train_texts, train_labels, tokenizer)
    val_dataset = SentimentDataset(val_texts, val_labels, tokenizer)
    
    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=16)
    
    # Optimizer and scheduler
    optimizer = AdamW(model.parameters(), lr=5e-5)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=len(train_loader) * 3  # 3 epochs
    )
    
    # Training loop (simplified)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    
    for epoch in range(3):  # 3 epochs
        model.train()
        for batch in train_loader:
            optimizer.zero_grad()
            
            inputs = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**inputs)
            loss = outputs.loss
            
            loss.backward()
            optimizer.step()
            scheduler.step()
            
        # Validation
        model.eval()
        # ... validation code ...
        
    return model, tokenizer

## BERT Variants and Evolution

Since its introduction, several variants of BERT have been developed:

1. **RoBERTa** (Facebook, 2019): Removes NSP, uses dynamic masking, larger batches, more data
2. **DistilBERT** (Hugging Face, 2019): Knowledge distillation to create smaller, faster BERT model
3. **ALBERT** (Google, 2020): Parameter reduction techniques for more efficient training
4. **ELECTRA** (Google, 2020): Replaced MLM with replaced token detection for more efficient training
5. **DeBERTa** (Microsoft, 2021): Enhanced attention mechanisms with disentangled attention

Each variant addresses specific limitations or improves certain aspects of the original BERT.

## Applications of BERT

BERT has been successfully applied to a wide range of NLP tasks:

- **Text Classification**: Sentiment analysis, topic classification, toxicity detection
- **Question Answering**: SQuAD, Natural Questions
- **Named Entity Recognition**: Identifying persons, locations, organizations in text
- **Text Summarization**: When combined with generation components
- **Information Retrieval**: Search engine results ranking
- **Language Understanding**: GLUE benchmark tasks (NLI, paraphrasing)
- **Document Classification**: Legal, medical, scientific document categorization

## Limitations of BERT

Despite its success, BERT has several limitations:

1. **Maximum Sequence Length**: Limited to 512 tokens, making it unsuitable for long document processing
2. **Computational Requirements**: Large model size requires significant computational resources
3. **Encoder-only Architecture**: Not directly suitable for text generation tasks
4. **Static Knowledge**: Knowledge limited to pre-training data, no ability to update knowledge
5. **Domain Specificity**: May require domain adaptation for specialized fields (medical, legal)

## Conclusion

BERT represents a major milestone in NLP, introducing techniques that have become standard in modern language models:

- Bidirectional context understanding
- Pre-training and fine-tuning paradigm
- Transformer-based architectures for language understanding

Its success paved the way for subsequent models like GPT, T5, and other transformer-based architectures that continue to advance the state of natural language processing.

## References

- Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2018). [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). arXiv.
- Vaswani, A., et al. (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762). NIPS.
- Liu, Y., et al. (2019). [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692). arXiv.
- Sanh, V., et al. (2019). [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108). arXiv.
- Clark, K., et al. (2020). [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555). ICLR.