<a href="https://miro.com/app/board/uXjVGfTwWv8=/">Paper (with notes)</a>
## I. Introduction and Context

**BERT** (Bidirectional Encoder Representations from Transformers) is a seminal paper from Google (Devlin et al.) that revolutionized Natural Language Processing (NLP).

*   **Significance:** The paper has amassed nearly 150,000 citations. Along with "Attention Is All You Need" and "Vision Transformer," it represents a core pillar of Google's contribution to AI.
*   **The Core Distinction:** While models like **GPT** (Generative Pre-trained Transformer) excel at **text generation** (predicting the next word), BERT is designed for **language understanding**.
    *   *Analogy:* Just as image generation is not the only task in computer vision (classifying objects or segmentation is equally important), text generation is not the only goal in NLP. Tasks like classification, grammatical assessment, and Question Answering (QA) require deep understanding rather than just generation.

## II. Architecture and "Deep Bidirectionality"

The architecture of BERT is based on the **Transformer Encoder** from the "Attention Is All You Need" paper.

### A. The Unidirectionality Problem
Previous models had significant limitations regarding context:
1.  **GPT (OpenAI):** Uses a **Left-to-Right (LTR)** architecture. A token can only attend to previous tokens (masked attention). This is optimal for generation but suboptimal for understanding because the model cannot see the "future" context.
2.  **ELMo:** Attempted bidirectionality by training two separate LSTMs (one Left-to-Right, one Right-to-Left) and performing a **shallow concatenation** of their embeddings. This is not "deep" because the left and right contexts are trained independently.

### B. The BERT Solution
BERT achieves **Deep Bidirectionality**. It jointly conditions on both left and right context in *all* layers.
*   **Mechanism:** In BERT, the attention mechanism is **unmasked**. A specific token (e.g., "sat") pays attention to *every* other token in the sequence (e.g., "The", "cat", "on", "the", "mat") simultaneously.

## III. Input Representation

BERT uses a unified input representation to handle both single sentences and sentence pairs (e.g., Question-Answer) within a single token sequence. The input embedding is the sum of three distinct embeddings:

1.  **Token Embeddings:** Uses a 30,000 token vocabulary (WordPiece).
2.  **Segment Embeddings:** Learned embeddings added to every token to indicate whether it belongs to **Sentence A** or **Sentence B**.
3.  **Position Embeddings:** Standard embeddings to indicate token position.

**Special Tokens:**
*   **`[CLS]`:** The first token of every sequence. Its final hidden state serves as the **aggregate sequence representation** for classification tasks (e.g., sentiment analysis).
*   **`[SEP]`:** A special separator token used to distinguish between Sentence A and Sentence B.

## IV. Pre-training Tasks

Because standard conditional language models (like GPT) cannot be trained bidirectionally (the model would "see" the target word it is trying to predict), BERT uses two novel unsupervised pre-training objectives.

### Task 1: Masked Language Model (MLM)
Inspired by the "Cloze" task, BERT randomly masks tokens and attempts to predict them based on the surrounding context.

*   **The Setup:** 15% of all tokens in a sequence are chosen for prediction.
*   **The Mismatch Problem:** If we always replace words with `[MASK]`, the model never sees `[MASK]` tokens during fine-tuning (real-world use), creating a mismatch.
*   **The Solution (Mixed Masking Strategy):** Of the 15% tokens chosen:
    1.  **80% of the time:** Replaced with the **`[MASK]`** token.
    2.  **10% of the time:** Replaced with a **random token** (noise).
    3.  **10% of the time:** Kept **unchanged** (the model must still predict the correct label for it).
*   **Outcome:** This forces the model to maintain a distributional contextual representation for *every* token, as it never knows which word might be wrong or masked.

### Task 2: Next Sentence Prediction (NSP)
Many downstream tasks (like Question Answering or Natural Language Inference) rely on understanding the relationship between two sentences. Language modeling (MLM) does not inherently capture this.

*   **The Setup:** The model receives pairs of sentences (A and B).
*   **The Objective:** Predict if B effectively follows A.
    *   **50% of the time:** B is the actual next sentence (Label: **IsNext**).
    *   **50% of the time:** B is a random sentence from the corpus (Label: **NotNext**).
*   **Result:** This simple binary classification task significantly improves performance on QA and inference tasks.

## V. Fine-tuning vs. Feature-Based Approaches

BERT supports two methods of application:

### 1. Fine-tuning
*   **Method:** The pre-trained BERT parameters are used to initialize the model. A simple task-specific layer (e.g., a classifier) is added on top. **All** parameters (BERT + new layer) are trained (fine-tuned) on the downstream task.
*   **Efficiency:** It is relatively inexpensive; most results in the paper could be replicated in 1 hour on a single Cloud TPU.

### 2. Feature-Based Approach (e.g., NER)
*   **Method:** The BERT model parameters are **frozen**. The model is used solely to extract "contextual embeddings" (features) which are fed into a separate, randomly initialized model (like a BiLSTM).
*   **Performance:** Concatenating the embeddings from the **last four hidden layers** of BERT yielded performance nearly identical (only 0.3 F1 score behind) to fine-tuning the entire model.

## VI. Ablation Studies & Findings

To prove the efficacy of BERT's design choices, the authors performed ablation studies (removing features to see the impact).

1.  **Impact of NSP:** Removing the Next Sentence Prediction task significantly hurt performance on inference tasks like QNLI and MNLI.
2.  **Bidirectionality is Key:**
    *   A **Left-to-Right (LTR)** model (No MLM, No NSP) was compared to BERT. The LTR model performed significantly worse, particularly on SQuAD (Question Answering), because token-level hidden states had no "right-side" context.
    *   Adding a BiLSTM on top of an LTR model improved results but was still far inferior to BERT's deep bidirectionality.
3.  **Model Size Matters:**
    *   **BERT_BASE:** 12 Layers, 768 Hidden Dim, 110M Params (Same size as OpenAI GPT for fair comparison).
    *   **BERT_LARGE:** 24 Layers, 1024 Hidden Dim, 340M Params.
    *   **Finding:** Larger models lead to strict accuracy improvements across all tasks, even those with very small training datasets. This contradicts previous literature (like ELMo) where increasing hidden dimensions beyond a certain point did not help.

## VII. Summary of Training Details
*   **Data:** Trained on BooksCorpus (800M words) and English Wikipedia (2,500M words).
*   **Compute:** BERT_BASE trained on 4 Cloud TPUs; BERT_LARGE on 16 Cloud TPUs. Training took 4 days.
*   **Convergence:** MLM converges slightly slower than LTR models (because it learns from only 15% of tokens per batch vs. 100% in LTR), but achieves significantly higher absolute accuracy.