# BERT
BERT (Bidirectional Encoder Representation from Transformers) is a state of the art model desined for natural language understanding. It's based on the Transformer architecture and is notable for its bidirectional training approch.

## BERT's Core Concept

BERT improves on previous models like GPT (which is unidirectional, i.e., left-to-right). BERT is **bidirectional**, meaning it looks at both the left and right context in all layers of the encoder, which leads to better understanding of the meaning of a word in context.

- **Main Concept**
    - **BIdirectional:** Unlike traditional language model that read the text in a single direction (left-to-right or right-to-left), BERT reads the text in both directions simultaneously. This allows the model to understand the context of a word based on the word that come before and after it.

    - **Masked language Model:** During pre-training, BERT randomly masked some tokens in a sentence and then tries to predict those masked tokens. This forces the model to build a deep understanding of both sides of the sentence (left and right context).
    
    - **Next Sentence Predction:** To help BERT understand relationships between sentence, it was trained to predict whether one sentence follows another. This is crucial for tasks like question answering and language inference.

# BERT Architecture
**1. Input Representation:**

BERT processes text in the form of tokens. Before feeding input to BERT, text is tokenized into subword units using WordPiece tokenizer.

Each input tokeb is represented by the sum of three embeddings:
- Token Embedding: A vector for each token in the input sequense.

- Segment Embedding: BERT can takes pair of sentence as input. Segment embeddings help distinguish between the two sentences.

- Positional Embeddings: Adds positional information to each token so that the model can understand the order of words in the sequence.

**Example:** For the input: *"The cat sat on the mat."*, BERT adds special tokens:

- `[CLS] The cat sat on the mat. [SEP]` Here, `[CLS]` is a special token added at the beginning (used for classification task) and `[SEP]` marks the end of a sentence.

**2. Encoder Stack:**

BERT uses a multi-layered transformer encoder. The standerd BERT model has 12 layers of transformers (BERT-Base) or 24 layer (BERT-Large). Each layer contains:

- **Multi-Head Self-Attention:** This layer allows the model to attend to different words in the sequence simultaneously, capturing word relationships and dependencies. Each head capture different contextual information.

- **Feed-Forward Neural Network:** After attention, the output goes through a fully connected feed forward network to refine the learned representation.

- **Layer Normalization & Residual Connections:** Normalization helps stabilize training and residual connection anable the model to pass information across layers efficiently.

___
**What does Layer Normalization do?**

Layer Normalization is a technique used in machine learning and artificial intelligence to normalize the inputs of a neural network layer. It ensures that the inputs have a consistent distribution and reduces the internal covariate shift problem that can occur during training. By normalizing the inputs, Layer Normalization enhances the stability and generalization of the network.

Layer Normalization directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. It works well for RNNs and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with Transformer models.

We compute the layer normalization statistics over all the hidden units in the same layer as follows:

$$
\mu^l = \frac{1}{H}\sum_{i=1}^{H} a_i^l\\
\sigma^l = \sqrt{\frac{1}{H}\sum_{i=1}^{H} (a_i^l-\mu^l)^2}
$$

where $H$ denotes the number of hidden units in a layer. Under layer normalization, all the hidden units in a layer share the same normalization terms $\mu$ and $\sigma$, but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of the mini-batch and it can be used in the pure online regime with batch size $1$.

___

# Pre-training BERT

When training language models, there is a challenge of defining a predction goal. Many models predict the next word in a sentence (e.g., "The child came home from ___"), a directional approch which inherently limit contect learning. To overcome this challenge, BERT uses two training straregies:

**Masked LM (MLM):**

Before feeding word sequences into BERT, 15% of the word in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked words in the sequence. In technical terms, the predction of the output words requires:

1. Adding a classification layer on top of the encoder output.

2. Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.

3. Calculating the probability of each word in the vocabulary with softmax.

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ViwaI3Vvbnd-CJSQ.png" width="450">

The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. As a consequence, the model converges slower than directional models, a characteristic which is offset by its increased context awareness.

**Next Sentence Prediction (NSP):**

In the BERT training process, the model receives pairs of sequences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The asumption is that the random sentence will be disconnected from the first sentence.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.

2. A sentence (or segment) embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.

3. A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.

**Example input:**

- Sentence A: "The cat sat on the mat."
- Sentence B: "It started to purr."

The input becomes:
- [CLS] The cat sat on the mat [SEP] It started to purr [SEP]

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/0*m_kXt3uqZH9e7H4w.png" width="450">

To predict if the second sentence is indeed connected to the first, the following steps are performed:

1. The entire input sequence goes through the Transformer encoder model.

    - key Output:

        - [CLS] Token Representation:
            
            - The output embedding for the [CLS] token is particularly important in NSP. It acts as a summary of the entire input sequence (i.e., the two sentences combined).
            - The contextualized representation of [CLS] is used for the NSP task, and it captures information from both Sentence A and Sentence B.
            
        - Contextualized Token Embeddings:
            - Each token in the sequence (e.g., "The", "cat", "sat", etc.) has its own output embedding, which is a vector that captures the token's meaning in the context of the sentence pair. These embeddings are not directly used for NSP but are available for other tasks.

2. The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).

3. Calculating the probability of IsNextSequence with softmax.

When training the BERT model, Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies.

- **Summary of Outputs in NSP:**
    1. Contextualized Token Embeddings for each token in the input sequence, capturing token meaning in context.

    2. [CLS] Token Embedding: A summary of the entire input (Sentence A + Sentence B) used for the NSP task.

    3. NSP Classification Output: The output of a classification layer that predicts whether Sentence B logically follows Sentence A (binary classification).
    
    Thus, the NSP task leverages the output of the [CLS] token from the Transformer encoder, using it to determine sentence-level relationships.


## Fine-tuning BERT

Once BERT is pre-trained, it can be fine-tuned for specific downstream task like text classification, named entity recognition or question answering. This is done by adding task-specific layers on top of BERT's pre-trained layers and training the model on task-specific labeled data.

1. GLUE Benchmark Tasks (General Languuage Understanding Evaluation)
    
    GLUE is a widely used benchmark for evaluating models on multiple NLP task. BERT was fine-tuned on the following tasks from GELU.

    1. CoLA (Corpus of Linguistic Acceptability)
        
        - Task: Sentence Classification, determining whether a sentence is grammatically acceptable or not.

        - Metric: Mmatthews Correlation Cofficient (MCC)

        - BERT Result: $60.5$ (for BERT-Large), which outperforms the previous state-of-art model by significant margin.
    
    2. SST-2 (Stanford Sentiment Treebank)

        - Task: Binary sentiment classification (positive or negitive sentiment) on movie reviews.

        - Metric: Accuracy

        - BERT Result: $94.9\%$ (BERT-Large), achiving near-human performance.

    3. MRPC (Microsoft Research Paraphrase Corpus)

        - Task: Classify whether two sentence are paraphrases of each other.

        - Metric: F!-score/Accuracy

        - BERT Result: $89.3$ (F1-score), $86.7\%$ (Accuracy) on BERT-Large.

    4. STS-B (Semantic Textual Similarity Benchmark)

        - Task: Predict the similarity between two sentences (on a scale of 1 to 5).

        - Metric: Pearson/sperman Correlation

        - BERT Result: $90.0$ (BERT-Large), indicating strong sentence-pair similarity prediction.

    5. QQP (Quora Question Pairs)

        - Task: Determine if two question from Quora are paraphrases.

        - Metric: F1-score/Accuracy

        - BERT Result: $88.5$ (F1-score), $91.3\%$ (Accuracy) on BERT-Large.

    6. MNLI (Multi-Genre Natural Language Inference)

        - Task: Sentence-pair classification task, where the goal is to predict whether the second sentence is an entailment, contradiction, or neutral to the first sentence.
        
        - Metric: Accuracy
        
        - BERT Result: $86.6\%$ (matched), $85.9\%$ (mismatched) with BERT-Large.

    7. QNLI (Question Natural Language Inference)

        - Task: Sentence-pair classification derived from the Stanford Question Answering Dataset (SQuAD), where the goal is to determine whether a given sentence contains the answer to a question.

        - Metric: Accuracy

        - BERT Result: $92.7\%$ (BERT-Large).

    8. RTE (Recognizing Textual Entailment)

        - Task: Predict whether one sentence entails another.

        - Metric: Accuracy

        - BERT Result: $70.1\%$ (BERT-Large).

    9. WNLI (Winograd NLI)

        - Task: Resolve pronoun reference ambiguities to determine sentence entailment.

        - Metric: Accuracy

        - BERT Result: $65.1\%$ (BERT-Large). The WNLI task is quite difficult and models often struggle with it.

2. SQuAD (Stanford Question Answering Dataset)

    SQuAD is a popular dataset for question-answering tasks, where the model must predict the span of text that answers a given question from a paragraph.

    1. SQuAD V1.1
        
        - Task: Given a passage, the model must predict the exact span of words that answers the question.

        - Metrics: Exact Match (EM) and F1-score

        - BERT Result: $84.1$ (EM), $90.9$ (F1-score) for BERT-Large

    2. SQuAD V2.0

        - Task: In addition to answering questions, the model must also handle questions that have no answer in the passage.

        - Metrics: Exact Match (EM) and F1-score

        - BERT Result: $78.7$ (EM), $81.9$ (F1-score) for BERT-Large.

3. NER (Named Entity Recognition) on CoNLL-2003

    The task is to identify named entities like people, organizations, locations, etc., in text.

    - Task: Token-level classification where each token is labeled as belonging to an entity (e.g., PERSON, LOCATION).

    - Metric: F1-score

    - BERT Result: $92.8\%$ F1-score (BERT-Large), outperforming previous state-of-the-art models.

## Applications of BERT

BERT is used in a variety of NLP tasks:

1. Text Classification: Predicting a class label (e.g., spam detection).

2. Named Entity Recognition (NER): Identifying entities in a text like person names or locations.

3. Question Answering: BERT has been used for tasks like SQuAD (Stanford Question Answering Dataset).

4. Natural Language Inference: Determining if one sentence logically follows from another.

## BERT Variants

- BERT-Base: 12 layers (transformer blocks), 12 attention heads, 110M parameters.

- BERT-Large: 24 layers, 16 attention heads, 340M parameters.

- DistilBERT: A lighter version of BERT, optimized for speed and efficiency.

## Ref

- https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

- https://arxiv.org/pdf/1810.04805

- https://paperswithcode.com/method/batch-normalization

- https://paperswithcode.com/method/layer-normalization

- https://www.analyticsvidhya.com/blog/2021/05/all-you-need-to-know-about-bert/