
# BERT Overview.

## What is BERT?

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a groundbreaking model developed by Google for natural language processing (NLP). BERT is designed to understand the context of words in a sentence by considering the words that come before and after each word, hence the term "bidirectional."

## What tasks does BERT solve?

BERT excels in a variety of NLP tasks including:

- **Sentiment Analysis**: Determining the sentiment behind a piece of text, such as whether a review is positive or negative.
- **Question Answering**: Providing precise answers to questions based on a given context or passage.
- **Text Prediction**: Predicting the next word or phrase in a sequence.
- **Text Generation**: Creating coherent and contextually relevant text based on a prompt.
- **Summarization**: Condensing lengthy documents into concise summaries while preserving key information.

BERT is so effective that it is now incorporated into Google Search to improve the understanding and relevance of search results.

## How do encoder-only models work?

Encoder-only models, like BERT, process text by encoding the entire input sequence in one go. They use multiple layers of transformers to capture contextual information from all parts of the text. This differs from traditional models which might process text sequentially. By leveraging attention mechanisms, encoder-only models can weigh the importance of different words in relation to each other, thus enhancing their understanding of the text.

## What is a Bidirectional Context?

A bidirectional context refers to the ability of BERT to consider both the preceding and following words around a given word in a sentence. This contrasts with unidirectional models, which can only use context from one direction. By understanding the full context, BERT can provide more accurate and nuanced interpretations of words and phrases.

## What are the Drawbacks of Bidirectional Context?

While bidirectional context offers richer understanding, it also presents some challenges. The primary drawback is computational complexity. Processing text bidirectionally requires more resources and time compared to unidirectional models. Additionally, the model's increased capacity can sometimes lead to overfitting, where it performs exceptionally well on training data but less so on unseen data.

## Why do we want pretrained models?

Pretrained models like BERT are valuable because they have already been trained on vast amounts of data and have learned general language patterns and nuances. This pretraining allows them to understand and generate language more effectively out-of-the-box. When adapted to specific tasks through fine-tuning, they can perform well with relatively less additional data and training time.

## What are the pre-training tasks designed for BERT?

BERT’s pre-training involves two main tasks:

1. **Masked Language Model (MLM)**: Randomly masks out words in a sentence and trains the model to predict the masked words based on the context provided by the other words.
2. **Next Sentence Prediction (NSP)**: Trains the model to predict whether one sentence follows another in a given text, which helps the model understand relationships between sentences.

## Special BERT Tokens

BERT uses several special tokens to manage different aspects of text processing: 
- **[CLS] Token**: Added at the beginning of the input sequence, this token is used to aggregate information from the entire sequence for classification tasks. The representation of this token at the end of the model is often used for classification purposes. 
- **[SEP] Token**: Used to separate different segments within the input sequence. For tasks like question answering or sentence pair classification, [SEP] tokens delineate the boundaries between different sentences or segments. 
- **[MASK] Token**: Used during the pre-training phase for the Masked Language Model task. [MASK] tokens are placeholders for randomly masked words, which the model is trained to predict.
- **[UNK] Token**: This token is used to replace words not in the training dataset.
## It's now customary to only train the FFN at the end of the Model by freezing other layers, what do we mean by this?

In practice, when adapting BERT for specific tasks, it is common to freeze the layers of the model that have been pretrained and only train the final Feed-Forward Network (FFN) layer added for the specific task. This approach leverages the pretrained knowledge of BERT while fine-tuning the model to the task at hand, making the adaptation process more efficient and less prone to overfitting.

## References

[https://huggingface.co/blog/bert-101](https://huggingface.co/blog/bert-101)
