# What is BERT?

## Introduction

Taken directly from the abstract of the paper: 

> We introduce a new language representation model called BERT, which stands for
Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications <cite data-cite="8277123/AADH59T3"></cite>.

Let's break it down a bit. Firstly, what is a *Transformer*?

## Transformers

### Self-Attention

For more details, ["Attention Is All You Need"](https://arxiv.org/pdf/1706.03762.pdf). Roughly speaking, a Transformer has a sequence-to-sequence encoder-decoder architecture. We'll focus on the concept of *encoders*, since this is what is relevant for BERT.

<img src="images/bert.png">

The image above shows the key component of a Transformer - *self-attention*.

> “Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.” <cite data-cite="8277123/8IC9ACGZ"></cite>

From the encoder's input vectors, there are three vectors of interest - Query, $Q$, Key, $K$ and Value, $V$. The attention is computed as

\begin{equation}
    \text{Attention}_{Q,K,V} = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}}\right) V
\end{equation}

We take a query vector, e.g. $(q_1, q_2, q_3)$, and perform matrix multiplication with the key vector, $(k_1, k_2, k_3)$ of all the words to get something along the lines of:


\begin{bmatrix}
q_1 k_1 & q_2 k_1 & q_3 k_1 \\
q_1 k_2 & q_2 k_2 & q_3 k_2 \\
q_1 k_3 & q_2 k_3 & q_3 k_3\\
\end{bmatrix}



This is the *'score'*. The columns represent the score for each query. This is then scaled by a factor $1 / \sqrt{d_k}$, where $d_k$ refers to the dimensionality of the query/key vectors. The scaling is performed so that the arguments of the softmax function do not become excessively large with keys of higher dimensions. Subsequently, it is normalized using the softmax-activation function. Therefore, the $\text{softmax}$ term can be viewed as *weights* that are assigned to the values in vector $(v_1, v_2, v_3)$.

To summarise:

> "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key." <cite data-cite="8277123/8IC9ACGZ"></cite>

### Multi-Head Attention

Instead of using a single attention function, *Multi-Head Attention* does just as the name implies. It linearly projects the queries, keys and values $h$ times with different, learned linear projections to dimensions. On each of the projected versions of queries, keys and values, the attention function is performed in parallel, yielding output values that are concatenated and once again projected. This results in the final values, as depicted in the figure above.

### Positional Encoding

In the proposed architecture thus far, there is no notion of word order. We need a position-dependent signal for each word-embedding to incorporate this contextual information. Recurrent Neural Networks (RNNs) achieve this by parsing a sentence word by word sequentially. 

In Transformers, "positional encodings" are added to the input embeddings, and they share the same dimension as the embeddings. In the work outlined by Vaswani, sine and cosine functions of different frequencies are used:

\begin{align}
    PE_{(pos,2i)} &= \sin \left(pos/10000^{\frac{2i}{d_{model}}}\right) \\
    PE_{(pos,2i+1)} &= \cos \left(pos/10000^{\frac{2i}{d_{model}}}\right)
\end{align}

where $pos$ is the position and $i$ is the dimension.

## What makes BERT special?

BERT was trained in two ways - **Masked LM (MLM)** and **Next Sentence Prediction (NSP)**. The BERT model was trained for both tasks together, minimizing the combined loss of both tasks.

### Masked LM

In pre-training BERT, 15% of the words in each sentenced was replaced by a `[MASK]` token. The model was then trained to predict the value of the masked words. This was achieved by adding a classification layer on top of the encoder input. the output vectors are multiplied by the embedding matrix, transforming them into the vocabulary dimension. 

The probability of each word in the vocabulary is computed using the softmax function. BERT uses a loss function that only takes into consideration the *prediction of the masked values*, and ignores the predictions of the *non-masked words*.

### Next Sentence Prediction

The model received pairs of sentences as input and was trained to predict the second sentence in the pair. During training, 50% of the inputs were an actual pair and the other 50% had random (disconnected) sentences from the corpus. To distinguish between the two sentences in training, certain processing steps were taken for the inputs:

1. There are two types of tokens, `[CLS]`, which is inserted at the beginning of the first sentence, and `[SEP]`, which is inserted at the end of each sentence.

2. A sentence embedding indicating Sentence A or Sentence B is added to each token.

3. A positional embedding is added to each token to indicate its position in the sequence.

The entire input sequence first goes through the Transformer model, as described in the previous section. The output of the `[CLS]` token is transformed into a 2 x 1 shaped vector, using a simple classification layer. Then, the probability of the next sequence is computed with softmax.

## Why am I using BERT?

I wish to create a model that can accurately classify the sentiment of domain-specific texts. A 'Financial Phrasebank' exists on [Kaggle](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news). However, there are only 4837 unique examples.

BERT proves especially useful for classification tasks because it was trained to perform NSP tasks. We can take advantage of a pre-trained BERT model that has been trained on a vast corpus:

> The original English-language BERT model comes with two pre-trained general types: (1) the BERTBASE model, a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, and (2) the BERTLARGE model, a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture; both of which were trained on the BooksCorpus with 800M words, and a version of the English Wikipedia with 2,500M words. <cite data-cite="8277123/AADH59T3"></cite>.

In the next notebook, I use PyTorch to fine-tune a BERT model for financial news headlines.


## References

<div class="cite2c-biblio"></div>