Bi-directional encoder representations from transformers, here are the things I need to cover for building one
-
Implement the multi-head self-attention mechanism
- the input is given to queries, keys, and values with linear transformation
- splitting the heads and compute scaled dot product attention for each head
- Concatenate the heads and project back
-
Implement the position-wise feed-forward network.
- Two Linear transformations with GELU activation in between
Math behind the TransformerEncoderLayer :
- Attention sub-layer : $ x' = LayerNorm(x + Dropout(Attention(x))) $
- Implement layer normalization and residual connections
- stack of
num_hidden_layersencoder layers
- stack of
- Combining these into a single encoder layer
- Stacking multiple encoder layers to form the BERT encoder
Extracting the attention_probs from MultiHeadAttention module and create a heatmap to show how tokens attend to each other
My Approach
- Take the first sequence in the batch and then plot the heatmap
How is visualization part is working ?
Approach :
- Classification Head (BERTClassifier) : the
BERTClassifierclass wrap theBERTEncoderand adds a linear layer to predict class probabilities
How are we doing it ??
-
Extract the CLS token's embedding
(output[:, 0, :])from the encoder's output, as BERT uses this for classification tasks. -
Applies dropout for regularization.
- Tokenizer integration : using
transformers.BertTokenizerfrom hugging face, pre-trained onbert-base-uncased.
- Convert this whole into a hybrid model
Maths here
Bert encoder : input token ID