# **Universal sentence embeddings background [2 points]**
In this takehome, we will be exploring different ways of learning sentence embeddings. Sentence embedding is the collective name for a set of techniques in natural language processing (NLP) where sentences are mapped to vectors of real numbers. For an overview of sentence embeddings and some common methods, we refer these articles: [link1](https://txt.cohere.com/sentence-word-embeddings/), [link2](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a)  

Q1: What are some real world applications of dense sentence embeddings?

Answer: Dense sentence embeddings find applications in Natural Language Processing (NLP) and Information Retrieval (IR) tasks. Some real-world applications include:
1. **Text Classification**: Email spam classification, sentiment analysis of customer reviews or tweets, etc.
2. **Text Similarity**: Identifying websites with similar product descriptions, detecting duplicate documents within an organization, etc.
3. **Question Answering**: Automating FAQ doc generation from product documentation, extracting specific information from legal and financial documents, etc.
4. **Text Summarization**: Generating simplified summaries of complex research papers, getting an overview of domain specific documents or websites, etc.
5. **Text Clustering**: Organizing large datasets, automating data labeling for machine learning systems, debugging noise in datasets by visualization, etc.
6. **Topic Modeling**: Extracting underlying themes in large corpus without human involvement, automatically generating sections for a given document, etc.

Q2: Apart from using large language models, what are other ways to compute sentence embeddings?

Answer:
1. **TF-IDF**: Uses the frequency of count of words within and across documents to create vectors. Generates sparse vectors without any context information.
2. **Word2Vec**: Generates word embeddings based on the surrounding words in the corpus. Great for bag of words representation where sentences aren't meaningful.
3. **Doc2Vec**: Generates paragraph vectors using the underlying word embeddings from Word2Vec model.
4. **GloVe**: Word2vec uses a shallow neural network to create vectors, while GloVe uses a global matrix factorization technique.
5. **Latent Semantic Analysis (LSA)**: Involves applying singular value decomposition (SVD) on the term-document matrix to derive sentence embeddings.
6. **FastText**: Word2Vec works on the word level, while fastText works on more granular character n-grams and can provide embeddings for OOV words.

**Imports**

In [1]:
import json
import math
from collections import OrderedDict
import torch
from torch import nn, Tensor
from typing import Union, Tuple, List, Iterable, Dict
import torch.nn.functional as F
from torch.nn.parameter import Parameter
from torch.optim import AdamW
from torch.utils.data import DataLoader
from scipy.stats import pearsonr, spearmanr
import numpy as np
import gzip, csv
import pandas as pd
from tqdm.auto import tqdm

torch.manual_seed(0)
np.random.seed(0)

In [2]:
%pip install transformers
from transformers import AutoTokenizer
# If you can not find all the bugs, use the line below for AutoModel
#from transformers import AutoModel

Collecting transformers
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m51.8 MB/s[0m eta [36m0:00:0

# **Coding Challenge Part 1: Debugging custom BERT code [7 points]**

BERT ([Bidirectional Encoder Representations from Transformers](https://arxiv.org/abs/1810.04805)) is a transformer based language model that is pretrained to generate contextualized embeddings. In this part, we provide a BERT implementation together with a pretrained checkpoint file. This BERT implementation includes 7 bugs in which some of these bugs break the code but some of them only impact the model performance.

Tasks:
* [**7 points**] Your goal is to get the code working. There are 7 bugs in the code, some of them lead to error in the code but some of them are designed to impair test accuracy but not break the code. You will get one point for each of the 7 bugs you find.

* [**1 points**] You will get extra points for also adding improved documentation to each of the functions we introduce in this section, and for describing the fixes to the bugs.


Note for usage and comparison:
*   In order to test this implementation, we provide ***bert_tiny.bin*** and example usage in the below cells.
*   You can check if your bugfixes are correct based on your results in "Coding challenge Part 2". Except the Bert imlementation, there is no bugs in other parts, so if your fixes are correct you should achieve the same results. We provide the expected results for you to compare.


**Please DO NOT use any additional library except the ones that are imported!!**

### **Identified bugs and explanations**

All the identified bugs are marked with comments in the code cell below. Here are their explanations:

**BUG 1**: Layer normalization is designed to normalize the features (i.e., the `hidden_size` dimension) for each individual token in the sequence across the batch to zero mean and unit variance. This is different from batch normalization, which normalizes across the batch dimension.
1. Therefore, the mean should be calculated for each example/row (dim=-1) instead batch/column (dim=0).
2. The mean should be subtracted from the input to get zero mean. Hence, it should be (x-u) instead of (x+u).

Layer normalization is crucial in enabling training of deep neural networks as they help centering the input to activation functions and improving optimization algorithms like gradient descent.

**BUG 2**: Tensors in PyTorch can be stored in memory in a non-contiguous manner, especially after certain operations like transpose, permute, or slicing. The `.contiguous()` method returns a contiguous tensor containing the same data. This helps improve performance and memory efficiency. If we try to use `.view()` on a non-contiguous tensor, PyTorch will raise an error.

**BUG 3**: `torch.matmul(q, k)` returns an error as the dimensions for matrix multiplication don't align. This should be `k.transpose(-1, -2)`. Therefore, the resultant matrix multiplication would be `(batch, num_heads, seq_len, attention_head_size) x (batch, num_heads, attention_head_size, seq_len) = (batch, num_heads, seq_len, seq_len)`.

**BUG 4**: The attention mask tells the model what part of the inputs to attend to. When mask is 0, setting values to `-inf` (before applying a softmax function) effectively masks those values out, making their softmax result very close to zero. When mask is 1, we just copy the attention scores in those positions. If we set values to `inf` instead of `-inf` then we would get `nan` values in the output embeddings.

**BUG 5**: We need to convert raw attention scores (`s`) to probabilities (`p`) using softmax. This helps the model learn distribute its attention to different parts of the input sequence when we multiply these probabilities with the `values` matrix. Therefore, `p = F.softmax(s, dim=-1)`.

**BUG 6**: Like layer normalization, residual connections are another important component to stabilize training of deep neural networks. Transformers utilize this technique for its multi-head attention block and the feed forward block.

**BUG 7**: The Transformer architecture doesn't have any inherent notion of the order and type of tokens. To address this, positional and type encodings are added to the embeddings. The Transformer architecture adds these embeddings instead of concatenating them to keep the dimensions of the input same, reducing the complexity of learning when combining the two information, and avoiding increase in computational costs.

In [10]:
def gelu(x):
    """Implementation of the gelu activation function.

    Returns:
        torch.Tensor: Output tensor of shape [batch_size, seq_len, hidden_size] containing the output of the gelu activation function.
    """
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))


class Config(object):
    """Configuration class to store the configuration of a `BertModel`."""
    def __init__(self,
                vocab_size,
                hidden_size=768,
                num_hidden_layers=12,
                num_attention_heads=12,
                intermediate_size=3072,
                # Dropout of 0.9 was too high, changing it to 0.1
                dropout_prob=0.1,
                max_position_embeddings=512,
                type_vocab_size=2,
                initializer_range=0.02):
        """Constructs Config for BertModel.

        Args:
            vocab_size (int): Vocabulary size of `inputs_ids` in `BertModel`.
            hidden_size (int, optional): Size of the encoder layers and the pooler layer. Defaults to 768.
            num_hidden_layers (int, optional): Number of hidden layers in the Transformer encoder. Defaults to 12.
            num_attention_heads (int, optional): Number of attention heads for each attention layer in the Transformer encoder. Defaults to 12.
            intermediate_size (int, optional): The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. Defaults to 3072.
            dropout_prob (float, optional): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. Defaults to 0.1.
            max_position_embeddings (int, optional): The maximum sequence length that this model might ever be used with. Defaults to 512.
            type_vocab_size (int, optional): The vocabulary size of the `token_type_ids` passed into `BertModel`. Defaults to 2.
            initializer_range (float, optional): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. Defaults to 0.02.
        """

        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = dropout_prob
        self.attention_probs_dropout_prob = dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.initializer_range = initializer_range

    @classmethod
    def from_dict(cls, dict_object):
        """Constructs Config from a Python dictionary of parameters.

        Args:
            dict_object (dict): Dictionary containing the configuration parameters.

        Returns:
            Config: Configuration object constructed from the dictionary.
        """
        config = Config(vocab_size=None)
        for (key, value) in dict_object.items():
            config.__dict__[key] = value
        return config


class LayerNorm(nn.Module):
      """Layer normalization module."""
      def __init__(self, hidden_size, variance_epsilon=1e-12):
        """Constructs LayerNorm object for Transformer layer in BERT model.

        Args:
            hidden_size (int): Size of hidden dimension.
            variance_epsilon (float, optional): A value added to the denominator for numerical stability. Defaults to 1e-12.
        """
        super(LayerNorm, self).__init__()
        self.gamma = nn.Parameter(torch.ones(hidden_size))
        self.beta = nn.Parameter(torch.zeros(hidden_size))
        self.variance_epsilon = variance_epsilon

      def forward(self, x):
        """Forward pass of the LayerNorm layer.

        Args:
            x (torch.Tensor): Input tensor of shape [batch_size, hidden_size]

        Returns:
            torch.Tensor: Normalized input tensor.
        """
        # BUG 1: We need to find mean of individual examples in each batch
        # instead of calculating mean of the examples across batch
        u = x.mean(-1, keepdim=True)
        # BUG 1: Mean should be subtracted from the input
        s = (x - u).pow(2).mean(-1, keepdim=True)
        x = (x - u) / torch.sqrt(s + self.variance_epsilon)
        return self.gamma * x + self.beta


class MLP(nn.Module):
      """Feed forward network with gelu activation"""
      def __init__(self, hidden_size, intermediate_size):
        """Constructs MLP object for Transformer layer in BERT model.

        Args:
            hidden_size (int): Size of hidden dimension.
            intermediate_size (int): Size of intermediate dimension.
        """
        super(MLP, self).__init__()
        self.dense_expansion = nn.Linear(hidden_size, intermediate_size)
        self.dense_contraction = nn.Linear(intermediate_size, hidden_size)

      def forward(self, x):
        """Forward pass of the MLP layer.

        Args:
            x (torch.Tensor): Input tensor of shape [batch_size, seq_len, hidden_size] containing the input features.

        Returns:
            torch.Tensor: Output tensor of shape [batch_size, seq_len, hidden_size] containing the output features.
        """
        x = self.dense_expansion(x)
        x = self.dense_contraction(gelu(x))
        return x


class Layer(nn.Module):
    """The Transformer layer"""
    def __init__(self, config):
        """Constructs Layer object for Transformer layer in BERT model based on config.

        Args:
            config (Config): Configuration of the BERT model containing the hyperparameters.
        """
        super(Layer, self).__init__()

        self.hidden_size = config.hidden_size
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

        self.attn_out = nn.Linear(config.hidden_size, config.hidden_size)
        self.ln1 = LayerNorm(config.hidden_size)

        self.mlp = MLP(config.hidden_size, config.intermediate_size)
        self.ln2 = LayerNorm(config.hidden_size)

    def split_heads(self, tensor, num_heads, attention_head_size):
        """Split hidden_size into num_heads * attention_head_size and transpose into shape [batch, num_heads, seq_len, attention_head_size]

        Args:
            tensor (torch.Tensor): Input tensor of shape [batch_size, seq_len, hidden_size]
            num_heads (int): Number of attention heads.
            attention_head_size (int): Size of each attention head.

        Returns:
            torch.Tensor: Tensor of shape [batch_size, num_heads, seq_len, attention_head_size] containing the split heads.
        """
        new_shape = tensor.size()[:-1] + (num_heads, attention_head_size)
        tensor = tensor.view(*new_shape) # batch, seq_len, num_heads, attention_head_size
        # BUG 2: Call .contiguous() after .permute()
        return tensor.permute(0, 2, 1, 3).contiguous() # batch, num_heads, seq_len, attention_head_size

    def merge_heads(self, tensor, num_heads, attention_head_size):
        """Transpose and then reshape into shape [batch, seq_len, hidden_size]

        Args:
            tensor (torch.Tensor): Input tensor of shape [batch_size, num_heads, seq_len, attention_head_size] containing the split heads.
            num_heads (int): Number of attention heads.
            attention_head_size (int): Size of each attention head in the input tensor.

        Returns:
            torch.Tensor: Tensor of shape [batch_size, seq_len, hidden_size] containing the merged heads.
        """
        # tensor.shape: batch, num_heads, seq_len, attention_head_size
        tensor = tensor.permute(0, 2, 1, 3).contiguous() # batch, seq_len, num_heads, attention_head_size
        new_shape = tensor.size()[:-2] + (num_heads * attention_head_size,)
        return tensor.view(new_shape) # batch, seq_len, all_head_size

    def attn(self, q, k, v, attention_mask):
        """Attention mechanism for the Transformer layer.

        Args:
            q (torch.Tensor): Query tensor of shape [batch_size, num_heads, seq_len, attention_head_size] containing the query vectors for all attention heads.
            k (torch.Tensor): Key tensor of shape [batch_size, num_heads, seq_len, attention_head_size] containing the key vectors for all attention heads.
            v (torch.Tensor): Value tensor of shape [batch_size, num_heads, seq_len, attention_head_size] containing the value vectors for all attention heads.
            attention_mask (torch.Tensor): Mask to avoid performing attention on padding token indices of shape [batch_size, seq_len].

        Returns:
            torch.Tensor: Tensor of shape [batch_size, num_heads, seq_len, attention_head_size] containing the attention outputs for all attention heads.
        """
        mask = attention_mask == 1
        mask = mask.unsqueeze(1).unsqueeze(2)

        # BUG 3: k should be k.transpose(-1, -2)
        s = torch.matmul(q, k.transpose(-1, -2)) # batch, num_heads, seq_len, seq_len
        s = s / math.sqrt(self.attention_head_size)

        # BUG 4: Replace attention scores with -inf based on mask
        s = torch.where(mask, s, torch.tensor(float('-inf')))

        # BUG 5: Convert attention scores to probabilities using softmax
        p = F.softmax(s, dim=-1) # batch, num_heads, seq_len, seq_len
        p = self.dropout(p)

        a = torch.matmul(p, v) # batch, num_heads, seq_len, attention_head_size
        return a

    def forward(self, x, attention_mask):
        """Forward pass of the Transformer layer in BERT.

        Args:
            x (torch.Tensor): Input tensor of shape [batch_size, seq_len, hidden_size]
            attention_mask (torch.Tensor): Mask to avoid performing attention on padding token indices of shape [batch_size, seq_len].

        Returns:
            torch.Tensor: Output tensor of shape [batch_size, seq_len, hidden_size] containing the output of the Transformer layer.
        """
        q, k, v = self.query(x), self.key(x), self.value(x) # batch, seq_len, all_head_size

        q = self.split_heads(q, self.num_attention_heads, self.attention_head_size) # batch, num_heads, seq_len, attention_head_size
        k = self.split_heads(k, self.num_attention_heads, self.attention_head_size)
        v = self.split_heads(v, self.num_attention_heads, self.attention_head_size)

        a = self.attn(q, k, v, attention_mask) # batch, num_heads, seq_len, attention_head_size
        a = self.merge_heads(a, self.num_attention_heads, self.attention_head_size) # batch, seq_len, all_head_size
        a = self.attn_out(a) # batch, seq_len, hidden_size (because: hidden_size == all_head_size)
        a = self.dropout(a)
        # BUG 6: Missing residual connection
        a = self.ln1(a + x)

        m = self.mlp(a) # batch, seq_len, hidden_size
        m = self.dropout(m)
        # BUG 6: Missing residual connection
        m = self.ln2(m + a)

        return m


class Bert(nn.Module):
      """Custom BERT implementation"""
      def __init__(self, config_dict):
        """Constructs Bert model based on config_dict.

        Args:
            config_dict (dict): Dictionary containing the configuration of the model.
        """
        super(Bert, self).__init__()
        self.config = Config.from_dict(config_dict)
        self.embeddings = nn.ModuleDict({
          'token': nn.Embedding(self.config.vocab_size, self.config.hidden_size, padding_idx=0),
          'position': nn.Embedding(self.config.max_position_embeddings, self.config.hidden_size),
          'token_type': nn.Embedding(self.config.type_vocab_size, self.config.hidden_size),
        })

        self.ln = LayerNorm(self.config.hidden_size)
        self.dropout = nn.Dropout(self.config.hidden_dropout_prob)

        self.layers = nn.ModuleList([
            Layer(self.config) for _ in range(self.config.num_hidden_layers)
        ])

        self.pooler = nn.Sequential(OrderedDict([
            ('dense', nn.Linear(self.config.hidden_size, self.config.hidden_size)),
            ('activation', nn.Tanh()),
        ]))

      def forward(self, input_ids, attention_mask=None, token_type_ids=None, ):
        """Forward pass of the model to generate embeddings.

        Args:
            input_ids (torch.Tensor): Input tensor of shape [batch_size, seq_len] containing token ids of tokens in the input sequence.
            attention_mask (torch.Tensor, optional): Mask to avoid performing attention on padding token indices. Defaults to None.
            token_type_ids (torch.Tensor, optional): Segment token indices to indicate first and second portions of the inputs. Defaults to None.

        Returns:
            Tuple[torch.Tensor, torch.Tensor]: Tuple of output embeddings of shape [batch_size, seq_len, hidden_size] and pooled output of shape [batch_size, hidden_size].
        """
        position_ids = torch.arange(input_ids.size(1), dtype=torch.long, device=input_ids.device)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # BUG 7: Embeddings should be summed together, not concatenated
        x = (self.embeddings.token(input_ids) +
              self.embeddings.position(position_ids) +
              self.embeddings.token_type(token_type_ids))
        x = self.dropout(self.ln(x))

        for layer in self.layers:
            x = layer(x, attention_mask)

        o = self.pooler(x[:, 0])
        return (x, o)

      def load_model(self, path):
        """Load model from path and return the model.

        Args:
            path (str): Path to the model checkpoint.

        Returns:
            Bert: Bert model loaded from the checkpoint.
        """
        self.load_state_dict(torch.load(path))
        return self

**Download weights for the custom Bert**

In [11]:
!wget https://github.com/for-ai/bert/raw/master/bert_tiny.bin

--2023-10-01 04:13:46--  https://github.com/for-ai/bert/raw/master/bert_tiny.bin
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/for-ai/bert/master/bert_tiny.bin [following]
--2023-10-01 04:13:47--  https://raw.githubusercontent.com/for-ai/bert/master/bert_tiny.bin
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17555010 (17M) [application/octet-stream]
Saving to: ‘bert_tiny.bin’


2023-10-01 04:13:47 (383 MB/s) - ‘bert_tiny.bin’ saved [17555010/17555010]



**An example use of pretrained BERT with transformers library to encode a sentence**

In [5]:
MODEL_NAME = 'prajjwal1/bert-tiny'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

bert_config = {"hidden_size": 128, "num_attention_heads": 2, "num_hidden_layers": 2, "intermediate_size": 512, "vocab_size": 30522}
bert = Bert(bert_config).load_model('bert_tiny.bin')

#EXAMPLE USE
sentence = 'An example use of pretrained BERT with transformers library to encode a sentence'
tokenized_sample = tokenizer(sentence, return_tensors='pt', padding='max_length', max_length=512)
output = bert(input_ids=tokenized_sample['input_ids'],
              attention_mask=tokenized_sample['attention_mask'],)

# We use "pooler_output" for simplicity. This corresponds the last layer
# hidden-state of the first token of the sequence (CLS token) after
# further processing through the layers used for the auxiliary pretraining task.
embedding = output[1]
print(f'\nResulting embedding shape: {embedding.shape}')

Downloading (…)lve/main/config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]


Resulting embedding shape: torch.Size([1, 128])
