<b>IMPORTANT</b>: Run the below cell to import the modules from previous challenge.

In [None]:
%run solution1.ipynb

# **Coding Challenge Part 2: Evaluate a pretrained BERT model on STS benchmark [4 points]**

**Please do not use additional library except the ones that are imported!!**

In this part, we are going to evaluate a pretrained BERT model on STS benchmark without applying any additional training. For the evaluation we provide Pearson/Spearman correlation functions and cosine similarity method.

Tasks:

*   **[2 Points]** Prepare an evaluation data loader and evaluation loop: Read in the STS data, tokenize it as shown in the example,  generate the dataloader and return Pearson and Spearman correlation scores.
*   **[1 Point]** Implement cosine similarity function, explained as TODO

**[1 Point] Question**: What is the difference between Pearson and Spearman correlation, and why you might want to evaluate using both metrics?

**Answer**:



| Criteria                  | Pearson Correlation                          | Spearman Correlation                         |
|---------------------------|----------------------------------------------|----------------------------------------------|
| **Type of Relationship**  | Linear                                       | Monotonic                                    |
| **Sensitivity to Outliers**| High                                         | Low                                          |
| **Distribution Assumptions**| Assumes normal distribution                  | No distributional assumptions                |
| **Scale of Measurement**  | Interval/Ratio                               | Ordinal, Interval/Ratio                      |
| **Mathematical Complexity**| Involves covariance and standard deviation  | Based on ranks                               |
| **Robustness**            | Less Robust                                  | More Robust                                  |
| **Focus**                 | Absolute magnitudes                          | Relative ordering                            |
| **Interpretability**      | Quantifies linear relationship strength      | Quantifies monotonic relationship strength   |
| **Use Case in STS**       | Checks linear alignment with ground truth    | Checks monotonic alignment with ground truth |
| **Diagnostic Utility**    | Sensitive to anomalies in dataset            | Robust to anomalies in dataset               |

**Why Use Both?**

1. **Comprehensiveness**: Using both Pearson and Spearman provides a more comprehensive evaluation of the BERT model.

2. **Nature of Relationship**: If both Pearson and Spearman are high, the relationship might be linear. If Spearman is high but Pearson is low, the relationship might be monotonic but not linear.

3. **Outlier Diagnosis**: A significant difference between Pearson and Spearman could indicate the presence of outliers.

4. **Assumptions**: These metrics can help validate or invalidate any assumptions about the linearity or monotonicity of the relationship.

**Semantic Textual Similarity (STS) benchmark:**
STS dataset consists of a set of sentence pairs with a semantic similarity scores for each pair. We want to use this dataset to evaluate the quality of a sentence encoder.

Dataset includes:

*   Sentence pair: (sentence1, sentence2)
*   Similarity score: ranging 1 to 5 where 5 corresponds highest similarity
*   Splits: train, dev, test

**Download datasets**

In [12]:
!wget https://sbert.net/datasets/stsbenchmark.tsv.gz

--2023-10-01 04:13:53--  https://sbert.net/datasets/stsbenchmark.tsv.gz
Resolving sbert.net (sbert.net)... 104.21.67.200, 172.67.180.145, 2606:4700:3036::6815:43c8, ...
Connecting to sbert.net (sbert.net)|104.21.67.200|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/stsbenchmark.tsv.gz [following]
--2023-10-01 04:13:54--  https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/stsbenchmark.tsv.gz
Resolving public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)... 130.83.167.186
Connecting to public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)|130.83.167.186|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 392336 (383K) [application/octet-stream]
Saving to: ‘stsbenchmark.tsv.gz’


2023-10-01 04:13:56 (571 KB/s) - ‘stsbenchmark.tsv.gz’ saved [392336/392336]



In [7]:
data = pd.read_csv('stsbenchmark.tsv.gz', nrows=5, compression='gzip', delimiter='\t')
data.head()

Unnamed: 0,split,genre,dataset,year,sid,score,sentence1,sentence2
0,train,main-captions,MSRvid,2012test,1,5.0,A plane is taking off.,An air plane is taking off.
1,train,main-captions,MSRvid,2012test,4,3.8,A man is playing a large flute.,A man is playing a flute.
2,train,main-captions,MSRvid,2012test,5,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,train,main-captions,MSRvid,2012test,6,2.6,Three men are playing chess.,Two men are playing chess.
4,train,main-captions,MSRvid,2012test,9,4.25,A man is playing the cello.,A man seated is playing the cello.


### **Implementation explanation**

1. **load_sts_dataset()**: I am using gzip package and CSV DictReader to load the STS benchmark dataset and normalize the ground truth similarity scores to range 0-1. Normalization enables comparing cosine similarity scores (range between 0-1) with the ground truth scores and calculation of pearson and spearman correlation scores.

2. **tokenize_sentence_pair_dataset()**: I tokenize the sentences using the tokenizer imported from the HuggingFace repository. An important part here is to add `[CLS]` and `[SEP]` tokens to the start and end of the tokens before converting them to token ids. This is consistent with the original BERT model, where the `[CLS]` token's representation is used for classification tasks. After generating the token ids, I create the input mask using the length of the token ids and then pad it to the max sequence length.

3. **get_dataloader()**: No changes needed in the provided code.

4. **cosine_sim()**: Cosine similarity is defined as dot product of two vectors divided by the product of the length of the vectors. `torch.norm()` gives us the length of the vectors. The implementation is vectorized where it calculates all the pairwise cosine similarities in parallel.

5. **eval_loop()**: This function runs inference on the sentence pairs in the STS benchmark dataset tokenized using the BERT tokenizer and embedded using the pre-trained Tiny BERT model. After the embeddings are generated, they are used to calculate the cosine similarity scores and corresponding pearson and spearman correlation scores.

In [13]:
def load_sts_dataset(sts_dataset_path):
  # add code to load STS dataset in required format
  train_samples = []
  dev_samples = []
  test_samples = []
  with gzip.open(sts_dataset_path, 'rt', encoding='utf8') as fIn:
      reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
      for row in tqdm(reader):
          score = float(row['score']) / 5.0  # Normalize score to range 0-1
          if row['split'] == 'dev':
              dev_samples.append((row['sentence1'], row['sentence2'], score))
          elif row['split'] == 'test':
              test_samples.append((row['sentence1'], row['sentence2'], score))
          else:
              train_samples.append((row['sentence1'], row['sentence2'], score))
  sts_samples = {
      "train": train_samples,
      "dev": dev_samples,
      "test": test_samples
  }
  return sts_samples


def tokenize_sentence_pair_dataset(dataset, tokenizer, max_length=512):
  # add code to generate tokenized version of the dataset
  tokenized_dataset = []
  for sentence1, sentence2, label in dataset:
      # Convert sentence 1 to features
      tokens = tokenizer.tokenize(sentence1)
      # Add the CLS and SEP token for fine-tuned bert model
      tokens = ["[CLS]"] + tokens + ["[SEP]"]
      input_ids = tokenizer.convert_tokens_to_ids(tokens)
      # The mask has 1 for real tokens and 0 for padding tokens
      input_mask = [1] * len(input_ids)
      # Zero-pad up to the sequence length
      padding = [0] * (max_length - len(input_ids))
      input_ids += padding
      input_mask += padding
      input_ids1 = torch.tensor(input_ids, dtype=torch.long)
      input_mask1 = torch.tensor(input_mask, dtype=torch.long)

      # Convert sentence 2 to features
      tokens = tokenizer.tokenize(sentence2)
      # Add the CLS and SEP token for fine-tuned bert model
      tokens = ["[CLS]"] + tokens + ["[SEP]"]
      input_ids = tokenizer.convert_tokens_to_ids(tokens)
      # The mask has 1 for real tokens and 0 for padding tokens
      input_mask = [1] * len(input_ids)
      # Zero-pad up to the sequence length
      padding = [0] * (max_length - len(input_ids))
      input_ids += padding
      input_mask += padding
      input_ids2 = torch.tensor(input_ids, dtype=torch.long)
      input_mask2 = torch.tensor(input_mask, dtype=torch.long)

      tokenized_dataset.append((input_ids1, input_mask1, input_ids2, input_mask2, label))
  return tokenized_dataset


def get_dataloader(tokenized_dataset, batch_size, shuffle=False):
  return DataLoader(tokenized_dataset, batch_size=batch_size, shuffle=shuffle)


def cosine_sim(a, b):
  # Implement cosine similarity function **from scrach**:
  # This method should expect two 2D matrices (batch, vector_dim) and
  # return a 2D matrix (batch, batch) that contains all pairwise cosine similarities
  dot_product = torch.mm(a, b.t())
  norm_a = torch.norm(a, dim=1, keepdim=True)
  norm_b = torch.norm(b, dim=1, keepdim=True)
  return dot_product / (norm_a * norm_b.t())


def eval_loop(model, eval_dataloader, device):
  # add code to for evaluation loop
  # Use cosine_sim function above as distance metric for pearsonr and spearmanr functions that are imported
  model.eval()
  model.to(device)
  sentence1_embeddings = []
  sentence2_embeddings = []
  gt_scores = []

  with torch.no_grad():
      for input_ids1, attention_mask1, input_ids2, attention_mask2, scores in tqdm(eval_dataloader):
          # Move the batch to device
          input_ids1, attention_mask1 = input_ids1.to(device), attention_mask1.to(device)
          # Run forward pass
          _, embeddings1 = model(input_ids=input_ids1, attention_mask=attention_mask1)
          input_ids2, attention_mask2 = input_ids2.to(device), attention_mask2.to(device)
          _, embeddings2 = model(input_ids=input_ids2, attention_mask=attention_mask2)
          # Store the embeddings and the ground truth scores
          sentence1_embeddings.extend(embeddings1)
          sentence2_embeddings.extend(embeddings2)
          gt_scores.extend(scores)

  sentence1_embeddings = torch.stack(sentence1_embeddings)
  sentence2_embeddings = torch.stack(sentence2_embeddings)
  # Calculate cosine similarity scores
  cosine_scores = cosine_sim(sentence1_embeddings, sentence2_embeddings)
  # The diagonal elements correspond to the sentence similarity scores aligned with the ground truth
  diag_cosine_scores = torch.diag(cosine_scores).cpu().numpy()
  eval_pearson_cosine = pearsonr(diag_cosine_scores, gt_scores)[0]
  eval_spearman_cosine = spearmanr(diag_cosine_scores, gt_scores)[0]

  return [eval_pearson_cosine, eval_spearman_cosine]

**Evaluation**

Expected result:

Pearson correlation: 0.32

Spearman correlation: 0.33

In [14]:
#INFO: model and tokenizer
model_name = 'prajjwal1/bert-tiny'
tokenizer = AutoTokenizer.from_pretrained(model_name)

#INFO: load bert
bert_config = {"hidden_size": 128, "num_attention_heads": 2, "num_hidden_layers": 2, "intermediate_size": 512, "vocab_size": 30522}
bert = Bert(bert_config).load_model('bert_tiny.bin')

#INFO: load dataset
sts_dataset = load_sts_dataset('stsbenchmark.tsv.gz')

#INFO: tokenize dataset
tokenized_test = tokenize_sentence_pair_dataset(sts_dataset['test'], tokenizer)

#INFO: generate dataloader
test_dataloader = get_dataloader(tokenized_test, batch_size=1)

#INFO: run evaluation loop
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
results_from_pretrained = eval_loop(bert, test_dataloader, device)

print(f'\nPearson correlation: {results_from_pretrained[0]:.2f}\nSpearman correlation: {results_from_pretrained[1]:.2f}')

0it [00:00, ?it/s]

  0%|          | 0/1379 [00:00<?, ?it/s]


Pearson correlation: 0.32
Spearman correlation: 0.33
