# Objective
The objective of this exercise is to observe the different representations of the same word occuring in different contexts.

# BERT
BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.

Meaning that a general-purpose "language understanding" model is trained on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering, classification etc).

In [1]:
%%bash
pip install tqdm boto3 requests regex sentencepiece sacremoses transformers

Collecting boto3
  Downloading boto3-1.25.3-py3-none-any.whl (132 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.5/132.5 kB 4.0 MB/s eta 0:00:00
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp310-cp310-macosx_11_0_arm64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 9.7 MB/s eta 0:00:00
Collecting sacremoses
  Using cached sacremoses-0.0.53.tar.gz (880 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting transformers
  Using cached transformers-4.23.1-py3-none-any.whl (5.3 MB)
Collecting botocore<1.29.0,>=1.28.3
  Downloading botocore-1.28.3-py3-none-any.whl (9.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.3/9.3 MB 11.3 MB/s eta 0:00:00
Collecting jmespath<2.0.0,>=0.7.1
  Using cached jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting s3transfer<0.7.0,>=0.6.0
  Using cached s3transfer-0.6.0-py3-none-any.whl (79 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp310-cp3

# transformers
PyTorch-Transformers is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).
The library currently contains PyTorch implementations, pre-trained model weights, usage scripts and conversion utilities for the models like BERT, GPT, XLM, RoBERTa, BistilBERT

[LINK FOR TRANSFORMERS](https://pytorch.org/hub/huggingface_pytorch-transformers/)

# DistilBERT
A smaller general-purpose language representation model which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. 

[MORE ABOUT DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)

This approach reduces the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and is 60% faster.

In [2]:
from transformers import DistilBertTokenizer, DistilBertModel
import torch
from sklearn.manifold import TSNE
import numpy as np

In [3]:
# Using the tokenizer provided to us 
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
def get_features_diff_context(sentence_list, word_of_interest_list):
  """
  sentence_list: a list of sentence
  word_of_interest_list: list of words which occur in the corresponding sentence, 
                          whose representation we are interested in
  
  return dict
   key: word_s:{1,2,3...n } - word of interest and the index of the centre it occurs in.
   value: 
  """
  assert len(sentence_list) == len(word_of_interest_list)
  sentence_list = [sentence.lower() for sentence in sentence_list]
  inputs = tokenizer(sentence_list, return_tensors="pt", padding=True, truncation=True)
  outputs = model(**inputs)
  reps = outputs['last_hidden_state']
  out_dict = {}
  # words_of_interest = ['good', 'good', 'good', 'good', 'excellent']
  for i, tokens in enumerate(inputs['input_ids'].tolist()):
    for tok_pos, tok_indx in enumerate(tokens):
      tok = tokenizer.convert_ids_to_tokens(tok_indx)
      if tok == word_of_interest_list[i]:
        out_dict[f'{tok}_s:{i}'] = reps[i, tok_pos, :].detach().numpy()
  return out_dict            


In [5]:
sentence_list = ["the river bank was quite nice", "The bank ran out of money"]
word_of_interest_list = ["bank", "bank"]

bank_different_representations = get_features_diff_context(sentence_list, word_of_interest_list)

print('The representations obtained from sentence_list and words_of_interest:')
for k,v in bank_different_representations.items():
  print(f'key: `{k}`, representation_dimensions {v.shape}')

The representations obtained from sentence_list and words_of_interest:
key: `bank_s:0`, representation_dimensions (768,)
key: `bank_s:1`, representation_dimensions (768,)


# Exercise 1
Similar to the above example:
Get the representation of:

a. `good` from the sentence: `that is quite good`

b. `good` from `that is very good`

c. `good` from `that can be good`

d. `bad` from `that is bad`

Store the result of `get_features_diff_context` in a variable named `word_feature_dict`


In [6]:
## YOUR CODE GOES HERE
sentence_list = ["that is quite good", "that is very good", "that can be good", "that is bad"]
word_of_interest_list = ["good", "good", "good", "bad"]

word_feature_dict = get_features_diff_context(sentence_list, word_of_interest_list)

print('The representations obtained from sentence_list and words_of_interest:')
for k,v in word_feature_dict.items():
  print(f'key: `{k}`, representation_dimensions {v.shape}')


The representations obtained from sentence_list and words_of_interest:
key: `good_s:0`, representation_dimensions (768,)
key: `good_s:1`, representation_dimensions (768,)
key: `good_s:2`, representation_dimensions (768,)
key: `bad_s:3`, representation_dimensions (768,)


In [7]:
## Run this cell as it is:
{key: value.shape for key, value in word_feature_dict.items()}

{'good_s:0': (768,), 'good_s:1': (768,), 'good_s:2': (768,), 'bad_s:3': (768,)}

expected result:
```bash
{'good_s:0': (768,), 'good_s:1': (768,), 'good_s:2': (768,), 'bad_s:3': (768,)}
```

# Exercise 2:
Implement a similarity function that takes the previously generated set of key and features and calculates the cosine similarity of one representation with all other representation except itself.

e.g. of output:

```python
{
  'good_s:0 & good_s:1': COSINE_SIMILRITY_VALUE,
  'good_s:0 & good_s:2': COSINE_SIMILRITY_VALUE,
  'good_s:0 & bad_s:3': COSINE_SIMILRITY_VALUE,
  .
  .
  .
}

```

In [36]:
def similarity(rep_dict):
    keys = list(rep_dict.keys())
    out_dict = {}
    
    for i in range(len(rep_dict)):
        for j in range(len(rep_dict)):
            if i == j : continue
          # dot product of vector1 and vector2
            numerator = np.dot(rep_dict[keys[i]], rep_dict[keys[j]]) 
          # the product of the normed vectors
            denominator = np.linalg.norm(rep_dict[keys[i]]) * np.linalg.norm(rep_dict[keys[j]]) 
        
            out_dict[f'{keys[i]} & {keys[j]}'] = round(numerator / denominator, 4)

    return out_dict

In [39]:
similarity_dict = similarity(word_feature_dict)
for k,v in similarity_dict.items():
  print(f'{k}: {v}')

good_s:0 & good_s:1: 0.9825999736785889
good_s:0 & good_s:2: 0.9071999788284302
good_s:0 & bad_s:3: 0.8004999756813049
good_s:1 & good_s:0: 0.9825999736785889
good_s:1 & good_s:2: 0.9132000207901001
good_s:1 & bad_s:3: 0.8098000288009644
good_s:2 & good_s:0: 0.9071999788284302
good_s:2 & good_s:1: 0.9132000207901001
good_s:2 & bad_s:3: 0.8065000176429749
bad_s:3 & good_s:0: 0.8004999756813049
bad_s:3 & good_s:1: 0.8098000288009644
bad_s:3 & good_s:2: 0.8065000176429749


# Exercise 3
1. In the results of `Exercise 3` why the representation of `good` has high cosine similarity in sentence 0 and sentence 1, while similarity of both of token `good` in both the sentences 0,1  is low when compared to `good` of sentence 2.

2. How the representations obtained here differ from the token representation of GLOVE.



In [None]:
1. The good in sentence 0 and 1 both has similar meaning while the meaning of good in sentence 2 is slightly different.
hence the cosine similarity is measuring the similarity of the term in the context, instead of the term itself.

2. 