<a href="https://colab.research.google.com/github/nyp-sit/iti107-2024S2/blob/main/session-4/contextual_embedding_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contextual Embedding

One of the main drawbacks of embeddings such as Word2Vec and GloVE are that they have the same embedding for the same word regardless of its meaning in a particular context. For example, the word `rock` in `The rock concert is being held at national stadium` have a very different meaning in `The naughty boy throws a rock at the dog`.

Contextual embedding such as those produced by transformers (where the modern-day large language are based on) took into account the context of the word, and different embedding is generated for the same word depending on the context.

## Install Hugging Face Transformers library
If you are running this notebook in Google Colab, you will need to install the Hugging Face transformers library as it is not part of the standard environment.

In [1]:
%%capture
!pip install transformers
!pip install datasets

Let's try to generate some embeddings using one of the transformer model `deberta`.

In [2]:
from transformers import AutoModel, AutoTokenizer
# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



In [5]:
# Load a language model
model = AutoModel.from_pretrained("distilbert-base-uncased")
# Tokenize the sentence
tokens = tokenizer('The rock concert is being held at national stadium.', return_tensors='pt')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

{'input_ids': tensor([[ 101, 1996, 2600, 4164, 2003, 2108, 2218, 2012, 2120, 3346, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
[CLS]
the
rock
concert
is
being
held
at
national
stadium
.
[SEP]


We will pass the tokens through the model to generate embeddings.  We will take the embedding produced by the last layer.

In [6]:
# Process the tokens
embeddings_1 = model(**tokens)[0]
print(embeddings_1)

tensor([[[-0.1179, -0.2114,  0.1210,  ..., -0.0858,  0.3062,  0.1480],
         [-0.2119, -0.4472,  0.0427,  ...,  0.2377,  0.5819, -0.5535],
         [-0.2656, -0.3143,  0.3377,  ...,  0.1910,  0.2830, -0.6662],
         ...,
         [ 0.7404, -0.0133,  0.1456,  ..., -0.2208,  0.1618,  0.2280],
         [ 0.5188,  0.1038, -0.2805,  ...,  0.2582, -0.4593, -0.5703],
         [-0.1118,  0.2719,  0.2146,  ..., -0.0221, -0.1454, -0.4274]]],
       grad_fn=<NativeLayerNormBackward0>)


**Questions**

1. What is the shape of the embeddings?
2. Why is the shape is such?

Let's try to find the embedding of the token 'rock' used here.

In [7]:
embedding_rock1 = embeddings_1[0][2]
print(embedding_rock1)

tensor([-2.6564e-01, -3.1427e-01,  3.3765e-01,  1.5091e-01,  9.7145e-03,
        -2.5332e-01,  3.5789e-01,  3.5013e-02,  3.5348e-03, -1.0108e-01,
        -2.3330e-01, -4.9488e-01,  6.2507e-03,  4.0436e-01, -3.4843e-01,
         6.5464e-01, -2.9584e-01,  1.1097e-01,  2.7702e-01,  3.9513e-01,
         7.4761e-02,  3.3236e-01,  2.6188e-01,  2.1825e-01,  6.0295e-01,
        -3.6744e-01, -7.5139e-02,  7.9165e-01,  1.5885e-01, -2.7578e-01,
         1.3927e-01,  2.3892e-01,  4.7009e-02,  4.8528e-01, -2.6793e-01,
         2.6853e-01,  1.1397e-01,  1.8382e-01,  2.9557e-01, -2.5852e-01,
         8.2411e-02, -8.8077e-01,  7.1996e-01,  5.3384e-02, -7.4675e-02,
        -3.3500e-01,  5.6863e-01,  4.0533e-01,  1.0933e-01, -1.9471e-01,
         7.1417e-01,  9.2324e-01,  2.4231e-03,  8.5712e-02,  2.5317e-01,
         2.7494e-01, -2.8317e-01, -3.6612e-02, -1.5650e-01,  9.9709e-02,
         1.0116e-01,  3.5243e-01, -4.6183e-02, -5.5556e-01,  2.2894e-02,
         4.5337e-01,  5.1713e-01,  1.8168e-01, -1.8

Now write codes to find the embeddings of the word `rock` as used in the sentence `The naughty boy throws a rock at the dog.` and `The boy throws the rock into the drain`.


In [9]:
tokens = tokenizer('The naughty boy throws a rock at the dog.', return_tensors='pt')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
embeddings_2 = model(**tokens)[0]
embedding_rock2 = embeddings_2[0][6]
print(embedding_rock2)

{'input_ids': tensor([[  101,  1996, 20355,  2879, 11618,  1037,  2600,  2012,  1996,  3899,
          1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
[CLS]
the
naughty
boy
throws
a
rock
at
the
dog
.
[SEP]
tensor([-1.8449e-02,  1.4041e-01, -5.9298e-01, -1.9007e-01,  2.9867e-01,
        -5.6538e-01,  2.2924e-01,  4.2855e-01, -1.6997e-01,  4.2268e-01,
        -2.0346e-01, -2.4299e-01, -1.5935e-01,  5.5291e-01, -5.0368e-01,
         1.6519e-01, -6.3577e-04,  1.9181e-02, -2.2280e-01,  4.9356e-01,
        -1.0993e-01, -9.2604e-02, -3.2445e-01, -1.6764e-01,  1.0712e+00,
         9.1160e-02,  1.6012e-01,  7.1009e-01, -1.3440e-01,  4.4889e-02,
         4.8974e-02,  3.6020e-01,  3.2765e-01, -3.6109e-01, -2.4887e-01,
         1.2486e-02,  2.9745e-01,  2.0511e-01,  4.8202e-03, -4.5461e-01,
         1.0606e-01, -1.1665e-01,  2.1970e-01, -1.3617e-01, -6.2869e-02,
        -4.7899e-02,  5.2956e-01,  1.2373e-02,  4.2364e-01,  6.3077e-02,
        -7.4303e-02,  8.8332e-

In [10]:
tokens = tokenizer('A big rock falls from the slope after heavy rain.', return_tensors='pt')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
embeddings_3 = model(**tokens)[0]
embedding_rock3 = embeddings_3[0][3]
print(embedding_rock3)

{'input_ids': tensor([[ 101, 1037, 2502, 2600, 4212, 2013, 1996, 9663, 2044, 3082, 4542, 1012,
          102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
[CLS]
a
big
rock
falls
from
the
slope
after
heavy
rain
.
[SEP]
tensor([ 7.0959e-02, -1.8239e-03, -3.5609e-02,  1.7368e-03,  5.7576e-01,
        -4.2885e-01,  3.6450e-01,  4.7624e-01, -5.3132e-02,  1.3941e-01,
         1.8148e-01, -5.5918e-01, -1.0324e-02,  5.3650e-01, -8.8018e-01,
        -1.2646e-02,  5.2722e-02, -1.2881e-01,  2.9228e-02,  6.2961e-01,
        -3.0681e-01, -1.7172e-01, -2.8808e-01,  1.6210e-02,  8.3863e-01,
        -7.0563e-02, -2.8046e-03,  8.3378e-01, -1.4187e-01,  1.0302e-01,
         3.3615e-01,  3.3300e-01,  1.7306e-02,  4.0775e-01, -4.1927e-02,
         2.9804e-01,  5.7630e-01,  3.9225e-02, -1.4017e-01, -1.6162e-01,
        -1.2510e-01, -2.1063e-02, -3.6838e-03,  2.7771e-01, -1.7232e-01,
         6.8470e-02,  5.5323e-01,  2.3179e-01,  7.3501e-02, -2.3209e-01,
        -1.7352e-01,  7.5

Let's compute how similar are the embeddings to each other

In [11]:
import torch

cos = torch.nn.CosineSimilarity(dim=0)
similarity1 = cos(embedding_rock1, embedding_rock2)
print(similarity1)

similarity2 = cos(embedding_rock2, embedding_rock3)
print(similarity2)



tensor(0.5968, grad_fn=<SumBackward1>)
tensor(0.8156, grad_fn=<SumBackward1>)


We can see that embedding_rock2 are more similar to embedding_rock3 than with embedding_rock1.

## Train Text Classification Model with DistilBert Embeddings

In the previous lab, we have trained a text classification model using pretrained context-free embeddings GloVE.

In this exercise, we will replace the embeddings with embeddings produced by DistilBERT model and compare the performance.

### Create the dataset

Instead of using 10000 samples as before, we will just use 2000 samples for training.

In [12]:
from datasets import load_dataset

# downloaded the datasets.
test_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv'
train_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv'

dataset = load_dataset('csv', data_files=train_data_url, split="train").shuffle().select(range(2500))

Downloading data:   0%|          | 0.00/53.0M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 2000
})

In [None]:
def process_dataset(sample):
    sample['sentiment'] = 0 if sample['sentiment'] == 'negative' else 1
    return sample

dataset = dataset.map(process_dataset)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
dataset = dataset.train_test_split(test_size=0.2)
train_dataset = dataset['train'].batch(batch_size=5)
test_dataset = dataset['test'].batch(batch_size=5)

Batching examples:   0%|          | 0/1600 [00:00<?, ? examples/s]

Batching examples:   0%|          | 0/400 [00:00<?, ? examples/s]

## Tokenization

We will now load the DistilBert tokenizer for the pretrained model "distillbert-base-uncased".  This is the same as the other lab exercise.

In [None]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModel.from_pretrained('distilbert-base-uncased').to('cuda')



Here we will tokenize the text string, and pad the text string to the longest sequence in the batch, and also to truncate the sequence if it exceeds the maximum length allowed by the model (in BERT's case, it is 512).

In [None]:
import torch

train_X = []
train_y = []

for data in train_dataset:
    train_encodings = tokenizer(data['review'], padding="max_length", max_length=512, truncation=True, return_tensors='pt').to('cuda')
    train_labels = data['sentiment']
    # train_encodings.to('cuda')
    # train_labels.to('cuda')
    outputs = model(**train_encodings)
    mean = torch.mean(outputs[0], axis=1)
    train_X.append(mean.detach().cpu().numpy())
    train_y.append(train_labels)



In [None]:
import numpy as np

X_train = np.concatenate(train_X)
y_train = np.concatenate(train_y)
print(X_train.shape, y_train.shape)

(1600, 768) (1600,)


In [None]:
import torch

test_X = []
test_y = []

for data in test_dataset:
    test_encodings = tokenizer(data['review'], padding="max_length", max_length=512, truncation=True, return_tensors='pt').to('cuda')
    test_labels = data['sentiment']
    # train_encodings.to('cuda')
    # train_labels.to('cuda')
    outputs = model(**train_encodings)
    mean = torch.mean(outputs[0], axis=1)
    # test_X.append(outputs[0].detach().cpu().numpy())
    test_X.append(mean.detach().cpu().numpy())
    test_y.append(test_labels)



In [None]:
import numpy as np

X_test = np.concatenate(test_X)
y_test = np.concatenate(test_y)
print(X_test.shape, y_test.shape)

(400, 768) (400,)


In [None]:
np.save('X_train.npy', X_train)
np.save('y_train.npy', y_train)
np.save('X_test.npy', X_test)
np.save('y_test.npy', y_test)

In [None]:
import numpy as np

X_train = np.load('X_train.npy')
y_train = np.load('y_train.npy')
X_test = np.load('X_test.npy')
y_test = np.load('y_test.npy')

## Train a classifier using the extracted features (embeddings)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

In [None]:
print(f'train score : {clf.score(X_train, y_train)}')
# print(f'validation score : {clf.score(X_test, y_test)}')
print(f'test score : {clf.score(X_test, y_test)}')

train score : 0.90875
test score : 0.5025


We should be getting an validation and accuracy score of around 86% to 87% which is quite good, considering we are training with only 2000 samples!

**Exercise**

1. Modify the code to use the hidden states from a different attention layer as features or take average of hidden states  from few layers as features.
2. Modify the code to use BERT model and see if it performs better than the DistilBERT. For BERT Model, the output of different layers are in `output[2]`