<a href="https://colab.research.google.com/github/nyp-sit/iti107-2024S2/blob/main/contextual_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contextual Embedding

One of the main drawbacks of embeddings such as Word2Vec and GloVE are that they have the same embedding for the same word regardless of its meaning in a particular context. For example, the word `rock` in `The rock concert is being held at national stadium` have a very different meaning in `The naughty boy throws a rock at the dog`.

Contextual embedding such as those produced by transformers (where the modern-day large language are based on) took into account the context of the word, and different embedding is generated for the same word depending on the context.

## Install Hugging Face Transformers library
If you are running this notebook in Google Colab, you will need to install the Hugging Face transformers library as it is not part of the standard environment.

In [1]:
%%capture
!pip install transformers

Let's try to generate some embeddings using one of the transformer model `deberta`.

In [2]:
from transformers import AutoModel, AutoTokenizer
# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



In [4]:
# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")
# Tokenize the sentence
tokens = tokenizer('The rock concert is being held at national stadium.', return_tensors='pt')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

{'input_ids': tensor([[   1,  133, 3152, 4192,   16,  145,  547,   23,  632, 4773,    4,    2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
[CLS]
The
 rock
 concert
 is
 being
 held
 at
 national
 stadium
.
[SEP]


We will pass the tokens through the model to generate embeddings.  We will take the embedding produced by the last layer.

In [5]:
# Process the tokens
embeddings_1 = model(**tokens)[0]
print(embeddings_1)

tensor([[[-3.3300, -0.0213,  0.1184,  ..., -0.2487, -0.2351,  0.3524],
         [-0.1233, -0.1571,  0.7768,  ..., -0.3475,  0.5131, -0.0671],
         [-0.4908,  0.1666,  0.5813,  ..., -0.4292, -0.4206, -0.4516],
         ...,
         [-0.0890,  0.7978,  0.7153,  ...,  0.2171,  0.2045,  0.2699],
         [-1.3818,  0.4259,  0.5570,  ...,  0.6608,  0.6024, -0.3742],
         [-3.2719,  0.1538,  0.1215,  ..., -0.4146, -0.2362,  0.5298]]],
       grad_fn=<NativeLayerNormBackward0>)


**Questions**

1. What is the shape of the embeddings?
2. Why is the shape is such?

Let's try to find the embedding of the token 'rock' used here.

In [None]:
embedding_rock1 = embeddings_1[0][2]
print(embedding_rock1)

Now write codes to find the embeddings of the word `rock` as used in the sentence `The naughty boy throws a rock at the dog.` and `The boy throws the rock into the drain`.


In [None]:
tokens = tokenizer('The naughty boy throws a rock at the dog.', return_tensors='pt')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
embeddings_2 = model(**tokens)[0]
embedding_rock2 = embeddings_2[0][6]
print(embedding_rock2)

In [15]:
tokens = tokenizer('The boy throws the rock into the drain.', return_tensors='pt')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
embeddings_3 = model(**tokens)[0]
embedding_rock3 = embeddings_3[0][5]
print(embedding_rock3)

{'input_ids': tensor([[    1,   133,  2143,  6989,     5,  3152,    88,     5, 15160,     4,
             2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
[CLS]
The
 boy
 throws
 the
 rock
 into
 the
 drain
.
[SEP]
tensor([-2.3196e-01,  2.1071e-01,  6.2342e-01, -2.7786e+00,  5.1574e-02,
         3.1599e-01,  7.5339e-02, -9.4382e-01,  1.3343e+00, -8.2008e-01,
        -2.5001e-01, -1.7147e-01,  1.9489e-01,  1.1405e-01, -2.6709e-01,
        -5.0175e-01,  2.8915e-02, -9.0471e-01,  6.7788e-01,  1.9722e+00,
        -3.1726e-01, -5.2483e-01,  6.1962e-01,  3.2000e+00,  1.1604e+00,
         3.0195e-01,  1.1973e-01,  1.5559e+00, -1.0279e+00,  2.4765e+00,
         1.9272e-01, -3.0063e-02,  5.6097e-01, -2.6215e+00,  2.0704e-01,
        -6.0283e-01,  8.3411e-01, -3.4919e-01,  4.9485e-01, -1.0404e-01,
        -9.4652e-02, -6.2800e-01, -1.8027e+00, -9.5371e-01,  5.2433e-01,
         3.0184e-01, -4.3497e-01,  1.0706e-01

Let's compute how similar are the embeddings to each other

In [16]:
import torch

cos = torch.nn.CosineSimilarity(dim=0)
similarity1 = cos(embedding_rock1, embedding_rock2)
print(similarity1)

similarity2 = cos(embedding_rock2, embedding_rock3)
print(similarity2)



tensor(0.7695, grad_fn=<SumBackward1>)
tensor(0.8779, grad_fn=<SumBackward1>)


We can see that embedding_rock2 are more similar to embedding_rock3 than with embedding_rock1.

In [None]:
# downloaded the datasets.
test_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv'
train_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv'

train_df = pd.read_csv(train_data_url)
test_df = pd.read_csv(test_data_url)

The train set has 40000 samples. We will use a small subset (e.g. 2000) samples for finetuning our pretrained model. Similarly we will use a smaller test set for evaluating our model. We use dataframe's sample() to randomly select a subset of samples.

In [None]:
TRAIN_SIZE = 2000
TEST_SIZE = 200

train_df = train_df.sample(n=TRAIN_SIZE, random_state=128)
test_df = test_df.sample(n=TEST_SIZE, random_state=128)

In [None]:
train_df['sentiment'] =  train_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)
test_df['sentiment'] =  test_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)

In [None]:
train_texts = train_df['review'].to_list()
train_labels = train_df['sentiment'].to_list()
test_texts = test_df['review'].to_list()
test_labels = test_df['sentiment'].to_list()

In [None]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

## Tokenization

We will now load the DistilBert tokenizer for the pretrained model "distillbert-base-uncased".  This is the same as the other lab exercise.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

Here we will tokenize the text string, and pad the text string to the longest sequence in the batch, and also to truncate the sequence if it exceeds the maximum length allowed by the model (in BERT's case, it is 512).

In [None]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

In [None]:
train_encodings[1]

We will create a tensorflow dataset and use it's efficient batching later to obtain the embeddings.

In [None]:
BATCH_SIZE = 16

In [None]:
batch_size = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(batch_size)

Here we instantiate a pretrained model from 'distilbert-base-cased' and specify output_hidden_state=True so that we get the output from each of the attention layers.

## Feature Extraction using (Distil)BERT.

Here we will load the pretrained model for distibert-based-uncased and use it to extract features from the text (i.e. emeddings).

In [None]:
from transformers import TFAutoModel

model = TFAutoModel.from_pretrained("distilbert-base-uncased",output_hidden_states=True)

The model will produce two outputs: the 1st output `output[0]` is of shape `(16, 512, 768)` which corresponds to the output of the last hidden layer and the second output `output[1]` is a list of 7 outputs of shape `(16, 512, 768)`, corresponding to the output of each of the 6 attention layers and the output. 768 refers to the hidden size.

In [None]:
def extract_features(dataset):

    embeddings = []
    labels = []

    for encoding, label in dataset:
        output = model(encoding)
        hidden_states = output[1]
        # here we take the output of the second last attention layer as our embeddings.
        # We take the average of the embedding value of 512 tokens (at axis=1) to generate sentence embedding
        sentence_embedding = tf.reduce_mean(hidden_states[-2], axis=1).numpy()
        embeddings.append(sentence_embedding)
        labels.append(label)

    embeddings, labels = np.concatenate(embeddings), np.concatenate(labels)

    return embeddings, labels

In [None]:
X_train, y_train = extract_features(train_dataset)
X_val, y_val = extract_features(val_dataset)
X_test, y_test = extract_features(test_dataset)

## Train a classifier using the extracted features (embeddings)

In [None]:
X_train.shape

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

In [None]:
print(f'train score : {clf.score(X_train, y_train)}')
print(f'validation score : {clf.score(X_val, y_val)}')
print(f'test score : {clf.score(X_test, y_test)}')

We should be getting an validation and accuracy score of around 86% to 87% which is quite good, considering we are training with only 2000 samples!

**Exercise**

1. Modify the code to use the hidden states from a different attention layer as features or take average of hidden states  from few layers as features.
2. Modify the code to use BERT model and see if it performs better than the DistilBERT. For BERT Model, the output of different layers are in `output[2]`