<a href="https://colab.research.google.com/github/nyp-sit/iti107-2024s2/blob/main/session-4/contextual_embedding_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contextual Embedding

One of the main drawbacks of embeddings such as Word2Vec and GloVE are that they have the same embedding for the same word regardless of its meaning in a particular context. For example, the word `rock` in `The rock concert is being held at national stadium` have a very different meaning in `The naughty boy throws a rock at the dog`.

Contextual embedding such as those produced by transformers (where the modern-day large language are based on) took into account the context of the word, and different embedding is generated for the same word depending on the context.

## Install Transformers library
If you are running this notebook in Google Colab, you will need to install the Hugging Face transformers library as it is not part of the standard environment.

In [None]:
%%capture
!pip install transformers
!pip install datasets

Let's try to generate some embeddings using one of the transformer model `deberta`.

In [None]:
from transformers import TFAutoModel, AutoTokenizer
# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
# Load a language model
model = TFAutoModel.from_pretrained("distilbert-base-uncased")
# Tokenize the sentence
tokens = tokenizer('The rock concert is being held at national stadium.', return_tensors='tf')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

We will pass the tokens through the model to generate embeddings.  We will take the embedding produced by the last layer.

In [None]:
# Process the tokens
embeddings_1 = model(**tokens)[0]
print(embeddings_1)

**Questions**

1. What is the shape of the embeddings?
2. Why is the shape is such?

Let's try to find the embedding of the token 'rock' used here.

In [None]:
embedding_rock1 = embeddings_1[0][2]
print(embedding_rock1)

Now write codes to find the embeddings of the word `rock` as used in the sentence `The naughty boy throws a rock at the dog.` and `A big rock falls from the slope after heavy rain.`.


<details>
<summary>Click here for answer</summary>

```
tokens = tokenizer('The naughty boy throws a rock at the dog.', return_tensors='tf')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
embeddings_2 = model(**tokens)[0]
embedding_rock2 = embeddings_2[0][6]
```

</details>

In [None]:
# Write code to extract embedding of rock for sentence "The naughty boy throws a rock at the dog."
# store the embedding as embedding_rock2

embedding_rock2 = None

<details>
<summary>Click here for answer</summary>

```
tokens = tokenizer('A big rock falls from the slope after heavy rain.', return_tensors='tf')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
embeddings_3 = model(**tokens)[0]
embedding_rock3 = embeddings_3[0][3]
```

</details>

In [None]:
# Write code to extract embedding of rock for sentence "A big rock falls from the slope after heavy rain."
# store the embedding as embedding_rock3

embedding_rock3 = None

Let's compute how similar are the embeddings to each other

In [None]:
from keras.losses import CosineSimilarity

cos = CosineSimilarity(axis=0)
similarity1 = cos(embedding_rock1, embedding_rock2)
# invert the negative
print(-similarity1)

similarity2 = cos(embedding_rock2, embedding_rock3)
print(-similarity2)



We can see that embedding_rock2 are more similar to embedding_rock3 than with embedding_rock1.

## Train Text Classification Model with DistilBert Embeddings

In the previous lab, we have trained a text classification model using pretrained context-free embeddings GloVE.

In this exercise, we will replace the embeddings with embeddings produced by DistilBERT model and compare the performance.

### Create the dataset

Instead of using 10000 samples as before, we will just use 2000 samples for training.

In [None]:
import pandas as pd
import tensorflow as tf
import numpy as np


# downloaded the datasets.
test_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv'
train_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv'

train_df = pd.read_csv(train_data_url)
test_df = pd.read_csv(test_data_url)

In [None]:
TRAIN_SIZE = 2500
TEST_SIZE = 500
BATCH_SIZE = 2

train_df = train_df.sample(n=TRAIN_SIZE, random_state=128)
test_df = test_df.sample(n=TEST_SIZE, random_state=128)

# convert the text label to numeric label
train_df['sentiment'] =  train_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)
test_df['sentiment'] =  test_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)

In [None]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=128)

In [None]:
train_texts = train_df['review'].to_list()
train_labels = train_df['sentiment'].to_list()
val_texts = val_df['review'].to_list()
val_labels = val_df['sentiment'].to_list()
test_texts = test_df['review'].to_list()
test_labels = test_df['sentiment'].to_list()

In [None]:
len(train_texts)

## Tokenization

We will now load the DistilBert tokenizer for the pretrained model "distillbert-base-uncased".  This is the same as the other lab exercise.

In [None]:
from transformers import AutoTokenizer, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = TFAutoModel.from_pretrained('distilbert-base-uncased')

In [None]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)



In [None]:
batch_size = 32

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(batch_size)

In [None]:
def extract_features(dataset):

    embeddings = []
    labels = []

    for encoding, label in dataset:
        output = model(encoding)
        sentence_embedding = tf.reduce_mean(output[0], axis=1).numpy()
        embeddings.append(sentence_embedding)
        labels.append(label)

    embeddings, labels = np.concatenate(embeddings), np.concatenate(labels)

    return embeddings, labels

In [None]:
X_train, y_train = extract_features(train_dataset)
X_val, y_val = extract_features(val_dataset)
X_test, y_test = extract_features(test_dataset)

Here we will tokenize the text string, and pad the text string to the longest sequence in the batch, and also to truncate the sequence if it exceeds the maximum length allowed by the model (in BERT's case, it is 512).

## Train a classifier using the extracted features (embeddings)

In [None]:
model = tf.keras.Sequential([
    # tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
              metrics=['accuracy'])

In [None]:
import os
root_logdir = os.path.join(os.curdir, "tb_logs")

def get_run_logdir():    # use a new directory for each run
	import time
	run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
	return os.path.join(root_logdir, run_id)

run_logdir = get_run_logdir()
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=run_logdir)
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="bert_best.weights.h5",
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)


In [None]:
model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=32,
    epochs=30,
    callbacks=[tensorboard_callback, model_checkpoint_callback])

We should be getting an validation accuracy score of around 86% which is quite good, considering we are training with only 2000 samples!

Let's evaluate it on the test set.

In [None]:
model.load_weights("bert_best.weights.h5")
model.evaluate(X_test, y_test)

## Using Sentence Transformer



In [None]:
%%capture
!pip install sentence_transformers

In [None]:
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)


In [None]:
train_df.head()

In [None]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(train_df["review"], show_progress_bar=True)
val_embeddings = model.encode(val_df['review'], show_progress_bar=True)
test_embeddings = model.encode(test_df["review"], show_progress_bar=True)

In [None]:
train_embeddings.shape

In [None]:
model = tf.keras.Sequential([
    # tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
              metrics=['accuracy'])

In [None]:
import os
root_logdir = os.path.join(os.curdir, "tb_logs")

def get_run_logdir():    # use a new directory for each run
	import time
	run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
	return os.path.join(root_logdir, run_id)

run_logdir = get_run_logdir()
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=run_logdir)
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="sentence_transformer.weights.h5",
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)


In [None]:
train_df.head()

In [None]:
model.fit(
    train_embeddings, train_df['sentiment'],
    validation_data=(val_embeddings, val_df['sentiment']),
    batch_size=32,
    epochs=30,
    callbacks=[tensorboard_callback, model_checkpoint_callback])

In [None]:
model.load_weights("sentence_transformer.weights.h5")
model.evaluate(test_embeddings, test_df['sentiment'])