<a href="https://colab.research.google.com/github/nyp-sit/iti107-2024S2/blob/main/session-4/contextual_embedding_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contextual Embedding

One of the main drawbacks of embeddings such as Word2Vec and GloVE are that they have the same embedding for the same word regardless of its meaning in a particular context. For example, the word `rock` in `The rock concert is being held at national stadium` have a very different meaning in `The naughty boy throws a rock at the dog`.

Contextual embedding such as those produced by transformers (where the modern-day large language are based on) took into account the context of the word, and different embedding is generated for the same word depending on the context.

## Install Hugging Face Transformers library
If you are running this notebook in Google Colab, you will need to install the Hugging Face transformers library as it is not part of the standard environment.

In [1]:
%%capture
!pip install transformers
!pip install datasets

Let's try to generate some embeddings using one of the transformer model `deberta`.

In [2]:
from transformers import TFAutoModel, AutoTokenizer
# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
# Load a language model
model = TFAutoModel.from_pretrained("distilbert-base-uncased")
# Tokenize the sentence
tokens = tokenizer('The rock concert is being held at national stadium.', return_tensors='tf')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


{'input_ids': <tf.Tensor: shape=(1, 12), dtype=int32, numpy=
array([[ 101, 1996, 2600, 4164, 2003, 2108, 2218, 2012, 2120, 3346, 1012,
         102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 12), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}
[CLS]
the
rock
concert
is
being
held
at
national
stadium
.
[SEP]


We will pass the tokens through the model to generate embeddings.  We will take the embedding produced by the last layer.

In [4]:
# Process the tokens
embeddings_1 = model(**tokens)[0]
print(embeddings_1)

tf.Tensor(
[[[-0.11788939 -0.2113927   0.12098115 ... -0.08584853  0.30624133
    0.1480256 ]
  [-0.2119007  -0.44722807  0.04267938 ...  0.2376779   0.58190066
   -0.55352634]
  [-0.26564342 -0.31427473  0.33765164 ...  0.19100845  0.2829538
   -0.6662007 ]
  ...
  [ 0.7403881  -0.0133046   0.14560933 ... -0.22076568  0.16180259
    0.22805092]
  [ 0.5187568   0.10377903 -0.28046018 ...  0.2582071  -0.45931965
   -0.57033604]
  [-0.11179692  0.27194744  0.21463615 ... -0.02214796 -0.14540192
   -0.42742828]]], shape=(1, 12, 768), dtype=float32)


**Questions**

1. What is the shape of the embeddings?
2. Why is the shape is such?

Let's try to find the embedding of the token 'rock' used here.

In [None]:
embedding_rock1 = embeddings_1[0][2]
print(embedding_rock1)

Now write codes to find the embeddings of the word `rock` as used in the sentence `The naughty boy throws a rock at the dog.` and `The boy throws the rock into the drain`.


In [6]:
tokens = tokenizer('The naughty boy throws a rock at the dog.', return_tensors='tf')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
embeddings_2 = model(**tokens)[0]
embedding_rock2 = embeddings_2[0][6]
print(embedding_rock2)

{'input_ids': <tf.Tensor: shape=(1, 12), dtype=int32, numpy=
array([[  101,  1996, 20355,  2879, 11618,  1037,  2600,  2012,  1996,
         3899,  1012,   102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 12), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}
[CLS]
the
naughty
boy
throws
a
rock
at
the
dog
.
[SEP]
tf.Tensor(
[-1.84490643e-02  1.40414923e-01 -5.92978776e-01 -1.90066323e-01
  2.98672557e-01 -5.65375507e-01  2.29240373e-01  4.28549826e-01
 -1.69967264e-01  4.22681034e-01 -2.03460813e-01 -2.42989957e-01
 -1.59345537e-01  5.52909553e-01 -5.03683090e-01  1.65194169e-01
 -6.35739416e-04  1.91807821e-02 -2.22798109e-01  4.93561566e-01
 -1.09925024e-01 -9.26045850e-02 -3.24452490e-01 -1.67636424e-01
  1.07115698e+00  9.11600143e-02  1.60117760e-01  7.10087717e-01
 -1.34398356e-01  4.48895246e-02  4.89737280e-02  3.60203445e-01
  3.27645361e-01 -3.61093700e-01 -2.48873815e-01  1.24855582e-02
  2.97447979e-01  2.05108792e-01  4.82026301e-

In [None]:
tokens = tokenizer('A big rock falls from the slope after heavy rain.', return_tensors='tf')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
embeddings_3 = model(**tokens)[0]
embedding_rock3 = embeddings_3[0][3]
print(embedding_rock3)

Let's compute how similar are the embeddings to each other

In [12]:
from keras.losses import CosineSimilarity

cos = CosineSimilarity(axis=0)
similarity1 = cos(embedding_rock1, embedding_rock2)
print(-similarity1)

similarity2 = cos(embedding_rock2, embedding_rock3)
print(-similarity2)



tf.Tensor(0.59675664, shape=(), dtype=float32)
tf.Tensor(0.81558204, shape=(), dtype=float32)


We can see that embedding_rock2 are more similar to embedding_rock3 than with embedding_rock1.

## Train Text Classification Model with DistilBert Embeddings

In the previous lab, we have trained a text classification model using pretrained context-free embeddings GloVE.

In this exercise, we will replace the embeddings with embeddings produced by DistilBERT model and compare the performance.

### Create the dataset

Instead of using 10000 samples as before, we will just use 2000 samples for training.

In [99]:
import pandas as pd
import tensorflow as tf
import numpy as np


# downloaded the datasets.
test_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv'
train_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv'

train_df = pd.read_csv(train_data_url)
test_df = pd.read_csv(test_data_url)

In [100]:
TRAIN_SIZE = 2500
TEST_SIZE = 500
BATCH_SIZE = 2

train_df = train_df.sample(n=TRAIN_SIZE, random_state=128)
test_df = test_df.sample(n=TEST_SIZE, random_state=128)

# convert the text label to numeric label
train_df['sentiment'] =  train_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)
test_df['sentiment'] =  test_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)

In [101]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=128)

In [102]:
train_texts = train_df['review'].to_list()
train_labels = train_df['sentiment'].to_list()
val_texts = val_df['review'].to_list()
val_labels = val_df['sentiment'].to_list()
test_texts = test_df['review'].to_list()
test_labels = test_df['sentiment'].to_list()

In [103]:
len(train_texts)

2000

## Tokenization

We will now load the DistilBert tokenizer for the pretrained model "distillbert-base-uncased".  This is the same as the other lab exercise.

In [104]:
from transformers import AutoTokenizer, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = TFAutoModel.from_pretrained('distilbert-base-uncased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [105]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

In [106]:
batch_size = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(batch_size)

In [107]:
for encoding, label in train_dataset:
    print(encoding)
    print(label)
    output = model(encoding)
    print(output[0])
    break

{'input_ids': <tf.Tensor: shape=(16, 512), dtype=int32, numpy=
array([[ 101, 1045, 2052, ...,    0,    0,    0],
       [ 101, 1996, 2190, ...,    0,    0,    0],
       [ 101, 2023, 2003, ...,    0,    0,    0],
       ...,
       [ 101, 2019, 5186, ...,    0,    0,    0],
       [ 101, 1000, 7114, ...,    0,    0,    0],
       [ 101, 1000, 4990, ...,    0,    0,    0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(16, 512), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>}
tf.Tensor([0 1 1 0 0 1 1 1 0 1 0 0 0 1 1 1], shape=(16,), dtype=int32)
tf.Tensor(
[[[ 0.20510897 -0.02048307  0.00196606 ... -0.13195124  0.56740284
    0.33686063]
  [ 0.7437506   0.01668045 -0.04088961 ...  0.15722656  0.77581644
    0.15504575]
  [ 0.24532312 -0.23797199  0.18643066 ... -0.04386552  0.53037524
   -0.1

In [108]:
def extract_features(dataset):

    embeddings = []
    labels = []

    for encoding, label in dataset:
        output = model(encoding)
        embeddings.append(output[0])
        labels.append(label)

    embeddings, labels = np.concatenate(embeddings), np.concatenate(labels)

    return embeddings, labels

In [109]:
X_train, y_train = extract_features(train_dataset)
X_val, y_val = extract_features(val_dataset)
X_test, y_test = extract_features(test_dataset)

In [114]:
X_train[:10]

array([[[ 0.20510897, -0.02048307,  0.00196606, ..., -0.13195124,
          0.56740284,  0.33686063],
        [ 0.7437506 ,  0.01668045, -0.04088961, ...,  0.15722656,
          0.77581644,  0.15504575],
        [ 0.24532312, -0.23797199,  0.18643066, ..., -0.04386552,
          0.53037524, -0.1493563 ],
        ...,
        [ 0.2526188 ,  0.1111629 ,  0.18625496, ..., -0.2999372 ,
          0.02266818, -0.18007219],
        [ 0.5047112 , -0.1042671 ,  0.24382335, ...,  0.25614917,
         -0.03424211, -0.12829389],
        [ 0.4825932 , -0.04205373,  0.39244625, ...,  0.14891887,
          0.07405909, -0.3140675 ]],

       [[-0.02506212, -0.3235323 ,  0.0497772 , ...,  0.04040866,
          0.33898205,  0.31572133],
        [-0.29285005, -0.64260083, -0.47798085, ...,  0.3228737 ,
          0.6661966 , -0.26964435],
        [-0.725631  , -0.14314805, -0.33124873, ...,  0.16771463,
          0.302625  , -0.43758023],
        ...,
        [ 0.05692494, -0.3331193 ,  0.17502831, ..., -

Here we will tokenize the text string, and pad the text string to the longest sequence in the batch, and also to truncate the sequence if it exceeds the maximum length allowed by the model (in BERT's case, it is 512).

## Train a classifier using the extracted features (embeddings)

In [110]:
model = tf.keras.Sequential([
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [111]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
              metrics=['accuracy'])

In [112]:
import os
root_logdir = os.path.join(os.curdir, "tb_logs")

def get_run_logdir():    # use a new directory for each run
	import time
	run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
	return os.path.join(root_logdir, run_id)

run_logdir = get_run_logdir()
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=run_logdir)
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="bestcheckpoint.weights.h5",
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)


In [None]:
model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=100,
    callbacks=[tensorboard_callback, model_checkpoint_callback])

In [None]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

In [None]:
print(f'train score : {clf.score(X_train, y_train)}')
# print(f'validation score : {clf.score(X_test, y_test)}')
print(f'test score : {clf.score(X_test, y_test)}')

train score : 0.90875
test score : 0.5025


We should be getting an validation and accuracy score of around 86% to 87% which is quite good, considering we are training with only 2000 samples!

**Exercise**

1. Modify the code to use the hidden states from a different attention layer as features or take average of hidden states  from few layers as features.
2. Modify the code to use BERT model and see if it performs better than the DistilBERT. For BERT Model, the output of different layers are in `output[2]`