<a href="https://colab.research.google.com/github/nyp-sit/iti107-2024S2/blob/main/session-4/word_embedding_glove.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Pre-trained Word embeddings

In this lab exercise, we will use a pretrained word embedding (GloVE) for our text classification task, instead of training our own embedding layer.

## Setup

In [1]:
import os
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

### Download the IMDb Dataset

Download the dataset using Keras file utility and do the clean-up as before

In [26]:
# downloaded the datasets.
train_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv'
test_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv'

train_df = pd.read_csv(train_data_url)
test_df = pd.read_csv(test_data_url)

In [25]:
len(train_df)

40000

We will just use a subset of 10000 for training/validation and 500 for testing.

In [28]:
TRAIN_SIZE = 10000
TEST_SIZE = 500
BATCH_SIZE = 64

train_df = train_df.sample(n=TRAIN_SIZE, random_state=128)
test_df = test_df.sample(n=TEST_SIZE, random_state=128)

# convert the text label to numeric label
train_df['sentiment'] =  train_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)
test_df['sentiment'] =  test_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)

In [29]:
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=128)

Let's convert to pandas dataframe to Tensorflow Dataset (tf.data.Dataset) suitable for model training later.

In [30]:
train_ds = tf.data.Dataset.from_tensor_slices(
            (train_df['review'].values,
            train_df['sentiment'].values)
)

val_ds = tf.data.Dataset.from_tensor_slices(
            (val_df['review'].values,
            val_df['sentiment'].values)
)

test_ds = tf.data.Dataset.from_tensor_slices(
            (test_df['review'].values,
            test_df['sentiment'].values)
)

# optimize the data pipeline
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.batch(BATCH_SIZE).cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.batch(BATCH_SIZE).cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.batch(BATCH_SIZE).cache().prefetch(buffer_size=AUTOTUNE)

In [32]:
len(train_df)

8000

## Text preprocessing

We then initialize a TextVectorization layer with the desired parameters to vectorize our movie reviews.

In [33]:
# Vocabulary size and number of words in a sequence.
VOCAB_SIZE = 10000
MAX_SEQUENCE_LENGTH = 500

# Use the text vectorization layer to normalize, split, and map strings to
# integers.
# Set output_sequence length as all samples are not of the same length.
vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

## Getting the pre-trained embedding model

We will use the pre-trained GloVe embeddings available from [stanford site](https://nlp.stanford.edu/projects/glove/). The original zip file contains embedding of different dimensions (e.g. 50d, 100d, 200d, etc) and is more than 800MB in file size. To save downloading time, we have made a copy of 50d GloVe file available on our course website for download. If you want to experiment with other embedding dimensions, please feel free to download from the stanford site.


In [8]:
glove_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/pretrained-models/glove.6B.50d.zip'

glove_files = tf.keras.utils.get_file("glove.6B.50d.zip", glove_url,
                                    extract=True, cache_dir='.',
                                    cache_subdir='')

## Load the Embedding layer

In the code below, we read the embeddings from the download file line by line to create the embeddings index and then initialize the Keras embedding layer with this embeddings index.

In [9]:
# Load up the GloVe word embedding data
EMBEDDING_DIM = 50

print("Loading GloVe Word Embedding...")
embeddings_index = {}
with open('glove.6B.50d.txt', encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

Loading GloVe Word Embedding...


Let's print out the embedding for the word `happy`.

In [10]:
embeddings_index['happy']

array([ 0.092086,  0.2571  , -0.58693 , -0.37029 ,  1.0828  , -0.55466 ,
       -0.78142 ,  0.58696 , -0.58714 ,  0.46318 , -0.11267 ,  0.2606  ,
       -0.26928 , -0.072466,  1.247   ,  0.30571 ,  0.56731 ,  0.30509 ,
       -0.050312, -0.64443 , -0.54513 ,  0.86429 ,  0.20914 ,  0.56334 ,
        1.1228  , -1.0516  , -0.78105 ,  0.29656 ,  0.7261  , -0.61392 ,
        2.4225  ,  1.0142  , -0.17753 ,  0.4147  , -0.12966 , -0.47064 ,
        0.3807  ,  0.16309 , -0.323   , -0.77899 , -0.42473 , -0.30826 ,
       -0.42242 ,  0.055069,  0.38267 ,  0.037415, -0.4302  , -0.39442 ,
        0.10511 ,  0.87286 ], dtype=float32)

In [11]:
# Construct the word embedding matrix that will be used in the Embedding layer.
vocab = vectorize_layer.get_vocabulary()
glove_embedding_matrix = np.zeros((len(vocab), EMBEDDING_DIM))
for i, word in enumerate(vocab):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        glove_embedding_matrix[i] = embedding_vector

In [12]:
vocab_size = len(vocab)
embedding_layer = tf.keras.layers.Embedding(VOCAB_SIZE,
                            EMBEDDING_DIM,
                            input_length = MAX_SEQUENCE_LENGTH,
                            weights=[glove_embedding_matrix],
                            trainable=False)



In [39]:
print(vocab_size)

10000


## Create a classification model

We will now create the model as before, but this time with embedding layer initialized with pretrained embedding.

In [34]:
model = tf.keras.Sequential([
    vectorize_layer,
    embedding_layer,
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

## Compile and train the model

In [35]:
import os
root_logdir = os.path.join(os.curdir, "tb_logs")

def get_run_logdir():    # use a new directory for each run
	import time
	run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
	return os.path.join(root_logdir, run_id)

run_logdir = get_run_logdir()
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=run_logdir)
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="bestcheckpoint.weights.h5",
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)


Compile and train the model using the `Adam` optimizer and `BinaryCrossentropy` loss.

In [36]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
              metrics=['accuracy'])

In [37]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=30,
    callbacks=[tensorboard_callback, model_checkpoint_callback])

Epoch 1/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 18ms/step - accuracy: 0.5000 - loss: 0.7083 - val_accuracy: 0.4900 - val_loss: 0.6932
Epoch 2/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.5002 - loss: 0.6937 - val_accuracy: 0.5115 - val_loss: 0.6927
Epoch 3/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.5013 - loss: 0.6937 - val_accuracy: 0.4900 - val_loss: 0.6932
Epoch 4/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 21ms/step - accuracy: 0.4951 - loss: 0.6932 - val_accuracy: 0.4935 - val_loss: 0.6926
Epoch 5/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 12ms/step - accuracy: 0.5054 - loss: 0.6927 - val_accuracy: 0.5135 - val_loss: 0.6923
Epoch 6/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.4970 - loss: 0.6931 - val_accuracy: 0.5005 - val_loss: 0.6922
Epoch 7/30
[1m125/125

<keras.src.callbacks.history.History at 0x7b989d1d15d0>

In [None]:
%load_ext tensorboard
%tensorboard --logdir tb_logs

With pretrained embedding layer, we reaches a validation accuracy of around 70%, worse than jointly train our embedding layer with the classification task.

This maybe because the kind of vocabulary used to train the pretrained embedding is quite different from the one used in the IMDB dataset.

If we have enough data (like in our case), joinly train our own embedding layer will usually yield a better performance.


Let's evaluate the model on our test dataset.

In [None]:
model.load_weights("bestcheckpoint.weights.h5")
model.evaluate(test_ds)

## Additional Exercise

You can try other pretrained embedding available from the GloVe website, such as those trained on Common Crawl dataset (Beware: this pretrained embedding is huge, i.e. more than 1.7GB)