<a href="https://colab.research.google.com/github/nyp-sit/it3103/blob/main/week7/word_embedding_glove.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Pre-trained Word embeddings

In this lab exercise, we will use a pretrained word embedding for our text classification task, instead of training our own embedding layer. 

## Setup

In [None]:
import io
import os
import shutil
import numpy as np
import tensorflow as tf

### Download the IMDb Dataset

Download the dataset using Keras file utility and do the clean-up as before

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
train_dir = os.path.join(dataset_dir, 'train')

In [None]:
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

Next, create a `tf.data.Dataset` using `tf.keras.preprocessing.text_dataset_from_directory`. 

Use the `train` directory to create both train and validation datasets with a split of 20% for validation.

In [None]:
batch_size = 1024
seed = 123
train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, 
    subset='training', seed=seed)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, 
    subset='validation', seed=seed)

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

## Text preprocessing

We then initialize a TextVectorization layer with the desired parameters to vectorize our movie reviews.

In [None]:
# Vocabulary size and number of words in a sequence.
VOCAB_SIZE = 10000
MAX_SEQUENCE_LENGTH = 200

# Use the text vectorization layer to normalize, split, and map strings to 
# integers. 
# Set output_sequence length as all samples are not of the same length.
vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

## Getting the pre-trained embedding model

We will use the pre-trained GloVe embeddings available from [stanford site](https://nlp.stanford.edu/projects/glove/). The original zip file contains embedding of different dimensions (e.g. 50d, 100d, 200d, etc) and is more than 800MB in file size. To save downloading time, we have made a copy of 50d GloVe file available on our course website for download. If you want to experiment with other embedding dimensions, please feel free to download from the stanford site.


In [None]:
glove_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/pretrained-models/glove.6B.50d.zip'

glove_files = tf.keras.utils.get_file("glove.6B.50d.zip", glove_url,
                                    extract=True, cache_dir='.',
                                    cache_subdir='')

## Load the Embedding layer

In the code below, we read the embeddings from the download file line by line to create the embeddings index and then initialize the Keras embedding layer with this embeddings index.

In [None]:
# Load up the GloVe word embedding data

EMBEDDING_DIM = 50

print("Loading GloVe Word Embedding...")
embeddings_index = {}
with open('glove.6B.50d.txt', encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

Let's print out the embedding for the word `happy`.

In [None]:
embeddings_index['happy']

In [None]:
# Construct the word embedding matrix that will be used in the Embedding layer.
vocab = vectorize_layer.get_vocabulary()
glove_embedding_matrix = np.zeros((len(vocab), EMBEDDING_DIM))
for i, word in enumerate(vocab):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        glove_embedding_matrix[i] = embedding_vector

In [None]:
vocab_size = len(vocab)
embedding_layer = tf.keras.layers.Embedding(VOCAB_SIZE, 
                            EMBEDDING_DIM, 
                            input_length = MAX_SEQUENCE_LENGTH,
                            weights=[glove_embedding_matrix],  
                            trainable=False)

## Create a classification model

We will now create the model as before, but this time with embedding layer initialized with pretrained embedding.

In [None]:
model = tf.keras.Sequential([
    vectorize_layer,
    embedding_layer,
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

## Compile and train the model

In [None]:
root_logdir = os.path.join(os.curdir, "tb_logs")

def get_run_logdir():    # use a new directory for each run
	import time
	run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
	return os.path.join(root_logdir, run_id)

run_logdir = get_run_logdir()
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=run_logdir)
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="bestcheckpoint",
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)


Compile and train the model using the `Adam` optimizer and `BinaryCrossentropy` loss. 

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
              metrics=['accuracy'])

In [None]:
model.fit(
    train_ds,
    validation_data=val_ds, 
    epochs=30,
    callbacks=[tensorboard_callback, model_checkpoint_callback])

In [None]:
%load_ext tensorboard
%tensorboard --logdir tb_logs

With pretrained embedding layer, we reaches a validation accuracy of around 70%, worse than jointly train our embedding layer with the classification task. 

This maybe because the kind of vocabulary used to train the pretrained embedding is quite different from the one used in the IMDB dataset. 

If we have enough data (like in our case), joinly train our own embedding layer will usually yield a better performance.


Let's evaluate the model on our test dataset.

In [None]:
test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test', 
    batch_size=batch_size)

In [None]:
model.load_weights("bestcheckpoint")
model.evaluate(test_ds)

## Additional Exercise

You can try other pretrained embedding available from the GloVe website, such as those trained on Common Crawl dataset (Beware: this pretrained embedding is huge, i.e. more than 1.7GB)