# 2 approaches for representing groups of words: Sets and sequences

## 11.3.1 Preparing the IMDB movie reviews data

### Download dataset and uncompress

In [8]:
! cat ../datasets/aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

prepare validation set - move 20% of training text files to new dir 

In [3]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("/home/user/development/datasets/aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

FileExistsError: [Errno 17] File exists: '/home/user/development/datasets/aclImdb/val/neg'

### text_dataset_from_directory utiliy = create a batched Dataset of text and their labels from a directory structure

In [4]:
from tensorflow import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    base_dir / "train", batch_size=batch_size
)

val_ds = keras.utils.text_dataset_from_directory(
    base_dir / "val", batch_size=batch_size
)

test_ds = keras.utils.text_dataset_from_directory(
    base_dir / "test", batch_size=batch_size
)

Found 20000 files belonging to 2 classes.


2022-01-12 13:31:41.177059: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-01-12 13:31:41.177115: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-01-12 13:31:41.177131: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (05f1986588ca): /proc/driver/nvidia/version does not exist
2022-01-12 13:31:41.178078: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [11]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'One reviewer notes that it does not seem to matter what Welles actually says or does, he moves you. I concur. He was and remains a unique force in film. More than a triple threat who could act, write and direct, he had a genius uniquely suited to film. One can consider whether in an earlier age he would have been a painter. This film certainly reinforces that impression. A musician, a theatre actor, an heir to Shakespeare? hard to tell but I am very grateful that his time cam with film and he have him captured on film. I like the accent. I like the face, the size, the style, the mind and the games. I love all of his movies and wish there were more. I particularly love how other actors interacted with him on film. Many were never better or at least somehow different with him because he was o firmly there. Even towards the end when his beauty was ruined, perhaps

## 11.3.2: Processing words as a set: the bag-of-words-approach

simplest way to encode a piece of text: discard order and treat it as a bag of tokens

let's process raw text datasets with a TextVectorization layer = multi-hot encoded binary word vectors (vector dimensions as big as my number of words with 0s almost everywhere and 1s for dimensions that encode words present in the text)

### unigrams with binary encoding

In [5]:
from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(
    max_tokens=20000,   # limit vocabulary to the most 20k frequent words - 20k is good for text classification
    output_mode="multi_hot"
)
text_only_train_ds = train_ds.map(lambda x, y: x)  # prepare a dataset that only yields raw text inputs (no labels)
text_vectorization.adapt(text_only_train_ds)  # use dataset to index the dataset vocabulary via adapt()

binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

inspect output of one of these datasets

In [13]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)     
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


let's write a reusable model-building function that we'll use in all of our experiments 

In [6]:
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

let's train and test our model

In [15]:
#model was already trained - just load the model checkpoint here

model = get_model()
#model.summary()

#callbacks = [
#    keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)
#]
#model.fit(binary_1gram_train_ds.cache(),    # we call cache() on the datasets to cache them in memory - only do preprocessing once during 1st 
#          validation_data=binary_1gram_val_ds.cache(),  # epoch and reuse preprocessed texts for the following epochs
#          epochs=10,                                    # can only be done if the data is small enough to fit in memory
#          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1] :.3f}")

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.885


### bigrams with binary encoding

e.g. {"the", "the cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the", "the mat", "mat"}

TextVectorization layer can be configured to return arbitrary N-grams: bigrams, trigrams, etc

In [7]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

let's see how our model performs when trained on such binary encoded bags of bigrams - any difference to unigrams?

In [17]:
#model was already trained - just load the model checkpoint here

text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
# model.summary()
# callbacks = [
#     keras.callbacks.ModelCheckpoint("binary_2gram.keras",
#                                     save_best_only=True)
# ]
# model.fit(binary_2gram_train_ds.cache(),
#           validation_data=binary_2gram_val_ds.cache(),
#           epochs=10,
#           callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")


Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.897


test accuracy is improved! -> local order is pretty important

### Bigrams with TF-IDF encoding: add more information by counting how many times each word/n-gram occurs - histogram of words over text

e.g. {"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
"sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}

count bigram occurences with TextVectorization layer

In [18]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

words like "th", "a", "is", "are" will always be dominant in texts -> normalization could solve it, but would wreck sparsity of having many zeros in vectorized sentences - computational effort increases

solution: TF-IDF = term frequency, inverse document frequency --> compare the term frequency in the current sample with the frequency over the entire document -> just change output mode in TextVectorization layer

In [19]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)


In [20]:
#model was already trained - just load the model checkpoint here

text_vectorization.adapt(text_only_train_ds)        # will learn tf-idf weights as well as vocabulary

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
# model.summary()
# callbacks = [
#     keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
#                                     save_best_only=True)
# ]
# model.fit(tfidf_2gram_train_ds.cache(),
#           validation_data=tfidf_2gram_val_ds.cache(),
#           epochs=10,
#           callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")


Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.886


# Processing words as a sequence - the sequence model approach

### Order matters! as seen in 11.3.2 - inducing order into our data like in bigrams yields better accuracy

#### represent my input samples as sequences of integer indices - where each integer stands for one word

#### We use in this chapter: Bidirectional RNNs / bidirectional LSTMs - SOTA for sequence modeling 2016-2017

##### Nowadays: almost universally done with Transformers

In [8]:
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,      # we truncate the inputs after the first 600 words (average review length is 233)
)                                           # only 5% of reviews are longer than 600 words

text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

simplest way to convert our integer sequences to vector sequences: one-hot encode the ints, add simple bidirectional LSTM

In [23]:
import tensorflow as tf

inputs = keras.Input(shape=(None,), dtype="int64")      # one input is a sequence of integers
embedded = tf.one_hot(inputs, depth=max_tokens)         # encode the ints into binary 20,000-dimensional vectors
x = layers.Bidirectional(layers.LSTM(32))(embedded)     # add a bidirectional LSTM
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)      # add a classification layer
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 tf.one_hot_1 (TFOpLambda)   (None, None, 20000)       0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 64)               5128448   
 nal)                                                            
                                                                 
 dropout_4 (Dropout)         (None, 64)                0         
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                                 
Total params: 5,128,513
Trainable params: 5,128,513
Non-trainable params: 0
_________________________________________________

In [24]:
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test accuracy: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10


2022-01-11 15:17:37.367506: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1536000000 exceeds 10% of free system memory.
2022-01-11 15:17:37.867303: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1536000000 exceeds 10% of free system memory.
2022-01-11 15:17:37.867437: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1536000000 exceeds 10% of free system memory.
2022-01-11 15:18:08.540364: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1536000000 exceeds 10% of free system memory.


  1/625 [..............................] - ETA: 23:12:06 - loss: 0.6942 - accuracy: 0.5000

2022-01-11 15:19:48.523781: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1536000000 exceeds 10% of free system memory.


 11/625 [..............................] - ETA: 2:32:05 - loss: 0.6941 - accuracy: 0.4773

result: this is super slow because each input sample is encoded as a matrix of size (600, 20000) - 600 words per sample, 20k possible words -> tht's 12MIO floats for a single movie review \
one-hot encoding is not a good idea, better: word embeddings

#### understanding word embeddings

when I do one-hot encoding, i assume "that the different tokens I am encoding are all independent from each other" \
one-hot vectors are all orthogonal to one another - but vector for "movie" and "film" should be the same or very close vectors \
geometric relationship between word vectors should reflect semantic relationship between words!!

#### word embeddings are vector representations that map human language into a structures geometric space

word embeddings: low dimensional float vectors = dense vectors; structured and structure is learned from data \
similar words get embedded in close locations and specific directions in the embedding space are meaningful \
common meaningful geometric transformations: "gender" or "plural" vectors

2 ways of obtaining word embeddings: \
1: learn word embeddings jointly with main learning task \
2: load pretrained word embeddings into my model

#### 1: Learn word embeddings with the embedding layer

In [9]:
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)

embedding layer: dictionary that maps integer indices (stand for specific words) to dense vectors \
takes ints as input, loops up these ints in an internal dictionary and returns the associated vectors \
word index -> embedding layer -> corresponding word vector \
input: rank2 tensor of ints of shape (batch_size, sequence_length) \
output: 3D floating-point tensor of shape (batch_size, sequence_length, embedding_dimensionality) \
weights randomly initialized (like with any other layer) - gradually adjusted via backpropagation

In [10]:
#model was already trained - just load the model checkpoint here

inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

# callbacks = [
#     keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
#                                     save_best_only=True)
# ]
# model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, 
#           callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_1 (Embedding)     (None, None, 256)         5120000   
                                                                 
 bidirectional (Bidirectiona  (None, 64)               73984     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
___________________________________________________

#### understanding padding and masking

padding: \
what's hurting our model performance: input sequences are full of zeros which comes from output_sequence_length=max_length (600) in TextVectorization -> this cuts off reviews longer than 600 tokens BUT ALSO fills up shorter reviews with zeros = padding

masking: \ 
we use a bidirectional RNN - two RNN layers running in parallel \
1 processing tokens in their natural order (will spend the last iterations seeing only vectors that encode padding) \
1 processing the same tokens in reverse \
way to tell the RNN to skip padding iterations = masking \
\
mask = tensor of ones and zeros (true/false bools) of shape (batch_size, sequence_length) - indicates which samples should be skipped (0 or false) or which should be processed (1 or true) \
turned off by default - turn on in Embedding layer with mask_zero=True \ 
retrieve mask with compute_mask() method

In [11]:
embedding_layer = layers.Embedding(input_dim=10, output_dim=256, mask_zero=True)
some_input = [
     [4, 3, 2, 1, 0, 0, 0],
     [5, 4, 3, 2, 1, 0, 0],
     [2, 1, 0, 0, 0, 0, 0]]

mask = embedding_layer.compute_mask(some_input)

In [12]:
#model was already trained - just load the model checkpoint here

inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

# callbacks = [
#     keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
#                                     save_best_only=True)
# ]
# model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
#           callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru_with_masking.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")


Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_3 (Embedding)     (None, None, 256)         5120000   
                                                                 
 bidirectional_1 (Bidirectio  (None, 64)               73984     
 nal)                                                            
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
_________________________________________________

### using pretrained word embeddings

useful if: \
too little training data to learn truly powerful embeddings of my vocabulary \
same as in CNNS - images as well as language have fairly generic features - visual or semantic respectively

well-known word embeddings - generally computed using word-occurence statistics (which words co-occur in sentences): \
word2vec algorithm by google in 2013: famous word-embedding scheme, captures specific semantic properties such as gender \
GloVe: global vectors for word representation, Stanford 2014 (based on factorizing a matrix of word co-occurence statistics, from Wikipedia and Common Crawl data)

let's use GloVe embeddings in a keras model:

In [14]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2022-01-12 15:39:49--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-01-12 15:39:49--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-01-12 15:39:50--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-0

In [15]:
!unzip -q glove.6B.zip

let's parse the unzipped file to build an index that maps words as strings to their vector representations from the glove dictionary

In [16]:
import numpy as np
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

Found 400000 word vectors.


let's build an embedding matrix that you can load into an Embedding layer - it must be of shape (max_words, embedding_dim), where each entry i contains the embedding_dim-dimensional vector for the word of index i in the reference word index (built during tokenization)

In [17]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()  #retrieve the vocabulary indexed by our previous TextVectorization layer
word_index = dict(zip(vocabulary, range(len(vocabulary))))  # use it to create a mapping from words to their index in the vocab

embedding_matrix = np.zeros((max_tokens, embedding_dim))    # prepare a matrix that we'll fill with the GloVe vectors
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector      # fill entry i in the matrix with the word vector for indexx i; words not found in embedding index will be zero


use a Constant initializer to load the pretrained embeddings in an embedding layer - we freeze the layer via trainable=False to not disrupt the pretrained representations during training (to not "learn" them)

In [18]:
embedding_layer = layers.Embedding(
    max_tokens, 
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)

train the model again

In [19]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("glove_embeddings_sequence_model.keras")
print(f"Test ass: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_4 (Embedding)     (None, None, 100)         2000000   
                                                                 
 bidirectional_2 (Bidirectio  (None, 64)               34048     
 nal)                                                            
                                                                 
 dropout_2 (Dropout)         (None, 64)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2,034,113
Trainable params: 34,113
Non-trainable params: 2,000,000
____________________________________________