### Using GloVe embeddings

This notebooks shows how to use GloVe pretrained embeddings. The notebook is modified from a Keras [blog post](https://keras.io/examples/nlp/pretrained_word_embeddings/)

Read more about GloVe here: https://nlp.stanford.edu/projects/glove/

The notebook uses input data that was processed and picked in notebook 'Embedded Data Pre-Processing'

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

Read in preprocessed data. See Embedding Data Pre-Processing notebook for details. 

In [2]:
import pickle

train_samples = pickle.load(open('data/train_samples.pkl', 'rb'))
train_labels = pickle.load(open('data/train_labels.pkl', 'rb'))

val_samples = pickle.load(open('data/val_samples.pkl', 'rb'))
val_labels = pickle.load(open('data/val_labels.pkl', 'rb'))

test_samples = pickle.load(open('data/test_samples.pkl', 'rb'))
test_labels = pickle.load(open('data/test_labels.pkl', 'rb'))

class_names = pickle.load(open('data/class_names.pkl', 'rb'))

#### Set up the vectorizer

Use Keras's TextVectorization() function to vectorize the data, using only the top 20K words. Each sample will be truncated or padded to a length of 200. 

In [3]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

In [6]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

### Load pretrained word embeddings

The pretrained GloVe embeddings were downloaded from: http://nlp.stanford.edu/data/glove.6B.zip

The file was then expanded in the .keras/datasets folder. 

The next block of code creates an embeddings index dictionary to map words to the GloVe embedding.

In [9]:
import os

path_to_glove_file = os.path.join(
    os.path.expanduser("~"), ".keras/datasets/glove.6B/glove.6B.100d.txt"
)

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


Create an embedding matrix, replacing the original token with the GloVe embedding. 

In [10]:
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 17889 words (2111 misses)


Set up the Embedding layer, setting trainable to False so that the embeddings are not modified during model training. 

In [11]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

### Build the model

Several layers of Conv1D followed by pooling, ending in a softmax classification layer. Instead of the usual Keras syntax, this example uses syntax from the Functional API: https://www.tensorflow.org/guide/keras/functional

In [12]:
from tensorflow.keras import layers

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 100)         2000200   
_________________________________________________________________
conv1d (Conv1D)              (None, None, 128)         64128     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, None, 128)         0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, None, 128)         0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, None, 128)        

### Vectorize train and validation sets

Using vectorizer in this way will right-pad the samples. 

In [13]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

### Train the model

Sparse categorical crossentropy is used because the final layer is a multi-class softmax layer. 

In [14]:
model.compile(
    loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f851667bdf0>

### Export the model

The next code block shows how you could create an end-to-end systems where the input is a text string and the output is the predicted label. 

In [15]:
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)

probabilities = end_to_end_model.predict(
    [["this message is about computer graphics and 3D modeling"]]
)

class_names[np.argmax(probabilities[0])]

'comp.graphics'

### Evaluate on the test data

In [19]:
test_x = vectorizer(np.array([[s] for s in test_samples])).numpy()

preds = model.predict(test_x)
pred_labels = [np.argmax(p) for p in preds]

In [18]:
from sklearn.metrics import classification_report

print(classification_report(test_labels, pred_labels))

              precision    recall  f1-score   support

           0       0.39      0.71      0.51       200
           1       0.60      0.59      0.60       202
           2       0.65      0.59      0.62       196
           3       0.44      0.57      0.49       192
           4       0.49      0.75      0.60       196
           5       0.67      0.62      0.65       190
           6       0.78      0.63      0.70       201
           7       0.77      0.72      0.75       200
           8       0.86      0.82      0.84       196
           9       0.94      0.88      0.91       213
          10       0.90      0.94      0.92       188
          11       0.89      0.78      0.83       196
          12       0.62      0.59      0.61       206
          13       0.79      0.81      0.80       190
          14       0.82      0.84      0.83       206
          15       0.70      0.69      0.69       193
          16       0.72      0.55      0.62       223
          17       0.76    

This accuracy is not terrible for a 20-class classification problem, but not terrific. It is only slightly higher than the other notebook which used an embedding layer instead of pretrained embeddings. 