# Using pre-trained word embeddings


## Setup

In [65]:
import numpy as np
import io
import os
import re
import shutil
import string

import tensorflow as tf
from tensorflow import keras

## Introduction

In this example, we show how to train a text classification model that uses pre-trained
word embeddings.

We'll work with threat intelligence report dataset.

## Download the Threat Intelligence data

In [18]:
url = 'https://github.com/eyalmazuz/AttackAttributionDataset/archive/refs/heads/master.zip' 

dataset = tf.keras.utils.get_file('master.zip', url,
                                  extract=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'AttackAttributionDataset-master')
print(dataset_dir)
os.listdir(dataset_dir)

./AttackAttributionDataset-master


['APT29',
 'Lazarus',
 'FIN7',
 'APT17',
 'Winnti',
 'DeepPanda',
 'RocketKitten',
 'APT28',
 'Turla',
 'menuPass',
 'README.md',
 'OilRig',
 'APT3']

## Shuffle and split the data into training & validation sets

In [23]:
batch_size = 5
seed = 42

train_ds = tf.keras.utils.text_dataset_from_directory(
    'AttackAttributionDataset-master',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)


val_ds = tf.keras.utils.text_dataset_from_directory(
    'AttackAttributionDataset-master', batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed)

AUTOTUNE = tf.data.AUTOTUNE

class_names = train_ds.class_names

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

Found 238 files belonging to 12 classes.
Using 191 files for training.
Found 238 files belonging to 12 classes.
Using 47 files for validation.


## Create a vocabulary index

Let's use the `TextVectorization` to index the vocabulary found in the dataset.
Later, we'll use the same layer instance to vectorize the samples.

Our layer will only consider the top 20,000 words, and will truncate or pad sequences to
be actually 200 tokens long.

In [7]:
from tensorflow.keras.layers import TextVectorization

vectorize_layer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's
print the top 5 words:

In [8]:
vectorize_layer.get_vocabulary()[:5]

['', '[UNK]', 'the', 'to', 'of']

Let's vectorize a test sentence:

In [9]:
output = vectorize_layer([["the cat sat on the mat"]])
output.numpy()[0, :6]

array([    2, 17174, 12257,    14,     2,     1])

As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first
word in the vocabulary? That's because index 0 is reserved for padding and index 1 is
reserved for "out of vocabulary" tokens.

Here's a dict mapping words to their indices:

In [10]:
voc = vectorize_layer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

As you can see, we obtain the same encoding as above for our test sentence:

In [12]:
test = ["the", "cat", "sat", "on", "the", "mat"]
# [word_index[w] for w in test]

## Load pre-trained word embeddings

Let's download pre-trained GloVe embeddings (a 822M zip file).

In [14]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip


--2022-08-19 22:21:23--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-08-19 22:21:24--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-08-19 22:21:24--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

The archive contains text-encoded vectors of various sizes: 50-dimensional,
100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.

Let's make a dict mapping words (strings) to their NumPy vector representation:

In [19]:
path_to_glove_file = os.path.join(
    os.path.dirname(dataset), "glove.6B.100d.txt"
)

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


Now, let's prepare a corresponding embedding matrix that we can use in a Keras
`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained
vector for the word of index `i` in our `vectorizer`'s vocabulary.

In [20]:
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))


Converted 12056 words (6898 misses)


Next, we load the pre-trained word embeddings matrix into an `Embedding` layer.

Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to
update them during training).

In [21]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

## Build the model

A simple 1D convnet with global max pooling and a classifier at the end.

In [51]:
from tensorflow.keras import layers
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

model = Sequential([
  vectorize_layer,
  embedding_layer,
  GlobalAveragePooling1D(),
  Dense(384, activation="relu"),
  tf.keras.layers.Dropout(0.1),  Dense(192, activation="relu"),
  tf.keras.layers.Dropout(0.1),
  Dense(16, activation='relu'),
  Dense(len(class_names))
])

# int_sequences_input = keras.Input(shape=(None,), dtype=tf.string)
# embedded_sequences = embedding_layer(int_sequences_input)
# x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
# x = layers.MaxPooling1D(5)(x)
# x = layers.Conv1D(128, 5, activation="relu")(x)
# x = layers.MaxPooling1D(5)(x)
# x = layers.Conv1D(128, 5, activation="relu")(x)
# x = layers.Flatten()(embedded_sequences)
# x = layers.GlobalAveragePooling1D()(x)
# x = layers.Dense(128, activation="relu")(x)
# x = layers.Dropout(0.5)(x)
# preds = layers.Dense(len(class_names), activation="softmax")(x)
# model = keras.Model(int_sequences_input, preds)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (TextV  (None, 200)              0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       multiple                  1895600   
                                                                 
 global_average_pooling1d_9   (None, 100)              0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_15 (Dense)            (None, 384)               38784     
                                                                 
 dropout_8 (Dropout)         (None, 384)               0         
                                                                 
 dense_16 (Dense)            (None, 192)               7

## Train the model

First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays
are right-padded.

In [52]:
# x_train = vectorize_layer(np.array([[s] for s in train_ds])).numpy()
# x_val = vectorize_layer(np.array([[s] for s in val_ds])).numpy()

# y_train = np.array(train_labels)
# y_val = np.array(val_labels)

We use categorical crossentropy as our loss since we're doing softmax classification.
Moreover, we use `sparse_categorical_crossentropy` since our labels are integers.

In [53]:
model.compile(
    loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
)
model.fit(train_ds, epochs=20, validation_data=val_ds)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fdf8adc02d0>

## Export an end-to-end model

Now, we may want to export a `Model` object that takes as input a string of arbitrary
length, rather than a sequence of indices. It would make the model much more portable,
since you wouldn't have to worry about the input preprocessing pipeline.

Our `vectorizer` is actually a Keras layer, so it's simple:

In [64]:
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorize_layer(string_input)
preds = model.predict(val_ds)
# end_to_end_model = keras.Model(string_input, preds)

# probabilities = end_to_end_model.predict(
#     [["this message is about computer graphics and 3D modeling"]]
# )

classes = np.argmax(preds)
print(classes)
class_names[np.argmax(preds)]

9


'Turla'