---

## 11.3.3 Processing words as a sequence: The sequence model approach

In [1]:
import os
import sys
import shutil
import random
import pathlib

import numpy as np
import tensorflow as tf

from IPython.display import YouTubeVideo

---

### Word embeddings

A big revolution of this last decade: researchers discovered that we can get systems to **learn to project words** into a **vector space** that retain **semantic relationship**.

The idea behind this discovery is that **similar words** (similar meanings) occur in **similar contexts**.

*You shall know a word by the company it keeps.*  
(J. R. Firth, "A Synopsis of Linguistic Theory", 1957, cf. also [the late Wittgenstein](https://plato.stanford.edu/entries/wittgenstein/#MeanUse))

The algorithms will train on large text corpora, and:
- count all the occurrences of words and contexts;  
- try and predict either the word given the context;
- or the context given the word.

Word embeddings in the context of a large dictionary have typically 256, 512 or 1024 dimensions  
(the dimension of a vector is its length i.e. the number of components).

One-hot encoded vectors can exceed 20,000 dimensions!  
Also, technically all one-hot encoded are **orthogonal**: no similarity between them.

Word embeddings compress the information into fewer dimensions.  
Two word vectors can be compared to each other!

<!-- <img style="height: 700px" src="images/nlp/chollet.one-hot-embeddings.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/6-text-and-sequences/images/nlp/chollet.one-hot-embeddings.png?raw=true">

<small>DLWP, p.330</small>

|One-hot|Word embeddings|
|:---|:---|
|binary (integers: 0/1)|floating point vectors|
|sparse (most elements are zeros)|*dense*|
|very high-dimensional|low-to-medium-dimensional|
|hard-coded|learnt from data|


The results is that each token will be represented as a **coordinate** (aka a **vector**) in a high-dimensional space.

The most striking features of these spaces is that they seem to encode **semantic relationships**!

<!-- <img src="images/nlp/linear-relationships.svg"> -->
<img src="https://github.com/jchwenger/AI/blob/main/6-text-and-sequences/images/nlp/linear-relationships.svg?raw=true">


<small>[Embeddings: Translating to a Lower-Dimensional Space, Google Foundational Courses, Machine Learning, Embeddings](https://developers.google.com/machine-learning/crash-course/embeddings/translating-to-a-lower-dimensional-space)</small>

#### Note: universal embedding

A universal embedding is unlikely, or very difficult to achieve (although [recent work](https://phillipi.github.io/prh/) proposes the opposite conjecture, thanks Peyton Hammersley for the references).

Semantic relationships depend on task – the text corpus and what we are learning.

Expect different geometries for different tasks (e.g. sentiment analysis is very different from classification of legal documents).

#### Also: bias

The biases of your dataset **will be encoded** in the space (for instance, gendered associations between professions).

### References

#### Word2Vec

Google, 2013, aka the Skip-Gram model, or Continuous Bag of Words (CBOW)

<small>[Mikolov et al., "Distributed Representations of Words and Phrases and their Compositionality", arxiv](https://arxiv.org/abs/1310.4546)</small>  

Perhaps the most famous word embedding scheme.

#### GloVe: *Global Vectors for Word Representation*

Stanford University, 2014

<small>[Pennington et al., "GloVe: Global Vectors for Word Representation", arxiv](https://nlp.stanford.edu/pubs/glove.pdf)</small>  


Based on factorizing a matrix of word co-occurrence statistics.

Millions of English tokens harvested from Wikipedia and other sources.

#### Learning word embeddings with the Embedding layer

Later, researchers discovered that you can simply **learn** these vectors with your DL model using backprop like everything else!

In current models, you just invoke a specific layer, and all the work is done for you.

Note that you must specify in advance the **dimensionality** of the embedding space.

As usual, more dimensions == more **resolution** (finer-grained), but more computationally expensive.

In [3]:
max_tokens = 8000
inputs = tf.keras.Input(shape=(max_tokens,))
embedding_layer = tf.keras.layers.Embedding( # ← EMBEDDING LAYER
    input_dim=max_tokens,                    # the size of the vocabulary (8000)
    output_dim=256                           # the dimensionality of the embedding space
)
x = embedding_layer(inputs)
model = tf.keras.Model(inputs, x)
model.summary()

The two arguments of the embedding layer code are:

```python
tf.keras.layers.Embedding(input_dim=max_tokens, output_dim=256)
```

- input_dim = 8000 (the size of our vocab)
- output_dim = 256 (the dimension of the embedding space)

In Chollet's example (the IMDB sentiment task again), reviews have been reduced (or expanded) to a constant length of 600 words.

E.g. `x_test[0] = [65, 16, 38, 1334, 88, 12, ..., 16, 5345, 19, 178, 32]`

65, 16, 38... are dictionary entries – word 65, word 16, word 38... in the 8000-words dictionary.

The `Embedding` layers creates a matrix of randomly initialised vectors, one for each word: $d_{vocab} \times d_{embedding}$.

Then, each elements of `x_test[0]` is used as an index to retrieve the appropriate embedding vector => 600 vectors of dimensions $256$.

#### Note: one-hot vector matrix multiplication as index retrieval

The actual operation implemented in `Keras` and other libraries does not, to my knowledge, use this, as it would require the allocation of a large sparse matrix, but it is worth noting that if your word index is encoded as a one-hot vector, a matrix multiplication will in effect retrieve the vector in the corresponding row:


$$
\begin{bmatrix}
0 & \cdots & 1_{\text{col r}} & \cdots & 0\\
\end{bmatrix}
\overbrace{
\begin{bmatrix}
\text{e}_{11} & \cdots & \text{e}_{1d_e}\\
\vdots & & \vdots \\
\color{red}{\text{e}_{r1}} & \cdots & \color{red}{\text{e}_{rd_e}} \\ 
\vdots & & \vdots \\
\text{e}_{d_v1} & \cdots &\text{e}_{d_vd_e}\\
\end{bmatrix}}^{\text{all embedding vectors (rows)}} = 
\begin{bmatrix}
\text{e}_{r1} & \cdots & \text{e}_{rd_e}\\
\end{bmatrix}
$$

Or, column, if transposed:

$$
\underbrace{
\begin{bmatrix}
\text{e}_{11} & \cdots & \overbrace{\color{red}{\text{e}_{1r}}}^{\text{selected embedding}} & \cdots & \text{e}_{1\ d_{v}}\\
\vdots & & \vdots & & \vdots  \\
\text{e}_{d_e1} & \cdots & \color{red}{\text{e}_{d_er}} & \cdots & \text{e}_{d_e\ d_v}\\
\end{bmatrix}
}_{\text{all embedding vectors (columns)}}
\begin{bmatrix}
0\\
\vdots \\
1_{\text{row r}} \\
\vdots \\
0\\
\end{bmatrix} =
\begin{bmatrix}
\text{e}_{1r}\\
\vdots \\
\text{e}_{d_er}\\
\end{bmatrix}
$$

Where:
- $e_{ij}$ represents a small number in the embedding vector
- $d_v$ is the vocabulary dimension (`input_dim`)
- $d_e$ is the embedding dimension (`output_dim`)

In our case, a vanilla way of retrieving the vectors would bbe to have our sequence of integers encoded as matrix of one-hot vectors, dimensions: $600 \times 8'000$, which would then be multiplied by an embedding matrix of $8'000 \times 256$ of dimensions $256 \times 8'000$ => a sequence of embeddings, 600 floating point vectors of length 256, or $600 \times 256$.

#### Embedding layer: \# of parameters and dimensions

How many learnable parameters does the embedding weight matrix have?

$$
\bbox[5px,border:2px solid red]
{
\mathrm{input\_dim} \times \mathrm{output\_dim}
}
$$

That is:

$$
\bbox[5px,border:2px solid red]
{
\mathrm{vocab\_size} \times \mathrm{embed\_size}
}
$$

Example:

$ 10'000 \times 8 = 80'000$ elements.



The embedding layer takes as input tensors of shape `(batch_size, sequence_length)`.  
The outputs tensors of shape `(bach_size, sequence_length, output_dim)`.

As usual in Keras, the batch_size is represented as `None`:

$$
\bbox[5px,border:2px solid red]
{
In: (None, sequence\_length) \to Out: (None, sequence\_length, output\_dim)
}
$$

`output_dim` could be called `embed_dim`, the number of dimensions of our embedding space!



### A first practical example

Sentiment analysis on the IMDB dataset.

#### Downloading the data


In [4]:
DATASET_DIR = pathlib.Path("aclImdb")

if "google.colab" in sys.modules and not DATASET_DIR.exists():
    !curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    !tar -xf aclImdb_v1.tar.gz # this untars the archive to a folder called aclImdb
    !rm -r aclImdb/train/unsup

MODELS_DIR = pathlib.Path("models")
MODELS_DIR.mkdir(exist_ok=True)

In [5]:
# code to split the data into train/val folders
DATASET_DIR = pathlib.Path("aclImdb")
TRAIN_DIR = DATASET_DIR / "train"
VAL_DIR = DATASET_DIR / "val"
TEST_DIR = DATASET_DIR / "test"

for category in ("neg", "pos"):
    if not os.path.isdir(VAL_DIR / category):    # do this only once
        os.makedirs(VAL_DIR / category)          # make 'neg'/'pos' dir in validation
        files = os.listdir(TRAIN_DIR / category) # list files in 'train'
        random.Random(1337).shuffle(files)       # shuffle using a seed
        num_val_samples = int(0.2 * len(files))  # 2% of our samples for validation
        val_files = files[-num_val_samples:]
        for fname in val_files:                  # move our files
            shutil.move(TRAIN_DIR / category / fname,
                        VAL_DIR / category / fname)

#### Processing using `text_dataset_from_directory`


The [`tf.keras.utils.text_dataset_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/text_dataset_from_directory) layer, for a directory structure like so:
```
main_directory/
...class_a/
......a_text_1.txt
......a_text_2.txt
...class_b/
......b_text_1.txt
......b_text_2.txt
```

In [6]:
BATCH_SIZE = 32

# each of these iterables returns tuples containing two tensors:
# samples, shape: (batch_size, sample_shape) ← our texts
# targets, shape: (batch_size,)              ← 0 or 1
train_ds = tf.keras.utils.text_dataset_from_directory(
    TRAIN_DIR, batch_size=BATCH_SIZE
)
val_ds = tf.keras.utils.text_dataset_from_directory(
    VAL_DIR, batch_size=BATCH_SIZE
)
test_ds = tf.keras.utils.text_dataset_from_directory(
    TEST_DIR, batch_size=BATCH_SIZE
)
text_only_train_ds = train_ds.map(lambda x, y: x)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [7]:
# Preparing integer sequence train/val/test datasets
max_length = 600   # we cut our sequences to 600 words max! (For memory.) This will affect performance...
max_tokens = 8000

text_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

#### Training with `one_hot` vectors


In [8]:
# A sequence model built on one-hot encoded vector sequences

inputs = tf.keras.Input(shape=(None,), dtype="int64")
# ↓ our one-hot vectors --------------------------------------------
embedded = tf.keras.ops.one_hot(inputs, max_tokens) # tf.one_hot incompatible with keras in new versions
# ---------------------------------------------------- passed here ↓
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(embedded)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
    optimizer="rmsprop",
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

In [9]:
model.summary()

In [10]:
# Training a first basic sequence model
callbacks = [
    tf.keras.callbacks.ModelCheckpoint(
        str(MODELS_DIR / "one_hot_bidir_lstm.keras"),
        save_best_only=True
    )
]
model.fit(
    int_train_ds,
    validation_data=int_val_ds,
    epochs=10,
    callbacks=callbacks
)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m113s[0m 174ms/step - accuracy: 0.6192 - loss: 0.6332 - val_accuracy: 0.8436 - val_loss: 0.4016
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m147s[0m 186ms/step - accuracy: 0.8504 - loss: 0.3838 - val_accuracy: 0.7244 - val_loss: 0.5503
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 188ms/step - accuracy: 0.8750 - loss: 0.3252 - val_accuracy: 0.8628 - val_loss: 0.3209
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 174ms/step - accuracy: 0.8979 - loss: 0.2863 - val_accuracy: 0.8762 - val_loss: 0.3561
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m108s[0m 173ms/step - accuracy: 0.9151 - loss: 0.2568 - val_accuracy: 0.8658 - val_loss: 0.3175
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m152s[0m 189ms/step - accuracy: 0.9231 - loss: 0.2262 - val_accuracy: 0.8692 - val_loss: 0.3277
Epoc

<keras.src.callbacks.history.History at 0x7c274494c0a0>

In [11]:
model = tf.keras.models.load_model(MODELS_DIR / "one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds, verbose=0)[1]:.3f}")
del model

Test acc: 0.855


#### Learning word embeddings with the `Embedding` layer

In [12]:
# A model that uses an `Embedding` layer trained from scratch

tf.keras.backend.clear_session()

inputs = tf.keras.Input(shape=(None,), dtype="int64")
# ↓ our embedding layer --------------------------------------------
embedded = tf.keras.layers.Embedding(
    input_dim=max_tokens, # our data comes in with a vocab size of `max_tokens`
    output_dim=256        # and comes out as dense vectors of dim ("vocab") of 256
)(inputs)
# ---------------------------------------------------- passed here ↓
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(embedded)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
    optimizer="rmsprop",
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

Note that given our vocabulary of $8'000$ tokens the embedding layer is **large**: $256 \times 8'000 = 2'048'000$ elements.

In [13]:
model.summary()

In [14]:
callbacks = [
    tf.keras.callbacks.ModelCheckpoint(
        str(MODELS_DIR / "embeddings_bidir_lstm.keras"),
        save_best_only=True
    )
]
model.fit(
    int_train_ds,
    validation_data=int_val_ds,
    epochs=10,
    callbacks=callbacks
)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 60ms/step - accuracy: 0.6196 - loss: 0.6325 - val_accuracy: 0.8026 - val_loss: 0.4396
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 53ms/step - accuracy: 0.8230 - loss: 0.4248 - val_accuracy: 0.8142 - val_loss: 0.4022
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 58ms/step - accuracy: 0.8603 - loss: 0.3736 - val_accuracy: 0.8580 - val_loss: 0.3404
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 57ms/step - accuracy: 0.8800 - loss: 0.3204 - val_accuracy: 0.8484 - val_loss: 0.3408
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m51s[0m 73ms/step - accuracy: 0.9046 - loss: 0.2642 - val_accuracy: 0.8664 - val_loss: 0.3336
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 55ms/step - accuracy: 0.9216 - loss: 0.2273 - val_accuracy: 0.8730 - val_loss: 0.3516
Epoch 7/10
[1m6

<keras.src.callbacks.history.History at 0x7c27ac0f5c60>

In [15]:
model = tf.keras.models.load_model(MODELS_DIR / "embeddings_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds, verbose=0)[1]:.3f}") # in this case, roughly the same as one-hot
del model

Test acc: 0.857


---

### Using pretrained word embeddings

Pretrained word embeddings are useful **when training data is limited** – just as with pretrained convnets.

Very structured embeddings hopefully capture **generic structure** appropriate to diverse domains.

(The more data you can train on, the more likely your task-specific embeddings will perform better.)

Let's see how we can use GloVe embeddings in `tensorflow.keras`.

(The same method applies to Word2Vec or any other embedding technique.)

#### Download the data

In [16]:
GLOVE_DIR = pathlib.Path("glove") # I have my file in a folder called 'glove'

if not DATASET_DIR.exists():
    !wget http://nlp.stanford.edu/data/glove.6B.zip
    !unzip -q glove.6B.zip -d glove # unzip to a directory called "glove"
    !rm glove.6B.zip                # remove the zip file

PATH_TO_GLOVE_FILE =  GLOVE_DIR / "glove.6B.100d.txt"

#### Inspect the data

In [17]:
# another Jupyter magic: use $python_variable in bash commands
!head -n 1 $PATH_TO_GLOVE_FILE
# ↓ the word "the" followed by its coordinates in a 100-dimensional space

the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062


#### Import into an `Embedding` layer

In [18]:
# parsing the GloVe word-embeddings file
embeddings_index = {}                                   # our dictionary: {'word': np.array([...coordinates..])}
with open(PATH_TO_GLOVE_FILE) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)            # split: word | coordinates
        coefs = np.fromstring(coefs, "float", sep=" ")  # load string floats into numpy, space-separated
        embeddings_index[word] = coefs                  # save into dictionary

print(f"Found {len(embeddings_index):,} word vectors.")

Found 400,000 word vectors.


In [19]:
embedding_dim = 100

# we reuse the same TextVectorization object as earlier, turning our sentences into integers
# max_length: 600, max_tokens: 8000
vocabulary = text_vectorization.get_vocabulary()
word_index = dict(zip(vocabulary, range(len(vocabulary))))

# preparing the GloVe word-embeddings matrix
embedding_matrix = np.zeros((max_tokens, embedding_dim))     # create a matrix (max_tokens, embedding_dim)
for word, i in word_index.items():                           # looping through our vocab
    if i < max_tokens:                                       # don't try and retrieve beyond max_tokens
        embedding_vector = embeddings_index.get(word)        # try and get the vector associated with the word
    if embedding_vector is not None:                         # if the vector exists
        embedding_matrix[i] = embedding_vector               # assign it to our matrix

In [20]:
embedding_layer = tf.keras.layers.Embedding(
    max_tokens,
    embedding_dim,        # using our embedding matrix through an initializer
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,      # WE DO NOT TRAIN IT!
    mask_zero=True,
)

# Given that our network is initialized randomly, the massive changes it undergoes at the beginning
# of training would certainly affect/damage the representations in our embedding matrix
# (same scenario as with pretrained ConvNets)

#### Define model & Train

In [21]:
# A model that uses a pretrained Embedding layer
tf.keras.backend.clear_session()

inputs = tf.keras.Input(shape=(None,), dtype="int64")
# ↓ our embedding layer --------------------------------------------
embedded = embedding_layer(inputs)
# ---------------------------------------------------- passed here ↓
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(embedded)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
    optimizer="rmsprop",
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

In [22]:
model.summary()

In [23]:
callbacks = [
    tf.keras.callbacks.ModelCheckpoint(
        str(MODELS_DIR / "glove_embeddings_sequence_model.keras"),
        save_best_only=True
    )
]
model.fit(
    int_train_ds,
    validation_data=int_val_ds,
    epochs=10,
    callbacks=callbacks
)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 55ms/step - accuracy: 0.6133 - loss: 0.6418 - val_accuracy: 0.7982 - val_loss: 0.4468
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 58ms/step - accuracy: 0.7835 - loss: 0.4693 - val_accuracy: 0.8238 - val_loss: 0.3928
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 54ms/step - accuracy: 0.8112 - loss: 0.4135 - val_accuracy: 0.8370 - val_loss: 0.3720
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m57s[0m 80ms/step - accuracy: 0.8340 - loss: 0.3787 - val_accuracy: 0.8446 - val_loss: 0.3593
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 55ms/step - accuracy: 0.8500 - loss: 0.3477 - val_accuracy: 0.8596 - val_loss: 0.3285
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 62ms/step - accuracy: 0.8613 - loss: 0.3282 - val_accuracy: 0.8602 - val_loss: 0.3192
Epoch 7/10
[1m6

<keras.src.callbacks.history.History at 0x7c27347d36d0>

In [24]:
print(f"Test acc: {model.evaluate(int_test_ds, verbose=0)[1]:.3f}")

Test acc: 0.858


### Save models to Google Drive


In [27]:
EXPORT=False

if EXPORT:
    # zip models
    !zip sequences.models.zip {MODELS_DIR}/*
    # connect to drive
    from google.colab import drive
    drive.mount('/content/drive')
    # copy zip to drive (adjust folder as needed)
    !cp sequences.models.zip drive/MyDrive/IS53024B-Artificial-Intelligence/models

## Summary

### Word embbeddings

- **Various kinds of word encodings**:
  - **one-hot/multi-hot**: the presence of words is marked by a 1 (binary) → *sparse* & *hard-coded*
  - **word embeddings**: project words/tokens into vector spaces where collocations between words ("the company a word keeps") are modeled as the distance between vectors. → *dense* & *learnt from data*
- Two most important embedding models
    - **Word2Vec** from Google
    - **GloVe** from Stanford
- **Embedding layers** can be trained end to end with your net!
- **Pretrained embeddings** can also be used on top of your own models, like pretrained networks!
  