# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 1: Document Classification/Regression with Neural Networks</font>

# <font color="#003660">Notebook 6: Long-Short Term Memory (LSTM) Networks with Word Embeddings as Features</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... understand the logic behind recurrent neural networks, especially LSTMs. and <br>
        ... are able to train a LSTM with word embeddings as features.
    </font>
</div>
</center>
</p>

# What are Recurrent Neural Networks?

## Simple Recurrent Neural Network (RNN)

A RNN processes sequences by iterating through the sequence elements and maintaining a *state* containing information relative to what it has seen so far. In effect, a RNN layer is a neural network layer with an internal loop, as shown in the figure below.

<br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/rnn1.png"/><br>
<center>Source: Chollet (2021)</center>

The figure below shows a simple RNN unrolled over time. As can be seen from the figure the output of a layer is a combination of

1. its direct data input (`input_t`),
2. the layer's state from the previous timestep (`state_t`), and
3. a bias term (`bo`).

<br><img width=700 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/rnn2.png"/><br>

<center>Source: Chollet (2021)</center>

## Long-Short Term Memory (LSTM) Networks

Compared to a simple RNN layer, a LSTM layer contains one central innovation: A *carry track* that allows to carry over information over time from any previous timestep to the current timestep. Consequently, the output of a layer is a combination of



1. its direct data input (`input_t`),
2. the layer's state from the previous timestep (`state_t`),
3. the input from the carry track (`c_t`), and
4. a bias term (`bo`).

<br><img width=700 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/lstm.png"/><br>
<center>Source: Chollet (2021)</center>

## Comparing RNNs with MLPs

Andrej Karpathy's legendary blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" contains a very informative comparison of RNNs with traditional neural networks: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- `sklearn` is a free software machine learning library for the Python programming language.
- `tensorflow` is an end-to-end open source platform for machine learning, especially deep learning.
- `matplotlib` is a plotting library for the Python programming language and its numerical mathematics extension NumPy


In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from sklearn import metrics
import matplotlib.pyplot as plt

Check if we are running on GPU.

In [3]:
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

# Load documents

Load wine reviews (Source: https://www.kaggle.com/datasets/zynicide/wine-reviews) from a csv file.

In [None]:
corpus = pd.read_csv("https://raw.githubusercontent.com/olivermueller/amlta-2025/main/Session_01/winemag-data-130k-v2.csv")

Split data into three sets: training, validation, and test.

In [5]:
training = corpus.iloc[0:80000,].sample(n=10000) # sample to speed up training
validation = corpus.iloc[80000:100000,]
test = corpus.iloc[100000:,]

For each dataset, store features and targets in separate variables

In [6]:
train_corpus_features = training[["description"]]
train_corpus_target = training[["points"]]
val_corpus_features = validation[["description"]]
val_corpus_target = validation[["points"]]
test_corpus_features = test[["description"]]
test_corpus_target = test[["points"]]

Create [TensorFlow `Datasets`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) from the Pandas Dataframes. The use of TensorFlow Datasets follows a common pattern:

1.   Create a dataset from raw data (e.g., a Pandas dataframe, a CSV file, multiple text files).
2.   Apply transformations to preprocess the data in the dataset (e.g., vectorize text data).
3. Iterate over the dataset and process its elements. Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.

Here, we use the `from_tensor_slices` constructor to create datasets from dataframes.

In [7]:
train_ds = tf.data.Dataset.from_tensor_slices((tf.cast(train_corpus_features.values, tf.string),
                                               tf.cast(train_corpus_target.values, tf.int32)))

val_ds = tf.data.Dataset.from_tensor_slices((tf.cast(val_corpus_features.values, tf.string),
                                             tf.cast(val_corpus_target.values, tf.int32)))

test_ds = tf.data.Dataset.from_tensor_slices((tf.cast(test_corpus_features.values, tf.string),
                                              tf.cast(test_corpus_target.values, tf.int32)))

2025-10-06 17:45:00.114847: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2025-10-06 17:45:00.114992: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2025-10-06 17:45:00.115006: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2025-10-06 17:45:00.115190: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-10-06 17:45:00.115341: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Display some stats and examples from the created datasets.

In [8]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("===")
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (1,)
inputs.dtype: <dtype: 'string'>
targets.shape: (1,)
targets.dtype: <dtype: 'int32'>
===
inputs[0]: tf.Tensor(b"There's a salinity to this wine, which is grown to the west of Gilroy on the lower foothills of the Santa Cruz Mountains. Aromas are brackish and suggest sweet lemons, while the palate offers apples, squeezed citrus and a waxy character.", shape=(), dtype=string)
targets[0]: tf.Tensor(84, shape=(), dtype=int32)


# Vectorize documents

We will now use [TensorFlow's `TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) function to transform raw texts into numerical vectors. Again, we map unique words to integers.

In [9]:
max_tokens = 10000
max_length = 100

text_vectorization = TextVectorization(
    max_tokens = max_tokens,
    output_mode = "int",
    output_sequence_length = max_length
)

Some apects of the `TextVectorization` function (e.g., the size and contents of the vocabulary) have to be fit using training data, which can be done with the `adapt` function.

In [10]:
train_ds_features_only = train_ds.map(lambda x, y: x)
text_vectorization.adapt(train_ds_features_only)

2025-10-06 17:45:00.683341: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


Show the vocabulary that our vectorizer knows after being fit to the training data.

In [11]:
text_vectorization.get_vocabulary()[0:10]

['', '[UNK]', 'and', 'the', 'a', 'of', 'with', 'this', 'is', 'wine']

Next, we apply our `text_vectorization` function to all three datasets. This corresponds to step 2 mentioned above.

In [12]:
vectorized_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls = 4)

vectorized_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls = 4)

vectorized_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls = 4)

Show results.

In [13]:
for inputs, targets in vectorized_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("===")
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (1, 100)
inputs.dtype: <dtype: 'int64'>
targets.shape: (1,)
targets.dtype: <dtype: 'int32'>
===
inputs[0]: tf.Tensor(
[ 125    4 2811   12    7    9  146    8  652   12    3 2595    5    1
   15    3 3733 3115    5    3 1123 4413 2045   17   29    1    2  514
   51 1132   57    3   18   52  479 2796   58    2    4 1217   83    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0], shape=(100,), dtype=int64)
targets[0]: tf.Tensor(84, shape=(), dtype=int32)


# Train model

We are now ready to specify a neural network and feed it with the vectroized datasets. For convenience, we define a custome function `get_model` which defines the network architecture, creates a model from it, and compiles this model (by defining, e.g., an otpimizer and loss function).

Instead of averaging or flattening the outputs of the embedding layer, a RNN/LSTM layer can directly process its 2D output (i.e., it takes a sequence of vectors as input instead of a single vector).

In [None]:
def get_model(hidden_dim=32):
    inputs = keras.Input(shape=(max_length,), dtype="int64")
    embedded = layers.Embedding(input_dim=max_tokens, output_dim=300, mask_zero=True)(inputs)
    hidden1 = layers.LSTM(hidden_dim, return_sequences = False)(embedded)
    outputs = layers.Dense(1, activation = "linear")(hidden1)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer = tf.optimizers.Adam(),
                  loss = "mean_absolute_error",
                  metrics = ["mean_absolute_error"])
    return model

Instantiate model and show it's architecture.

In [15]:
model = get_model()
model.summary()



Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 100)]             0         
                                                                 
 embedding (Embedding)       (None, 100, 300)          3000000   
                                                                 
 lstm (LSTM)                 (None, 32)                42624     
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 3042657 (11.61 MB)
Trainable params: 3042657 (11.61 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Fit model on training data and save best model to disk.

In [None]:
callbacks = [keras.callbacks.ModelCheckpoint("lstm.keras", save_best_only=True)]

history = model.fit(vectorized_train_ds.cache(),
          validation_data = vectorized_val_ds.cache(),
          epochs = 3,
          batch_size = 128,
          callbacks = callbacks)

Epoch 1/3


2025-10-06 17:45:11.545408: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Adam/AssignAddVariableOp_10.
2025-10-06 17:45:11.988355: W tensorflow/core/common_runtime/type_inference.cc:339] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_INT32
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_FLOAT
    }
  }
}

	for Tuple type infernce function 0
	while inferring type of node 'cond_36/output/_23'


Epoch 2/3
 1592/10000 [===>..........................] - ETA: 4:26 - loss: 2.4970 - mean_absolute_error: 2.4970

Plot the learning process.

In [None]:
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['val_mean_absolute_error'])
plt.title('Model accuracy')
plt.ylabel('Mean Absolute Error')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

# Make predictions

Load best model from training phase.

In [None]:
model = keras.models.load_model("lstm.keras")

Make predictions on test set.

In [None]:
preds = model.predict(vectorized_test_ds)

Calculate accuracy metrics.

In [None]:
print(metrics.mean_absolute_error(test_corpus_target, preds))