This is a companion notebook for the book [Deep Learning with Python, Second Edition](https://www.manning.com/books/deep-learning-with-python-second-edition?a_aid=keras&a_bid=76564dff). For readability, it only contains runnable code blocks and section titles, and omits everything else in the book: text paragraphs, figures, and pseudocode.

**If you want to be able to follow what's going on, I recommend reading the notebook side by side with your copy of the book.**

This notebook was generated for TensorFlow 2.6.

# Deep learning for text

## Natural-language processing: The bird's eye view

When you start having to create a large amount of adhoc rules ask yourself if there
is a better way. Can I use a corpus of data to automate the creation of these rules?

- 1st Approach: Decision Trees, lots of if-then-else rules
- 2nd Approach: Logistic Regression
- Computers can now predict the next words based on a lot of data and statistics
    - Content filtering
    - text classification
    - sentiment analysis
    - language modeling
    - translation
    - summarization

In 2017 RNN became popular with the realease of tensorflow and keras. It was the
era of the Bidirectional LSTM. However, quickly after transformers came around
to replace them.

## Preparing text data

Vectorizing text is the process of transforming text into numeric tensors.

See the steps below.

### Text standardization

Basic transforms to make it easier to "read". ie. Lowercasing, remove punctuation, etc.
This is a feature engineering step to remove differences that are unimportant and you 
do not want your model to have to worry about. Examples:

- Convert to Lowercase and remove punctuation
- Convert special characters to a standard form. Remove accents and combined characters
- Stemming: converting words to their key stems (roots)



### Text splitting (tokenization)

Split the text into tokens, like characters, words, or groups of words.

- Word Level: tokens are sparated by spaces
- N-gram: where tokens are groups of N consecutive tokens (or less)
- Charactere: where words are split into characters for tokens. This is rarely used
  except for text generation or speech recognition

There are 2 types of models: sequential and bag of words models. Order matters
for sequential models, order of tokens does not matter for bag of words.

### Vocabulary indexing

This is the process of encoding your tokens into numerics. You find all unique
tokens and assign them an integer. 

```python
vocabulary = {}
for text in dataset:
    text = standardize(text)
    tokens = tokenizer(text)
    for token in tokens:
        if token not in vocabulary:
            vocabulary[token] = len(vocabulary)
```
You can take that and encode using one hot encoding.

```python
def one_hot_encode_token(token):
    vector = np.zeros((len(vocabulary)))
    token_index = vocabulary[token]
    vector[token_index] = 1

```
- Note that vocabularies are limited to the top 20,000 to 30,000 most common words
  in the training set. Text datasets often have a lot of unique words that are not
  very common.
- There will be words that do not appear in your vocabulary. These should be put in
  a 'OOV' an out of vocabulary index. When you decode, you will put something
  like "UNK".


### Using the TextVectorization layer

Here is everything coded in Pure Python

In [1]:
import string

class Vectorizer:
    def standardize(self, text):
        text = text.lower()
        return "".join(char for char in text if char not in string.punctuation)

    def tokenize(self, text):
        text = self.standardize(text)
        return text.split()

    def make_vocabulary(self, dataset):
        self.vocabulary = {"": 0, "[UNK]": 1}
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        self.inverse_vocabulary = dict(
            (v, k) for k, v in self.vocabulary.items())

    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token, 1) for token in tokens]

    def decode(self, int_sequence):
        return " ".join(
            self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

vectorizer = Vectorizer()
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
vectorizer.make_vocabulary(dataset)

In [2]:
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

[2, 3, 5, 7, 1, 5, 6]


In [3]:
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


You can use a `TextVectorization` layer to do much of the same. 

- It has a setting for lowercase and remove punctuation and split on whitespace
- You can also use your own custom functions as well. (These should operate on tf.string)


In [4]:
from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
    output_mode="int",
)

2025-06-04 17:04:01.120565: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-04 17:04:01.222095: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-04 17:04:01.308407: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749074641.402666   18089 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749074641.442554   18089 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1749074641.613090   18089 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

In [5]:
import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
    lowercase_string = tf.strings.lower(string_tensor)
    return tf.strings.regex_replace(
        lowercase_string, f"[{re.escape(string.punctuation)}]", "")

def custom_split_fn(string_tensor):
    return tf.strings.split(string_tensor)

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn,
)

In [6]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
text_vectorization.adapt(dataset)

**Displaying the vocabulary**

Note that the most common words come first. 

In [7]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 np.str_('erase'),
 np.str_('write'),
 np.str_('then'),
 np.str_('rewrite'),
 np.str_('poppy'),
 np.str_('i'),
 np.str_('blooms'),
 np.str_('and'),
 np.str_('again'),
 np.str_('a')]

Here we encode and decode the example sentence

Note that text vectorization is a lookup operation and cannot be run on a GPU.
Note that you can vectorize in two places, in the data pipeline, or make it part of the
model.

```python
# Part of the data pipeline
int_sequence_dataset = string_dataset.map( # string dataset yeilds string tensors
    text_vectorization,
    num_parallel_calls=4 # Number of parallel CPU processes
)
```

```python
# Part of the Model
text_input = keras.Input(shape=(), dtype="string") # Create symbolic input that expects strings
vectorized_text = text_vectorization(text_input) # Apply the vectorization
embedded_input = keras.layers.Embedding(...)(vectorized_text) # Chain functions on top...
output = ...
model = keras.Model(text_input, output)
```
When it is part of the model the vectorization is part of every step. This is compute heavy.
When it is part of the data pipeline it happens once. The first option gives the best performance.

In [8]:
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)


In [9]:
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


## Two approaches for representing groups of words: Sets and sequences

A problematic question is weather the word order should matter. Bag-of-word
models ignore the order, Recurrent models use the order, and Transformers are
a hybrid. 

### Preparing the IMDB movie reviews data

In [19]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  2949k      0  0:00:27  0:00:27 --:--:-- 3640k678k      0  0:00:48  0:00:02  0:00:46 1678k     0  2918k      0  0:00:28  0:00:26  0:00:02 4101k


In [20]:
!rm -r aclImdb/train/unsup

Always look at what your are reading before reading it.

In [21]:
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

In [22]:
# Shuffle and do validation split
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
print(val_dir)
train_dir = base_dir / "train"
print(train_dir)
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files) # Shuffle the list of files
    num_val_samples = int(0.2 * len(files)) # Take 20% of files for validation
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

aclImdb/val
aclImdb/train


In [23]:
# Use the `text_dataset_from_directory` to create train, validation, and test
from tensorflow import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


**Displaying the shapes and dtypes of the first batch**

In [24]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'As seems to be the general gist of these comments, the film has some stunning animation (I watched it on blu-ray) but it really falls short of any real depth.<br /><br />Firstly the characters are all pretty dull. I got a hint of a kind of Laputa situation between Agito, Toola and the main antagonist Shunack. However maybe my mind wanderd and this was wishful thinking (Laputa being my favourite anim\xc3\xa9, original Engilsh dub). The characters are not really lovable either and as mentioned in another post they fall in love exceptionally quickly, leaving poor old Minka jealous and rejected (she loves Agito, who seems oblivious of this). However she promptly seems to forgive Toola at the end with no explanation for the change of heart other than it makes the ending a little bit more "happy". <br /><br />There is also a serious lack of explanation. Like who are

### Processing words as a set: The bag-of-words approach

This is the easiet way to do text processing. You discard the order and treat it
as a bag of words. You can look at single words, unigrams or try to recover some
local information using N-grams

#### Single words (unigrams) with binary encoding

unigrams just separate the words. You can represent an entire sentence in a single
word.  You can use one hot encoding to represent with vectors for each word in the
vocabulary.

**Preprocessing our datasets with a `TextVectorization` layer**

In [26]:
text_vectorization = TextVectorization(
    max_tokens=20000, # Limit vocab.
    output_mode="multi_hot", # Use multi-hot binary vectors
)
text_only_train_ds = train_ds.map(lambda x, y: x) # get a dataset with text only inputs
text_vectorization.adapt(text_only_train_ds) # This indexes the vocabulary

# Prepare processed version of our training and validation and test sets
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

2025-06-04 17:21:43.156604: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


**Inspecting the output of our binary unigram dataset**

In [27]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'int64'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1 1 1 ... 0 0 0], shape=(20000,), dtype=int64)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


**Our model-building utility**

In [28]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

**Training and testing the binary unigram model**

In [0]:
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(), # Here we cache the data in memory
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

#### Bigrams with binary encoding

**Configuring the `TextVectorization` layer to return bigrams**

In [29]:
text_vectorization = TextVectorization(
    ngrams=2, # Here we use bigrams
    max_tokens=20000,
    output_mode="multi_hot",
)

**Training and testing the binary bigram model**

In [30]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

2025-06-04 17:29:40.222143: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 12ms/step - accuracy: 0.7821 - loss: 0.4664 - val_accuracy: 0.9006 - val_loss: 0.2604
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 10ms/step - accuracy: 0.9109 - loss: 0.2513 - val_accuracy: 0.9032 - val_loss: 0.2557
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 10ms/step - accuracy: 0.9295 - loss: 0.2082 - val_accuracy: 0.9032 - val_loss: 0.2780
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 10ms/step - accuracy: 0.9408 - loss: 0.1862 - val_accuracy: 0.8976 - val_loss: 0.2870
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 10ms/step - accuracy: 0.9447 - loss: 0.1725 - val_accuracy: 0.8980 - val_loss: 0.3125
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 10ms/step - accuracy: 0.9493 - loss: 0.1777 - val_accuracy: 0.8944 - val_loss: 0.3340
Epoch 7/10
[1m625/625

#### Bigrams with TF-IDF encoding

You can add a bit more information by using TFIDF which is basically taking
the histogram over the text. The specific method allows for more computational
efficiency. It looks at term frequency inside of the document and across all
documents.

**Configuring the `TextVectorization` layer to return token counts**

In [31]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

**Configuring `TextVectorization` to return TF-IDF-weighted outputs**

In [32]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)

**Training and testing the TF-IDF bigram model**

In [33]:
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]
model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 13ms/step - accuracy: 0.7289 - loss: 0.6409 - val_accuracy: 0.8942 - val_loss: 0.2899
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.8761 - loss: 0.3141 - val_accuracy: 0.8774 - val_loss: 0.2889
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.8967 - loss: 0.2795 - val_accuracy: 0.8846 - val_loss: 0.2849
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.9036 - loss: 0.2552 - val_accuracy: 0.8770 - val_loss: 0.2962
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9094 - loss: 0.2480 - val_accuracy: 0.8820 - val_loss: 0.2984
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.9092 - loss: 0.2358 - val_accuracy: 0.8706 - val_loss: 0.3234
Epoch 7/10
[1m625/625[0m 

In [34]:
inputs = keras.Input(shape=(1,), dtype="string")
processed_inputs = text_vectorization(inputs)
outputs = model(processed_inputs)
inference_model = keras.Model(inputs, outputs)

In [35]:
import tensorflow as tf
raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I loved it."],
])
predictions = inference_model(raw_text_data)
print(f"{float(predictions[0] * 100):.2f} percent positive")

95.48 percent positive


TFIDF typically leads to a 1% increase in accuracy. If you want to export your
model you will need to incorporate your text processing into the model. This can easily
be done with the `TextVectorization` layer.