# 🥙 LSTM on Recipe Data

In this notebook, we'll walk through the steps required to train your own LSTM on the recipes dataset

In [None]:
#John Rogers
#Generative AI
#Assignment 5
#12/13/2024

In [1]:
import numpy as np
import json
import re
import string

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses

## 0. Parameters <a name="parameters"></a>

In [2]:
VOCAB_SIZE = 20000
MAX_LEN = 200
EMBEDDING_DIM = 100
N_UNITS = 128
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32
EPOCHS = 50

## 1. Load the data <a name="load"></a>

In [3]:
import requests
import json

      # List of URLs for additional texts (e.g., F. Scott Fitzgerald Books)
urls = [
          "https://www.gutenberg.org/cache/epub/805/pg805.txt",  # This Side of Paradise
          "https://www.gutenberg.org/cache/epub/64317/pg64317.txt",   # The Great Gatsby
          "https://www.gutenberg.org/cache/epub/6695/pg6695.txt"   # Tales of the Jazz Age
      ]

In [4]:
# Initialize an empty string to hold all text
all_text = ""

In [5]:
 # Download each text file and append to all_text
for url in urls:
          response = requests.get(url)
          text = response.text
          all_text += text + "\n\n"  # Separate texts by newlines

      # Save combined text to a single file
with open("combined_fScott.txt", "w", encoding="utf-8") as file:
          file.write(all_text)

print(all_text)

﻿The Project Gutenberg eBook of This Side of Paradise
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: This Side of Paradise

Author: F. Scott Fitzgerald

Release date: February 1, 1997 [eBook #805]
                Most recently updated: June 22, 2011

Language: English

Credits: Produced by David Reed, Ken Reeder, and David Widger


*** START OF THE PROJECT GUTENBERG EBOOK THIS SIDE OF PARADISE ***




Produced by David Reed, and Ken Reeder





THIS SIDE OF PARADISE

By F. Scott Fitzgerald


      ... Well this side of Paradise!...
       There's

In [6]:
with open("combined_fScott.txt", "r", encoding="utf-8") as file:
          all_text = file.read()

In [7]:
text_data = all_text.split("\n")

filtered_data = [
    "Text: " + line
    for line in text_data
    if line.strip()
]

In [8]:
example = filtered_data[15046]
print(example)

Text: glasses, and siphon one of the bottles was handed back; thereafter the


In [9]:
# Pad the punctuation, to treat them as separate 'words'
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}])", r" \1 ", s)
    s = re.sub(" +", " ", s)
    return s

padded_text = [pad_punctuation(x) for x in filtered_data]

example_data = text_data[15046]
print(example_data)

“Look at that,” she whispered, and then after a moment: “I’d like to


In [10]:
# Convert the text to a list of words
text_ds = (
    tf.data.Dataset.from_tensor_slices(text_data)
    .batch(BATCH_SIZE)
    .shuffle(1000)
)

In [11]:
# Create a vectorisation layer
vectorize_layer = layers.TextVectorization(
    standardize="lower",
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [12]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()

In [13]:
# Display some token:word mappings
for i, word in enumerate(vocab[:10]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: the
3: and
4: a
5: of
6: to
7: in
8: he
9: was


## 3. Create the Training Set

In [14]:
# Create the training set of recipes and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y

train_ds = text_ds.map(prepare_inputs)

## 4. Build the LSTM <a name="build"></a>

In [15]:
# Define the LSTM model

inputs = layers.Input(shape=(None,), dtype="int32")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = layers.LSTM(N_UNITS, return_sequences=True)(x)
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)

# Compile the lstm
lstm = models.Model(inputs, outputs)


# Print the lstm summary
lstm.summary()


In [16]:
loss_fn = losses.SparseCategoricalCrossentropy()
lstm.compile("adam", loss_fn)

## 5. Train the LSTM <a name="train"></a>

In [17]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self._model = None
        self.word_to_index = {
            word: index for index, word in enumerate(index_to_word)
        }  # <1>

    @property
    def model(self):
        return self._model

    @model.setter
    def model(self, value):
        self._model = value

    def sample_from(self, probs, temperature):  # <2>
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]  # <3>
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:  # <4>
            x = np.array([start_tokens])
            y = self.model.predict(x, verbose=0)  # <5>
            sample_token, probs = self.sample_from(y[0][-1], temperature)  # <6>
            info.append({"prompt": start_prompt, "word_probs": probs})
            start_tokens.append(sample_token)  # <7>
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("As he walked into class", max_tokens=100, temperature=1.0)

In [19]:
# Train the model
text_generator = TextGenerator(vocab)
lstm.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator]
)

Epoch 1/50
[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - loss: 0.1027
generated text:
As he walked into class well and talking to the city, and the 

[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 18ms/step - loss: 0.1027
Epoch 2/50
[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - loss: 0.0980
generated text:
As he walked into class blankly. 

[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 18ms/step - loss: 0.0980
Epoch 3/50
[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - loss: 0.0964
generated text:
As he walked into class fate in case with a certain verve that 

[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 18ms/step - loss: 0.0964
Epoch 4/50
[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - loss: 0.0943
generated text:
As he walked into class thought of her. she doesn't carry 

[1m967/967[0m [32m━━━━━━━

<keras.src.callbacks.history.History at 0x7ac8700f85e0>

## 6. Generate text using the LSTM

In [20]:
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
        print("--------\n")

In [24]:

info = text_generator.generate(
    "After they ate at the restaurant", max_tokens=10, temperature=1.0
)


generated text:
After they ate at the restaurant like loud asked and



In [25]:
print_probs(info, vocab)


PROMPT: After they ate at the restaurant
[UNK]:   	85.31%
and:   	2.25%
so:   	1.48%
amory.:   	1.25%
tired:   	1.21%
--------


PROMPT: After they ate at the restaurant like
a:   	57.2%
loud:   	5.83%
another:   	4.62%
wells:   	4.19%
[UNK]:   	3.8%
--------


PROMPT: After they ate at the restaurant like loud
and:   	24.7%
with:   	15.37%
by:   	14.99%
things.:   	6.3%
himself:   	5.22%
--------


PROMPT: After they ate at the restaurant like loud asked
the:   	91.13%
and:   	1.92%
him,:   	1.35%
amory:   	0.78%
many:   	0.57%
--------



In [26]:
info2 = text_generator.generate(
    "After they ate at the restaurant", max_tokens=10, temperature=0.2
)


generated text:
After they ate at the restaurant [UNK] was quite chatter,



In [27]:
print_probs(info2, vocab)


PROMPT: After they ate at the restaurant
[UNK]:   	100.0%
and:   	0.0%
so:   	0.0%
amory.:   	0.0%
tired:   	0.0%
--------


PROMPT: After they ate at the restaurant [UNK]
after:   	55.37%
away:   	42.16%
was:   	1.86%
she:   	0.35%
beyond:   	0.06%
--------


PROMPT: After they ate at the restaurant [UNK] was
quite:   	99.99%
[UNK]:   	0.01%
by:   	0.0%
too:   	0.0%
for:   	0.0%
--------


PROMPT: After they ate at the restaurant [UNK] was quite
chatter,:   	60.28%
a:   	36.63%
able:   	2.72%
sceptical:   	0.26%
radiant:   	0.05%
--------



In [34]:
info3 = text_generator.generate(
    "After they ate at the restaurant", max_tokens=7, temperature=1.0
)
print_probs(info3, vocab)


generated text:
After they ate at the restaurant and


PROMPT: After they ate at the restaurant
[UNK]:   	85.31%
and:   	2.25%
so:   	1.48%
amory.:   	1.25%
tired:   	1.21%
--------



In [41]:
def lstm2(num_layers=2, num_units=256, dropout_rate=0.2):
    inputs = layers.Input(shape=(None,), dtype="int32")
    x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
    for _ in range(num_layers):
        x = layers.LSTM(num_units, return_sequences=True)(x)
        x = layers.Dropout(dropout_rate)(x)
    outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
    lstm_model = models.Model(inputs, outputs)
    return lstm_model

model2 = lstm2()
model2.summary()

In [42]:
lstm_2 = lstm2(num_layers=2, num_units=256)
lstm_2.compile("adam", loss_fn)

In [45]:
lstm_2.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator]
)

Epoch 1/50
[1m966/967[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 27ms/step - loss: 0.2768
generated text:
As he walked into class followed gatsby’s stenographers managed the of 

[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 28ms/step - loss: 0.2768
Epoch 2/50
[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - loss: 0.2378
generated text:
As he walked into class the settled the tapestries 

[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 27ms/step - loss: 0.2378
Epoch 3/50
[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - loss: 0.2225
generated text:
As he walked into class in the understand,” these reputation and married 

[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 28ms/step - loss: 0.2225
Epoch 4/50
[1m967/967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - loss: 0.2085
generated text:
As he walked into class and we was he’s just “i’ve 

<keras.src.callbacks.history.History at 0x7ac87eef0250>

In [70]:
info1 = text_generator.generate(
    "Once exams were finished", max_tokens=10, temperature=1.0
)
print_probs(info1, vocab)


generated text:
Once exams were finished through the course of sophomore normal


PROMPT: Once exams were finished
and:   	25.31%
through:   	14.69%
by:   	12.31%
with:   	10.44%
along:   	6.16%
--------


PROMPT: Once exams were finished through
the:   	77.76%
her:   	7.77%
a:   	6.13%
his:   	2.95%
an:   	1.22%
--------


PROMPT: Once exams were finished through the
[UNK]:   	34.04%
front:   	5.05%
pages:   	5.0%
course:   	4.54%
corner:   	2.39%
--------


PROMPT: Once exams were finished through the course
of:   	99.89%
:   	0.02%
to:   	0.01%
you:   	0.01%
[UNK]:   	0.01%
--------


PROMPT: Once exams were finished through the course of
the:   	77.26%
his:   	2.99%
a:   	2.83%
her:   	1.77%
my:   	1.49%
--------


PROMPT: Once exams were finished through the course of sophomore
:   	58.15%
two:   	3.22%
o.:   	2.51%
many:   	2.0%
marble:   	1.49%
--------



In [72]:
info2 = text_generator.generate(
    "Around Christimas time is when", max_tokens=10, temperature=2.0
)
print_probs(info2, vocab)


generated text:
Around Christimas time is when goes council georgia buy they


PROMPT: Around Christimas time is when
they:   	10.16%
he:   	7.96%
i:   	6.11%
we:   	4.9%
she:   	4.55%
--------


PROMPT: Around Christimas time is when goes
out:   	5.93%
here:   	3.76%
on:   	3.7%
in:   	3.65%
to:   	2.24%
--------


PROMPT: Around Christimas time is when goes council
to:   	27.59%
in:   	3.56%
on:   	2.68%
for:   	2.62%
here:   	2.11%
--------


PROMPT: Around Christimas time is when goes council georgia
do:   	3.27%
you:   	2.86%
see:   	2.25%
buy:   	1.8%
find:   	1.61%
--------


PROMPT: Around Christimas time is when goes council georgia buy
the:   	7.83%
a:   	5.27%
my:   	5.14%
your:   	3.35%
it:   	2.82%
--------



In [80]:
info3 = text_generator.generate(
    "They had plans over the summer to", max_tokens=10, temperature=4.0
)
print_probs(info3, vocab)


generated text:
They had plans over the summer to listen henry kid


PROMPT: They had plans over the summer to
[UNK]:   	0.09%
eat:   	0.08%
come:   	0.08%
washington:   	0.08%
them.:   	0.07%
--------


PROMPT: They had plans over the summer to listen
to:   	0.26%
him:   	0.18%
into:   	0.13%
around:   	0.13%
them:   	0.13%
--------


PROMPT: They had plans over the summer to listen henry
to:   	0.14%
become:   	0.13%
laughed:   	0.12%
[UNK]:   	0.1%
started:   	0.09%
--------



In [89]:
info4 = text_generator.generate(
    "He wanted to hangout with his friends but", max_tokens=10, temperature=5.0
)
print_probs(info4, vocab)


generated text:
He wanted to hangout with his friends but sallee twenty-five


PROMPT: He wanted to hangout with his friends but
to:   	0.09%
[UNK]:   	0.07%
that:   	0.07%
look:   	0.07%
i:   	0.07%
--------


PROMPT: He wanted to hangout with his friends but sallee
in:   	0.27%
that:   	0.23%
for:   	0.22%
to:   	0.21%
at:   	0.19%
--------



I've observed that increasing the temperature too much seems to make the sentences shorter and less coherent.

At first it didn't seem like there were much changes between the sentences generated from the second lstm compared to the first, but there were a few setences that had slightly more coherence.