# 🥙 LSTM on Recipe Data

In this notebook, we'll walk through the steps required to train your own LSTM on the recipes dataset

In [1]:
#%load_ext autoreload
#%autoreload 2

import numpy as np
import json
import re
import string

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses

## 0. Parameters <a name="parameters"></a>

In [2]:
VOCAB_SIZE = 10000
MAX_LEN = 200
EMBEDDING_DIM = 100
N_UNITS = 128
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32
EPOCHS = 25

## 1. Load the data <a name="load"></a>

In [3]:
# Load the full dataset
with open("./full_format_recipes/full_format_recipes.json") as json_data:
    recipe_data = json.load(json_data)

In [37]:
print(len(recipe_data))  # list type
print(recipe_data[0])
print(recipe_data[0].keys())

20130
{'directions': ['1. Place the stock, lentils, celery, carrot, thyme, and salt in a medium saucepan and bring to a boil. Reduce heat to low and simmer until the lentils are tender, about 30 minutes, depending on the lentils. (If they begin to dry out, add water as needed.) Remove and discard the thyme. Drain and transfer the mixture to a bowl; let cool.', '2. Fold in the tomato, apple, lemon juice, and olive oil. Season with the pepper.', '3. To assemble a wrap, place 1 lavash sheet on a clean work surface. Spread some of the lentil mixture on the end nearest you, leaving a 1-inch border. Top with several slices of turkey, then some of the lettuce. Roll up the lavash, slice crosswise, and serve. If using tortillas, spread the lentils in the center, top with the turkey and lettuce, and fold up the bottom, left side, and right side before rolling away from you.'], 'fat': 7.0, 'date': '2006-09-01T04:00:00.000Z', 'categories': ['Sandwich', 'Bean', 'Fruit', 'Tomato', 'turkey', 'Vegetab

In [5]:
# Filter the dataset
filtered_data = [
    "Recipe for " + x["title"] + " | " + " ".join(x["directions"])
    for x in recipe_data
    if "title" in x
    and x["title"] is not None
    and "directions" in x
    and x["directions"] is not None
]

In [6]:
# Count the recipes
n_recipes = len(filtered_data)
print(f"{n_recipes} recipes loaded")
#print(filtered_data[0])

20111 recipes loaded


In [7]:
example = filtered_data[0]
print(example)

Recipe for Lentil, Apple, and Turkey Wrap  | 1. Place the stock, lentils, celery, carrot, thyme, and salt in a medium saucepan and bring to a boil. Reduce heat to low and simmer until the lentils are tender, about 30 minutes, depending on the lentils. (If they begin to dry out, add water as needed.) Remove and discard the thyme. Drain and transfer the mixture to a bowl; let cool. 2. Fold in the tomato, apple, lemon juice, and olive oil. Season with the pepper. 3. To assemble a wrap, place 1 lavash sheet on a clean work surface. Spread some of the lentil mixture on the end nearest you, leaving a 1-inch border. Top with several slices of turkey, then some of the lettuce. Roll up the lavash, slice crosswise, and serve. If using tortillas, spread the lentils in the center, top with the turkey and lettuce, and fold up the bottom, left side, and right side before rolling away from you.


## 2. Tokenise the data

In [8]:
# Pad the punctuation, to treat them as separate 'words'
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}])", r" \1 ", s)
    s = re.sub(" +", " ", s)
    return s


text_data = [pad_punctuation(x) for x in filtered_data]

In [9]:
len(text_data)

20111

In [10]:
# Display an example of a recipe
example_data = text_data[0]
example_data

'Recipe for Lentil , Apple , and Turkey Wrap | 1 . Place the stock , lentils , celery , carrot , thyme , and salt in a medium saucepan and bring to a boil . Reduce heat to low and simmer until the lentils are tender , about 30 minutes , depending on the lentils . ( If they begin to dry out , add water as needed . ) Remove and discard the thyme . Drain and transfer the mixture to a bowl ; let cool . 2 . Fold in the tomato , apple , lemon juice , and olive oil . Season with the pepper . 3 . To assemble a wrap , place 1 lavash sheet on a clean work surface . Spread some of the lentil mixture on the end nearest you , leaving a 1 - inch border . Top with several slices of turkey , then some of the lettuce . Roll up the lavash , slice crosswise , and serve . If using tortillas , spread the lentils in the center , top with the turkey and lettuce , and fold up the bottom , left side , and right side before rolling away from you . '

In [11]:
# Convert to a Tensorflow Dataset
text_ds = (
    tf.data.Dataset.from_tensor_slices(text_data)
    .batch(BATCH_SIZE)
    .shuffle(1000)
)

In [12]:
for batch in text_ds.take(1):  # Take only 1 batch for display
    print(batch.numpy().shape)  # Convert Tensor to NumPy array for readability
    print(batch.numpy()[0])

(32,)
b'Recipe for Coffee Doughnuts With Coffee Glaze | Combine flour , baking powder , baking soda , and salt in a large bowl . Beat granulated sugar and butter in the bowl of a stand mixer fitted with the paddle attachment on medium speed until smooth , about 3 minutes . Add egg and continue to beat , scraping down sides of bowl as needed , until incorporated . Reduce speed to low and gradually add buttermilk and coffee concentrate , beating just until combined . Gradually add dry ingredients and beat just until dough comes together . Turn out dough onto a large piece of parchment paper and cover with a second sheet of parchment . Using a rolling pin , roll dough between parchment sheets to about 1 / 3 " thick . Transfer dough in parchment to a rimmed baking sheet and freeze until firm , about 20 minutes . Peel off top sheet of parchment . Working on bottom sheet , punch out as many rounds as you can with 3 " cutter , then use 1 " cutter to punch out center of each round . Gather dou

In [13]:
# Create a vectorisation layer
vectorize_layer = layers.TextVectorization(
    standardize="lower",
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [14]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()

In [15]:
# Display some token:word mappings
for i, word in enumerate(vocab[:10]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: .
3: ,
4: and
5: to
6: in
7: the
8: with
9: a


In [16]:
test = 'and with the ,'
example_test = vectorize_layer(test)
print(example_test.numpy())

[4 8 7 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [17]:
# Display the same example converted to ints
example_tokenised = vectorize_layer(example_data)
print(example_tokenised.numpy())
print(example_tokenised.numpy().shape)

[  26   16 1733    3  428    3    4  221  212   27   11    2   64    7
  300    3  924    3  353    3  576    3  307    3    4   24    6    9
   29   80    4   84    5    9   69    2  153   17    5  134    4   70
   10    7  924   79   85    3   19  126   12    3 1135   28    7  924
    2   34   92  316  601    5  162  124    3   18   39  151  542    2
   35   71    4  206    7  307    2  120    4   40    7   31    5    9
   21   22   67   60    2   15    2  255    6    7  265    3  428    3
  109  104    3    4  252   37    2   63    8    7   33    2   36    2
    5 1567    9  212    3   64   11 2863  105   28    9  370  387  207
    2  166  256   14    7 1733   31   28    7  613 1908  215    3  447
    9   11   13   53  813    2   72    8  688  160   14  221    3   46
  256   14    7  711    2  263   99    7 2863    3  284  446    3    4
   68    2   92   87  704    3  166    7  924    6    7  168    3   72
    8    7  221    4  711    3    4  255   99    7  169    3 1286   96
    3 

## 3. Create the Training Set

In [18]:
# Create the training set of recipes and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)  # Expands dimensions to match vectorization layer input format, shift the entire sequence by one token in order to create our target variable.
    tokenized_sentences = vectorize_layer(text)  # Convert text into tokenized sequences (integer representation)
    x = tokenized_sentences[:, :-1]  # Input sequence (all words except last)
    y = tokenized_sentences[:, 1:]   # Target sequence (all words except first, i.e., shifted by one)
    return x, y


train_ds = text_ds.map(prepare_inputs)

In [35]:
train_ds

<_MapDataset element_spec=(TensorSpec(shape=(None, None), dtype=tf.int64, name=None), TensorSpec(shape=(None, None), dtype=tf.int64, name=None))>

## 4. Build the LSTM <a name="build"></a>

In [20]:
inputs = layers.Input(shape=(None,), dtype="int32")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = layers.LSTM(N_UNITS, return_sequences=True)(x)
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
lstm = models.Model(inputs, outputs)
lstm.summary()

In [None]:
if LOAD_MODEL:
    # model.load_weights('./models/model')
    lstm = models.load_model("./models/lstm.h5", compile=False)

## 5. Train the LSTM <a name="train"></a>

In [22]:
loss_fn = losses.SparseCategoricalCrossentropy()
lstm.compile("adam", loss_fn)

In [23]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {
            word: index for index, word in enumerate(index_to_word)
        }  # <1>

    def sample_from(self, probs, temperature):  # <2>
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]  # <3>
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:  # <4>
            x = np.array([start_tokens])
            y = self.model.predict(x, verbose=0)  # <5>
            sample_token, probs = self.sample_from(y[0][-1], temperature)  # <6>
            info.append({"prompt": start_prompt, "word_probs": probs})
            start_tokens.append(sample_token)  # <7>
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("recipe for", max_tokens=100, temperature=1.0)

In [26]:
# Create a model save checkpoint
model_checkpoint_callback = callbacks.ModelCheckpoint(
    filepath="./checkpoint/checkpoint.weights.h5",
    save_weights_only=True,
    save_freq="epoch",
    verbose=0,
)

tensorboard_callback = callbacks.TensorBoard(log_dir="./logs")

# Tokenize starting prompt
text_generator = TextGenerator(vocab)

In [27]:
lstm.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[model_checkpoint_callback, tensorboard_callback, text_generator],
)

Epoch 1/25
[1m629/629[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 134ms/step - loss: 4.9973
generated text:
recipe for pie pan sauce | heat then and inch sharp version which , lemon [UNK] layers - deep thermometer - round into a fingers - 2 mixture burger 

[1m629/629[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m87s[0m 136ms/step - loss: 4.9959
Epoch 2/25
[1m629/629[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 137ms/step - loss: 3.0447
generated text:
recipe for porcini - vegetable 6 timbales with sweet nuts | preheat oven to coat . preheat oven lime juice to processor 

[1m629/629[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m87s[0m 139ms/step - loss: 3.0444
Epoch 3/25
[1m629/629[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 140ms/step - loss: 2.4941
generated text:
recipe for matzo cake hazelnuts with cranberry cobbler with green cardamom in cucumber - orange mesa | melt potatoes in heavy large oil with some weights and beat sauce until tender . a

<keras.src.callbacks.history.History at 0x1aeeacd4ee0>

In [29]:
# Save the final model
lstm.save("./models/lstm.h5")



## 6. Generate text using the LSTM

In [39]:
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
        print("--------\n")

In [40]:
info = text_generator.generate(
    "recipe for roasted vegetables | chop 1 /", max_tokens=10, temperature=1.0
)


generated text:
recipe for roasted vegetables | chop 1 / 3 cup



In [41]:
print_probs(info, vocab)


PROMPT: recipe for roasted vegetables | chop 1 /
2:   	56.70000076293945%
4:   	27.8799991607666%
8:   	8.069999694824219%
3:   	4.409999847412109%
6:   	0.3499999940395355%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 3
cup:   	80.55999755859375%
inch:   	4.869999885559082%
of:   	2.25%
-:   	2.1600000858306885%
pound:   	1.2999999523162842%
--------



In [42]:
info = text_generator.generate(
    "recipe for roasted vegetables | chop 1 /", max_tokens=10, temperature=0.2
)


generated text:
recipe for roasted vegetables | chop 1 / 2 cup



In [43]:
print_probs(info, vocab)


PROMPT: recipe for roasted vegetables | chop 1 /
2:   	97.19999694824219%
4:   	2.7899999618530273%
8:   	0.009999999776482582%
3:   	0.0%
6:   	0.0%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 2
cup:   	99.9800033569336%
inch:   	0.019999999552965164%
teaspoon:   	0.0%
-:   	0.0%
pound:   	0.0%
--------



In [44]:
info = text_generator.generate(
    "recipe for chocolate ice cream |", max_tokens=7, temperature=1.0
)
print_probs(info, vocab)


generated text:
recipe for chocolate ice cream | combine


PROMPT: recipe for chocolate ice cream |
in:   	25.780000686645508%
bring:   	13.210000038146973%
stir:   	9.34000015258789%
preheat:   	6.239999771118164%
whisk:   	4.210000038146973%
--------



In [45]:
info = text_generator.generate(
    "recipe for chocolate ice cream |", max_tokens=7, temperature=0.2
)
print_probs(info, vocab)


generated text:
recipe for chocolate ice cream | in


PROMPT: recipe for chocolate ice cream |
in:   	95.9000015258789%
bring:   	3.390000104904175%
stir:   	0.6000000238418579%
preheat:   	0.07999999821186066%
whisk:   	0.009999999776482582%
--------

