# Neural Nourishment

This project draws on examples from the Keras tutorials [Text generation with a miniature GPT](https://keras.io/examples/generative/text_generation_with_miniature_gpt/) and [GPT text generation from scratch with KerasNLP](https://keras.io/examples/generative/text_generation_gpt/), as well as the papers ["Attention is All You Need"](https://arxiv.org/abs/1706.03762) by Vaswani et al. and ["Language Models are Few-Shot Listeners"](https://arxiv.org/abs/2005.14165) by Brown et al.

It uses [WordPiece Tokenization](https://research.google/blog/a-fast-wordpiece-tokenization-system/) and is trained on the [RecipeNLG dataset](https://www.kaggle.com/datasets/paultimothymooney/recipenlg) of 2,231,142 cooking recipes.


**Import neccesary libraries**

In [1]:
import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import keras_nlp
import pickle
import tensorflow as tf
import tensorflow.data as tf_data
import tensorflow.strings as tf_strings
import tensorflow.io as tf_io
from constants import *

Using TensorFlow backend


---
**Define the dataset as strings of full recipes**

To keep training managable for a laptop, we load the dataset into a tensorflow dataset object. This allows us to load data into memory as needed, opposed to all at once. 

In [2]:
def csv_row_to_json(row):
    row = tf_io.decode_csv(records=row, record_defaults=[tf.constant([],dtype=tf.string)] * 7)
    
    title = row[1]
    ingredients = row[2]
    directions = row[3]
    ner = row[6]

    return tf_strings.join([
        '{"ner": ', ner, ', ',
        '"title": "', title, '", ',
        '"ingredients": ', ingredients, ', ',
        '"directions": ', directions, '}',
    ])

dataset = (
    tf_data.TextLineDataset("RecipeNLG/RecipeNLG_dataset.csv") # load the csv file line by line
    .skip(1) # skip the header row
    .shuffle(buffer_size=256) # store 256 shuffled records in memory at a time before reshuffling and refetching
    .map(lambda row: csv_row_to_json(row)) # map each row of the csv to a jsonified recipe
    .apply(tf.data.experimental.ignore_errors()) # ignore any errors in the csv file
    .batch(BATCH_SIZE) # batch the dataset to train on multiple records at once
)

---
**Load the WordPiece tokenizer**

In another notebook, we train a WordPiece tokenizer on the dataset. Special tokens are used for denoting the beginning and end of recipes. We can load the vocabulary and use the Keras `WordPieceTokenizer` to tokenize our tensors within the `tf.data` pipeline.

In [3]:
# load the tokenizer
with open(VOCAB_FILE, "rb") as f:
    vocab = pickle.load(f)

tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    special_tokens_in_strings=True,
    special_tokens=SPECIAL_TOKENS,
    oov_token=OOV,
)

---
**Tokenize the dataset**

Start and end tokens get added, then recipes are tokenized and prepared for training

In [4]:
packer = keras_nlp.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id(START_OF_RECIPE),
    end_value=tokenizer.token_to_id(END_OF_RECIPE),
    pad_value=tokenizer.token_to_id(PAD),
)

def preprocess(recipe_batch):
    outputs = tokenizer(recipe_batch)
    features = packer(outputs)
    labels = outputs
    return features, labels

dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

In [5]:
def create_model():
    inputs = keras.layers.Input(shape=(None,), dtype="int32")

    # token embedding layer
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
        vocabulary_size=VOCAB_SIZE,
        sequence_length=SEQ_LEN,
        embedding_dim=EMBED_DIM,
        mask_zero=True,
    )

    # transformer decoders
    transformer_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
        dropout=0.1
    )

    # output layer
    output_layer = keras.layers.Dense(VOCAB_SIZE)

    # assemble the model
    x = embedding_layer(inputs)
    for _ in range(NUM_LAYERS): x = transformer_layer(x)
    outputs = output_layer(x)
    model = keras.Model(inputs=inputs, outputs=outputs)

    loss_function = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)

    model.compile(optimizer="adam", loss=loss_function, metrics=[perplexity])
    
    return model

In [6]:
model = create_model()
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 token_and_position_embedding (  (None, None, 256)   557056      ['input_1[0][0]']                
 TokenAndPositionEmbedding)                                                                       
                                                                                                  
 transformer_decoder (Transform  (None, None, 256)   329085      ['token_and_position_embedding[0]
 erDecoder)                                                      [0]',                            
                                                                  'transformer_decoder[0][0]']

---
**Fit the model**

In [7]:
# training one epoch on my M1 macbook pro with 16GB of RAM takes roughly 5 hours 20 minutes ...
checkpoint_callback = keras.callbacks.ModelCheckpoint(filepath='checkpoints/checkpoint_{epoch:02d}.tf')

callbacks = [
    checkpoint_callback,
]

model.fit(
    dataset.take(5),
    validation_data=dataset.skip(1).take(1),
    epochs=EPOCHS,
    callbacks=callbacks,
)

2024-04-26 18:16:07.793253: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 1/3
      5/Unknown - 4s 486ms/step - loss: 6.4108 - perplexity: 620.0692



INFO:tensorflow:Assets written to: checkpoints/checkpoint_01.tf/assets


INFO:tensorflow:Assets written to: checkpoints/checkpoint_01.tf/assets


Epoch 2/3



INFO:tensorflow:Assets written to: checkpoints/checkpoint_02.tf/assets


INFO:tensorflow:Assets written to: checkpoints/checkpoint_02.tf/assets


Epoch 3/3



INFO:tensorflow:Assets written to: checkpoints/checkpoint_03.tf/assets


INFO:tensorflow:Assets written to: checkpoints/checkpoint_03.tf/assets




<keras.callbacks.History at 0x16077fd30>