# Neural Nourishment

This project draws on examples from the Keras tutorials [Text generation with a miniature GPT](https://keras.io/examples/generative/text_generation_with_miniature_gpt/) and [GPT text generation from scratch with KerasNLP](https://keras.io/examples/generative/text_generation_gpt/), as well as the papers ["Attention is All You Need"](https://arxiv.org/abs/1706.03762) by Vaswani et al. and ["Language Models are Few-Shot Listeners"](https://arxiv.org/abs/2005.14165) by Brown et al.

It uses [WordPiece Tokenization](https://research.google/blog/a-fast-wordpiece-tokenization-system/) and is trained on the [RecipeNLG dataset](https://www.kaggle.com/datasets/paultimothymooney/recipenlg) of 2,231,142 cooking recipes.


**Import neccesary libraries and the Tokenizer class**

In [1]:
import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import keras_nlp
import tensorflow as tf
import tensorflow.data as tf_data
import tensorflow.strings as tf_strings
import tensorflow.io as tf_io

Using TensorFlow backend


---
**Define hyperparameters**

In [2]:
# Data
BATCH_SIZE = 64
MIN_STRING_LEN = 512  # Strings shorter than this will be discarded
SEQ_LEN = 128  # Length of training sequences, in tokens

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 128
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 4096  # Limits parameters in model.

# Training
EPOCHS = 1
TOKENIZER_TRAINING_SIZE = 256

# Inference
NUM_TOKENS_TO_GENERATE = 80

# Special tokens
START_OF_RECIPE = "<|recipe_start|>"
END_OF_RECIPE = "<|recipe_end|>"
PAD = "<|pad|>"
OOV = "<|oov|>"
SPECIAL_TOKENS = [PAD, START_OF_RECIPE, END_OF_RECIPE, OOV]

---
**Define the dataset as strings of full recipes**

To keep training managable for a laptop, we load the dataset into a tensorflow dataset object. This allows us to load data into memory as needed, opposed to all at once. 

In [3]:
def csv_row_to_json(row):
    row = tf_io.decode_csv(records=row, record_defaults=[tf.constant([],dtype=tf.string)] * 7)
    
    title = row[1]
    ingredients = row[2]
    directions = row[3]
    ner = row[6]

    return tf_strings.join([
        '{"ner": ', ner, ', ',
        '"title": "', title, '", ',
        '"ingredients": ', ingredients, ', ',
        '"directions": ', directions, '}',
    ])

dataset = (
    tf_data.TextLineDataset("RecipeNLG/RecipeNLG_dataset.csv") # load the csv file line by line
    .skip(1) # skip the header row
    .shuffle(buffer_size=256) # store 256 shuffled records in memory at a time before reshuffling and refetching
    .map(lambda row: csv_row_to_json(row)) # map each row of the csv to a jsonified recipe
    .apply(tf.data.experimental.ignore_errors()) # ignore any errors in the csv file
    .batch(BATCH_SIZE) # batch the dataset to train on multiple records at once
)

---
**Train the BPE tokenizer**

On a subset of our data, we train a WordPiece tokenizer. Special tokens are used for denoting the beginning and end of recipes. Once trained, we can use the Keras `WordPieceTokenizer` to tokenize our tensors within the `tf.data` pipeline.

In [4]:
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    data=dataset.take(TOKENIZER_TRAINING_SIZE),
    vocabulary_size=VOCAB_SIZE,
    reserved_tokens=SPECIAL_TOKENS,
)

2024-04-25 21:52:25.271213: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


In [5]:
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    special_tokens_in_strings=True,
    special_tokens=SPECIAL_TOKENS,
    oov_token=OOV,
)

---
**Tokenize the dataset**

Start and end tokens get added, then recipes are tokenized and prepared for training

In [6]:
packer = keras_nlp.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id(START_OF_RECIPE),
    end_value=tokenizer.token_to_id(END_OF_RECIPE),
    pad_value=tokenizer.token_to_id(PAD),
)

def preprocess(recipe_batch):
    outputs = tokenizer(recipe_batch)
    features = packer(outputs)
    labels = outputs

    return features, labels

dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

In [7]:
def create_model():
    inputs = keras.layers.Input(shape=(None,), dtype="int32")

    # token embedding layer
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
        vocabulary_size=VOCAB_SIZE,
        sequence_length=SEQ_LEN,
        embedding_dim=EMBED_DIM,
        mask_zero=True,
    )

    # transformer decoders
    decoder_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )

    # output layer
    output_layer = keras.layers.Dense(VOCAB_SIZE)

    # build the model
    x = embedding_layer(inputs)
    for _ in range(NUM_LAYERS): x = decoder_layer(x)
    outputs = output_layer(x)
    model = keras.Model(inputs=inputs, outputs=outputs)

    loss_function = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)

    model.compile(optimizer="adam", loss=loss_function, metrics=[perplexity])
    
    return model

In [8]:
model = create_model()
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 token_and_position_embedding (  (None, None, 256)   1081344     ['input_1[0][0]']                
 TokenAndPositionEmbedding)                                                                       
                                                                                                  
 transformer_decoder (Transform  (None, None, 256)   329085      ['token_and_position_embedding[0]
 erDecoder)                                                      [0]',                            
                                                                  'transformer_decoder[0][0]']

---
**Fit the model**

In [9]:
model.fit(
    dataset,
    epochs=EPOCHS,
    callbacks=[
        keras.callbacks.ModelCheckpoint("save_at_{epoch}.keras"),
    ],
)