# Import data and map ingredients to IDs

This notebook is used to generate training data ready for use in training a Word2Vec model.

## Creating a vocabulary

In [1]:
import pandas as pd
import numpy as np
import ast
import dill as pickle
import tqdm
import tensorflow as tf

RANDOM_SEED=42

2023-03-08 14:08:15.660977: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-08 14:08:16.186154: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-08 14:08:16.186185: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-08 14:08:17.553053: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

In [10]:
# function to parse strings of lists as Python lists
def parseTupleFunc(tuple_str: str):

    try:
        return ast.literal_eval(tuple_str)

    except Exception as e:

        print(tuple_str)

# import column
recipes = pd.read_csv("../data/recipes.csv", usecols=["RecipeIngredientParts"]).squeeze("columns")

recipes = recipes.drop(recipes[recipes.str[:2] != "c("].index)

recipes = recipes.str[1:]

parseTuple = lambda tuple_str: ast.literal_eval(tuple_str)

recipes = recipes.apply(parseTupleFunc)

There are approximately 510,000 recipes:

In [11]:
recipes.shape

(511626,)

Produce a vocabulary of all unique ingredients across all recipes and determine its length.

In [12]:
vocab = recipes.explode().unique()
vocab_size = len(vocab)

Insert an empty string to the vocabulary (to pad recipes of different numbers of ingedients) at index 0:

In [13]:
vocab = np.insert(vocab, 0, "")

Save the vocabulary to a `npy` binary file.

In [14]:
# with open("vocab.npy","wb") as f:
#     np.save(f, vocab)


The word corresponding to a given index is easily retrieved using its index:

In [15]:
print(vocab[0])
print(vocab[1])
print(vocab[23])


blueberries
onion


We use `argmax` to retrieve the index of a given word:

In [16]:
vocabIndex = lambda query: np.argmax(vocab == query)

print(vocabIndex(""))
print(vocabIndex("blueberries"))
print(vocabIndex("onion"))

0
1
23


## Mapping recipe ingredient tokens to vocabulary IDs

We now aim to transform each recipe's ingredient tokens into their corresponding ingredient IDs.

In [17]:
def ingredient_mapper(recipe_ingredients):
    return [vocabIndex(ingredient) for ingredient in recipe_ingredients]

recipe_ids = recipes.apply(ingredient_mapper)

This process takes approximately 10 minutes. 
We query the resulting array to get the text names of the ingredient:

In [18]:
[vocab[id] for id in recipe_ids[0]]

['blueberries', 'granulated sugar', 'vanilla yogurt', 'lemon juice']

And confirm the text ingredient names by using the same ID with the `recipes` array:

In [19]:
recipes[0]

('blueberries', 'granulated sugar', 'vanilla yogurt', 'lemon juice')

As the above pre-processing took a relatively long time, we export to a binary for ease of reuse:

In [2]:
# with open("recipe_ids.npy","wb") as f:
#     np.save(f, recipe_ids)

NameError: name 'recipe_ids' is not defined

## Pad data to correct length

Collect training data from pickle:

In [6]:
recipes_ingredient_ids = np.load("recipe_ids.npy", allow_pickle=True)
vocab = np.load("vocab.npy", allow_pickle=True)
vocab_size = len(vocab)

UnpicklingError: Failed to interpret file 'recipe_ids.npy' as a pickle

Calculate the maximum number of ingredients in any single recipe:

In [None]:
max_ingredients = max(len(rec) for rec in recipe_ids)

Pad all recipes to have the max number of ingredients:

In [None]:
# create array of 0s to fill
padded_recipes = np.zeros((len(recipe_ids),max_ingredients),dtype="int64")

for i, row in enumerate(recipe_ids):
    padded_recipes[i, :len(row)] += row

## Collect positive and negative context words

Each recipe's list of ingredients is now padded to ensure they have the same length. We continue by aiming to produce a training dataset where each record contains:
- The ID of a single ingredient in a particular recipe
- The ID of a positive context word found from the same recipe (as determined by a window size of $w$)
- Some number ($n$) of negative context words found from the vocabulary

Therefore for a recipe with $r$ ingredients we will obtain $r\times n$ records. 

The dataset comprises three arrays:
- `targets`, a one-dimensional array equal with length equal to the number of unique pairings of each ingredient (in each recipe) with every ingredient in the surrounding window (for a window size $w$). This array stores the target word for each unique pairing.
- `contexts`, a two-dimensional array with length equal to the length of `targets`. Each subarray has length equal to the number of negative samples per ingredient pairing, plus one: $n+1$. This array stores the IDs of the context word and the $n$ negatively-sampled context words for each unique ingredient pairing.  
- `labels`, a two-dimensional array of equal size to `contexts`, where each subarray stores the labelling of the context words as positive or negative.

**This likely needs to be reworked - we just need pairs of (target, context, label). This potentially means we can use the `negative_samples` parameter of tf.keras.preprocessing.sequence.skipgrams to generate the entire training dataset.**

In [None]:

targets = []
contexts = []
labels = []

window_size = 4
num_negative_samples = 5

# for each recipe
for recipe in tqdm.tqdm(padded_recipes):
    
    # generate all positive skip grams using the given window size
    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
        recipe,
        vocabulary_size=vocab_size,
        window_size=window_size,
        # will generate negative samples separately as the returned format is unhelpful
        negative_samples=0
    )
    
    
    
    """
    # for each ingredient-positive skipgram pair
    for target_ingredient, context_ingredient in positive_skip_grams:
        
        # create tensor flow the context
        context_const = tf.constant([context_ingredient], dtype="int64")
        
        # expand const to have another dimension for 
        context_class = tf.expand_dims(context_const, 1)
        
        # generate negative samples
        negative_samples, _, _ = tf.random.log_uniform_candidate_sampler(
            true_classes=context_class,
            # number of target classes per example
            num_true=1,
            # number to generate
            num_sampled=num_negative_samples,
            # ensure all samples in the batch are unique
            unique=True,
            # max value
            range_max=vocab_size,
            seed=RANDOM_SEED,
            name="negative_sampling"
        )
        
        # combine positive sample with negative samples
        context = tf.concat([context_const, negative_samples], 0)
        label = tf.constant ([1] + [0]*num_negative_samples, dtype="int64")
        # add to training data
        targets.append(target_ingredient)
        contexts.append(context)
        labels.append(label)
        """

  0%|          | 0/511626 [00:00<?, ?it/s]2023-03-07 17:49:14.007881: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-07 17:49:14.009233: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-07 17:49:14.009309: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (jkunix): /proc/driver/nvidia/version does not exist
2023-03-07 17:49:14.012548: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  8%|▊         | 39694/511626 [11:5

: 

: 

In [None]:
with open("targets.pkl","wb") as f:
    pickle.dump(targets, f)
with open("contexts.pkl","wb") as f:
    pickle.dump(contexts, f)
with open("labels.pkl","wb") as f:
    pickle.dump(labels, f)

We can see some of the training data:

In [None]:
print_num = 10
for target, context, label in zip(targets[:print_num],contexts[:print_num],labels[:print_num]):
    
    print(f"target_index: {target}")
    print(f"target ingredient: {vocab[target]}")
    print(f"context indexes: {context}")
    print(f"context words: {[vocab[c.numpy()] for c in context]}")
    print(f"label: {label}")
    print("`")

target_index: 4
target ingredient: lemon juice
context indexes: [   3   12 4945    0   13 1555]
context words: ['vanilla yogurt', 'cardamom seed', 'Liptauer cheese', '', 'cumin seed', 'Anjou pears']
label: [1 0 0 0 0 0]
`
target_index: 1
target ingredient: blueberries
context indexes: [   3    9    6 1863    8   10]
context words: ['vanilla yogurt', 'garlic', 'milk', 'chunk pineapple', 'onions', 'clove']
label: [1 0 0 0 0 0]
`
target_index: 3
target ingredient: vanilla yogurt
context indexes: [   1   15 1192 3124 5904  533]
context words: ['blueberries', 'mace', 'unflavored gelatin', 'chestnuts', 'vegan worcestershire sauce', 'turkey Polish kielbasa']
label: [1 0 0 0 0 0]
`
target_index: 4
target ingredient: lemon juice
context indexes: [  2   4  73 141 221  12]
context words: ['granulated sugar', 'lemon juice', 'cake flour', 'dark molasses', 'romano cheese', 'cardamom seed']
label: [1 0 0 0 0 0]
`
target_index: 3
target ingredient: vanilla yogurt
context indexes: [  2   3  23  84   1 

It's looking good. We can now build the model using this training data.

# Generate training dataset

Assuming:

-Ingredients are imported

-Unique IDs given to each ingredient

-The vocabulary is established and its size is known

-Each recipe is transformed into a list of IDs for each ingredient

The training dataset will use a tuple of (target_word, context, label), where:
- Target_word is a single target word ID
- Context is an array of context word IDs
- Label is an array of context word IDs, where the value at an index of the array describes if a corresponding element in the context array is a positive or negative sample

1. Import ingredients into a 2D array
2. Identify unique ingredients
3. Assign IDs to each unique ingredient
4. Perform some exploration to determine relevant window size and add padding, OR
    
    a. Choose a window size, $n$

    b. For each recipe
        i. For each ingredient in the recipe
            x. Randomy sample $n$ other ingredients (and get their IDs) from the window without replacement
            y. Pad if not enough ingredients to fulfill the window size
            z. Add the token's ID and its neighbouring sample to a 2D array
5. Choose output vector size
6. Create Keras model and train

# Research


## Thorough Stanford notes:
https://web.stanford.edu/~jurafsky/slp3/6.pdf
https://web.stanford.edu/~jurafsky/slp3/

## CBOW Example
https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html

## TF Skipgram implementation
https://www.tensorflow.org/tutorials/text/word2vec#compile_all_steps_into_one_function
https://www.tensorflow.org/tutorials/text/word2vec#compile_all_steps_into_one_function


## Word2Vec Illustrated
https://jalammar.github.io/illustrated-word2vec/

In [21]:
"""
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

from keras.preprocessing import text

# raw data
sentences = [
    "hi",
    "hello",
    "another",
    "word two",
    "a much longer sentence with a word"
]
y = ["a", "b", "c"]

# fit tokenizer
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(sentences)

# get unique ids of each word
vocab = tokenizer.word_index
vocab_size = len(vocab)

np.random.seed(42)

# generate 10 integers between 0 and 100
max_int_value = 100
# inps = np.random.randint(low=max_int_value, size=(10,2))

inps = [
    2,
    9,
    4,
    6]

# size of outputted vector
output_size = 5

model = keras.Sequential(
    [   
        # convert word instance to an embedding
        layers.Embedding(
            # number of distinct, unique input values
            input_dim=10, 
            output_dim=output_size,
            # number of values per input
            input_length=1
        )
    ]
)

model.compile("rmsprop", "mse")

outs = model.predict(inps)
print(outs.shape)

"""

'\nimport tensorflow as tf\nfrom tensorflow import keras\nfrom tensorflow.keras import layers\nimport numpy as np\n\nfrom keras.preprocessing import text\n\n# raw data\nsentences = [\n    "hi",\n    "hello",\n    "another",\n    "word two",\n    "a much longer sentence with a word"\n]\ny = ["a", "b", "c"]\n\n# fit tokenizer\ntokenizer = text.Tokenizer()\ntokenizer.fit_on_texts(sentences)\n\n# get unique ids of each word\nvocab = tokenizer.word_index\nvocab_size = len(vocab)\n\nnp.random.seed(42)\n\n# generate 10 integers between 0 and 100\nmax_int_value = 100\n# inps = np.random.randint(low=max_int_value, size=(10,2))\n\ninps = [\n    2,\n    9,\n    4,\n    6]\n\n# size of outputted vector\noutput_size = 5\n\nmodel = keras.Sequential(\n    [   \n        # convert word instance to an embedding\n        layers.Embedding(\n            # number of distinct, unique input values\n            input_dim=10, \n            output_dim=output_size,\n            # number of values per input\n     