In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

# Whats Cooking

Today we are going to be using the kaggle [What's Cooking Dataset](https://www.kaggle.com/c/whats-cooking-kernels-only/data). (Please download and load the data in appropriately to follow along below).


This is basically a list of recipes, and we need to decide which cuisine it comes from. We can check out some of the data below:

In [2]:
import json
recipeRaw = pd.read_json("../whats-cooking/train.json")
recipeRaw["ingredientsFlat"] = recipeRaw["ingredients"].apply(lambda x: ' '.join(x))
recipeRaw.head()

Unnamed: 0,cuisine,id,ingredients,ingredientsFlat
0,greek,10259,"[romaine lettuce, black olives, grape tomatoes...",romaine lettuce black olives grape tomatoes ga...
1,southern_us,25693,"[plain flour, ground pepper, salt, tomatoes, g...",plain flour ground pepper salt tomatoes ground...
2,filipino,20130,"[eggs, pepper, salt, mayonaise, cooking oil, g...",eggs pepper salt mayonaise cooking oil green c...
3,indian,22213,"[water, vegetable oil, wheat, salt]",water vegetable oil wheat salt
4,indian,13162,"[black pepper, shallots, cornflour, cayenne pe...",black pepper shallots cornflour cayenne pepper...


So our goal is the predict the cuisine - this means a multiclassification problem. We can see all the classes below:

In [3]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(recipeRaw["cuisine"].values)
le.classes_

array(['brazilian', 'british', 'cajun_creole', 'chinese', 'filipino',
       'french', 'greek', 'indian', 'irish', 'italian', 'jamaican',
       'japanese', 'korean', 'mexican', 'moroccan', 'russian',
       'southern_us', 'spanish', 'thai', 'vietnamese'], dtype=object)

For keras to be able to work with this, we will need to convert these strings into one-hot encodings:

In [4]:
docs = recipeRaw["ingredientsFlat"].values
labels_enc = le.transform(recipeRaw["cuisine"].values)
labels = tf.keras.utils.to_categorical(labels_enc)
labels

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

One useful numeric feature we could use, is the number of ingredients in each recipe

In [5]:
recipeRaw['ingredients_len'] = recipeRaw['ingredients'].apply(len)
doc_lengths = recipeRaw[['ingredients_len']].values

In [6]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

doc_lengths_standardized = ss.fit_transform(doc_lengths)



Next we need to transform the ingredients into categories. In one sense this is a pretty typical NLP problem, but the cool thing about it is that the order of the ingredients does not matter, so this is an unordered variable length features problem with high cardinality categorical variables.

---

To transform these into categories we use the below:

In [7]:
pad_sequences = tf.keras.preprocessing.sequence.pad_sequences

t = tf.keras.preprocessing.text.Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1

# label encode the documents
encoded_docs = t.texts_to_sequences(docs)

# pad documents to a max length of 40 words
max_length = 40
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

vocab_size

3065

And now we are ready for modeling

In [8]:
def bootstrap_sample_generator(batch_size):
    while True:
        batch_idx = np.random.choice(
            padded_docs.shape[0], batch_size)
        yield ({'cat_inputs': padded_docs[batch_idx],
                'numeric_inputs': doc_lengths[batch_idx]
               }, 
               {'output': labels[batch_idx] })

In [9]:
def emb_sz_rule(n_cat): 
    return min(600, round(1.6 * n_cat**0.56))

p = .1

Notice that again we have two types of inputs:

In [10]:
cat_inputs = tf.keras.layers.Input((40,), name='cat_inputs')
numeric_inputs = tf.keras.layers.Input((1,), name='numeric_inputs')

And we use the same rules as last time to make and add in the embedding layer:

In [11]:
embedding_layer = tf.keras.layers.Embedding(
    vocab_size, 
    emb_sz_rule(vocab_size), 
    input_length=40)
cat_x = embedding_layer(cat_inputs)

global_ave = tf.keras.layers.GlobalAveragePooling1D()(cat_x)
global_max = tf.keras.layers.GlobalMaxPool1D()(cat_x)
x = tf.keras.layers.Concatenate()([global_ave, global_max])

In [12]:
# bonus
x = tf.keras.layers.RepeatVector(40)(x)
x = tf.keras.layers.Concatenate()([cat_x, x])

x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Conv1D(20, 1)(x)
x = tf.keras.layers.Activation('relu')(x)

global_ave = tf.keras.layers.GlobalAveragePooling1D()(x)
global_max = tf.keras.layers.GlobalMaxPool1D()(x)
x = tf.keras.layers.Concatenate()([global_ave, global_max])

And then after we process the variable length data, we will add on the fixed numeric inputs (notice they go in right where they went in the first two lessons):

In [13]:
x = tf.keras.layers.Concatenate()([x, numeric_inputs])

In [14]:
x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Dense(100, activation='relu')(x)

x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Dense(20, activation='relu')(x)

x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Dense(10, activation='relu')(x)

x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(p)(x)
out = tf.keras.layers.Dense(20, activation='softmax', name='output')(x)

In [15]:
model = tf.keras.models.Model(inputs=[cat_inputs, numeric_inputs], outputs=out)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [16]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
cat_inputs (InputLayer)         [(None, 40)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 40, 143)      438295      cat_inputs[0][0]                 
__________________________________________________________________________________________________
global_average_pooling1d (Globa (None, 143)          0           embedding[0][0]                  
__________________________________________________________________________________________________
global_max_pooling1d (GlobalMax (None, 143)          0           embedding[0][0]                  
______________________________________________________________________________________________

In [17]:
batch_size = 16

model.fit_generator(
    bootstrap_sample_generator(batch_size),
    steps_per_epoch=10_000 // batch_size,
    epochs=5,
    max_queue_size=10,
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x13af03198>

Not as good absolute accuracy, but hey we are looking at a different dataset with a different loss metric

---

I hope you can start to see how you can transform your old techniques into deep learning ones. There are a ton more things to do of course and a couple of different things I'm thinking about including:

* Time series
* Natural data (images, language, sound, etc)