## Neural-Net for Tabular data
### <u>Problem statement 5:</u> 
The following dataset is taken from kaggle [What's Cooking Dataset](https://www.kaggle.com/c/whats-cooking-kernels-only/data). 

The dataset is a list of ingredients and the model is expected to classify the cuisine based on the given set of ingredients. This problem can be framed as multi class classification and having only 1 independent variable `ingredients`.<br> <br>
On the other hand, this variable holds a number of ingredients of which can be considered as features/categories. As each observation is typically different from one another, this variable carries on high cardinality.<br>
The problem will be addressed using CNN as one variable detains several features which must be captured first, then aggregate afterward <br>
The steps would be as follow: <br>
* Encode each category within the `ingredients` variable 
* Embed them all to have a fixed-size vector representation 
* Capture features of each embedded vector of `ingredients`
* Wrap up all captured features into 2-dimension which represents each observation
* Classify this embedded & wrapped feature using dense layer as a classifier 

In [13]:
import numpy as np
import pandas as pd
import json

# Print out all rows and cols
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [8]:
df_recipes = pd.read_json('datasets/train.json')
df_recipes["ingredients_flat"] = df_recipes["ingredients"].apply(lambda x: ' '.join(x))

df_recipes.head()

Unnamed: 0,cuisine,id,ingredients,ingredients_flat
0,greek,10259,"[romaine lettuce, black olives, grape tomatoes...",romaine lettuce black olives grape tomatoes ga...
1,southern_us,25693,"[plain flour, ground pepper, salt, tomatoes, g...",plain flour ground pepper salt tomatoes ground...
2,filipino,20130,"[eggs, pepper, salt, mayonaise, cooking oil, g...",eggs pepper salt mayonaise cooking oil green c...
3,indian,22213,"[water, vegetable oil, wheat, salt]",water vegetable oil wheat salt
4,indian,13162,"[black pepper, shallots, cornflour, cayenne pe...",black pepper shallots cornflour cayenne pepper...


In [107]:
# Encode the target variable 
from sklearn import preprocessing
from tensorflow.keras.utils import to_categorical

enc = preprocessing.LabelEncoder()
enc.fit(df_recipes["cuisine"].values)

targets_enc = enc.transform(df_recipes["cuisine"].values)
# OneHot the target ->  
# so that we can make use of `categorical_crossentrpy` as loss function 
# and `softmax` as predictor/activation at the output layer
targets = to_categorical(targets_enc)

print("="*80)
print(f"Number & List of classes:\n {len(enc.classes_)}/{enc.classes_}")
print("="*80)
print(f"Target shape after encoding(OHE): \n {targets.shape}")


LabelEncoder()

Number & List of classes:
 20/['brazilian' 'british' 'cajun_creole' 'chinese' 'filipino' 'french'
 'greek' 'indian' 'irish' 'italian' 'jamaican' 'japanese' 'korean'
 'mexican' 'moroccan' 'russian' 'southern_us' 'spanish' 'thai'
 'vietnamese']
Target shape after encoding(OHE): 
 (39774, 20)


In [85]:
# Create a new feature(may be interested): the number of ingredients in the recipe
df_recipes["ingredients_len"] = df_recipes["ingredients"].apply(lambda x: len(x))
ingredients_len = df_recipes[["ingredients_len"]].values 

print(f"Maximun number of ingredients in the documents(recipes):\n {ingredients_len.max()}")
print(f"Manimun number of ingredients in the documents(recipes):\n {ingredients_len.min()}")
print("="*80)

# Scale the new feature 
scaler = preprocessing.StandardScaler()
ingredients_len_scaled = scaler.fit_transform(ingredients_len)
print(f"Final scaled 'ingredients length' variable shape:\n {ingredients_len_scaled.shape}")
print("="*80)

Maximun number of ingredients in the recipes:
 65
Manimun number of ingredients in the recipes:
 1
Final scaled 'ingredients length' variable shape:
 (39774, 1)




In [88]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

# Get list of ingredients of all rows to numpy array 
ingredients_docs = df_recipes["ingredients_flat"].values
print(f"Few sample as array of strings:\n {ingredients_docs[:2]}")
print("="*80)

# Tokenize all the words: just to count tokens 
# in order to get the maximum number of words in any documents
# and the oveal number of words in all corpus
tokenizer = Tokenizer()
tokenizer.fit_on_texts(ingredients_docs)

# Vocab size is used to encode each and every word in the document to and unique integer 
ingredients_docs_vocab_size = len(tokenizer.word_index) + 1
print(f"Vocab size of the documents:\n {ingredients_docs_vocab_size}")
print("="*80)


# Encode/Transform rows to integer after tokenization
ingredients_docs_enc = tokenizer.texts_to_sequences(ingredients_docs)
print(f"Few sample as array of integer:\n {ingredients_docs_enc[:1]}")
print("="*80)

# Narrow the size of ingredients vector of each doc into 40 and pad them if less than 40
# As we can see earlier the maximum number of ingredients is 65 and the minimum is 1
# This process will take into account the 40 first ingredients in case the doc size(# of words) is more than 40. 
# This number is an hyper-parameter that we can tune - later on
max_length = 40
ingredients_docs_padded = pad_sequences(ingredients_docs_enc, 
                                        maxlen=max_length, 
                                        padding='post', 
                                        values=0.0, # Zero padding
                                       )

print(f"Final 'ingredients' variable shape after Padding:\n {ingredients_docs_padded.shape}")
print("="*80)

print(f"Sample of 'ingredients' variable representation after padding:\n {ingredients_docs_padded[:1]}")
print("="*80)

Few sample as array of strings:
 ['romaine lettuce black olives grape tomatoes garlic pepper purple onion seasoning garbanzo beans feta cheese crumbles'
 'plain flour ground pepper salt tomatoes ground black pepper thyme eggs green tomatoes yellow corn meal milk vegetable oil']
Number of tokens in the documents:
 3065
Few sample as array of integer:
 [[314, 138, 13, 128, 339, 18, 4, 1, 104, 25, 79, 489, 50, 204, 10, 287]]
Final 'ingredients' variable shape after padding:
 (39774, 40)
Sample of 'ingredients' variable representation after padding:
 [[314 138  13 128 339  18   4   1 104  25  79 489  50 204  10 287   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0]]


In [89]:
def sample_generator(batch_size):
    while True:
        batch_idx = np.random.choice(ingredients_docs_padded.shape[0], batch_size)
        
        yield ({'cat_inputs': ingredients_docs_padded[batch_idx],
                'num_inputs': ingredients_len_scaled[batch_idx],}, 
               {'output': targets[batch_idx]}
              )

In [90]:
# Rule of thumb: size of embedding vector w.r.t the number of categories 
def embedding_size(no_cat):
    return min(600, round(1.6 * no_cat**0.56))

In [91]:
from tensorflow.keras.layers import (Input, Dropout, Dense, 
                                     BatchNormalization, Embedding, 
                                     Flatten, Concatenate, Conv1D, 
                                     Activation, GlobalAveragePooling1D,
                                     GlobalMaxPool1D, RepeatVector
                                    )
from tensorflow.keras.models import Model

# Dropout probability 
p = .1
batch_size = 32

In [99]:
cat_inputs = Input(shape=(40,), name="cat_inputs")
num_inputs = Input(shape=(1,), name="num_inputs")

#==== Categorical Transformation ====
# Encode the category using embedding layer
# Embedding layer weights are trainable 
embedding_layer = Embedding(input_dim=ingredients_docs_vocab_size, # number of unique words in the vocab
                            output_dim=embedding_size(ingredients_docs_vocab_size),
                            input_length=40
                           )

X_cat = embedding_layer(cat_inputs)

global_avg = GlobalAveragePooling1D(name='global_avg_1')(X_cat)
global_max = GlobalMaxPool1D(name='global_max_1')(X_cat)
x = Concatenate()([global_avg, global_max])

# just repeat the previous layer 40 times to match its dimension with the 
# to the dimension of the input layer 
x = RepeatVector(40, name='repeat')(x) 
x = Concatenate()([X_cat, x])

x = Dropout(p)(x)
x = Conv1D(20, 1)(x)
x = Activation('relu')(x)

global_avg = GlobalAveragePooling1D(name='global_avg_2')(x)
global_max = GlobalMaxPool1D(name='global_max_2')(x)
x = Concatenate()([global_avg, global_max])

#==== Append the categorical transformed features to the numerical features ====
x = Concatenate()([x, num_inputs])
#===== 
x = Dropout(p)(x)
x = Dense(100, activation='relu')(x)
#===== 
x = BatchNormalization()(x)
x = Dropout(p)(x)
x = Dense(20, activation='relu')(x)
#===== 
x = BatchNormalization()(x)
x = Dropout(p)(x)
x = Dense(10, activation='relu')(x)
#===== 
x = BatchNormalization()(x)
x = Dropout(p)(x)
#=====
out = Dense(20, activation='softmax', name='output')(x)

In [100]:
model = Model(inputs=[cat_inputs, num_inputs], outputs=out)

for layer in model.layers:
    print(layer.output_shape)

[(None, 40)]
(None, 40, 143)
(None, 143)
(None, 143)
(None, 286)
(None, 40, 286)
(None, 40, 429)
(None, 40, 429)
(None, 40, 20)
(None, 40, 20)
(None, 20)
(None, 20)
(None, 40)
[(None, 1)]
(None, 41)
(None, 41)
(None, 100)
(None, 100)
(None, 100)
(None, 20)
(None, 20)
(None, 20)
(None, 10)
(None, 10)
(None, 10)
(None, 20)


In [101]:
model.compile(optimizer='rmsprop',
             loss='categorical_crossentropy', 
             metrics=['accuracy'])
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
cat_inputs (InputLayer)         [(None, 40)]         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 40, 143)      438295      cat_inputs[0][0]                 
__________________________________________________________________________________________________
global_avg_1 (GlobalAveragePool (None, 143)          0           embedding_1[0][0]                
__________________________________________________________________________________________________
global_max_1 (GlobalMaxPooling1 (None, 143)          0           embedding_1[0][0]                
____________________________________________________________________________________________

In [98]:
model.fit_generator(
    sample_generator(batch_size),
    steps_per_epoch=10_000 / batch_size,
    epochs=20,
    max_queue_size=10
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fc4c4fbf310>