# What's cooking?

Here, we predict what country a given cuisine comes from, based on its ingredients.

We choose a very simple model (logistic regression) and obtain a score only a few percentage points worse off than the top ones on the leaderboard, but with the important difference that this kernel takes less than 15 minutes (including  tuning) to run as opposed to several hours.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

### Load data, basic preprocessing

We begin by reading in the data, and inspecting it. We see that each row corresponds to a recipe. Our objective is to predict `cuisine` given `ingredients`.

In [None]:
train = pd.read_json('../input/train.json')
test = pd.read_json('../input/test.json')
print('Train size is', train.shape)
print('Test size is', test.shape)
train.head()

As the ingredients are given as a list, join them together into a string.

In [None]:
for t in (train, test):
    t.set_index('id', inplace=True)
    t.ingredients = t.ingredients.str.join(' ')
    
train.head()

Then, give each ingredient its own column, whence the rows will be `1` if that particular recipe contains that ingredient, as `0` otherwise. Note that most values will now be `0`, so our dataset will be saved as a sparse matrix, so we need to call `to_dense` to visualise it.

We see that, for the column corresponding to 'romaine lettuce', the first row is `1` and the next four are `0`, as expected.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer 
from nltk import word_tokenize  
class LemmaTokenizer(object):
     def __init__(self):
         self.wnl = WordNetLemmatizer()
     def __call__(self, doc):
         return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

#count_vec = CountVectorizer(tokenizer=lambda x: [i.strip() for i in x.split(',')], min_df=10)
count_vec = CountVectorizer(tokenizer=LemmaTokenizer(), min_df=5)
X_train = count_vec.fit_transform(train.ingredients)
X_test = count_vec.transform(test.ingredients)

X_train[:5, count_vec.vocabulary_['lettuce']].todense()

In [None]:
count_vec.vocabulary_

Visualise the target variable.

*Note: many kernels in this competition use a label encoder at this stage. However, when using one of sklearn's classifiers, this is unnecessary.*

In [None]:
y_train = train.cuisine
y_train.value_counts().sort_values().plot(kind='barh')

### TFIDF

This next cell performs the following:
- each row is split into separate ingredients ('vectors');
- a new column is created for ingredient, where the row value corresponds to how many times that particular ingredient was present;
- rows are multiplied by their inverse-document frequency: this is $\ln((d+1)/(n+1))+1$, where $d$ is the number of rows containing the corresponding word and $n$ is the total number of rows;
- finally, rows are normalised by dividing by their $L2$ norm.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
X_train_idf = tfidf.fit_transform(X_train)
X_test_idf = tfidf.transform(X_test)

Look at how this changes our rows in the column corresponding to 'romaine lettuce'.

In [None]:
X_train_idf[:5, count_vec.vocabulary_['lettuce']].todense()

Let's try to match the top row by hand (because we'd rather undersand the tools we use...right?).

In [None]:
n = train.shape[0]
first_row_tfidf = []
first_row_ingredients = train.ingredients.iloc[0]
first_row_lemmas = LemmaTokenizer()(first_row_ingredients)
from sklearn.metrics.pairwise import cosine_similarity
for i in first_row_lemmas:
    d = np.sum((X_train[:, count_vec.vocabulary_[i]]==1).toarray(), axis=0)[0]
    if d<5:
        continue
    idf = np.log((n+1)/(d+1))+1
    first_row_tfidf.append(idf)
    if i=='lettuce':
        print('The ingredient "lettuce" appears in {} recipes, so d={}.'.format(d, d))
        print('In total, there are {} recipes, so n={}.'.format(n, n))
        print('Substituting into the formula above, we get an idf of {}.'.format(idf))
        
our_result = first_row_tfidf[first_row_lemmas.index('lettuce')]/np.linalg.norm(np.array(first_row_tfidf))
print('Normalising across ingredients, we get {}, which (almost) matches sklearn\'s result.'.format(our_result))

### Linear model: logistic regression

We use Bayesian Optimization to tune the regularization parameter in logistic regression.

In [None]:
from bayes_opt import BayesianOptimization
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

def optimise_lr(X, y, C):

    def target(C):
        clf = Pipeline(
            [("tf_idf", TfidfVectorizer(tokenizer=LemmaTokenizer(), min_df=5)),
             ("lr", LogisticRegression(C=10**C))])
        cv_results = np.mean(cross_val_score(clf, X, y, cv=5))
        return cv_results
    
    bo = BayesianOptimization(target, {'C': C})
    bo.maximize(init_points=2, n_iter=10)
    return bo.res['max']['max_params']

import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    best_params = optimise_lr(train.ingredients, train.cuisine, (-3, 2))
print(best_params)

![](https://i.imgur.com/XVLoRHA.jpg)

Logistic regression can be thought of as a neural network with no hidden layers. So...let's add a hidden layer!

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

model = Sequential([
    Dropout(.2, input_shape=(X_train.shape[1],)),
    Dense(2048, activation='relu'),
    Dropout(.2),
    Dense(len(set(y_train)), activation='softmax'),
])

model.compile(optimizer='adam',
              loss='categorical_hinge',
              metrics=['accuracy'])

In [None]:
import keras
from sklearn.preprocessing import LabelEncoder
from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(monitor='val_loss', patience=5, verbose=1)

lb = LabelEncoder()
lb_train = lb.fit_transform(y_train)

one_hot_labels = keras.utils.to_categorical(lb_train, num_classes=len(set(y_train)))
history = model.fit(X_train_idf, one_hot_labels, validation_split=0.33, epochs=50, batch_size=128, callbacks=[early_stopping])
import matplotlib.pyplot as plt
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

### Submit!

In [None]:
preds = model.predict(X_test_idf, batch_size=32)
test['cuisine'] = lb.inverse_transform(np.argmax(preds, axis=1))
test.reset_index()[['id', 'cuisine']].to_csv('preds.csv', index=False)