## What's Cooking?
Use recipe ingredients to categorize the cuisine    
Kaggle source: https://www.kaggle.com/c/whats-cooking

#### Competition Description
Picture yourself strolling through your local, open-air market... What do you see? What do you smell? What will you make for dinner tonight?

If you're in Northern California, you'll be walking past the inevitable bushels of leafy greens, spiked with dark purple kale and the bright pinks and yellows of chard. Across the world in South Korea, mounds of bright red kimchi greet you, while the smell of the sea draws your attention to squids squirming nearby. India’s market is perhaps the most colorful, awash in the rich hues and aromas of dozens of spices: turmeric, star anise, poppy seeds, and garam masala as far as the eye can see.

Some of our strongest geographic and cultural associations are tied to a region's local foods. This playground competitions asks you to predict the category of a dish's cuisine given a list of its ingredients. 

### Summary
- [Understand Features](#feature)
- [Binary Feature Matrix](#matrix)
- [NaïveBayes Classifier](#naive)
  - 3 fold cross-validation with both Gaussian distribution prior assumption and Bernoulli distribution prior assumption.
- [Logistic Regression](#logistic)
  - 3 fold cross-validation
- [Predict Test Results](#results)

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression

In [2]:
train_data = pd.read_json('train.json')

In [3]:
train_data

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."
...,...,...,...
39769,29109,irish,"[light brown sugar, granulated sugar, butter, ..."
39770,11462,italian,"[KRAFT Zesty Italian Dressing, purple onion, b..."
39771,2238,irish,"[eggs, citrus fruit, raisins, sourdough starte..."
39772,41882,chinese,"[boneless chicken skinless thigh, minced garli..."


#### <a id = 'feature'>Understand Features</a>

In [4]:
cuisines = set([])
ingredients = set([])

for i in range(len(train_data)):
    cuisines.add(train_data.iloc[i]['cuisine'])
    ingredients |= set(train_data.iloc[i]['ingredients'])
    
print("There are {} dishes in the training set.".format(len(train_data)))
print("There are {} unique cuisine in the training set.".format(len(cuisines)))
print("There are {} unique ingredients in the training set.".format(len(ingredients)))

There are 39774 dishes in the training set.
There are 20 unique cuisine in the training set.
There are 6714 unique ingredients in the training set.


#### <a id ='matrix'>Binary Feature Matrix<a/>

Represent each dish by a binary ingredient feature vector. The feature matrix should have a size of n × d to represent all the dishes in training set and test set, where n is the number of dishes and d is the number of unique ingredients.

In [38]:
ingred_tot = [",".join(ele) for ele in train_data.ingredients]
ctV = CountVectorizer(vocabulary=ingredients, tokenizer=lambda ele : ele.split(','))
X = ctV.fit_transform(ingred_tot)

In [47]:
X.shape

(39774, 6714)

Using LabelEncoder to encode traning data cuisine column

In [69]:
le = preprocessing.LabelEncoder()
y = le.fit_transform(train_data.cuisine)

In [70]:
y

array([ 6, 16,  4, ...,  8,  3, 13])

In [71]:
len(y)

39774

#### <a id = 'naive'>NaïveBayes Classifier<a/>

Using NaïveBayes Classifier to perform 3 fold cross-validation on the training set with both Gaussian distribution prior assumption and Bernoulli distribution prior assumption.

In [74]:
gaussian_accur = []
bernoulli_accur = []
kf = KFold(n_splits = 3)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index].toarray(), X[test_index].toarray()
    y_train, y_test = y[train_index], y[test_index]
    
    #Gaussian assumption
    gaussian = GaussianNB()    
    gaussian.fit(X_train, y_train)
    g_score = gaussian.score(X_test, y_test)
    gaussian_accur.append(g_score)

    #Bernoulli assumption
    bernoulli = BernoulliNB()
    bernoulli.fit(X_train, y_train)
    b_score = bernoulli.score(X_test, y_test)
    bernoulli_accur.append(b_score)
    
print("NaïveBayes Classifier with Gaussian distribution prior assumption has an average accuracy of {:.2f}"\
      .format(np.average(gaussian_accur)))
print("NaïveBayes Classifier with Bernoulli distribution prior assumption has an average accuracy of {:.2f}"\
      .format(np.average(bernoulli_accur)))

NaïveBayes Classifier with Gaussian distribution prior assumption has an average accuracy of 0.37
NaïveBayes Classifier with Bernoulli distribution prior assumption has an average accuracy of 0.68


With Bernoulli assumption, accuary is higher which is due to our feature matrix is binary (0 or 1).

#### <a id='logistic'>Logistic Regression<a/>

In [80]:
logistic_accur  = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index].toarray(), X[test_index].toarray()
    y_train, y_test = y[train_index], y[test_index]

    lr = LogisticRegression(max_iter =500, solver='lbfgs').fit(X_train, y_train)
    lr_score = lr.score(X_test, y_test)
    logistic_accur.append(lr_score)

print("Logistic Regression has an average accuracy of {:.2f}".format(np.average(logistic_accur)))

Logistic Regression has an average accuracy of 0.77


Use logistic Regression since it performs the best

#### <a id ='results'>Predict Test Results<a/>

In [82]:
#read test data
test_data = pd.read_json('test.json')

#process test data into a feature matrix
ingred_tot_test = [",".join(ele) for ele in test_data.ingredients]
X_test = ctV.fit_transform(ingred_tot_test)

In [83]:
lr_final = LogisticRegression(max_iter =500, solver='lbfgs')
lr_final.fit(X.toarray(),y)
y_pred = lr.predict(X_test)

In [90]:
#using labelencoder to convert numerical indexes back to categories. 
y_predlabel = list(le.inverse_transform(y_pred))

#### Export

In [103]:
output = test_data[['id']].copy()
output['cuisine'] = y_predlabel
output.sort_values('id' , ascending=True, inplace = True)

In [105]:
output.head()

Unnamed: 0,id,cuisine
4987,5,mexican
9232,7,indian
9638,11,vietnamese
4927,12,italian
3280,13,southern_us


In [109]:
output[['id', 'cuisine']].to_csv('output.csv', index=False)