## What's Cooking?
### Competition Description 

If you're in Northern California, you'll be walking past the inevitable bushels of leafy greens, spiked with dark purple kale and the bright pinks and yellows of chard. Across the world in South Korea, mounds of bright red kimchi greet you, while the smell of the sea draws your attention to squids squirming nearby. India’s market is perhaps the most colorful, awash in the rich hues and aromas of dozens of spices: turmeric, star anise, poppy seeds, and garam masala as far as the eye can see.

Some of our strongest geographic and cultural associations are tied to a region's local foods. This playground competitions asks you to predict the category of a dish's cuisine given a list of its ingredients. 

Competition Link: https://www.kaggle.com/c/whats-cooking

### Practice Skills 

* Logistic Regression
* K-fold cross-validation

(a) Join the What’s Cooking competition on Kaggle. Download the training and test data (in .json). The competition page describes how these files are formatted.

In [6]:
import numpy as np
import pandas as pd
train = pd.read_json('cooking/train.json')
test = pd.read_json('cooking/test.json')

In [7]:
train.head()

Unnamed: 0,cuisine,id,ingredients
0,greek,10259,"[romaine lettuce, black olives, grape tomatoes..."
1,southern_us,25693,"[plain flour, ground pepper, salt, tomatoes, g..."
2,filipino,20130,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,indian,22213,"[water, vegetable oil, wheat, salt]"
4,indian,13162,"[black pepper, shallots, cornflour, cayenne pe..."


(b) Tell us about the data. How many samples (dishes) are there in the training set? How many categories (types of cuisine)? Use a list to keep all the unique ingredients appearing in the training set. How many unique ingredients are there?


In [8]:
num_of_s = len(train)
print('There are', num_of_s, 'samples in the the training set.')

There are 39774 samples in the the training set.


In [9]:
cuisine = train['cuisine'].unique()
num_of_c = len(cuisine)
print('There are', num_of_c, 'types of cuisines in the the training set.')

There are 20 types of cuisines in the the training set.


In [20]:
%timeit 10
ingd = set(list(itertools.chain(*train['ingredients'])))

100000000 loops, best of 3: 10.8 ns per loop


In [10]:
import itertools 
ingd = set(list(itertools.chain(*train['ingredients'])))
ingd = np.array(list(ingd))
num_of_i = len(ingd)
print('There are', num_of_i, 'unique ingredients in the the training set.')

There are 6714 unique ingredients in the the training set.


(c) Represent each dish by a binary ingredient feature vector. Suppose there are d different in- gredients in total from the training set, represent each dish by a 1×d binary ingredient vector x, where xi = 1 if the dish contains ingredient i and xi = 0 otherwise. For example, suppose all the ingredients we have in the training set are { beef, chicken, egg, lettuce, tomato, rice } and the dish is made by ingredients { chicken, lettuce, tomato, rice }, then the dish could be represented by a 6 × 1 binary vector [0, 1, 0, 1, 1, 1] as its feature or attribute. Use n × d feature matrix to represent all the dishes in training set and test set, where n is the number of dishes.

In [11]:
pd.DataFrame(columns = ingd)

Unnamed: 0,no-salt-added black beans,grassfed beef,soba noodles,Herdez Salsa Verde,and fat free half half,knockwurst,marshmallow creme,medium salsa,mahi mahi,food gel,...,mint leaves,sweet peas,red rice,seitan,chicken schmaltz,savory,beef smoked sausage,bay scallops,fresh parmesan cheese,broccolini


In [13]:
import numpy as np

def feature(s):
    x = np.array([0]*6714)
    for i in s:
        a = (ingd == i)*1
        x = np.array([x,a]).sum(axis = 0)
    return x 
s = train['ingredients']
m = list(map(feature, s))

In [14]:
matrix = pd.DataFrame(np.array(m))

In [18]:
matrix.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6704,6705,6706,6707,6708,6709,6710,6711,6712,6713
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [109]:
matrix.shape

(39774, 6714)

(d) Using Naïve Bayes Classifier to perform 3 fold cross-validation on the training set and report your average classification accuracy. Try both Gaussian distribution prior assumption and Bernoulli distribution prior assumption.

In [134]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

In [102]:
gnb = GaussianNB()
gnb.fit(matrix, train['cuisine'])

GaussianNB(priors=None)

In [None]:
train1 = matrix
train1['cuisine'] = train['cuisine']

In [150]:
kf = KFold(n_splits = 3)
gnb_accuracy = []
for train_id, test_id in kf.split(train1):
    train = train1.iloc[train_id]
    test = train1.iloc[test_id]
    train_y = train['cuisine']
    train_x = train.drop('cuisine', axis = 1)
    test_y = test['cuisine']
    test_x = test.drop('cuisine', axis = 1)
    
    gnb = GaussianNB()
    gnb.fit(train_x, train_y)
    prediction = gnb.predict(test_x)
    ac = accuracy_score(test_y, prediction)
    gnb_accuracy.append(ac)
gnb_accuracy

[0.37901644290239855, 0.38293860310755767, 0.37765877206215115]

In [151]:
kf = KFold(n_splits = 3)
bn_accuracy = []
for train_id, test_id in kf.split(train1):
    train = train1.iloc[train_id]
    test = train1.iloc[test_id]
    train_y = train['cuisine']
    train_x = train.drop('cuisine', axis = 1)
    test_y = test['cuisine']
    test_x = test.drop('cuisine', axis = 1)
    
    bn = BernoulliNB()
    bn.fit(train_x, train_y)
    prediction = bn.predict(test_x)
    ac = accuracy_score(test_y, prediction)
    bn_accuracy.append(ac)
bn_accuracy

[0.68419067732689698, 0.67951425554382261, 0.68690601900739179]

(e) For Gaussian prior and Bernoulli prior, which performs better in terms of cross-validation accuracy? Why? Please give specific arguments.


Answer: Bernoulli Naive Bayes Classifier perform better. <br>
Reasons: 
1. Gaussian NB assumes the data is normally distributed but the data we have is binary so the bernoulli will have better fit. <br>
2. Gaussian NB assumes the data is continous with order but our data is discontinous and doesn't have order. So, bernoulli NB should perform better. <br>

(f) Using Logistic Regression Model to perform 3 fold cross-validation on the training set and report your average classification accuracy.

In [154]:
from sklearn.linear_model import LogisticRegression
kf = KFold(n_splits = 3)
lr_accuracy = []
for train_id, test_id in kf.split(train1):
    train = train1.iloc[train_id]
    test = train1.iloc[test_id]
    train_y = train['cuisine']
    train_x = train.drop('cuisine', axis = 1)
    test_y = test['cuisine']
    test_x = test.drop('cuisine', axis = 1)
    
    lr = LogisticRegression()
    lr.fit(train_x, train_y)
    prediction = lr.predict(test_x)
    ac = accuracy_score(test_y, prediction)
    lr_accuracy.append(ac)
lr_accuracy

[0.77590888520138779, 0.77213757731181176, 0.7786242268818826]

(g) Train your best-performed classifier with all of the training data, and generate test labels on test set. Submit your results to Kaggle and report the accuracy.


In [162]:
train = pd.read_json('cooking/train.json')
test = pd.read_json('cooking/test.json')
train_x = matrix.drop('cuisine', axis = 1)

In [157]:
test_m = list(map(feature, test['ingredients']))

In [158]:
test_x = pd.DataFrame(np.array(test_m))

In [168]:
lr = LogisticRegression()
lr.fit(train_x, train['cuisine'])
prediction = lr.predict(test_x)

In [179]:
type(prediction)

numpy.ndarray

In [180]:
result = pd.DataFrame({'id': test['id'], 'cuisine': prediction})

In [182]:
result.to_csv('cooking_prediction.csv', index = False)

accuracy: 0.783