**Objective:**
This project is all about predicting which country a recipe is from, given a list of its ingredient. The dataset consists of cuisine and ingredients.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

**Data Loading and Prepocessing:**
First we have to load the train dataset. The dataset has the rich variety of cuisines and ingredients. Let's have a look at the data.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
df_train = pd.read_json('../input/train.json')
df_train.head()

These are the first 5 ingredients found in the dataset. It seems each cuisine has different types of ingredients.

**Let's find out the top cuisines:**
Here, I am going to use value_counts to find the top cuisines. Value_counts help us to find how many and how much of these cuisines do we have.

In [None]:
plt.style.use('ggplot')
df_train['cuisine'].value_counts().plot(kind='bar')

From the graph, we see that Italian is the top most cuisine, next comes Mexican, Southern US, a little less of the other recipes.

**To get a little insight into the data, we can look at a couple of recipes. In particular, we can count the most frequent ingredients for each cuisine. To do that, I am going to use Python counter objects. 
**

In [None]:
from collections import Counter
counters = {}
for cuisine in df_train['cuisine'].unique():
    counters[cuisine] = Counter()
    indices = (df_train['cuisine'] == cuisine)
    for ingredients in df_train[indices]['ingredients']:
        counters[cuisine].update(ingredients)

In [None]:
counters['italian'].most_common(10)

Since Italian is the top most cuisine,  I have taken italian and found the most common ingredients in that particular cuisine. From the result, salt, olive oil, garlic cloves, grated parmesan cheese, garlic, ground black pepper, extra-virgin olive oil, onions, water and butter are the most common ingredients that have been used in Italian cuisine.

**Let's look at the most common ingredients for every cuisine:**

In [None]:
top10 = pd.DataFrame([[items[0] for items in counters[cuisine].most_common(10)] for cuisine in counters],
            index=[cuisine for cuisine in counters],
            columns=['top{}'.format(i) for i in range(1, 11)])
top10

These are the most common ingredients used in different cuisines.

**Let抯 see which ingredients among the top 10 ingredients are highly specific for a certain cuisine:
**
A way to do this is to simply count the number of times an ingredient appears in a given cuisine and divide by the total number of recipes.
To do this, I first created a new column(every_ingredients) in our dataframe by simply concatenating the ingredients to a single string.


In [None]:
df_train['every_ingredients'] = df_train['ingredients'].map(";".join)
df_train.head()

**We can now check for the  presence of an ingredient in a recipe:
**Let抯 take a pepper for example. This can be used to group our recipes by the presence of that ingredient.


In [None]:
df_train['every_ingredients'].str.contains('pepper')

The result shows the presence of the ingredient "pepper".  This can be used to analyse and group the recipes by the presence of that particular ingredient.

**Let's plot a graph for a ingredient "pepper" as per cuisine.**

In [None]:
indices = df_train['every_ingredients'].str.contains('pepper')
df_train[indices]['cuisine'].value_counts().plot(kind='bar',
                                                 title='pepper as found per cuisine')

From the result, pepper is being used mostly in Italian, Mexican, Southern_US cuisines.

**We can do this sort of plot for all the ingredients. First let's determine the unique ingredients:**

In [None]:
import numpy as np
unique = np.unique(top10.values.ravel())
unique

**Let's plot for all the ingredients as per the cuisine:**

In [None]:
fig, axes = plt.subplots(8, 8, figsize=(20, 20))
for ingredient, ax_index in zip(unique, range(64)):
    indices = df_train['every_ingredients'].str.contains(ingredient)
    relative_freq = (df_train[indices]['cuisine'].value_counts() / df_train['cuisine'].value_counts())
    relative_freq.plot(kind='bar', ax=axes.ravel()[ax_index], fontsize=7, title=ingredient)

The figure represent the ingredients which have a high amount of uniqueness. Those are listed below:
1. soy sauce (asian cuisine)
2. sake (Japanese)
3. sesame oil (asian cuisine)
4. feta cheese crumbs (Greek)
5. garam masala (Indian)
6. ground ginger (Moroccan)
7. avocado (Mexican)


**Training the Models:**
We are going to train the data usin three different models:
1. Logistic Regression
2. Decision Tree classifier
3. KNeighborsClassifier

We are going to use scikit learn to perform classification. Using count vectorizer, we need to econde our features to a matrix. Let the machine learning algorithms build the matrix with 1s and 0s when the ingredients are present.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer()
X = cv.fit_transform(df_train['every_ingredients'].values)
X.shape

The vectorizer has retained 3,010 ingredients and processed the 39,774 recipes in the training dataset.

In [None]:
print(list(cv.vocabulary_.keys())[:100])

Each feature gets assigned by a column number 1 or 0 depending on the presence of the ingredient.

We have our feature matrix, we still need to encode the labels that represent the cuisine of each recipe.

In [None]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(df_train.cuisine)
y[:100]

In [None]:
#We can check the result by inspecting the encoders classes
enc.classes_

**Let's train a logistic regression on the dataset: 
**We'll split the dataset so that we can also test our classifier on data.

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(X_train, y_train)

In [None]:
logistic.score(X_test, y_test)

It's performance is quiet nice with the accuracy of 78%

![](http://)**Let's train a Decision Tree Classifier on the dataset: **

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)

In [None]:
tree.score(X_test, y_test)

It's performace is not bad with the accuracy of 63%.

**Let's train a KNeighborsClassifier on the dataset:**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neighbor=KNeighborsClassifier()
neighbor.fit(X_train,y_train)

In [None]:
neighbor.score(X_test, y_test)

It's performance is good with the accuracy of 63%.

**Continue with our best model (Logistic Regression), we are going to look at the confusion matrix, and show the similarity between predicted and True labels.**


In [None]:
#Inspecting the classification results using a confusion matrix
from sklearn.metrics import confusion_matrix

plt.figure(figsize=(10, 10))

cm = confusion_matrix(y_test, logistic.predict(X_test))
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.imshow(cm_normalized, interpolation='nearest')
plt.title("confusion matrix")
plt.colorbar(shrink=0.3)
cuisines = df_train['cuisine'].value_counts().index
tick_marks = np.arange(len(cuisines))
plt.xticks(tick_marks, cuisines, rotation=90)
plt.yticks(tick_marks, cuisines)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

From the matrix, the cuisines which are really well predicted are Moroccan, Thai, greek and Indian.

In [None]:
from sklearn.metrics import classification_report
y_pred = logistic.predict(X_test)
print(classification_report(y_test, y_pred, target_names=cuisines))

This shows the different precision measurements accuracy, recall, f1 score. From the result, moroccan, thai, Vietnamese, Spanish, and Korean has higher range of prediction.   