<p style="font-size:32px;text-align:center"> <b>Cuisine Prediction</b> </p>

<h2>Description</h2>

Source: https://www.kaggle.com/c/whats-cooking

## Dataset
The data is stored in JSON format.
Each document in train.json has id, cuisine (label) and list of ingredients.
And each document in test.json has only id and list of ingredients.

An example of a recipe node in train.json:
```
{
 "id": 24717,
 "cuisine": "indian",
 "ingredients": [
     "tumeric",
     "vegetable stock",
     "tomatoes",
     "garam masala",
     "naan",
     "red lentils",
     "red chili peppers",
     "onions",
     "spinach",
     "sweet potatoes"
 ]
 },
 ```

In [2]:
import numpy as np 
import pandas as pd
import json
import pprint

with open("train.json") as data_file:
    data = json.load(data_file)
    
print("No. of recipes: %d"%len(data))
print("Example:")
pprint.pprint(data[0])

No. of recipes: 39774
Example:
{'cuisine': 'greek',
 'id': 10259,
 'ingredients': ['romaine lettuce',
                 'black olives',
                 'grape tomatoes',
                 'garlic',
                 'pepper',
                 'purple onion',
                 'seasoning',
                 'garbanzo beans',
                 'feta cheese crumbles']}


## Cleaning ingredient string
<ol>
    <li>Remove any character except a-z and A-Z</li>
    <li>Convert all letters to lower case</li>
</ol>

In [3]:
import re
def clean_string(s):
    letters_only = re.sub("[^a-zA-Z]", " ", s)
    words = letters_only.lower().split()
    return( " ".join(words))

clean_string("Garbanzo beans25")

'garbanzo beans'

In [4]:
# Converting ingredients of a document into a string.
ingredients = []
cuisine = []
for doc in data:
    l = []
    cuisine.append(doc["cuisine"])
    for i in doc['ingredients']:
        l.append(i)
    s = " ".join(l)
    ingredients.append(clean_string(s))

cuisine = np.array(cuisine)
pprint.pprint(ingredients[0:5])
print('-'*50)
print("Cuisines: ")
print(cuisine)

['romaine lettuce black olives grape tomatoes garlic pepper purple onion '
 'seasoning garbanzo beans feta cheese crumbles',
 'plain flour ground pepper salt tomatoes ground black pepper thyme eggs green '
 'tomatoes yellow corn meal milk vegetable oil',
 'eggs pepper salt mayonaise cooking oil green chilies grilled chicken breasts '
 'garlic powder yellow onion soy sauce butter chicken livers',
 'water vegetable oil wheat salt',
 'black pepper shallots cornflour cayenne pepper onions garlic paste milk '
 'butter salt lemon juice water chili powder passata oil ground cumin boneless '
 'chicken skinless thigh garam masala double cream natural yogurt bay leaf']
--------------------------------------------------
Cuisines: 
['greek' 'southern_us' 'filipino' ... 'irish' 'chinese' 'mexican']


## Bag of Words
I have used bag of words representation as features.
Using each word in an ingredient string to create a column.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word", max_features = 5000) 
data_features = vectorizer.fit_transform(ingredients)
data_features = data_features.toarray()
print(data_features.shape)

(39774, 3002)


## Training
1. Split data into train_data and test_data.
2. Train Random Forest Classifier using train_data.
3. Calculate Classifier accuracy using test_data.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_features, cuisine, test_size=0.33, random_state=42)

In [9]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 500, verbose = 1)
clf = clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
y_pred = clf.predict(X_test)
test_score = accuracy_score(y_test, y_pred)

print("-"*70)
print("Test accuracy: %f"%test_score)

[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:  5.1min finished


----------------------------------------------------------------------
Test accuracy: 0.754381


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    9.9s finished


## Prediction

In [11]:
with open("test.json") as data_file:
    data = json.load(data_file)
    
print("No. of documents: %d"%len(data))
print("Example:")
pprint.pprint(data[0])

No. of documents: 9944
Example:
{'id': 18009,
 'ingredients': ['baking powder',
                 'eggs',
                 'all-purpose flour',
                 'raisins',
                 'milk',
                 'white sugar']}


In [12]:
test_ingredients = []
for doc in data:
    l = []
    for i in doc['ingredients']:
        l.append(i)
    s = " ".join(l)
    test_ingredients.append(clean_string(s))
pprint.pprint(test_ingredients[0:5])

['baking powder eggs all purpose flour raisins milk white sugar',
 'sugar egg yolks corn starch cream of tartar bananas vanilla wafers milk '
 'vanilla extract toasted pecans egg whites light rum',
 'sausage links fennel bulb fronds olive oil cuban peppers onions',
 'meat cuts file powder smoked sausage okra shrimp andouille sausage water '
 'paprika hot sauce garlic cloves browning lump crab meat vegetable oil all '
 'purpose flour freshly ground pepper flat leaf parsley boneless chicken '
 'skinless thigh dried thyme white rice yellow onion ham',
 'ground black pepper salt sausage casings leeks parmigiano reggiano cheese '
 'cornmeal water extra virgin olive oil']


In [13]:
test_data_features = vectorizer.transform(test_ingredients)
test_data_features = test_data_features.toarray()

In [14]:
y_predicted = clf.predict(test_data_features)
print(y_predicted)

['southern_us' 'southern_us' 'italian' ... 'italian' 'southern_us'
 'mexican']


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    9.0s finished


In [15]:
id_array = []
result_array = []
for i, doc in enumerate(data):
    id_array.append(doc['id'])
    result_array.append(y_predicted[i])

id_array = np.array(id_array)
result_array = np.array(result_array)

print(id_array)
print(result_array)

[18009 28583 41580 ... 22339 42525  1443]
['southern_us' 'southern_us' 'italian' ... 'italian' 'southern_us'
 'mexican']


In [16]:
submission = pd.DataFrame({
        "id": id_array,
        "cuisine": result_array
    })
submission.head()

Unnamed: 0,cuisine,id
0,southern_us,18009
1,southern_us,28583
2,italian,41580
3,cajun_creole,29752
4,italian,35687


In [17]:
submission.to_csv('submission.csv', index=False)