# Exploratory Data Analysis
### Jeff Ho

This notebook shows ingesting, cleaning, and understanding the data for this challenge.

**Question:**
Can we design and execute a method to predict the cuisine of a recipe given only its ingredients?

**Deliverables:**
Produce guidelines for a team to hand label cuisines based on ingredients.

**Why?**
To improve the product by building a feature for the food publication that enables users to query by cuisine.

In [2]:
import json
import pandas as pd

In [3]:
# load data
df = pd.read_json('recipies.json')
df.head()

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."


### Basic questions about the data

In [7]:
# How many recipes are there?
print(len(df['cuisine']))

39774


In [11]:
# Why are there so many recipes? The prompt said about 10,000. Are some rows empty?
df.sample(50) # Doesn't look like it based on random sample below (run multiple times). 

Unnamed: 0,id,cuisine,ingredients
510,32316,mexican,"[blanched almond flour, sea salt, tapioca flou..."
28479,41417,southern_us,"[simple syrup, branca menta, mint, bourbon whi..."
19794,3283,cajun_creole,"[tilapia fillets, unsalted butter, celery salt..."
32691,49022,italian,"[pinenuts, baby spinach, parmesan cheese, chee..."
22451,36209,southern_us,"[garlic powder, brown sugar, bacon, butter, so..."
5072,43256,mexican,"[green chile, fresh cilantro, jalapeno chilies..."
36031,31782,italian,"[fresh basil, dried thyme, diced tomatoes, sal..."
12576,49353,thai,"[whole wheat buns, pepper, shredded carrots, s..."
36378,10109,filipino,"[fresh tomatoes, olive oil, salt, bread crumbs..."
10717,36590,southern_us,"[green bell pepper, salt, shell-on shrimp, oni..."


In [9]:
# Are there duplicate ids?
df['id'].value_counts()

# Nope. It's possible there is just more data than expected.

2047     1
11663    1
44447    1
42398    1
48541    1
        ..
29339    1
31386    1
25241    1
27288    1
0        1
Name: id, Length: 39774, dtype: int64

In [10]:
# How many cuisines?
df['cuisine'].value_counts() #Mostly Italian, Mexican, and Southern.

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

In [4]:
# How many unique ingredients total? How many ingredients per cuisine?

# first need to unpack the ingredients column. get new rows for each ingredient.

# then count ingredients by recipe id.

# then count number of ingredients by cuisine type.


italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

In [None]:
# How different are the ingredients? e.g., All-Purpose Flour and Flour are likely the same ingredient, but red onions and yellow onions are incredibly different.



In [None]:
# For each cuisine, what is the histogram of ingredients?

