### Example: Recipe Database

In [27]:
import pandas as pd
import numpy as np

In [9]:
# unzip the file 
!gunzip recipeitems-latest.json.gz

* The database is in JSON format, so we will try pd.read_json to read it:

In [10]:
try:
    recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
    print("ValueError:", e)

ValueError: Trailing data


* We get a ValueError mentioning that there is “trailing data.” It seems that it’s due to using a file in which each line is itself a valid JSON, but the full file is not. Let’s check if this interpretation is true:

In [13]:
with open('recipeitems-latest.json') as f:
    line = f.readline()
pd.read_json(line).shape

(2, 12)

* Yes, apparently each line is a valid JSON, so we’ll need to string them together. One
way we can do this is to actually construct a string representation containing all these
JSON entries, and then load the whole thing with pd.read_json:

In [17]:
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r', encoding="utf8") as f:
    
    # Extract each line
    data = (line.strip() for line in f)
    
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))

# read the result as a JSON
recipes = pd.read_json(data_json)

In [19]:
recipes.shape

(173278, 17)

In [20]:
recipes.head()

Unnamed: 0,_id,name,ingredients,url,image,ts,cookTime,source,recipeYield,datePublished,prepTime,description,totalTime,creator,recipeCategory,dateModified,recipeInstructions
0,{'$oid': '5160756b96cc62079cc2db15'},Drop Biscuits and Sausage Gravy,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276011104},PT30M,thepioneerwoman,12,2013-03-11,PT10M,"Late Saturday afternoon, after Marlboro Man ha...",,,,,
1,{'$oid': '5160756d96cc62079cc2db16'},Hot Roast Beef Sandwiches,12 whole Dinner Rolls Or Small Sandwich Buns (...,http://thepioneerwoman.com/cooking/2013/03/hot...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276013902},PT20M,thepioneerwoman,12,2013-03-13,PT20M,"When I was growing up, I participated in my Ep...",,,,,
2,{'$oid': '5160756f96cc6207a37ff777'},Morrocan Carrot and Chickpea Salad,Dressing:\n1 tablespoon cumin seeds\n1/3 cup /...,http://www.101cookbooks.com/archives/moroccan-...,http://www.101cookbooks.com/mt-static/images/f...,{'$date': 1365276015332},,101cookbooks,,2013-01-07,PT15M,A beauty of a carrot salad - tricked out with ...,,,,,
3,{'$oid': '5160757096cc62079cc2db17'},Mixed Berry Shortcake,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/mix...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276016700},PT15M,thepioneerwoman,8,2013-03-18,PT15M,It's Monday! It's a brand new week! The birds ...,,,,,
4,{'$oid': '5160757496cc6207a37ff778'},Pomegranate Yogurt Bowl,For each bowl: \na big dollop of Greek yogurt\...,http://www.101cookbooks.com/archives/pomegrana...,http://www.101cookbooks.com/mt-static/images/f...,{'$date': 1365276020318},,101cookbooks,Serves 1.,2013-01-20,PT5M,A simple breakfast bowl made with Greek yogurt...,,,,,


* Let’s start by taking a closer look at the ingredients:

In [24]:
recipes.ingredients.str.len().describe()

count    173278.000000
mean        244.617926
std         146.705285
min           0.000000
25%         147.000000
50%         221.000000
75%         314.000000
max        9067.000000
Name: ingredients, dtype: float64

* let’s see which recipe has the longest ingredient list:

In [35]:
recipes.name[np.argmax(recipes.ingredients.str.len())]

'Carrot Pineapple Spice &amp; Brownie Layer Cake with Whipped Cream &amp; Cream Cheese Frosting and Marzipan Carrots'

* let’s see how many of the recipes are for breakfast food:

In [39]:
recipes.description.str.contains('[Bb]reakfast').sum()

3524

* Or how many of the recipes list cinnamon as an ingredient:

In [41]:
recipes.ingredients.str.contains('[Cc]innamon').sum()

10526

* We could even look to see whether any recipes misspell the ingredient as “cinamon”:

In [43]:
recipes.ingredients.str.contains('[Cc]inamon').sum()

11

#### A simple recipe recommender

we’ll start with a list of common ingredients, and simply search
to see whether they are in each recipe’s ingredient list. For simplicity, let’s just stick
with herbs and spices for the time being:

In [45]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
                'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

* We can then build a Boolean DataFrame consisting of True and False values, indicating
whether this ingredient appears in the list:

In [49]:
import re

spice_df = pd.DataFrame(
            dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
            for spice in spice_list))

spice_df.head()

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
0,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,True,True,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


* Now, as an example, let’s say we’d like to find a recipe that uses parsley, paprika, and
tarragon. We can compute this very quickly using the `query()` method of Data
Frames,

In [51]:
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)

10

* We find only 10 recipes with this combination; let’s use the index returned by this
selection to discover the names of the recipes that have this combination:

In [55]:
recipes.name[selection.index]

2069      All cremat with a Little Gem, dandelion and wa...
74964                         Lobster with Thermidor butter
93768      Burton's Southern Fried Chicken with White Gravy
113926                     Mijo's Slow Cooker Shredded Beef
137686                     Asparagus Soup with Poached Eggs
140530                                 Fried Oyster Po’boys
158475                Lamb shank tagine with herb tabbouleh
158486                 Southern fried chicken in buttermilk
163175            Fried Chicken Sliders with Pickles + Slaw
165243                        Bar Tartine Cauliflower Salad
Name: name, dtype: object

Source: Python Data Science Handbook - Jake VanderPlas