## Example: Recipe Database

These vectorized string operations become most useful in the process of cleaning up messy, real-world data.
Here I'll walk through an example of that, using an open recipe database compiled from various sources on the Web.
Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.

The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well.

As of Spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:

In [11]:
!curl -O https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz # this link is deprecated now.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  4 29.3M    4 1414k    0     0  1414k      0  0:00:21 --:--:--  0:00:21 1561k
 26 29.3M   26 7967k    0     0  7967k      0  0:00:03  0:00:01  0:00:02 4320k
 47 29.3M   47 14.0M    0     0  7175k      0  0:00:04  0:00:02  0:00:02 5046k
 66 29.3M   66 19.5M    0     0  6687k      0  0:00:04  0:00:03  0:00:01 5219k
 88 29.3M   88 25.9M    0     0  6651k      0  0:00:04  0:00:04 --:--:-- 5492k
100 29.3M  100 29.3M    0     0  6014k      0  0:00:05  0:00:05 --:--:-- 6391k


In [12]:
!gunzip 20170107-061401-recipeitems.json.gz # invalid command in windows
!mv 20170107-061401-recipeitems.json recipeitems-latest.json

'gunzip' is not recognized as an internal or external command,
operable program or batch file.


The database is in JSON format, so we will try ``pd.read_json`` to read it:

In [31]:
import pandas as pd
import numpy as np

In [32]:
try:
    recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
    print("ValueError:", e)

ValueError: Trailing data


Oops! We get a ``ValueError`` mentioning that there is "trailing data."
Searching for the text of this error on the Internet, it seems that it's due to using a file in which *each line* is itself a valid JSON, but the full file is not.
Let's check if this interpretation is true:

In [19]:
with open('recipeitems-latest.json') as f:
    line = f.readline()
pd.read_json(line).shape

(2, 12)

Yes, apparently each line is a valid JSON, so we'll need to string them together.
One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with ``pd.read_json``:

In [24]:
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r', encoding="utf8") as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)

In [25]:
recipes.shape

(12624, 17)

We see there are nearly 200,000 recipes, and 17 columns.
Let's take a look at one row to see what we have:

In [27]:
recipes.iloc[0]

_id                                {'$oid': '5160756b96cc62079cc2db15'}
name                                    Drop Biscuits and Sausage Gravy
ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
url                   http://thepioneerwoman.com/cooking/2013/03/dro...
image                 http://static.thepioneerwoman.com/cooking/file...
ts                                             {'$date': 1365276011104}
cookTime                                                          PT30M
source                                                  thepioneerwoman
recipeYield                                                          12
datePublished                                                2013-03-11
prepTime                                                          PT10M
description           Late Saturday afternoon, after Marlboro Man ha...
totalTime                                                           NaN
creator                                                         

There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web.
In particular, the ingredient list is in string format; we're going to have to carefully extract the information we're interested in.
Let's start by taking a closer look at the ingredients:

In [28]:
recipes.ingredients.str.len().describe()

count    12624.000000
mean       309.432193
std        186.765976
min          0.000000
25%        180.000000
50%        270.000000
75%        400.000000
max       3247.000000
Name: ingredients, dtype: float64

The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters!

Just out of curiousity, let's see which recipe has the longest ingredient list:

In [36]:
recipes.name[np.argmax(recipes.ingredients.str.len())]

'Braised Beef cheeks Recipe'

That certainly looks like an involved recipe.

We can do other aggregate explorations; for example, let's see how many of the recipes are for breakfast food:

In [37]:
recipes.description.str.contains('[Bb]reakfast').sum()

185

Or how many of the recipes list cinnamon as an ingredient:

In [38]:
recipes.ingredients.str.contains('[Cc]innamon').sum()

921

We could even look to see whether any recipes misspell the ingredient as "cinamon":

In [39]:
recipes.ingredients.str.contains('[Cc]inamon').sum()

0

This is the type of essential data exploration that is possible with Pandas string tools. It is data munging like this that Python really excels at.

### A simple recipe recommender

Let's go a bit further, and start working on a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients.
While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row.
So we will cheat a bit: we'll start with a list of common ingredients, and simply search to see whether they are in each recipe's ingredient list.
For simplicity, let's just stick with herbs and spices for the time being:

In [41]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

We can then build a Boolean DataFrame consisting of True and False values, indicating whether this ingredient appears in the list:

In [42]:
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
0,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,True,True,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


Now, as an example, let's say we'd like to find a recipe that uses parsley and paprika. We can compute this very quickly using the query() method of DataFrames.

In [48]:
selection = spice_df.query('parsley & paprika & sage')
len(selection)

6

We find only 6 recipes with this combination; let's use the index returned by this selection to discover the names of the recipes that have this combination:

In [49]:
recipes.name[selection.index]

2482                        Turkey Stuffing
2950                                 Strata
4635             Cod with lentils and choka
5006                Sausage Stuffing Recipe
6955        Sausage-Currant Stuffing Recipe
12052    Chorizo and Gigante Bean Cassoulet
Name: name, dtype: object

Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we'd like to cook for dinner.

### Going further with recipes

Hopefully this example has given you a bit of a flavor (ba-dum!) for the types of data cleaning operations that are efficiently enabled by Pandas string methods.
Of course, building a very robust recipe recommendation system would require a *lot* more work!
Extracting full ingredient lists from each recipe would be an important piece of the task; unfortunately, the wide variety of formats used makes this a relatively time-consuming process.
This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently.