First look at the ingredients in JSON and in .csv ingredient seznam.

In [1]:
import pandas as pd
import json
import numpy as np
import csv

In [2]:
from tqdm import tqdm
import inflect

In [3]:
recipes_filename = "../../data/full_format_recipes.json"

In [4]:
df = pd.read_json(recipes_filename)
len(df)

20130

In [5]:
df.head(5)

Unnamed: 0,directions,fat,date,categories,calories,desc,protein,rating,title,ingredients,sodium
0,"[1. Place the stock, lentils, celery, carrot, ...",7.0,2006-09-01 04:00:00+00:00,"[Sandwich, Bean, Fruit, Tomato, turkey, Vegeta...",426.0,,30.0,2.5,"Lentil, Apple, and Turkey Wrap","[4 cups low-sodium vegetable or chicken stock,...",559.0
1,[Combine first 9 ingredients in heavy medium s...,23.0,2004-08-20 04:00:00+00:00,"[Food Processor, Onion, Pork, Bake, Bastille D...",403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0
2,[In a large heavy saucepan cook diced fennel a...,7.0,2004-08-20 04:00:00+00:00,"[Soup/Stew, Dairy, Potato, Vegetable, Fennel, ...",165.0,,6.0,3.75,Potato and Fennel Soup Hodge,"[1 fennel bulb (sometimes called anise), stalk...",165.0
3,[Heat oil in heavy large skillet over medium-h...,,2009-03-27 04:00:00+00:00,"[Fish, Olive, Tomato, Sauté, Low Fat, Low Cal,...",,The Sicilian-style tomato sauce has tons of Me...,,5.0,Mahi-Mahi in Tomato Olive Sauce,"[2 tablespoons extra-virgin olive oil, 1 cup c...",
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,2004-08-20 04:00:00+00:00,"[Cheese, Dairy, Pasta, Vegetable, Side, Bake, ...",547.0,,20.0,3.125,Spinach Noodle Casserole,"[1 12-ounce package frozen spinach soufflé, th...",452.0


In [6]:
null_values_indexes = df[df['ingredients'].isnull() == True].index
null_values_indexes

Int64Index([ 1076,  1135,  1907,  5146,  5424,  5558,  7607,  7768,  7881,
             8177,  9590, 10085, 11224, 13206, 13944, 14684, 16210, 16903,
            19547],
           dtype='int64')

In [7]:
df = df.drop(index = null_values_indexes).reset_index(drop=True) #drop indexes with ingredient being nan

### Ingredients

I took the first n ingredients from JSON (*ingredients*) and list of .csv ingredients (in *seznam*). I added space before and after each element in seznam.

From the discussion: we want to distinguish between e. g. *teaspoon* and *tea* (we want to find only *tea*), but there is a problem with e. g. *noodles* in ingredients and *noodle* in seznam though (same with *potatoes* and *potato* etc.). 'Needs to be solved'.

This was extended to full ingredients.

The problem was solved by adding space before and after the ingredient (in *seznam*) and plural form was added to every ingredient (in *seznam*).

In [8]:
p = inflect.engine() #plural forms engine

In [9]:
ingredients = df['ingredients']
with open('../../data/seznam_all.csv', 'r') as f:
    reader = csv.reader(f)
    seznam = list(reader)
seznam = [''.join(x) for x in seznam] #make list of strings from list of lists
seznam = [' {0} '.format(elem) for elem in seznam] #add a space before and after a word
seznam = seznam[1:] #drop column name

for i in range(len(seznam)):
    seznam.append(p.plural(seznam[i]))

In [10]:
ingredients[0]

['4 cups low-sodium vegetable or chicken stock',
 '1 cup dried brown lentils',
 '1/2 cup dried French green lentils',
 '2 stalks celery, chopped',
 '1 large carrot, peeled and chopped',
 '1 sprig fresh thyme',
 '1 teaspoon kosher salt',
 '1 medium tomato, cored, seeded, and diced',
 '1 small Fuji apple, cored and diced',
 '1 tablespoon freshly squeezed lemon juice',
 '2 teaspoons extra-virgin olive oil',
 'Freshly ground black pepper to taste',
 '3 sheets whole-wheat lavash, cut in half crosswise, or 6 (12-inch) flour tortillas',
 '3/4 pound turkey breast, thinly sliced',
 '1/2 head Bibb lettuce']

In [11]:
len(seznam)

8964

In [12]:
seznam[:10]

[' salt ',
 ' onions ',
 ' olive oil ',
 ' water ',
 ' garlic ',
 ' sugar ',
 ' garlic cloves ',
 ' butter ',
 ' ground black pepper ',
 ' all-purpose flour ']

In [13]:
ingredients_split = ingredients
for j in tqdm(range(len(ingredients))):
    ingredients_split[j] = ' , '.join(ingredients[j]) 
    ingredients_split[j]=ingredients_split[j].replace(",", " , ")
    ingredients_split[j]=ingredients_split[j].replace("*", " * ")
    ingredients_split[j]=ingredients_split[j] + ' '
    #make one long string of ingredients
    #extra spaces added to distinguish at the end and at the front of a string

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
100%|████████████████████████████████████████████

In [14]:
ingredients_split[0]

'4 cups low-sodium vegetable or chicken stock  ,  1 cup dried brown lentils  ,  1/2 cup dried French green lentils  ,  2 stalks celery ,  chopped  ,  1 large carrot ,  peeled and chopped  ,  1 sprig fresh thyme  ,  1 teaspoon kosher salt  ,  1 medium tomato ,  cored ,  seeded ,  and diced  ,  1 small Fuji apple ,  cored and diced  ,  1 tablespoon freshly squeezed lemon juice  ,  2 teaspoons extra-virgin olive oil  ,  Freshly ground black pepper to taste  ,  3 sheets whole-wheat lavash ,  cut in half crosswise ,  or 6 (12-inch) flour tortillas  ,  3/4 pound turkey breast ,  thinly sliced  ,  1/2 head Bibb lettuce '

In [15]:
ingredients_individually = ingredients_split
final_ing = [[] for i in range(len(ingredients))]
for j in tqdm(range(len(ingredients))):
    for idx, polozka in enumerate(seznam):
        if seznam[idx] in ingredients_split[j]:
            final_ing[j].append(polozka)

100%|████████████████████████████████████████████████████████████████████████████| 20111/20111 [35:27<00:00,  9.45it/s]


In [16]:
final_ing[0]

[' salt ',
 ' olive oil ',
 ' ground black pepper ',
 ' pepper ',
 ' kosher salt ',
 ' extra-virgin olive oil ',
 ' black pepper ',
 ' oil ',
 ' flour ',
 ' lemon juice ',
 ' lemon ',
 ' chicken stock ',
 ' chicken ',
 ' celery ',
 ' flour tortillas ',
 ' thyme ',
 ' fresh thyme ',
 ' juice ',
 ' tortillas ',
 ' lettuce ',
 ' stock ',
 ' lentils ',
 ' turkey ',
 ' turkey breast ',
 ' brown lentils ',
 ' crosswise ',
 ' fresh ',
 ' carrot ',
 ' olive ',
 ' vegetable ',
 ' apple ',
 ' green ',
 ' green lentils ']

In [17]:
df['ingredients_individually'] = df['ingredients']
for j in tqdm(range(len(ingredients))):
    df['ingredients_individually'][j] = final_ing[j]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
100%|██████████████████████████████████████████████████████████████████████████| 20111/20111 [00:05<00:00, 3652.66it/s]


In [18]:
df.head()

Unnamed: 0,directions,fat,date,categories,calories,desc,protein,rating,title,ingredients,sodium,ingredients_individually
0,"[1. Place the stock, lentils, celery, carrot, ...",7.0,2006-09-01 04:00:00+00:00,"[Sandwich, Bean, Fruit, Tomato, turkey, Vegeta...",426.0,,30.0,2.5,"Lentil, Apple, and Turkey Wrap",4 cups low-sodium vegetable or chicken stock ...,559.0,"[ salt , olive oil , ground black pepper , ..."
1,[Combine first 9 ingredients in heavy medium s...,23.0,2004-08-20 04:00:00+00:00,"[Food Processor, Onion, Pork, Bake, Bastille D...",403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"1 1/2 cups whipping cream , 2 medium onions ...",1439.0,"[ salt , onions , olive oil , garlic , sug..."
2,[In a large heavy saucepan cook diced fennel a...,7.0,2004-08-20 04:00:00+00:00,"[Soup/Stew, Dairy, Potato, Vegetable, Fennel, ...",165.0,,6.0,3.75,Potato and Fennel Soup Hodge,"1 fennel bulb (sometimes called anise) , stal...",165.0,"[ butter , unsalted butter , milk , chicken..."
3,[Heat oil in heavy large skillet over medium-h...,,2009-03-27 04:00:00+00:00,"[Fish, Olive, Tomato, Sauté, Low Fat, Low Cal,...",,The Sicilian-style tomato sauce has tons of Me...,,5.0,Mahi-Mahi in Tomato Olive Sauce,"2 tablespoons extra-virgin olive oil , 1 cup...",,"[ olive oil , garlic , tomatoes , extra-vir..."
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,2004-08-20 04:00:00+00:00,"[Cheese, Dairy, Pasta, Vegetable, Side, Bake, ...",547.0,,20.0,3.125,Spinach Noodle Casserole,"1 12-ounce package frozen spinach soufflé , t...",452.0,"[ sour cream , ground nutmeg , cheddar chees..."


In [19]:
df['ingredients_individually'][0]

[' salt ',
 ' olive oil ',
 ' ground black pepper ',
 ' pepper ',
 ' kosher salt ',
 ' extra-virgin olive oil ',
 ' black pepper ',
 ' oil ',
 ' flour ',
 ' lemon juice ',
 ' lemon ',
 ' chicken stock ',
 ' chicken ',
 ' celery ',
 ' flour tortillas ',
 ' thyme ',
 ' fresh thyme ',
 ' juice ',
 ' tortillas ',
 ' lettuce ',
 ' stock ',
 ' lentils ',
 ' turkey ',
 ' turkey breast ',
 ' brown lentils ',
 ' crosswise ',
 ' fresh ',
 ' carrot ',
 ' olive ',
 ' vegetable ',
 ' apple ',
 ' green ',
 ' green lentils ']

### Save dataframe with column of found ingredients

In [20]:
recipes_filename_parsed_ingredients = "../../data/full_format_recipes_parsed_ingredients.json"
df.to_json(recipes_filename_parsed_ingredients)

### Find most frequent ingredients
This list will be fed to search engine

In [21]:
all_ingredients_list = []
for idx in tqdm(range(len(df['ingredients_individually']))):
    all_ingredients_list = all_ingredients_list + df['ingredients_individually'][idx] 
    #all ingredients from seznam used in all recipes
all_ingredients = np.array(all_ingredients_list)

100%|███████████████████████████████████████████████████████████████████████████| 20111/20111 [00:29<00:00, 681.32it/s]


In [22]:
unique, counts = np.unique(all_ingredients, return_counts=True)
unique_df = pd.DataFrame(data = zip(unique, counts), columns = ['Unique', 'Counts'])
unique_df = unique_df.sort_values(by = 'Counts', ascending = False)
print(len(unique))

3189


In [23]:
unique_df.head(50)

Unnamed: 0,Unique,Counts
1152,fresh,10981
1923,oil,9790
2465,salt,8658
2764,sugar,6636
1930,olive,6515
1931,olive oil,6475
1271,garlic,6032
2068,pepper,5977
415,butter,5970
1556,juice,5785


In [24]:
n_most_frequent = 100
string_to_feed_search_engine = unique_df['Unique'][:n_most_frequent].values.astype(str)
string_to_feed_search_engine

array([' fresh ', ' oil ', ' salt ', ' sugar ', ' olive ', ' olive oil ',
       ' garlic ', ' pepper ', ' butter ', ' juice ', ' water ',
       ' unsalted butter ', ' cloves ', ' lemon ', ' red ', ' onion ',
       ' cream ', ' flour ', ' leaves ', ' garlic cloves ', ' vinegar ',
       ' black pepper ', ' lemon juice ', ' chicken ', ' vegetable ',
       ' fresh lemon ', ' fresh lemon juice ', ' wine ', ' green ',
       ' cheese ', ' parsley ', ' eggs ', ' vegetable oil ', ' broth ',
       ' egg ', ' tomatoes ', ' ground black pepper ', ' sauce ',
       ' onions ', ' large eggs ', ' milk ', ' vanilla ',
       ' extra-virgin olive oil ', ' chicken broth ', ' seeds ',
       ' kosher salt ', ' ginger ', ' thyme ', ' all-purpose flour ',
       ' orange ', ' lime ', ' mustard ', ' cilantro ', ' cinnamon ',
       ' white wine ', ' brown sugar ', ' lime juice ', ' rice ',
       ' clove ', ' celery ', ' fresh lime ', ' extract ', ' bread ',
       ' fresh lime juice ', ' potatoes ',

### Final checks

In [25]:
import random
n = random.randint(0, 100)
print(df['ingredients'][n], '\n')
print(df['ingredients_individually'][n])

2 cups pecan halves  ,  3 tablespoons unsalted butter ,  melted  ,  1 1/4 teaspoons fine sea salt  

[' salt ', ' butter ', ' unsalted butter ', ' sea salt ', ' fine sea salt ', ' pecan halves ', ' pecan ']


In [26]:
n = random.randint(0, len(df))
print(df['ingredients'][n], '\n')
print(df['ingredients_individually'][n])

3 firm-ripe mangoes (3 pounds total) ,  peeled and cut into 1/2-inch cubes  ,  1/3 cup distilled white vinegar  ,  1/3 cup packed dark brown sugar  ,  1/3 cup golden raisins  ,  1 3/4 teaspoons salt  ,  1 (1-inch) piece fresh ginger ,  peeled and chopped  ,  1 tablespoon chopped fresh jalapeño including seeds (from 1 chile)  ,  3 garlic cloves ,  chopped  ,  3/4 teaspoon ground cumin  ,  3/4 teaspoon ground coriander  ,  1/2 teaspoon turmeric  ,  2 tablespoons vegetable oil  ,  1 medium onion ,  chopped  ,  1 red bell pepper ,  cut into 1/4-inch dice  ,  1 (3-inch) cinnamon stick  

[' salt ', ' garlic ', ' sugar ', ' garlic cloves ', ' vegetable oil ', ' pepper ', ' ground cumin ', ' red bell pepper ', ' oil ', ' ginger ', ' brown sugar ', ' fresh ginger ', ' cumin ', ' ground coriander ', ' cinnamon ', ' raisins ', ' white vinegar ', ' bell pepper ', ' coriander ', ' vinegar ', ' dark brown sugar ', ' golden raisins ', ' seeds ', ' red ', ' fresh ', ' onion ', ' cloves ', ' mangoes '

In [27]:
n = random.randint(0, len(df))
print(df['ingredients'][n], '\n')
print(df['ingredients_individually'][n])

1 navel orange  ,  3 tablespoons sugar  ,  3 tablespoons unsalted butter  ,  1 large egg  ,  1/2 cup well-chilled heavy cream  

[' sugar ', ' butter ', ' unsalted butter ', ' heavy cream ', ' orange ', ' cream ', ' egg ']
