# Processing of individial ingredients

This notebook is concerned with processing of df['ingredients'] column obtained from notebook Adam_ingredients_v2. The goal is to have three columns, together with df['ingredients'] second one df['ingredients_preprocessed'] with logical steps explained in this notebook and third one df['ingredients_to_cnn'] which would obtain only the ingredients, which would be fed into the CNN.

In [1]:
import pandas as pd
import json
import numpy as np
import csv

from tqdm import tqdm
import inflect
import copy

recipes_filename_parsed_ingredients = "../../data/full_format_recipes_parsed_ingredients.json"

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
p = inflect.engine()

In [4]:
df = pd.read_json(recipes_filename_parsed_ingredients)

## Ingredients preformatted

- Make plural singular. 

- Make 'olive oil' 'oil' again.

- There are more ingredients of the same name in the recipe

- Thresholding

### Plural 

Containing plural: e. g. recipe no. 557.

In [5]:
print(df['ingredients'][557], '\n')
print(df['ingredients_individually'][557])

['1 cup frozen baby peas (not thawed)', '1/2 cup heavy cream', '1/4 teaspoon dried hot red-pepper flakes', '1 garlic clove, smashed', '3 cups packed baby spinach (3 ounces)', '1 teaspoon grated lemon zest', '1 1/2 teaspoons fresh lemon juice', '1 pound dried gnocchi (preferably De Cecco)', '1/4 cup grated parmesan'] 

['peas', 'cream', 'pepper flakes', 'garlic', 'spinach', 'lemon zest', 'lemon juice', 'gnocchi', 'parmesan']


In [6]:
df['ingredients_preformatted'] = [x[:] for x in df['ingredients_individually']] #new df  column
for i, ingredient in enumerate(tqdm(df['ingredients_preformatted'])):
    for j, row_ingredient in enumerate(ingredient):
        if p.singular_noun(row_ingredient) is not False: #therefore, ingredient is plural
            df['ingredients_preformatted'][i][j] = p.singular_noun(row_ingredient)

100%|██████████████████████████████████████████████████████████████████████████| 20111/20111 [00:16<00:00, 1201.54it/s]


In [7]:
print(df['ingredients_individually'][557], '\n')
print(df['ingredients_preformatted'][557])

['peas', 'cream', 'pepper flakes', 'garlic', 'spinach', 'lemon zest', 'lemon juice', 'gnocchi', 'parmesan'] 

['pea', 'cream', 'pepper flake', 'garlic', 'spinach', 'lemon zest', 'lemon juice', 'gnocchi', 'parmesan']


### Make 'olive oil' 'oil' again

Unfortunately manual, look into manual_rename.py. Usually I rename e. g. all the oils into 'oil' etc.

In [8]:
from manual_rename import rename
df = rename(df)

100%|██████████████████████████████████████████████████████████████████████████| 20111/20111 [00:03<00:00, 6070.15it/s]


### There are more ingredients of the same name in the recipe + plural to singular duplicities + renaming duplicities

- If there were e. g. olives and olive in the recipe, the plural to singular change kept both (olive and olive).

- duplicity in the recipes themselves

- duplicity from renaming ('olive oil' to 'oil' AND 'oil')

All this can be seen in the recipe no. 135.

In [9]:
for i in tqdm(range(len(df))):
    df['ingredients_preformatted'][i] = list(set(df['ingredients_preformatted'][i]))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
100%|██████████████████████████████████████████████████████████████████████████| 20111/20111 [00:08<00:00, 2358.34it/s]


Recipe no 135.: all the plural (*blueberries*), duplicity (two times *flour*) and renaming (*granulated sugar* and *sugar*) in the recipe.

In [10]:
print(len(df['ingredients_individually'][135]), df['ingredients_individually'][135], '\n')
print(len(df['ingredients_preformatted'][135]), df['ingredients_preformatted'][135], '\n')

16 ['flour', 'sugar', 'granulated sugar', 'cinnamon', 'butter', 'flour', 'powder', 'soda', 'salt', 'cream', 'vanilla extract', 'butter', 'granulated sugar', 'egg', 'blueberries', 'cream'] 

11 ['sugar', 'cream', 'egg', 'butter', 'blueberry', 'salt', 'cinnamon', 'flour', 'vanilla', 'powder', 'soda'] 



### Thresholding

All the recipes with occurence lower than the threshold *t* will be dismissed, achieved by intersection with thresholding seznam.

I set the threshold arbitrarily at the value 10, can be changed. This corresponds to 689 ingredients.

In [14]:
with open('thresholding_list.csv', 'r') as f:
    reader = csv.reader(f)
    thresholding_list = list(reader)
thresholding_list = [''.join(x) for x in thresholding_list] #make list of strings from list of lists
thresholding_list = thresholding_list[1:] #drop column name

In [15]:
thresholding_list = thresholding_list[:689]

In [17]:
len(thresholding_list)

689

In [18]:
for j in tqdm(range(len(df['ingredients_preformatted']))):
    df['ingredients_preformatted'][j] = [x for x in df['ingredients_preformatted'][j] if x in thresholding_list]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
100%|██████████████████████████████████████████████████████████████████████████| 20111/20111 [00:08<00:00, 2264.87it/s]


## Unique ingredients left

In [19]:
all_ingredients_list = []
for idx in tqdm(range(len(df['ingredients_preformatted']))):
    all_ingredients_list = all_ingredients_list + df['ingredients_preformatted'][idx] 
    #all ingredients from seznam used in all recipes
all_ingredients = np.array(all_ingredients_list)

100%|███████████████████████████████████████████████████████████████████████████| 20111/20111 [00:35<00:00, 560.84it/s]


In [20]:
unique, counts = np.unique(all_ingredients, return_counts=True)
unique_df = pd.DataFrame(data = zip(unique, counts), columns = ['Unique', 'Counts'])
unique_df = unique_df.sort_values(by = 'Counts', ascending = False)
print("Number of unique (both singular and plural) ingredients:", len(unique))

unique_df.head()

Number of unique (both singular and plural) ingredients: 689


Unnamed: 0,Unique,Counts
383,oil,9671
527,salt,8553
601,sugar,6587
246,garlic,6063
72,butter,5809


In [21]:
unique_df.to_csv("unique_ingredients_final.csv")

### Example
Whole processing example from *ingredients* to *ingredients_individually* to *ingredients_preformatted* can be seen from e. g. recipe no. 4028.

In [22]:
print(df['ingredients'][4028], '\n')
print(len(df['ingredients_individually'][4028]), df['ingredients_individually'][4028], '\n')
print(len(df['ingredients_preformatted'][4028]), df['ingredients_preformatted'][4028])

['1 cup pitted brine-cured green olives', '1 tablespoon olive oil', '1 large garlic clove, peeled', '1 1/2 teaspoons chopped fresh rosemary', '1/2 teaspoon finely grated orange peel', '6 boneless chicken breast halves with skin', '6 tablespoons orange juice', '2 tablespoons chopped fresh rosemary', '4 garlic cloves, pressed', '1 tablespoon finely grated orange peel', '3/4 cup olive oil', '1/2 cup chopped pitted brine-cured green olives', 'Nonstick vegetable oil spray', '2 large unpeeled oranges, each cut into 6 wedges'] 

14 ['olives', 'olive oil', 'garlic', 'rosemary', 'orange peel', 'chicken breast', 'orange juice', 'rosemary', 'garlic cloves', 'orange peel', 'olive oil', 'olives', 'vegetable oil', 'oranges'] 

6 ['oil', 'garlic', 'olive', 'chicken', 'rosemary', 'orange']


## Ingredients to feed CNN 

Make intersection of preformatted ingredients and handpicked ingredients to feed CNN.

In [23]:
with open('to_cnn_list.csv', 'r') as f:
    reader = csv.reader(f)
    to_cnn_list = list(reader)
to_cnn_list = [''.join(x) for x in to_cnn_list] #make list of strings from list of lists
to_cnn_list = to_cnn_list[1:] #drop column name

In [24]:
df['ingredients_to_cnn'] = [[] for i in range(len(df['ingredients_preformatted']))]

for j in tqdm(range(len(df['ingredients_preformatted']))):
    df['ingredients_to_cnn'][j] = [x for x in df['ingredients_preformatted'][j] if x in to_cnn_list]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
100%|██████████████████████████████████████████████████████████████████████████| 20111/20111 [00:07<00:00, 2832.96it/s]


## Final checks

In [25]:
import random
i = random.randint(0, len(df))

print(df['ingredients'][i], '\n')
print(df['ingredients_individually'][i], '\n')
print(df['ingredients_preformatted'][i], '\n')
print(df['ingredients_to_cnn'][i])

['2 tablespoons unsalted butter', '2 medium onions, very thinly sliced', '4 stalks celery, very thinly sliced', '2 medium carrots, very thinly sliced', '2 dried bay leaves', '1/4 cup roughly chopped fresh flat-leaf parsley leaves and stems', '6 to 8 sprigs fresh thyme', '2 tablespoons black peppercorns', '1 large (6 inches long or more) or 2 small (4 inches long or less) fish heads from cod or haddock, split lengthwise, gills removed, and rinsed clean of any blood', '2 1/2 to 3 pounds fish frames (bones) from sole, flounder, bass, and/or halibut, cut into 2-inch pieces and rinsed clean of any blood', '1/4 cup dry white wine', 'About 2 quarts very hot or boiling water', 'Kosher or sea salt'] 

['butter', 'onions', 'celery', 'carrots', 'bay leaves', 'parsley leaves', 'thyme', 'peppercorns', 'haddock', 'flounder', 'wine', 'water', 'salt'] 

['parsley', 'haddock', 'flounder', 'wine', 'onion', 'carrot', 'salt', 'pepper', 'thyme', 'butter', 'water', 'celery', 'bay leaf'] 

['wine', 'onion', 

In [26]:
import random
i = random.randint(0, len(df))

print(df['ingredients'][i], '\n')
print(df['ingredients_individually'][i], '\n')
print(df['ingredients_preformatted'][i], '\n')
print(df['ingredients_to_cnn'][i])

['1/2 pound fresh tomatillos, husked, rinsed, and quartered', '1 1/2 pounds tomatoes, chopped, divided', '1/2 cup chopped white onion, divided', '1 fresh serrano chile, coarsely chopped, including seeds', '1 garlic clove, quartered', '2 tablespoons red-wine vinegar', '1 cup water', '2 tablespoons olive oil', '1/2 cup chopped cilantro'] 

['tomatillos', 'tomatoes', 'onion', 'serrano chile', 'garlic', 'wine vinegar', 'water', 'olive oil', 'cilantro'] 

['tomato', 'oil', 'garlic', 'serrano chile', 'cilantro', 'water', 'wine', 'tomatillo', 'onion'] 

['tomato', 'oil', 'garlic', 'wine', 'onion']


In [27]:
import random
i = random.randint(0, len(df))

print(df['ingredients'][i], '\n')
print(df['ingredients_individually'][i], '\n')
print(df['ingredients_preformatted'][i], '\n')
print(df['ingredients_to_cnn'][i])

['4 cups chopped cantaloupe', '1 1/3 cups Asti Spumante (Italian sparkling wine) or water', '3/4 cup sugar', '2 tablespoons fresh lemon juice'] 

['cantaloupe', 'water', 'sugar', 'lemon juice'] 

['lemon', 'water', 'cantaloupe', 'sugar'] 

['lemon']


## Save dataframe with three columns of found ingredients

In [29]:
recipes_filename_parsed_ingredients_final = "../../data/full_format_recipes_parsed_ingredients_final.json"
df.to_json(recipes_filename_parsed_ingredients_final)