# Using a subscription based service for ingredient parsing

Zestful is a subscription based service which uses a prebuilt NER model to parse ingredients. It offers the parsing of 30 ingredients for free per day.

However, it only intakes single lines of text relating to one ingredient:

       - 3 red peppers, finely chopped
       - 200 ml vegetable stock
       - 1 tin of baked beans
       
Therefore, this service will only be of use if singular ingredient lines can be retrieved from the text descriptions. This is a difficult task, as the posts come in a variety of formats. Here I shall focus on the most common formats of ingredient lists from the posts, i.e. those which are preceded by the word `ingredients` and are seperated by `\n` characters:

        ingredients:
        2 large onions, chopped
        3 celery stalks, sliced
        500g boneless chicken

In [41]:
import sys
sys.path.append("/Users/maxkirwan/Desktop/Uni/Data Science MSc/Data Science Project/nutrition-insta")
import functions

import numpy as np
import pandas as pd
import re

In [2]:
posts1 = pd.read_csv("/Users/maxkirwan/Desktop/Uni/Data Science MSc/Data Science Project/nutrition-insta/Instagram Data Scraping/Phantom Buster/recipe_posts.csv")
posts2 = pd.read_csv("/Users/maxkirwan/Desktop/Uni/Data Science MSc/Data Science Project/nutrition-insta/Instagram Data Scraping/Phantom Buster/recipe_posts_2.csv")
posts3 = pd.read_csv("/Users/maxkirwan/Desktop/Uni/Data Science MSc/Data Science Project/nutrition-insta/Instagram Data Scraping/Phantom Buster/recipe_posts_3.csv")

posts = pd.concat([posts1,posts2,posts3])

In [None]:
# Removing duplicate posts
posts = functions.remove_duplicates(posts)

In [3]:
# Getting english posts
posts = functions.get_english_posts(posts)

Detecting language of each post...
Language detection complete.
Time taken: 0:10:13.917597


In [4]:
# Preprocessing text descriptions
posts['description_preprocessed'] = posts['description'].apply(functions.preprocess_text)

In [35]:
def includes_ingredient_list(text):
    
    text_str = str(text)
    if ("ingredients:" in text or "ingredients :" in text) and text_str.count('\n') > 8:
        return True
        
    else:
        return False

In [179]:
posts['includes_ingredient_list'] = posts['description_preprocessed'].apply(includes_ingredient_list)
posts = posts[posts['includes_ingredient_list']]

## Function to get ingredient lines

In [170]:
def get_ingredient_lines(text):
    
    '''
    Function to get list of ingredient lines.
    Only works if ingredient lines are preceded by "ingredients:" and are followed by a blank line
    '''
    
    # Split post description by new line
    text_list = text.split("\n")
    
    # The idea is to only take the part of the list between "ingredients:" and "\n"
    # as the ingredient list items will usually be in new lines after "ingredients:"
    # seperated by a blank line at the end before further text
    

    # Find index of first case of "ingredients:"
    for line in text_list:
        if 'ingredients:' in line or 'ingredients :' in line:
            start_index = text_list.index(line)
            break
    
    # Assume no end index to begin with (as ingredient list might come at end of post)
    end_index = None
    # Find index of first case of new line break after "ingredients:"
    # Skip out first ingredient in case its a new line (which we want to miss)
    rest_of_text_list = text_list[start_index+2:]
    for line in rest_of_text_list[2:]:
        if len(line) < 3:
            end_index = rest_of_text_list.index(line)
            break

    if end_index == None:
        ingredient_lines = text_list[start_index+1 : ]
    else:
        ingredient_lines = text_list[start_index+1 : start_index+end_index+2]
    
    if len(ingredient_lines[0]) < 2:
        ingredient_lines.pop(0)
        
    return ingredient_lines

In [197]:
for i in range(len(post_list)):
    print(i)
    print(get_ingredient_lines(post_list[i]))

0
['1/2 cup instant oats', '1/2 cup milk', '1/2 cup water', '1 scoop chocolate protein powder', '1 teaspoon brown sugar']
1
['1 chicken breast⠀', '1 ball mozarella cheese ⠀', '2 tomatoes ⠀', 'small handful of basil ⠀', 'pesto ⠀', 'balsamic glaze ⠀']
2
['2 pound bok choy, small ', '1 tbs ginger, fresh, minced', ' 1/4 cup scallions, thinly sliced ', '1 1 /2 tbs olive oil, divided 6 tbs tamari sauce ', '1 tbs hemp seeds ']
3
['200 gram swiss brown mushrooms, chopped ', '220 gram all purpose flour', '20 gram sugar', '6 gram baking powder', '1/2 teaspoon salt', '1 teaspoon ', '55 gram cold butter, cut into small cubes', '25 gram beaten egg', 'approx. 100 ml milk ']
4
['curd - 1/2 cup', 'roasted gram flour - 1.5 tablespoon', 'carom seeds/ajwain - 1/4 teaspoon', 'ginger garlic paste - 1 tablespoon', 'kashmiri red mirch - 1 tablespoon', 'cumin powder - 1/2 teaspoon', 'black pepper powder - 1/2 teaspoon', 'turmeric powder - 1/2 teaspoon', 'salt as per taste', 'chaat masala - 1 teaspoon', 'garam

# Now ready to get ingredient lines for each post and add it to the dataframe...
## Then will be ready to try out zestful on it

In [200]:
posts['ingredient_lines'] = posts['description_preprocessed'].apply(get_ingredient_lines)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts['ingredient_lines'] = posts['description_preprocessed'].apply(get_ingredient_lines)


In [206]:
for lines in posts['ingredient_lines'][0:4]:
    print(lines)
    print(len(lines))

['1/2 cup instant oats', '1/2 cup milk', '1/2 cup water', '1 scoop chocolate protein powder', '1 teaspoon brown sugar']
5
['1 chicken breast⠀', '1 ball mozarella cheese ⠀', '2 tomatoes ⠀', 'small handful of basil ⠀', 'pesto ⠀', 'balsamic glaze ⠀']
6
['2 pound bok choy, small ', '1 tbs ginger, fresh, minced', ' 1/4 cup scallions, thinly sliced ', '1 1 /2 tbs olive oil, divided 6 tbs tamari sauce ', '1 tbs hemp seeds ']
5
['200 gram swiss brown mushrooms, chopped ', '220 gram all purpose flour', '20 gram sugar', '6 gram baking powder', '1/2 teaspoon salt', '1 teaspoon ', '55 gram cold butter, cut into small cubes', '25 gram beaten egg', 'approx. 100 ml milk ']
9


In [209]:
# Zestful can only parse up to 30 ingredients a day, so for now lets just use the first 4 posts (<30 ingredients)
posts_sample = posts.iloc[0:4,:]

In [210]:
# zestful ingredient parser
import parse_ingredient

def get_ingredients_dict(ingredient_lines):
    
    ingredients = parse_ingredient.parse_multiple(ingredient_lines)
    
    return ingredients.as_dict()

In [211]:
posts_sample['ingredients_dict'] = posts_sample['ingredient_lines'].apply(get_ingredients_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts_sample['ingredients_dict'] = posts_sample['ingredient_lines'].apply(get_ingredients_dict)


In [236]:
print(posts_sample['description_preprocessed'].iloc[2])

looking for a fast recipe to make for lunch? this dish is light and can be made in only 25 minutes!

bok choy with tamari sauce

ingredients:
2 pound bok choy, small 
1 tbs ginger, fresh, minced
 1/4 cup scallions, thinly sliced 
1 1 /2 tbs olive oil, divided 6 tbs tamari sauce 
1 tbs hemp seeds 

directions:
prep : 
1 . trim bok choy bottoms and cut in half lengthwise. 
2 . mince ginger. 
3 . slice scallions. 

make :
1 . coat bottom of pan with 1 1 /2 teaspoon oil, over medium-high heat. when hot, lay bok choy in single layer, cut side down (do not overcrowd). add 2 tablespoon water and cover about 2 minutes. 

2 . lift out of pan, drain, and place on platter. remove water from pan and repeat with remaining bok choy. leave residual water in pan. 

3 . add 1 tablespoon oil and ginger to pan and cook, stirring over high 
heat until ginger is browned. remove from heat and add tamari sauce. 

4 . pour sauce over bok choy, sprinkle with scallions and hemp seeds. 

for more healthy recipes

In [237]:
posts_sample['ingredient_lines'].iloc[2]

['2 pound bok choy, small ',
 '1 tbs ginger, fresh, minced',
 ' 1/4 cup scallions, thinly sliced ',
 '1 1 /2 tbs olive oil, divided 6 tbs tamari sauce ',
 '1 tbs hemp seeds ']

In [224]:
posts_sample['ingredients_dict'].iloc[2]

[{'error': None,
  'raw': '2 pound bok choy, small ',
  'parsed': {'confidence': 0.8712382,
   'product': 'bok choy',
   'productSizeModifier': None,
   'quantity': 2.0,
   'unit': 'pound',
   'preparationNotes': None,
   'usda_info': {'category': 'Vegetables and Vegetable Products',
    'description': 'Cabbage, chinese (pak-choi), raw',
    'fdcId': '170390',
    'matchMethod': 'exact'}}},
 {'error': None,
  'raw': '1 tbs ginger, fresh, minced',
  'parsed': {'confidence': 0.8276143,
   'product': 'ginger',
   'productSizeModifier': None,
   'quantity': 1.0,
   'unit': 'tablespoon',
   'preparationNotes': 'fresh, minced',
   'usda_info': {'category': 'Vegetables and Vegetable Products',
    'description': 'Ginger root, raw',
    'fdcId': '169231',
    'matchMethod': 'exact'}}},
 {'error': None,
  'raw': ' 1/4 cup scallions, thinly sliced ',
  'parsed': {'confidence': 0.9478576,
   'product': 'scallions',
   'productSizeModifier': None,
   'quantity': 0.25,
   'unit': 'cup',
   'prepara

In [219]:
num_ingredients = 0
for lines in posts['ingredient_lines']:
    num_ingredients += len(lines)
num_ingredients
print(f"Zestful charge £0.02 per ingredient, so the cost of parsing all ingredients would be £{num_ingredients*0.02}.")

Zestful charge £0.02 per ingredient, so the cost of parsing all ingredients would be £43.28.


### Using free trial each day to include another few posts

In [239]:
posts_sample.to_csv("posts_with_ingredients.csv")

In [241]:
posts.to_csv("posts_preprocessed.csv")

In [None]:
def add_new_posts(posts_file_path, posts_sample_file_path):
    
    posts = pd.read_csv(posts_file_path)
    posts_sample = pd.read_csv(posts_sample_file_path)
    
    

In [None]:
for index, row in posts.iterrows():
    

In [247]:
posts['ingredient_lines'].iloc[0]

['1/2 cup instant oats',
 '1/2 cup milk',
 '1/2 cup water',
 '1 scoop chocolate protein powder',
 '1 teaspoon brown sugar']

In [249]:
posts_sample['ingredients_dict'].iloc[0]

[{'error': None,
  'raw': '1/2 cup instant oats',
  'parsed': {'confidence': 0.976161,
   'product': 'instant oats',
   'productSizeModifier': None,
   'quantity': 0.5,
   'unit': 'cup',
   'preparationNotes': None,
   'usda_info': None}},
 {'error': None,
  'raw': '1/2 cup milk',
  'parsed': {'confidence': 0.9959719,
   'product': 'milk',
   'productSizeModifier': None,
   'quantity': 0.5,
   'unit': 'cup',
   'preparationNotes': None,
   'usda_info': {'category': 'Dairy and Egg Products',
    'description': 'Milk, fluid, 1% fat, without added vitamin A and vitamin D',
    'fdcId': '173441',
    'matchMethod': 'closestUnbranded'}}},
 {'error': None,
  'raw': '1/2 cup water',
  'parsed': {'confidence': 0.9923842,
   'product': 'water',
   'productSizeModifier': None,
   'quantity': 0.5,
   'unit': 'cup',
   'preparationNotes': None,
   'usda_info': {'category': 'Beverages',
    'description': 'Beverages, water, tap, drinking',
    'fdcId': '173647',
    'matchMethod': 'exact'}}},
 {'er