# Using a subscription based service for ingredient parsing

Zestful is a subscription based service which uses a prebuilt NER model to parse ingredients. It offers the parsing of 30 ingredients for free per day.

However, it only intakes single lines of text relating to one ingredient:

       - 3 red peppers, finely chopped
       - 200 ml vegetable stock
       - 1 tin of baked beans
       
Therefore, this service will only be of use if singular ingredient lines can be retrieved from the text descriptions. This is a difficult task, as the posts come in a variety of formats. Here I shall focus on the most common formats of ingredient lists from the posts, i.e. those which are preceded by the word `ingredients` and are seperated by `\n` characters:

        ingredients:
        2 large onions, chopped
        3 celery stalks, sliced
        500g boneless chicken

In [41]:
import sys
sys.path.append("/Users/maxkirwan/Desktop/Uni/Data Science MSc/Data Science Project/nutrition-insta")
import functions

import numpy as np
import pandas as pd
import re

In [2]:
posts1 = pd.read_csv("/Users/maxkirwan/Desktop/Uni/Data Science MSc/Data Science Project/nutrition-insta/Instagram Data Scraping/Phantom Buster/recipe_posts.csv")
posts2 = pd.read_csv("/Users/maxkirwan/Desktop/Uni/Data Science MSc/Data Science Project/nutrition-insta/Instagram Data Scraping/Phantom Buster/recipe_posts_2.csv")
posts3 = pd.read_csv("/Users/maxkirwan/Desktop/Uni/Data Science MSc/Data Science Project/nutrition-insta/Instagram Data Scraping/Phantom Buster/recipe_posts_3.csv")

posts = pd.concat([posts1,posts2,posts3])

In [None]:
# Removing duplicate posts
posts = functions.remove_duplicates(posts)

In [3]:
# Getting english posts
posts = functions.get_english_posts(posts)

Detecting language of each post...
Language detection complete.
Time taken: 0:10:13.917597


In [4]:
# Preprocessing text descriptions
posts['description_preprocessed'] = posts['description'].apply(functions.preprocess_text)

In [35]:
def includes_ingredient_list(text):
    
    text_str = str(text)
    if ("ingredients:" in text or "ingredients :" in text) and text_str.count('\n') > 8:
        return True
        
    else:
        return False

In [179]:
posts['includes_ingredient_list'] = posts['description_preprocessed'].apply(includes_ingredient_list)
posts = posts[posts['includes_ingredient_list']]

In [36]:
post_list = []
for post in posts['description_preprocessed']:
    if includes_ingredient_list(post):
        post_list.append(post)

In [39]:
for post in post_list:
    print(post)
    print("\n----------------\n")

chocolate fudge protein oatmeal 
ingredients:
1/2 cup instant oats
1/2 cup milk
1/2 cup water
1 scoop chocolate protein powder
1 teaspoon brown sugar

toppings:
1 tablespoon peanut butter
1 tablespoon chocolate spread
1/2 a banana
protein bar .in

----------------

⁣caprese chicken with pesto ⠀
⠀
this dish is a perfect light dinner for the warm weather we’ve been having and so easy to cook with hardly any ingredients needed! ⠀
⠀
ingredients:⠀
1 chicken breast⠀
1 ball mozarella cheese ⠀
2 tomatoes ⠀
small handful of basil ⠀
pesto ⠀
balsamic glaze ⠀
⠀
for the chicken: ⠀
preheat the oven to 180 degrees ⠀
chop your tomatoes ans mozarella into even slice ⠀
cut slice into the chicken breast deep enough for the stuffing ⠀
place the chicken breast on the oven tray and place slice of mozarella and tomato into the slice made in the chicken ⠀
top with 2 teaspoon of pesto, salt & pepper⠀
cook in the oven for roughly 30 mins till the chicken is cooked through ⠀
top with fresh basil & serve ⠀
⠀
for 

In [40]:
post_list

['chocolate fudge protein oatmeal \ningredients:\n1/2 cup instant oats\n1/2 cup milk\n1/2 cup water\n1 scoop chocolate protein powder\n1 teaspoon brown sugar\n\ntoppings:\n1 tablespoon peanut butter\n1 tablespoon chocolate spread\n1/2 a banana\nprotein bar .in',
 '\u2063caprese chicken with pesto ⠀\n⠀\nthis dish is a perfect light dinner for the warm weather we’ve been having and so easy to cook with hardly any ingredients needed! ⠀\n⠀\ningredients:⠀\n1 chicken breast⠀\n1 ball mozarella cheese ⠀\n2 tomatoes ⠀\nsmall handful of basil ⠀\npesto ⠀\nbalsamic glaze ⠀\n⠀\nfor the chicken: ⠀\npreheat the oven to 180 degrees ⠀\nchop your tomatoes ans mozarella into even slice ⠀\ncut slice into the chicken breast deep enough for the stuffing ⠀\nplace the chicken breast on the oven tray and place slice of mozarella and tomato into the slice made in the chicken ⠀\ntop with 2 teaspoon of pesto, salt & pepper⠀\ncook in the oven for roughly 30 mins till the chicken is cooked through ⠀\ntop with fresh

In [43]:
print(post_list[0])

chocolate fudge protein oatmeal 
ingredients:
1/2 cup instant oats
1/2 cup milk
1/2 cup water
1 scoop chocolate protein powder
1 teaspoon brown sugar

toppings:
1 tablespoon peanut butter
1 tablespoon chocolate spread
1/2 a banana
protein bar .in


In [170]:
def get_ingredient_lines(text):
    
    '''
    Function to get list of ingredient lines.
    Only works if ingredient lines are preceded by "ingredients:" and are followed by a blank line
    '''
    
    # Split post description by new line
    text_list = text.split("\n")
    
    # The idea is to only take the part of the list between "ingredients:" and "\n"
    # as the ingredient list items will usually be in new lines after "ingredients:"
    # seperated by a blank line at the end before further text
    

    # Find index of first case of "ingredients:"
    for line in text_list:
        if 'ingredients:' in line or 'ingredients :' in line:
            start_index = text_list.index(line)
            break
    
    # Assume no end index to begin with (as ingredient list might come at end of post)
    end_index = None
    # Find index of first case of new line break after "ingredients:"
    # Skip out first ingredient in case its a new line (which we want to miss)
    rest_of_text_list = text_list[start_index+2:]
    for line in rest_of_text_list[2:]:
        if len(line) < 3:
            end_index = rest_of_text_list.index(line)
            break

    if end_index == None:
        ingredient_lines = text_list[start_index+1 : ]
    else:
        ingredient_lines = text_list[start_index+1 : start_index+end_index+2]
    
    if len(ingredient_lines[0]) < 2:
        ingredient_lines.pop(0)
        
    return ingredient_lines

In [173]:
for i in range(len(post_list)):
    print(i)
    print(get_ingredient_lines(post_list[i]))

0
['1/2 cup instant oats', '1/2 cup milk', '1/2 cup water', '1 scoop chocolate protein powder', '1 teaspoon brown sugar']
1
['1 chicken breast⠀', '1 ball mozarella cheese ⠀', '2 tomatoes ⠀', 'small handful of basil ⠀', 'pesto ⠀', 'balsamic glaze ⠀']
2
['2 pound bok choy, small ', '1 tbs ginger, fresh, minced', ' 1/4 cup scallions, thinly sliced ', '1 1 /2 tbs olive oil, divided 6 tbs tamari sauce ', '1 tbs hemp seeds ']
3
['200 gram swiss brown mushrooms, chopped ', '220 gram all purpose flour', '20 gram sugar', '6 gram baking powder', '1/2 teaspoon salt', '1 teaspoon ', '55 gram cold butter, cut into small cubes', '25 gram beaten egg', 'approx. 100 ml milk ']
4
['curd - 1/2 cup', 'roasted gram flour - 1.5 tablespoon', 'carom seeds/ajwain - 1/4 teaspoon', 'ginger garlic paste - 1 tablespoon', 'kashmiri red mirch - 1 tablespoon', 'cumin powder - 1/2 teaspoon', 'black pepper powder - 1/2 teaspoon', 'turmeric powder - 1/2 teaspoon', 'salt as per taste', 'chaat masala - 1 teaspoon', 'garam

# Now ready to get ingredient lines for each post and add it to the dataframe...
## Then will be ready to try out zestful on it