In [1]:
import pandas as pd
import numpy as np
from selenium import webdriver
import re
import time
import requests
import matplotlib.pyplot as plt
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
import pickle
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from fractions import Fraction
import os.path
from bs4 import BeautifulSoup as bs
from matplotlib import pyplot as plt
from collections import Counter
import operator

In [None]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

$\textbf{1. Introduction and Background}$

There are often reasons why we can't eat what we want: sometimes we don't have it in our fridge; sometimes those eating to us are allergic to an ingredient in our desired food; sometimes we are on a plane, and plane food doesn't cut it.  So, we decided to ask the NP hard question "What DOES cut it?  How can I eat what I want given the circumstances I am in!?"

We might be in a position to choose amongst many ingredients to make any recipe we like, but these times are few and far between with young poor college students with potentially not much more than beans, rice, and raman.  So is there any hope for us?  Maybe.

If we want to overcome the nutrition crisis, we need solutions.  Fast.  We need a way to decide what we can eat, given the constraints we are in.  Don't have apples?  Constraint.  Are you a diabetic?  Constraint.  Will you settle for nothing less than a mexican meal?  Constraint.

In order to break out of the constraints of what we CANT have, we need the liberating cabability to see what we CAN have.  This will ulitimately be solved through machine learning, data science, and magic.  But in order for those things to happen, we need a ground to build upon: we need data.

Due to the nature of the question "What can I eat that will fit my nutritional needs, diatary constraints, and fit my current cravings?", we need to have recipes, and corresponding nutritional data.  How does one come by these you may ask?  Scraping!  We delve into the world of numbers and human error in the following story.

Let me take you on a mouth watering and brain hurting adventure.

$\textbf{2.A Data Collection: Recipes and Ingredients Scraping}$

Why are we here?  Ah!  To collect recipes.

When one first googles 'recipes'... one stops.  Let it suffice to say that the formatting, information, and quality of online recipe databases differs greatly.  There is little order or consistancy, even across a single website.  So, rather than suffer greatly trying to use Selinum and Beautiful soup, we decided to search out recipes in text format only, and only suffer a little- opting for our pain to come from Regex and a little bit of Beautiful Soup.

What follows is some code we used to scrape the text files.

In [None]:
def get_recipes():
    ''' Gets all the recipes and saves them in a pickle.  Doesn't scrape
    the website if the information is already there.'''

    # if the website has been scraped, doesn't scrape it again.
    # return the scraped contents
    if os.path.exists('recipes.pickle'): # checks if the folder already exists
        print("folder already here: returning contents")
        with open('recipes.pickle','rb') as f:
            recipes = pickle.load(f) # load the saved contents 
            return recipes
    # otherwise, scrapes the website, pickles the information, and 
    # returns the contents
    else:
        print("folder not here yet: creating contents")
        text_data = [] # create the list to store the contents
        for i in type_list:
            time.sleep(.25)
            # gets the text files from the links on the website
            contents = requests.get(f"http://mc6help.tripod.com/RecipeLibrary/{i}") 
            text_data.append(contents.text) # appends the contents to the list
            
        with open('recipes.pickle','wb') as f:
            pickle.dump(text_data,f) # save the contents
            
        return text_data

In [None]:
def get_HTML_extensions(filename = 'recipes.html'):
    """Return a list of the names of the text file extensions from the recipe website."""
    extensions = []
    with open(filename, 'r') as f:
        text = f.read()
    soup = bs(text,"html.parser") # create a beautiful soup object of the given code
    table_list = soup.find_all(href=True)
    for i in table_list:
        if len(i.text) > 3: # ignore the texts that are blank - all the ones
                            # we need are .txt files, so at least 3 chars long
            extensions.append(i.text)
    return extensions    # return the tag name list

We then broke down the recipes into their respecive categories: Mexican, Breakfast foods, Fish recipes, etc.

In [None]:
def get_categories():
    ''' Gets all the category variables and saves them in a pickle.'''
  
    if os.path.exists('categories.pickle'): # checks if the folder already exists
        print("folder already here: returning contents")
        with open('categories.pickle','rb') as f:
            categories = pickle.load(f) # load the saved contents 
            return categories
    # otherwise, scrapes the website, pickles the information, and 
    # returns the contents
    else:
        print("folder not here yet: creating contents")
        type_list = get_HTML_extensions() # contains the extensions for all
                                          # recipes of a specific food type 
        categories = [t[:-4] for t in type_list] # remove the '.txt' from the list names
        # put a space between the words
        categories = [re.sub(r"(?<=\w)([A-Z])", r" \1", c) for c in categories]
        # then put a space between 'and' and the proceeding word (if there is an 'and')
        categories = [re.sub(r"(?<=)(and )", r" \1", c) for c in categories]
     
        with open('categories.pickle','wb') as f:
            pickle.dump(categories,f) # save the contents
            
        return categories

$\textbf{2.B Data Collection: Recipie and Ingredient Cleaning }$

Now that we have our recipes, we turn to the task of making the food they contain worth their weight in data.  Literally.  The more of an ingredient, the bigger the impact it will have on the recipe's final numbers.  Here's how we started to get the goods.

We used regex to single out each recipe name, the ingredients  in it, and the serving size.

In [None]:
def split_category_to_recipes(category):
    '''Takes in a string of text from a single category of recipes,
       and returns a list of strings containing the recipes contained
       in that category'''
    # create the regex that all the recipes follow- not to give a 'clean'
    # cut-off, but rather to separate one recipe from the next.
    one_recipe_pattern = re.compile(r"\* Exported from MasterCook \*(.+?)Nutr\. Assoc\. : (\d+?)", re.DOTALL)
    batch = one_recipe_pattern.findall(r) # splits up the text to it's portions,
                               # but the formatting is as a list of strings
                               # in parenthesies
    singles = []
    for i in range(len(batch)):
        singles.append(batch[i][0]) # unpacks the information to make it accessable
    return singles



In [None]:
def get_recipe_info(recipe):
    '''Takes in a recipe string, and uses regex to parse out:
       the Title, ingredients (as a group), and the serving size
       
    '''
    title_pattern = re.compile(r"([A-Za-z]{1}[^\r\n\t\f\v]*)") # take the first match
    ingredients_batch_pattern = re.compile(r"--------------------------------(.+?)[\n\r]{4}", re.S)
    serving_size_pattern = re.compile(r"Serving Size  :\s*(\d*)")
    
    title = title_pattern.search(recipe).group(0) # we only need the first match
    serving_size = serving_size_pattern.findall(recipe)[0]
    ingredients_batch = ingredients_batch_pattern.findall(recipe)[0]
        
    return title, serving_size, ingredients_batch

In [None]:
# now to create the dictionary of all this information
# to turn it into a pandas dataframe.
# Note that for now, I leave the ingredients as a list.

# to create the dictionary, it needs to be of the form
# {'col1':list(),'col2':list(),...}
category_list = []
title_list = []
serving_size_list = []
ingredients_batch_list = []
for category,category_name in zip(split_rs,categories):
    for recipe in category:
        title,serving_size,ingredients_batch = get_recipe_info(recipe)
        
        # append the appropriate elements to create the needed dictionary
        category_list.append(category_name)
        title_list.append(title)
        serving_size_list.append(serving_size)
        ingredients_batch_list.append(ingredients_batch)

In [None]:
df = pd.DataFrame({'category':category_list,
                   'title':title_list,
                   'serving size':serving_size_list,
                   'ingredients batch':ingredients_batch_list})



In [None]:
def get_ingredients(batch):
    l = batch.splitlines()[1:]
    measure_pattern = re.compile(r"(\d+ \d/\d|\d+/\d+|\d+)?(?=  ) +?([A-Aa-z]+)?(?=  )", re.S)
    #TODO: Add the ability to check out the meausrements
    ingredient_info_pattern = re.compile(r".{24}(.*)")
    
    ans = []
    for i in range(len(l)):
        mini = measure_pattern.findall(l[i])
        numb = [measure[0] for measure in mini if measure[0] != ''];
        if numb: numb = numb[0]; 
        else: numb = None
            
        unit = [measure[1] for measure in mini if measure[1] != '']; 
        if unit: unit = unit[0];
        else: unit = None 
            
        # get the string containing info about the ingredient
        ing_string = ingredient_info_pattern.findall(l[i])
        if ing_string: ing_string = ing_string[0];
        else: ing_string = None    
        ans.append((numb, unit, ing_string))
    return ans



In [None]:
b = df['ingredients batch']
all_ings = []
for i in range(len(b)):
    all_ings += [group[2] for group in get_ingredients(b[i])]
ingredient_counter = Counter(all_ings)
sorted_ings = sorted(ingredient_counter.items(), key=lambda kv: kv[1],reverse=True)


df_ings = pd.DataFrame(sorted_ings)
# for i in df_ings[:100]:
#     print(i)
goods = []
i = 0
while len(goods)<100:
#     print(df_ings[0][int(i)])
    if '--' not in df_ings[0][int(i)] and 'OR' not in df_ings[0][int(i)]:
        goods.append(df_ings[0][int(i)])
    i = i+1

$\textbf{2.C Data Collection: Ingredient Nutritional Value Scraping}$

The following function takes in the ingredients and looks them up on nutritionvalue.org. and returns a dictionary with the key as the ingredient and the links of the top 3 results as the values.
If it does not find any results it will place a None on the value of the key.  That's how we roll.

In [None]:
def nutrition_website(ingredients):
    """Use Selenium to enter the given search query into the search bar of
    nutrion website and gets links to scrape data

    Returns:
        (dictonary): urls .
    """
    #initialize variables and chrome
    ingredients_dictionary = {}
    browser = webdriver.Chrome()
    browser.get("https://www.nutritionvalue.org/")
    num_links = 3
    try:
        for i in ingredients:
            try:
                
                search_bar = browser.find_element_by_name('food_query')
                search_bar.clear()
                search_bar.send_keys(Keys.CONTROL + "a")
                search_bar.send_keys(Keys.DELETE)
                search_bar.send_keys(i)

                search_bar.send_keys(Keys.RETURN)
                
                words = i.split()
                x = ""
                for n,w in enumerate(words):
                    if n == 0:
                        x += ".*(?<!food_query=)"+w+".*|"
                    else:
                        x += ".*(?<!\+)"+w+".*|"
                x = x[:-1]
                find = re.compile(x,re.IGNORECASE)
            
                
                # wait for page to load
                time.sleep(2)
                currentURL = browser.current_url
                if "food_query" in currentURL:
                    
                    links = browser.find_elements_by_tag_name('a')
                    links = [link.get_attribute("href") for link in links if isinstance(link.get_attribute("href"),str)]
                    urls = [link for link in links if len(find.findall(link)) > 0]
                    if len(urls) >num_links:
                        ingredients_dictionary[i] = urls[:num_links]
                    elif len(urls) == 0:
                        ingredients_dictionary[i] = None
                    else:
                        ingredients_dictionary[i] = urls
                else:
                    ingredients_dictionary[i] = [currentURL]
                

            except NoSuchElementException:
                print("could not find the search bar!")
                print(i)
                return ingredients_dictionary
    # close window
    finally:
        browser.close()
    # list with all the links
    return ingredients_dictionary


Once we have the dictionary of ingredients and sources we ran the following code that collects the nutritional value of all ingredients and builds a dictionary of dictionaries. the main key is the ingredient and the value is a dictionary containing the nutriotional value (ex: Calories: 300, Protein: 10g, etc). 

In the given chance that there is an error in the process, the code will print out the ingredient, the link, and the error. If the error is IDK it can be checked manually and it can be improved on (This is how we have improved our code). The ingredients that have errors are then added to "error_items" which can be used for debuging purposes as well.

The code also returns a set of all the links that have been used in order to not over use the server of the website (it can be passed as the second argument)



In [None]:
def nutrition_value(dictionary,set_of_links = set()):
    """Takes in a dictionary with ingredients as keys
    look through the websites and scrape the nutritional value"""
    error_items =[]
    df_d = dict()
    browser = webdriver.Chrome()
    try:
        for k,v in zip(dictionary.keys(),dictionary.values()):
            if v is None:
                df_d[k] = {}
            else:    
                for l in v:
                    try:
                        if l is None:
                            print(f"No Website for: {k}")
                            print()
                        elif l in set_of_links:
                            print(f"Duplicates for {k}")
                        else:
                            browser.get(l)
                            time.sleep(5)
                            # name of ingredient
                            name = browser.find_elements_by_tag_name('h1')[0]
                            name = name.text

                            #setting up nutritional values
                            nut = dict()
                            c = "tbody"
                            tables = browser.find_elements_by_tag_name(c)
                            #### For essentials [4]
                            ser_cal = tables[4].text.split('\n')
                            # Serving Size
                            nut[ser_cal[1][:12]] = ser_cal[1][13:]
                            #Calories
                            nut[ser_cal[3][:8]] = ser_cal[3][9:]

                            #### For all others [7-13]
                            n_v = re.compile('\s*(.*)\s([0-9]+\.[0-9]+\s\w+)')
                            for i in range(7,14):

                                nutrient_value = [n_v.findall(t) for t in tables[i].text.split('\n') if len(t) >0]
                                for t in nutrient_value:
                                    if len(t)>0:
                                        nut[t[0][0]] = t[0][1]

                            df_d[name] = nut
                            set_of_links.add(l)
                        
                    except IndexError as e:
                        error_items.append(k)
                        print(k)
                        print(l)
                        print(e)
                        print()
                    except:
                        error_items.append(k)
                        print(k)
                        print("IDK")
                        print(l)
                        print()
            
    finally:
        browser.close()
    # list with all the links
    
    
    df = pd.DataFrame.from_dict(df_d,'index')
    return df, error_items, set_of_links

    

$\textbf{2.D Data Collection: Ingredient Nutritional Value Cleaning}$

Basic cleaning includes taking care of any columns that are singletons for a data point, changing texts to floats, and documenting units and converting them to similar units.

In [None]:
# floats for calories
df = ds
df["Calories"] = df['Calories'].astype('float')/100
df = df.drop(columns = ["18","Adjusted Protein"])
# changing the entries to floats and keeping track of units
clean_columns = list(df.columns)
other_units = []
g_units = []


for j,c in enumerate(clean_columns):
    if c ==  "Calories":
        
        pass
    else:    
        # float values
        n = re.compile(r"(^\d*\.?\d*)\s(\w+)")
        
        ### changing Na for 0's, to go back change 0 to np.nan
        num = np.array([float(n.findall(i)[0][0]) if isinstance(i,str) else 0 for i in df[c].values])
        mes = np.array([n.findall(i)[0][1] if isinstance(i,str) else 'g' for i in df[c].values])

        # measurements to grams
        mask_mg = (mes == 'mg')/10
        mask_mcg = (mes == "mcg")/10000
        mask_g = (mes == "g")*.01
        mask = mask_mg + mask_mcg +mask_g
        

        mask += (mask==0)*-1

        if sum(mask < 0) >1 :
            df[c] = num/100
            other_units.append(c)

        else:
            df[c] = num*mask
            g_units.append(c)

# append new features
df.rename(columns = {c:c+ " (g)" for c in g_units if c!= "Calories"},inplace = True)
df.rename(columns = {c:c+ " (IU)" for c in other_units}, inplace = True)
df.head()

$\textbf{3. Data Visualization}$

Now we have our final, epic, humungo sized dataframe!  What should we ask it?

"Oh Oracle, tell me: from the many ingredients that we have, is there a correlation among the nutritional contents they contain?"

<img src="corr_m.png" style="width: 500px;"/>

Oracle says not realy...

Ok.  How about this:
"Oh Oracle, tell me: I fear getting fat over christmas.  From what should I abstain?"

<img src="fat.png" style="width: 500px;"/>

Oracle clearly has prohibited popcorn and easter food... That's ok.  It's Christmas time, and I like my gingerbread houses.

I also want to get swole.  Gains matter.  What should I beef up on? (Pun very much intended)

<img src="Protein_comp.png">

Looks like I should stick to seafood... doesn't seem very seasonal though.  (Unless I lived in sea-attle!)

Ok, well, being that I'm only going to eat Beef, Pork, or Poultry for my protein source, and that some cousin of mine might be vegatarian now, I'll just compare those options.  They look relatively similar overall... I wonder if they have a different makeup?  Wiki-legend and sources confirm that protein is made of amino acids... lets see how they compare!

<img src="amino_acids.png" style="width: 600px;"/>

That's funny... they look about the same!  I wonder how those values compare to the average of all the recipes we've looked at.

<img src="AA_ovarall.png" style="width: 500px;"/>

And there you have it.  The same pattern yet again.  Well, if in proportion, my amino acid count isn't going to differ much, I'll probably stick with crawfish.  Like I said, 'Gains'.

$\textbf{4. Conclusion}$

Many new discoveries. Have crawfish for your gains, and lose some weight over your pork or lamb holiday meal.

Eat, drink, and be merry.

Merry Christmas :)