# Motivation and Overview of Data:

## Project Purpose:

We want to compare recipies of different types by looking for trends in their nutritional value.


## Objective Questions:

What kind of protein based meal gives me the most protein per serving?

What kind of recipe gives the largest 'healthy fat' to 'unhealthy fat' ratio? (And visa versa)

Is there a correlation between the protein, mineral, and vitamin contents per serving of a recipe?


## Background knowledge and resources: 

#### What has been done

There are already lots of resources easily available online to help you search for recipes.  Many of these allow you to search recipes based on the type of recipe you want (comfort food, quick and easy, Mexican etc.), and some let you search for recipes based on what ingredients you have in your fridge, allowing the user to say "I have milk, eggs, rasins, chicken, and rice- What can I make?"  However, no recipe search databases are readily available that give nutritional value of the recipes.

#### How this is different

In this project, we are interested in creating a table that will help us sort through recipes based on nutritional value so that in a future project we can work on finding a set of recipes to match specific nutritional needs.  The table will enable us to answer the above questions. 

#### Resources

We will create the table with the information about the nutritional values of recipes by starting with two specific resources.  

The first resource pulls from recipe text files that used to be offered at the Recipe Library at MasterCook.com.  It is a website that contains a list of links, where each link leads to a text file containing many recipes that all fit in the same 'category' of food.  This is the resource we will be scraping from in order to collect a resonable sample of recipes.  

The second resource is a website called nutritionvalue.org (NutritionValue) that we will use to search for the nutritional content of the ingredients used in the recipes mentioned above.  NutritionValue allows us to search, for example, "banana", which will then return a list of links of all the various brands of bananas that it has in the database.  Each one of these links then leads to a page that contains all the nutritional information (calorie content, protein content, vitamins, minerals etc.) of that particular banana selection.

$$ Website\ Containing\ Recipe\ Collection\ Text\ Files $$ | $$ Website\ to\ Collect\ Individual\ Ingredient\ Nutrition $$ 
- | -
![title](recipe_website.png) | ![title](nutrition_website.png)


We will use these two resources to collect the needed information to create a table that will help us answer our objective questions.

#### Validity of Resources

It is important to note that both of these resources are appropriate for the question at hand.  

The first is valid because it contains recipes that were created by various users, and offers a variety of recipes.  Therefore, the sample of recipies is well distributed.  It is also a suitable source because it offers a particularly easy way to download the a large number of recipes with consistant formatting.

The second resource is valid because it pulls its data from the USDA National Nutrient Database for Standard Reference.  It has a broad range of ingreadients available to search, and is also very detailed in it's nutritional content breakdown, which makes it suitable for this project.


## Table Construction Overview:


In order to answer the objective questions, we need to create a table containing the relevent information. We will construct this table after the following format:

|Category|Recipe|Serving Size|Salt|Sugar|...|Ingredient n|
|------|------|------|------|------|------|------|
|Chowders|Clam Chowder|4|...|...|...|...|


Each recipe will have an associated category, serving size, and ingredient columns.  Each ingredient column shown here actually represents a 'batch' of many columns that contain: one column for how much of the ingredient is present, and a column for how much of each nutritional property (protein content, vitamin content, Calorie count etc) is contained per 100 grams of that ingredient.

# Data Collection and Cleaning:

In order to form the desired table, we first need to collect the recipes and their associated ingredient nutritional information.  The process for this is broken down into two steps: recipe collection and ingredient information collection.

## Recipe Collection Procedure:

### Link Collection

In order to collect the information for the recipes, we need to first scrape all the links that will lead to the text files of each category.  Each link contains a text file that looks like a list of recipes such as the picture on the left below.  The recipies consist of the title, ingredients, and an explanation portion on how to 'make' the recipe.  The portion of each recipe we will be using is just the quantitative portion, for which an example is given at the right.

$$ Portion\ of\ one\ Text\ File $$ | $$ Example\ Recipe $$ 
- | -
<img src="recipesamples.png" width="800" height="400">| <img src="recipeexample.png" width="800" height="400">

We need to first thus scrape the website of the appropriate links, and then scrape each link of the list of recipes it contains.  Here is some code that does that:

![title](recipe_scrape_code.png)

### Create Recipe Categories

We then use the titles of each scraped link to make a list of the category names.

![title](find_rcategories_code.png)

In all, there are 85 different categories, and they have names such as 'Alcoholic Beverages', 'All Appetizer Recipes', 'All Beverage Recipes', 'All Bread Recipes', 'All Breakfast Recipes', ..., 'Biscuits and Scones', 'Bread Machine Recipes', 'Brownies',... and many more.  

### Separate Individual Recipes

We then separate each category's single long string of recipes into a list of strings each containing one recipe.  We use Regex to do this in the following manner:

![title](separate_recipes_code.png)

### Get Ingredient Info

Now we take each recipe, and use Regex to separate out the title, serving size, and 'ingredients batch' (which is a string contianing the ingredients, their quantities and units of measurement)

![title](get_recipe_info_code.png)

### Create first basic table

Now we want to create a table that uses the data that we are able to successfully scrape.  

![title](create_recipe_df_code.png)

Here's the table that we end up with here:

![title](basic_table_head.png)

### Dropping Missformatted Recipes

Note that we drop some of the recipes because they don't fit the correct formatting, and there are few enough of them that they don't pose too much of a loss to the overall effort of creating a table that represents the nutritional values of various recipes.  It was interesting to note however that the category with the most 'erroneous' recipes was that of 'Sourdough bread' - there were multiple instances where people would give 'hints' about how to care for the bread, rather than actually providing a recipe!

#### $$ Example\ of\ poorly\ formatted\ recipe $$ 

![title](bad_recipe.png)

### Parsing the Ingredient Information

We now need to go through the recipe data and separate it out into it's constituant parts: separate the ingredient name from the quantity from the unit.  Here is a sample of code used to do that:

![title](separate_ingredients_code.png)

### Normalize ingreadient names

Next we need to have a way of normalizing the ingredient names.  This is because many ingredients are presented slightly differently.  For example, brown sugar, dark brown sugar, lightly-packed brown sugar, and Brn. sugar could all be reasonably categorized into the same ingredient class: "brown sugar."

In order to do this, we first normalized all the ingredients to make them 
lowercase.  

We then created 'class' categories by looking at the most common 200 ingredient types and creating a category for each of them.  We selected the top 200 categories because the 200th most common ingredient is only used in about 12 of the roughly 5,500 recipes we scraped from, and there were over 10,000 ingredients total.  Thus we decided the tail of ingredients that were used very little could be cut off, and the majority of the informational content would still be observed in the table we are creating. 

We ultimately created a dictionary that mapped the raw ingredient name to the class to which it pertains. (See pertinant Code in the Appendix)

For the classes that did not fit into a category, we put their information in an 'OTHER' class, allowing us to cleanly sort through the data.

### Create Second Basic Table

With the categories created, we can now create the basic layout of our recipe-ingredients table.  The following is the head of this dataframe, with the ingredients separated out, but the nutritional information of each ingredient not yet present.

![title](sorted_dataframe_head.png)

## Ingredient Information Collection Procedure:

Now that we have the individual ingredients we can start collecting nutritional data

The scaper we build for this task searches each ingredient on nutritionvalue.org and takes the first available 3 links. 
Ex: Oats
![title](nutrition_value_query.png)

In [1]:
## CODE: Scraper 1
def nutrition_website(ingredients):
    """Use Selenium to enter the given search query into the search bar of
    nutrion website and gets links to scrape data

    Returns:
        (dictonary): urls .
    """
    #initialize variables and chrome
    ingredients_dictionary = {}
    browser = webdriver.Chrome()
    browser.get("https://www.nutritionvalue.org/")
    num_links = 3
    try:
        for i in ingredients:
            try:
                #navigate
                search_bar = browser.find_element_by_name('food_query')
                search_bar.clear()
                search_bar.send_keys(Keys.CONTROL + "a")
                search_bar.send_keys(Keys.DELETE)
                search_bar.send_keys(i)

                search_bar.send_keys(Keys.RETURN)
                
                words = i.split()
                x = ""
                for n,w in enumerate(words):
                    if n == 0:
                        x += ".*(?<!food_query=)"+w+".*|"
                    else:
                        x += ".*(?<!\+)"+w+".*|"
                x = x[:-1]
                find = re.compile(x,re.IGNORECASE)
            
                
                # wait for page to load
                time.sleep(2)
                currentURL = browser.current_url
                if "food_query" in currentURL:
                    
                    links = browser.find_elements_by_tag_name('a')
                    links = [link.get_attribute("href") for link in links if isinstance(link.get_attribute("href"),str)]
                    urls = [link for link in links if len(find.findall(link)) > 0]
                    if len(urls) >num_links:
                        ingredients_dictionary[i] = urls[:num_links]
                    elif len(urls) == 0:
                        ingredients_dictionary[i] = None
                    else:
                        ingredients_dictionary[i] = urls
                else:
                    ingredients_dictionary[i] = [currentURL]
                

            except NoSuchElementException:
                print("could not find the search bar!")
                print(i)
                return ingredients_dictionary
    # close window
    finally:
        browser.close()
    # list with all the links
    return ingredients_dictionary


These links are collected in a dictionary with the ingredient as its respective value.

Then we are ready to get the nutritional value!

Continuing with our oats example, these are a few of the tables from which we got our nutrition data

![title](nutrition_values_oats.png)

In [None]:
# code for scraper
def nutrition_value(dictionary,set_of_links = set()):
    """Takes in a dictionary with ingredients as keys
    look through the websites and scrape the nutritional value"""
    error_items =[]
    df_d = dict()
    browser = webdriver.Chrome()
    try:
        for k,v in zip(dictionary.keys(),dictionary.values()):
            if v is None:
                df_d[k] = {}
            else:    
                for l in v:
                    try:
                        if l is None:
                            print(f"No Website for: {k}")
                        elif l in set_of_links:
                            print(f"Duplicates for {k}")
                        else:
                            browser.get(l)
                            time.sleep(5)
                            # name of ingredient
                            name = browser.find_elements_by_tag_name('h1')[0]
                            name = name.text

                            #setting up nutritional values
                            nut = dict()
                            c = "tbody"
                            tables = browser.find_elements_by_tag_name(c)
                            #### For essentials [4]
                            ser_cal = tables[4].text.split('\n')
                            # Serving Size
                            nut[ser_cal[1][:12]] = ser_cal[1][13:]
                            #Calories
                            nut[ser_cal[3][:8]] = ser_cal[3][9:]

                            #### For all others [7-13]
                            n_v = re.compile('\s*(.*)\s([0-9]+\.[0-9]+\s\w+)')
                            for i in range(7,14):
                                nutrient_value = [n_v.findall(t) for t in tables[i].text.split('\n') if len(t) >0]
                                for t in nutrient_value:
                                    if len(t)>0:
                                        nut[t[0][0]] = t[0][1]

                            df_d[name] = nut
                            set_of_links.add(l)
                        
                    except IndexError as e:
                        error_items.append(k)
                        print(f"ingredient:{k}, error: {e}, link: {l}")
                    except:
                        error_items.append(k)
                        print(f"ingredient:{k}, error: IDK, link: {l}")
            
    finally:
        browser.close()
    # list with all the links
    df = pd.DataFrame.from_dict(df_d,'index')
    return df, error_items, set_of_links

The scraper would print out any errors it ran into and the link that caused it. Each link was inspected to make sure we weren't losing any important information. All printed out links were to other parts of the website that were gathered depending on the words we were searching. 

In order to not abuse the websites information we have also added a argument that skips the search if it has already been pulled.

Ones we collected the data we analysed it for errors. There were two columns that had only one value out of all the ingredients - the number "18" and "adjusted Protein". For this reason we dropped those columns. 

The rest of the data was cleaned by making all values floats, converting values to grams (g) and making serving sizes be 1g for all ingredients.

Vitamin A and D were a special case because they were measured in International Units (IU) so we converted them to grams

Lastly we engineered a columns for all minerals and vitamins for future investigation of trends in nutrition - Whether you can get all the nutrition you need from certain foods or recipes.

Code:

The scraper would print out any errors it ran into and the link that caused it. Each link was inspected to make sure we weren't losing any important information. All printed out links were to other parts of the website that were gathered depending on the words we were searching. 

In order to not abuse the websites information we have also added a argument that skips the search if it has already been pulled.

Ones we collected the data we analysed it for errors. There were two columns that had only one value out of all the ingredients - the number "18" and "adjusted Protein". For this reason we dropped those columns. 

The rest of the data was cleaned by making all values floats, converting values to grams (g) and making serving sizes be 1g for all ingredients.

Vitamin A and D were a special case because they were measured in International Units (IU) so we converted them to grams

Lastly we engineered a columns for all minerals and vitamins for future investigation of trends in nutrition - Whether you can get all the nutrition you need from certain foods or recipes.

Code:

In [None]:
# floats for calories
df = ds
df["Calories"] = df['Calories'].astype('float')/100
df = df.drop(columns = ["18","Adjusted Protein"]) # contains singleton

for j,c in enumerate(list(df.columns)):
    if c ==  "Calories":     
        pass
    else:    
        n = re.compile(r"(^\d*\.?\d*)\s(\w+)")     # float values
        
        ### changing Na for 0's, to go back change 0 to np.nan
        num = np.array([float(n.findall(i)[0][0]) if isinstance(i,str) else 0 for i in df[c].values])
        mes = np.array([n.findall(i)[0][1] if isinstance(i,str) else 'g' for i in df[c].values])

        # messurements
        mask_mg = (mes == 'mg')/10
        mask_mcg = (mes == "mcg")/10000
        mask_g = (mes == "g")*.01
        mask = mask_mg + mask_mcg +mask_g
        mask += (mask==0)*-1

        if sum(mask < 0) >1 :
            df[c] = num/100
        else:
            df[c] = num*mask

df["Vitamin A"] *= 0.6/1000000
df["Vitamin D"] *= 0.025/1000000

vitamins=['Choline','Niacin','Pantothenic acid','Riboflavin','Thiamin',
          'Vitamin A','Vitamin B12','Vitamin B6','Vitamin C','Vitamin D','Vitamin E','Vitamin K']
minerals = ['Calcium, Ca','Copper, Cu','Iron, Fe',    'Magnesium, Mg',
            'Manganese, Mn','Phosphorus, P','Potassium, K','Selenium, Se','Sodium, Na','Zinc, Zn']
# append new features
df['Vitamins'] = df[vitamins].sum(axis=1)
df['Minerals'] = df[minerals].sum(axis=1)
df.head()

Now that both data sets are clean and contain featured columns we merged them to form a giant sparse dataframe of the form *(see appendix)

One of our biggest challenges was to sort through the different quantities and converting them to grams. We have come up with a temporal solution why converts abstract measurements into the equivalent in grams for the most used ingredient. 

Example:

In [None]:
conv = {c:[] for c in set(u.values())}
# g/ units
conv['gram'] = 1
conv['liter'] = 1000
conv["gallon"] = 0.00026417205
conv['lbs'] = 453.592
conv['milliliter'] = 1
conv['oz']= 28.34
conv['kilogram'] = 1000
conv['tablespoon'] = 14.3

We are continually searching for a solution to this problem. Nevertheless, We proceed with the analysis so that when an improvement to this method is found we can get better results. 

Code Quality and Robustness:

Data Visualization and Analysis:

![title](hefty_head.png)


# Apendix

#### Code for creating Ingredient Classes

In [None]:
# next singularize all the words in all_ingredients using the inflect package
ingredient_class_dict1 = []
stripped_ingredients = []
p = inflect.engine() # this will be used to singularize plural words
for ingredient_pair in ingredient_class_dict:
    # for each string, strip the non alphanumeric chars and
    # singularize the word if it is plural, then rejoin the string
    alph = re.compile('[\W_]+')
#     word = ingredient_pair[1]
    word_list = [p.singular_noun(alph.sub('', word))
                 if p.singular_noun(alph.sub('', word)) 
                 else 
                 (alph.sub('', word) if word not in not_foods
                  else None)                 
                 for word in ingredient_pair[1].split()]
#     if not p.singular_noun(alph.sub('', word)) and word in not_foods: # not necessary: just seeing what I'm throwing out.
#         print(ingredient_pair[0])
    no_nones = list(filter(None, word_list))
    word = " ".join(map(str,no_nones))
    # set the revised word as the second element in the pair
    ingredient_class_dict1.append((ingredient_pair[0],word))
    # also I'll save the word here so that I can sort through all words easily
    if word not in not_foods:
        stripped_ingredients.append(word)
stripped_ingredients[:10]    

In [None]:
def get_ingredient_classes(sorted_ings):
    ''' Takes in a list of sorted ingredients,
     and gets all the ingredient 'head' variables and saves them in a pickle.
     returns a list of sublists, where the first element of each sublist is the 'head' 
     variable, and each other element are the variables that fall under it's 'ingredient type'.
     '''
    if os.path.exists('ingredient_name_groups.pickle'): # checks if the folder already exists
        print("pickle already here: returning contents")
        with open('ingredient_name_groups.pickle','rb') as f:
            name_groups = pickle.load(f) # load the saved contents 
            return name_groups
    # otherwise, scrapes the website, pickles the information, and 
    # returns the contents
    else:
        print("pickle not here yet: creating contents")
        name_groups = []
        check_list = set()
        j = 0
        while len(name_groups) < 200:
            current_item = sorted_ings[j]
            if current_item not in check_list:
                check_list.add(current_item)
                # use fuzzywuzzy to appropriately add elements to the same list as the 'head' element
                name_groups.append([sorted_ings[i][0] for i in range(j,2000) if fuzz.partial_ratio(current_item,sorted_ings[i]) > 76])
            j += 1

        with open('ingredient_name_groups.pickle','wb') as f:
            pickle.dump(name_groups,f) # save the contents

        return name_groups  

#### Code for hefty DataFrame

In [None]:
# gets mask for ingredients (CHECK MASK BEING USED)
col = list(neal_df.columns) 
mask_class = np.array([i for i in col if i[:6] == "class_"])

In [None]:

hefty_columns = [nut_element+"_"+category[6:] for category in mask_class for nut_element in df.columns]
hefty_columns += ["Total "+ nut_element for nut_element in df.columns]

In [None]:
hefty_df = pd.DataFrame(columns=hefty_columns)
num_recipies = len(list(neal_df.index))
for num, recipe in enumerate(neal_df.index):
    r = neal_df.iloc[recipe]
    r_contents = r[mask_class].values != None
    ingredients = r[mask_class][r_contents].values
    hefty_df.loc[len(hefty_df)] = 0
    if num% 200 == 0:
        print(f"{num/num_recipies}%")
    for ing in ingredients:
        quantity = st_to_fl(r['quant_' + ing])
        unit_factor = conv[r['unit_' + ing]]
        conversion = quantity*unit_factor  
#         print(conversion)
        most_similar_ing = process.extractOne(ing,df.index)[0]
#         print(ing)
        col = list(hefty_df.columns) 
        hefty_mask = np.array([i for i in col if i[-(len(ing)+1):] == "_"+ing])
        for m,i,c in zip(hefty_mask, df.loc[most_similar_ing].values,df.columns):
            value = i*conversion
            hefty_df.loc[num][m] = value
            hefty_df.loc[num]["Total " + c] += value


### Code for making similar units the same by human input (for first time seen only)

In [None]:
# changing units by hand
# u = dict()
# for i,rec in enumerate(neal_df[mask].values):
#     for j,mes in enumerate(rec):
#         if mes in u:
#             pass
#         else:
#             print(neal_df[mask].columns[j])
#             print(mes)
#             u[mes] = input()
#             print()