By Neal Munson and Isaac Becker

# Motivation and Overview of Data:

## Project Purpose:

We want to compare recipes of different types by looking for trends in their nutritional value.


## Objective Questions:

What kind of protein-based meal gives me the most protein per serving?

What kind of recipe gives the largest 'healthy fat' to 'unhealthy fat' ratio?

Is there a correlation between the protein, mineral, and vitamin contents per serving of a recipe?


## Background knowledge and resources: 

#### What has been done

There are already lots of resources easily available online to help you search for recipes.  Many of these allow you to search recipes based on the type of recipe you want (comfort food, quick and easy, Mexican, etc.), and some let you search for recipes based on what ingredients you have in your fridge, allowing the user to say "I have milk, eggs, raisins, chicken, and rice- What can I make?"  However, no recipe search databases are readily available that give the nutritional value of the recipes.

#### How this is different

In this project, we are interested in creating a table that will help us sort through recipes based on nutritional value so that in a future project we can work on finding a set of recipes to match specific nutritional needs.  The table will enable us to answer the above questions. 

#### Resources

We will create the table with information about the nutritional values of recipes by starting with two specific resources.  

The first resource pulls from recipe text files that used to be offered at the Recipe Library at MasterCook.com.  It is a website that contains a list of links, where each link leads to a text file containing many recipes that all fit in the same 'category' of food.  This is the resource we will be scraping from to collect a reasonable sample of recipes.  

The second resource is a website called nutritionvalue.org (NutritionValue) that we will use to search for the nutritional content of the ingredients used in the recipes mentioned above.  NutritionValue allows us to search, for example, "banana", which will then return a list of links of all the various brands of bananas that it has in the database.  Each one of these links then leads to a page that contains all the nutritional information (calorie content, protein content, vitamins, minerals, etc.) of that particular banana selection.

![](websites.png)


We will use these two resources to collect the needed information to create a table that will help us answer our objective questions.

#### Validity of Resources

It is important to note that both of these resources are appropriate for the question at hand.  

The first is valid because it contains recipes that were created by various users, and offers a variety of recipes.  Therefore, the sample of recipes is well distributed.  It is also a suitable source because it offers a particularly easy way to download a large number of recipes with consistent formatting.

The second resource is valid because it pulls its data from the USDA National Nutrient Database for Standard Reference.  It has a broad range of ingredients available to search and is also very detailed in its nutritional content breakdown, which makes it suitable for this project.


## Table Construction Overview:


To answer the objective questions, we need to create a table containing the relevant information. We will construct this table after the following format:

|Category|Recipe|Serving Size|Salt|Sugar|...|Ingredient n|
|------|------|------|------|------|------|------|
|Chowders|Clam Chowder|4|...|...|...|...|


Each recipe will have an associated category, serving size, and ingredient columns.  Each ingredient column shown here actually represents a 'batch' of many columns that contain: one column for how much of the ingredient is present, and a column for how much of each nutritional property (protein content, vitamin content, Calorie count, etc.) is contained per 100 grams of that ingredient.

# Data Collection and Cleaning:

In order to form the desired table, we first need to collect the recipes and their associated ingredient nutritional information.  The process for this is broken down into two steps: recipe collection and ingredient information collection.

## Recipe Collection Procedure:

### Link Collection

To collect the information for the recipes, we need to first scrape all the links that will lead to the text files of each category.  Each link contains a text file that looks like a list of recipes such as the picture on the left below.  The recipes consist of the title, ingredients, and an explanation portion on how to 'make' the recipe.  The portion of each recipe we will be using is just the quantitative portion, for which an example is given at the right.

![](2textimg.png)

We need to first thus scrape the website of the appropriate links, and then scrape each link of the list of recipes it contains.  Here is some code that does that:

![](split1.png)

![](split2.png)

### Create Recipe Categories

We then use the titles of each scraped link to make a list of the category names.

![](find_rcategories_code.png)

In all, there are 85 different categories, and they have names such as 'Alcoholic Beverages', 'All Appetizer Recipes', 'All Beverage Recipes', 'All Bread Recipes', 'All Breakfast Recipes', ..., 'Biscuits and Scones', 'Bread Machine Recipes', 'Brownies',... and many more.  

### Separate Individual Recipes

We then separate each category's single long string of recipes into a list of strings each containing one recipe.  We use Regex to do this in the following manner:

![](separate_recipes_code.png)

### Get Ingredient Info

Now we take each recipe, and use Regex to separate the title, serving size, and 'ingredients batch' (which is a string containing the ingredients, their quantities and units of measurement)

![](get_recipe_info_code.png)

### Create a First Basic Table

Now we want to create a table that uses the data that we are able to successfully scrape.  

![](create_recipe_df_code.png)

Here's the table that we end up with here:

![](basic_table_head.png)

### Dropping Missformatted Recipes

Note that we drop some of the recipes because they don't fit the correct formatting, and there are few enough of them that they don't pose too much of a loss to the overall effort of creating a table that represents the nutritional values of various recipes.  It was interesting to note however that the category with the most 'erroneous' recipes was that of 'Sourdough bread' - there were multiple instances where people would give 'hints' about how to care for the bread, rather than actually providing a recipe!

#### $$ Example\ of\ poorly\ formatted\ recipe $$ 

![](bad_recipe.png)

### Parsing the Ingredient Information

We now need to go through the recipe data and separate it into its constituent parts: separate the ingredient name from the quantity from the unit.  Here is a sample of code used to do that:

![](screen3.png)



![](split4.png)

### Normalize ingreadient names

Next, we need to have a way of normalizing the ingredient names.  This is because many ingredients are presented slightly differently.  For example, brown sugar, dark brown sugar, lightly-packed brown sugar, and Brn. sugar could all be reasonably categorized into the same ingredient class: "brown sugar."

To do this, we first normalized all the ingredients to make them lowercase.  

We then created 'class' categories by looking at the most common 200 ingredient types and creating a category for each of them.  We selected the top 200 categories because the 200th most common ingredient is only used in about 12 of the roughly 5,500 recipes we scraped from, and there were over 10,000 ingredients total.  Thus we decided the tail of ingredients that were used very little could be cut off, and the majority of the informational content would still be observed in the table we are creating. 

We ultimately created a dictionary that mapped the raw ingredient name to the class to which it pertains. (See pertinent Code in the Appendix)

For the classes that did not fit into a category, we put their information in an 'OTHER' class, allowing us to cleanly sort through the data.

### Create a Second Basic Table

With the categories created, we can now create the basic layout of our recipe-ingredients table.  The following is the head of this dataframe, with the ingredients separated, but the nutritional information of each ingredient not yet present.

![](sorted_dataframe_head.png)

## Ingredient Information Collection Procedure:

### Link Collection/Error Handling

Now that we have the individual ingredients we can start collecting nutritional data. The site that we used is okay with us using their data as long as we are not going to sell it.

The scaper we built for this task searches each ingredient on nutritionvalue.org and takes the first available 3 links. The reason for this is that the top link might not be what we are looking for but, since they are ordered by relevance, it is highly likely to be on the top three.

Example of query with Oats:

![](nutrition_value_query.png)

These links are collected in a dictionary with the ingredient as its respective value. The code is able to handle common errors like "no results", "less than 3 links", and "can't find search bar.

![](isaac_code1.png)


### Nutritional Values

Now we are ready to get the nutritional values.

Each ingredient link contains tables of nutritional properties and their quantities. Here is an example of the content of the link related to "Oats".

![](nutrition_values_oats.png)

The scaper will go through the dictionary of links and collect the name, serving size, macronutrients (Protein, fat, carbohydrates), micronutrients (vitamins and minerals), and any other fact on the tables available.

The code prints out any links that cause an error to help minimize the number of missing ingredients. There were a few links with error but we inspected them and made sure we weren't losing any important information. They turned out to be links that were not related to food but rather other parts of the website that were at times included in our query. 

In order to not abuse the website's information we also added an argument that skips the search if it has already been pulled.

![](split5.png)

![](split6.png)

### Cleaning Nutritional Data

Once we collected the data we analyzed it for errors. There were two columns that had only one value out of all the ingredients - the number "18" and "adjusted Protein". For this reason, we dropped those columns. 

The rest of the data was cleaned by making all values floats, converting values to grams (g) and making serving sizes be 1g for all ingredients.

Vitamin A and D were a special case because they were measured in International Units (IU) so we converted them to grams

Lastly, we engineered columns for all minerals and vitamins for future investigation of trends in nutrition - Whether you can get all the nutrition you need from certain foods or recipes.

We create a dataframe with this adjusted information which we then title 'nutrtional_value'.

![](isaac_code4.png)


### Merging the Dataframes

Now that both data sets are clean and contain featured columns, we begin the merging process to produce a giant sparse dataframe where the rows are recipes and the columns are ingredient's (ing) nutritional properties for all ingredients. example:  

|recipe|ing1_calories|ing1_protein |... | ing1_minerals| ing2_calories|...|
|-|-|-|-|-|-|-|

### Normalizing quantities

In the transition to one dataframe we encountered one of our biggest challenges thus far. We need to sort through the quantities of each ingredient and convert them to grams. Unfortunately, we have not found a comprehensive list of food densities online, and the measurements are not consistent on the recipe website. We have created a temporal solution that converts abstract measurements into the equivalent in grams for the most used ingredient and we map words that mean the same things to a uniform name (ex: tbs, tablesp, TBS -> tablespoon, See appendix for code).

Small Example of the unit converter:

![](isaac_code5.png)


We are continually searching for a solution to the conversion problem. Nevertheless, We proceed with the analysis so that when an improvement to this method is found we can get better results. 

There are multiple steps to build what we are calling "Hefty_df". For each recipe, we find the ingredients it requires and look-up the units and quantities. For the quantities, we built a function that takes in strings of the form "1/2, 5 3/2, None" and returns its float equivalent or 1 for None (it represents "whole"). This ingredient is found in our nutritional_value dataframe by the Levenshtein distance which compares the similarity of words by comparing how many edits are needed to change one word into the other. For example by adding or dropping a letter.  We then multiply the quantity, unit (using our unit converter), and the nutritional value of the ingredient and place them into hefty_df. Lastly, we sum the values of each property and enter it into the "total_(respective property)" and repeat the process (*see appendix for code).

![](hefty_head.png)

The reason we built this sparse dataframe is for the intents and purposes of next semester. We use the columns with the total nutritional values and the nutritional_value dataframe for our analysis.

# Data Visualization and Analysis

### Answering the Questions

#### What kind of protein-based meal gives me the most protein per serving?

To answer this question we simply compare each category that represents a protein based meal, and compare their protein content.  As can be seen below, seafood based recipies contain a greater ammount of protein per serving than other recipe types.

![](protein_graph.png) 

#### What kind of recipe gives the largest 'healthy fat' to 'unhealthy fat' ratio?

There is an important distinction between the different fat values printed on produce. Our body needs good fats ("Polyunsaturated fatty acids" and "Monounsaturated fatty acids") but not the bad ones ("Saturated fatty acids" and "Trans fats"). 

On the left figure below we have the recipes with high good/bad ratio. These are the recipies that would be recommended if someone put the constraint to have more good fats rather than bad. These results look reasonable since they all tend to have leaner meat and/or recipes that are considered "healthier".

On the figure to the right are the recipes with the lowerst good/bad fat ratio. These are the recipies you might want to abstain from making/eating to stay healthy. Once again we have values that we expect like cake, pasta, and breakfast foods. Asparagus is on there because it is often cooked with butter, cheese, or oils. 

It is important to note that each bar on these graphs represent the average of many recipes of a certain class each with many ingredients.

![](isaac_2graphs.png) 

We chose to remove the following categories From the graph above:
- Alcoholic beverage - Most of the ingredients are not well represented since they are too specific and were put in the "other" ingredients pile
- Jam - Likely there are low levels of fat which would make the ratio extreme depending on one or two recipes
- Dog Biscuits - The reader can know that they are good for their dog 



#### Is there a correlation between the protein, mineral, and vitamin contents per serving of a recipe?

The last question we would like to address is whether there is any correlation between protein, vitamins, and minerals in our list of ingredients. If so then we know that choosing one high nutritious ingredient will provide a balance more balance to your diet.

The highest correlation is seen below to be between protein and Minerals which makes sense since there is iron in meat, and the lowest is between vitamins and minerals - Probably the reason they are sold seperately as supplements.

![](corr_m.png)

# Apendix

### Code for hefty DataFrame

In [None]:
# changing units by hand (only first time seen)
u = dict()
for i,rec in enumerate(neal_df[mask].values):
    for j,mes in enumerate(rec):
        if mes in u:
            pass
        else:
            print(neal_df[mask].columns[j])
            print(mes)
            u[mes] = input()
            print()

In [None]:
# Unit Conversion dictionary
# g/ units
conv['gram'] = 1
conv['liter'] = 1000
conv["gallon"] = 0.00026417205
conv['lbs'] = 453.592
conv['milliliter'] = 1
conv['oz']= 28.34
conv['kilogram'] = 1000
conv['tablespoon'] = 14.3
conv['teaspoon'] = 4.77
conv["cup"] = 201
conv["ear"] = 92
conv['clove'] = 7
conv['pinch'] = 0.36
conv['quart'] = 946.353
conv['pint'] = 473.176
conv['envelope'] = 7.085
conv['None'] = 0
conv['dash'] = 0.72
conv['head'] = 539
conv['stick'] = 113
conv['package'] = 7.085
conv['small'] = 75
conv['medium'] = 150
conv['large'] = 225
conv['stalk'] = 50
conv['strip'] = 10
conv['square'] = 13
conv['square'] = 56.7
conv['box'] = 382.59
conv['whole'] = 100
conv['bag'] = 453.59
conv['sprig'] = 30
conv['bulb'] = 30
conv['slice'] = 5
conv['bunch'] = 120
conv['part'] = 1
conv['cube'] = 57

In [None]:
# gets mask for ingredients
col = list(neal_df.columns) 
mask_class = np.array([i for i in col if i[:6] == "class_"])

In [None]:
# create the columns for Hefty
hefty_columns = [nut_element+"_"+category[6:] for category in mask_class for nut_element in df.columns]
hefty_columns += ["Total "+ nut_element for nut_element in df.columns]

In [None]:
# Turning strings to floats
def st_to_fl(s):
    try:
        # This implied "Whole"
        if s is None:
            return 1
        # other wise make it float
        return float(s)
    except ValueError:
        #if error do this
        return float(sum(Fraction(c) for c in s.split()))

In [None]:
# building hefty
hefty_df = pd.DataFrame(columns=hefty_columns)
num_recipies = len(list(neal_df.index))

# for each recipe
for num, recipe in enumerate(neal_df.index):
    # get all the entries and get the values that are not None
    r = neal_df.iloc[recipe]
    r_contents = r[mask_class].values != None # mask
    
    #get the values where it wasn't none and inicialize the row
    ingredients = r[mask_class][r_contents].values
    hefty_df.loc[len(hefty_df)] = 0
    
    #progress bar
    if num% 200 == 0:
        print(f"{num/num_recipies}%")
        
    # for each ingredient in the recipe
    for ing in ingredients:
        # change the quantity to float and convert units to grams
        quantity = st_to_fl(r['quant_' + ing])
        unit_factor = conv[r['unit_' + ing]]
        conversion = quantity*unit_factor  
        
        #find the most similar ingredient in the nutrition dataframe
        most_similar_ing = process.extractOne(ing,df.index)[0]
        
        # match the columns of hefty with the ingredient
        col = list(hefty_df.columns) 
        hefty_mask = np.array([i for i in col if i[-(len(ing)+1):] == "_"+ing])
        
        # for each column with the ingredient mutiply nutrition value by conversion
        # and add to total column
        for m,i,c in zip(hefty_mask, df.loc[most_similar_ing].values,df.columns):
            value = i*conversion
            hefty_df.loc[num][m] = value
            hefty_df.loc[num]["Total " + c] += value
