# NUTRITION FROM WEB-SCRAPED RECIPES

<><><> PICTURE <><><>

# DATASET AND MOTIVATION
**How**  
This data set was collected from web scraping a selection of online cooking websites and associating their ingredients with nutrition data from the USDA's Nutrient Data Library.

**Why**  
We thought that this project presented a pop-culture focus, demonstrated some proof-of-concept potential in nutritional health and research, and was both challenging and fun.

**Meta Data**
- Scraped Data
    - BeautifulSoup and open source recipe scraping code
    - 155,876 recipes with url, recipe title, total time, and each ingredient as a separate column

- USDA Nutrition Data 
    - Most major food items in the American diet
    - Common nutrition markers and experimental/analytic data

# RESEARCH QUESTIONS  
<br>  
**PROBLEM**: Nutritional information not widely available on recipe websites. This automated process would enable websites to offer that information to consumers and create a better user experience.
<br>  
**INPUT**: Raw web-scraped recipe data.
<br>  
**OUTPUT**: USDA nutritional information.

<><><> PICTURE <><><>

# DATA CLEANING/TRANSFORMATION

### RECIPE WEB-SCRAPING

blah

### USDA Nutrient Parsing

- Parsing quantity and weight information using keywords
    - Differentiating unique unicode characters such as '½' and accounting for every numeric combination to indicate quantity and weight.
- Removal of stop words
- Truncation of ingredient to match in USDA name file
    - Which heuristic to use?
        - Highest number of matches?
        - Lowest number of words?
        - **_"The correct item is the one that (1) contains the highest number of matches and (2) has the shortest length if the there is a tie of the highest number of matches."_**

### USDA Nutrient Parsing (Unusual Incidents)  
There were many edge cases that come up in parsing numeric descriptors. Two, six and a half ounce steaks may be written in many ways. For example:
- 2 6 1/2 ounce steaks
- 2, 6½ ounce steaks
- 2 6 ½ oz. steaks  

Making sure you don't end up with 26 steaks or 2 ounces of steak required EDGE CASES GALORE.
<br>  
Often there would be an incorrect match that satisfies the heuristic proprties. For example, _"1 clove of garlic"_ may either match with _"garlic, raw"_ or _"clove, spice"_ in the USDA nutrition databse, leading to a possible error.

# DEMONSTRATION

In [1]:
# Input scraped data
# from bonappetit.com ("Rack of Venison Stuffed with Pecans, Currants, Sausage, and Pears")

! curl -O https://raw.githubusercontent.com/nemasobhani/Nutrition-from-Web-Scraped-Recipes/master/get_nutrition_demo.py
! curl -O https://raw.githubusercontent.com/nemasobhani/Nutrition-from-Web-Scraped-Recipes/master/get_nutrition_vars.py
! curl -O https://raw.githubusercontent.com/nemasobhani/Nutrition-from-Web-Scraped-Recipes/master/recipe_output_demo.csv
! curl -O https://raw.githubusercontent.com/nemasobhani/Nutrition-from-Web-Scraped-Recipes/master/USDA_Nutrition_DataSet/FOOD_DES.txt
! curl -O https://raw.githubusercontent.com/nemasobhani/Nutrition-from-Web-Scraped-Recipes/master/USDA_Nutrition_DataSet/NUT_DATA.txt

from get_nutrition_demo import *

ingredients_df = GetNutrition("recipe_output_demo.csv")

﻿https://www.bonappetit.com/recipe/rack-of-venison-stuffed-with-pecans-currants-sausage-and-pears,"Rack of Venison Stuffed with Pecans, Currants, Sausage, and Pears",0,"5 tablespoons olive oil, divided",1 cup chopped onion,"4 ounces sweet Italian sausages, casings removed",Roasted Bosc Pears,"1/2 cup pecans, toasted",1/3 cup dried currants,1 teaspoon chopped fresh rosemary,1,"1 2 1/2-pound rack of venison, frenched ,",1,"2 medium onions, thinly sliced","2 heads of garlic, cloves separated, root ends trimmed, unpeeled",6 fresh rosemary sprigs,1 bunch fresh thyme,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,



FileNotFoundError: [Errno 2] No such file or directory: 'USDA_Nutrition_DataSet/FOOD_DES.txt'

In [4]:
import re
# !pip install pandas

In [None]:
ingredients_df

# STATISTICS

In [None]:
# WEB-SCRAPED RECIPES

In [None]:
# USDA NUTRITION BY RECIPE

# Analysis tools imported from python script in repository
! curl -O https://raw.githubusercontent.com/nemasobhani/Nutrition-from-Web-Scraped-Recipes/master/NS_analysis.py
from NS_analysis import *

# Stats by Recipe
RecipeAnalyze(analysis=True)

In [None]:
# Stats by Ingredient
IngredientAnalyze(analysis=True)

In [None]:
# Fun Facts
Factoids()

# VISUALIZATION

In [None]:
# FOR NAOMI!!!

In [None]:
# Nutrient Visualization by Recipe
RecipeAnalyze(plot=True)

In [None]:
# Nutrient Visualization by Ingredient
IngredientAnalyze(plot=True)