# Nutrition from Web-Scraped Recipes
### Nema Sobhani and Naomi Goodnight
#### Data Science Tools I: Final Project, Winter 2019

## Dataset and Motivation

**How**  
This data set was collected from web scraping a selection of online cooking websites and associating their ingredients with nutrition data from the USDA's Nutrient Data Library.

**Why**  
We thought that this project presented a pop-culture focus, demonstrated some proof-of-concept potential in nutritional health and research, and was both challenging and fun.

**Meta Data**
- Scraped Data
    - This dataset started its life on various recipe database websites.  Through webscraper code, utilizing BeautifulSoup and open source code as a base, the initial csv of 155,876 lines listed the url, recipe title, total time, and each ingredient as a separate column.  Recipe titles or ingredients which internally contained commas were surrounded by double quotes.

- USDA Nutrition Data 
    - These data contained most major food items in the American diet, along with common nutrition markers and experimental/analytic data.
    - Typical attributes include:
        - Food group
        - Name
        - Preparation
        - Nutrient data / 100 grams (Calories, Protein, Carbs, Fat, etc.)
        


## Task Definition/Research Question

For culinary websites looking to capture someone who shifts away from prepackaged foods with clearly-labeled nutritional content, especially with cook-it-yourself meal-delivery services growing in popularity, the need to appeal to home-cooking has created demand to provide nutritional content to consumers. 

To close these gaps, **we developed an automated process to return nutritional information from an input of raw web scraped recipe data**. This would allow websites to update their entire catalog of recipes to include nutritional data from the USDA and offer a significant advantage in the growing market for home cooking. 

This would also allow the website to conduct broad surveys of tendencies in nutritional content, ingredient frequencies, and health trends. 

<br>
  
**Future Directions**  
As we enter our next phase, there are several areas of interest that could be incorporated into our project to both increase its accuracy and widen it's scope. One focus would be to scrape more attributes, particularly location data and insights into ethnic cuisine. Incorporating machine learning techniques would offer the ability to create generic grocery lists based on a unique profile of a site. Natural language processing could be leveraged to more accurately determine the ingredients based on n-gram frequency. Visualizing popular recipes and ingredients by site, using word clouds, offers a custom representation of a website's culinary focus and sytle which carries innate marketing value.

## Literature Review

**Fitness Trackers**  
Mobile applications such as "MyFitnessPal" are consumer level calorie-counting applications that require manual input of food items in order to retrieve nutritional data. Our feature is distinct from this in that it does not require any input from consumers, since it is geared towards producers (food websites). Our feature would be ideal for recipe websites as it could be developed to interface with fitness apps and transfer all nutritional information to the user, removing the need for manual entry. 

https://www.myfitnesspal.com/


**Geographic/Cultural Food Association**  
There is a **_Kaggle_** competition in which a list of ingredients, provided via **_Yummly_**, is used to predict the ethnicity of that dish. This is similar to some of the word matching and recognition done in  our project, but they differ in that the competition criteria was only to determine the ethnic origin of a dish using keywords, while our project focused on actual matching of the ingredients to the USDA nutrition database in order to extract nutritional content. Beyond this, the Kaggle competition used thoroughly cleaned data from only one site through an API. Ours is intended to be used with almost any recipe website and handle raw html data and return nutrition data.  

https://www.kaggle.com/c/whats-cooking

## Quality of Cleaning

### 1. Web Scraping

NAOMI HI!

### 2. USDA Nutrition Retrieval
**Data Cleaning / Transformation**  
Raw scraped data is first split and all newlines and empty elements are removed. Next, each individual ingredient line is parsed to determine weight and quantity, using a mix of numeric parsing and keyword location for weight measurements (g, ml, tbsp, ounce, cup, etc). Next, the ingredient line is truncated to remove numerics, measurements, and stop words, which allows careful text matching in the USDA food description file. Once a match is made with a high number of matched words in the ingredient and the USDA word description file, the USDA nutritional data file is accessed by food ID, where all data is retrieved (calories, protein, fat, carbs, etc).

**Unusual Incidents**  
There were many edge cases that come up in parsing numeric descriptors. Two, six and a half ounce steaks may be written in many ways. For example:
- 2 6 1/2 ounce steaks
- 2, 6½ ounce steaks
- 2 6 ½ oz. steaks  

Parsing this was tricky, and many edge cases were defined in order to handle unusual ascii unicode, as well as to parse between quantities and weights.  
<br>
Another error-prone issue that has come up regularly that we have not been able to fully control is mis-matching of food descriptions. We used the heuristic that states, **_"the correct item is the one that (1) contains the highest number of matches and (2) has the shortest length if the there is a tie of the highest number of matches."_**. This heuristic isn't always necessarily correct! Often there would be an incorrect match that satisfies the heuristic proprties. For example, _"fresh garlic bread"_ may either match with _"fresh garlic"_ or _"fresh bread"_ in the USDA nutrition databse, leading to a possible error

**Missing Values**  
Due to the nature of the USDA nutritional data, if a value was not included, it means that it was not measured. These were stored in the dataframe as 'NaN' and were not included in analysis. If it was measured and it was 0, it was included.  

## Stats and Interpretation

### 1. Web Scraping

### 2. USDA Nutrition

## Visualization

### 1. Web Scraping

### 2. USDA Nutrition