# Healthier Groceries Recommender
### Research, Data Wrangling, Exploratory Data Analysis

*In this notebook you will find:*
1. High Level Overview of Project
2. Background Research
3. Data Imports
4. Exploratory Data Analysis

-------------------------------------

# High Level Overview of Project


 - Recommend healthier options
 - Would be an add on/opt in option for people who want healthier options
 
 
 
 ### Data & Research
 
 - Food Data Central - https://fdc.nal.usda.gov/download-datasets.html
 - Nutrient Rich Food Index - https://onlinelibrary.wiley.com/doi/full/10.1111/j.1753-4887.2007.00003.x
 - Scientific-Report-of-the-2015-Dietary-Guidelines-Advisory-Committee
 - Instacart Kaggle Dataset - https://www.kaggle.com/c/instacart-market-basket-analysis/data


### Inputs

 - Branded food item
 

### Outputs

 - Branded food item in same category with similar ingredients that is more nutrient dense
 
### How
 
 - Food data central has all FDA approved branded foods nutrient information 
 - Calculate Nutritious Food Index (or Nutrient Density Score) for all items in dataset
 - Tag items based on food category (as defined by food data central), food name (without brand)
 - Build recommender that provides a more nutritious alternative in the category
 - Test recommender
 

----------------

# Background Research

### Defining "Healthier"

"Nutrient profiling of foods, described as the science of ranking foods based on their
nutrient content, is fast becoming the basis for regulating nutrition labels, health
claims, and marketing and advertising to children. A number of nutrient profile
models have now been developed by research scientists, regulatory agencies, and by
the food industry. Whereas some of these models have focused on nutrients to limit,
others have emphasized nutrients known to be beneficial to health, or some combination of both. Although nutrient profile models are often tailored to specific goals,
the development process ought to follow the same science-driven rules. These
include the selection of index nutrients and reference amounts, the development of
an appropriate algorithm for calculating nutrient density, and the validation of the
chosen nutrient profile model against healthy diets. It is extremely important that
nutrient profiles be validated rather than merely compared to prevailing public
opinion. Regulatory agencies should act only when they are satisfied that the scientific process has been followed, that the algorithms are transparent, and that the
profile model has been validated withrespectto objective measures of a healthy diet."
© 2008 International Life Sciences Institute

*May want to consider different profiles for different categories*
The
food industry’s concern has been that some profiling
approaches, notably the “traffic light” system tend to
separate foods into “good” and “bad”, such that whole
categories of foods may be penalized. One way to
deal with this issue is to create nutrient profiles that are
category-specific, as opposed to an across-the-board
approach.

### THE DEFINITION OF NUTRIENT-DENSE FOODS
*according to the article cited*

The definition of healthy or nutrient dense foods has taken many forms over the years: 
 - more nutrients, fewer calories
 - bucketing certain food groups as healthy: fruits, veggies, lean meats, etc
 - bucketing certain foods as unhealthy: high fat (including nuts, olives, coconuts and avocados) (fat phobia)
 - "dark-green leafy and deep-orange vegetables [as] especially good sources of vitamins and minerals"
 - "Generally excluded from the definition of nutrient-dense foods were those that contained added fat, sugar, or sodium."
 - Focus on Recommended Dietary Allowance - rigidity of this method disqualified a lot of foods
 - Reference amounts: Gradients of nutrient density - "good source of", "excellent source of" vs. "free", "low", "reduced", "less"
 - Tiers of foods


"It is only recently that the FDA has explored the feasibility of allowing health claims if a food had a favorable nutrient profile, as an alternative to current measures based on grams of nutrients per serving"

"Nutrient profile models can be based on 1) qualifying nutrients known to be beneficial to health, mostly vitamins and minerals; 2) disqualifying nutrients, mostly fats, sugars, and sodium; or 3) some combination of both. The content of fruits, vegetables, nuts, or whole grains in a food can also be taken into account."



<img src='Nutrient_content_claims.png'>

<img src='Food Nutrient Index.png'>

<img src='Scores.png'>

**Leaning towards using one of the following:**
1. Nutrient Rich Food NRFNn.3 (NRFNn - LIM) = Unweighted arithmetic mean of %DV for n nutrients minus 3 negative nutrients
2. Nutritious Food Index (NFI) = Sum of weighted desirable and less desirable food components; each divided by recommended daily intake (RDI).



### Selecting Nutrients to be Used in Calculation

Nutrients to be used can be:
1. Nutrients beneficial to health
    - selected macronutrients (protein, fiber, essential fatty acids)
    - vitamins (vitamins A and C), and minerals (calcium and iron)
    - omega-3 fatty acids, B-vitamins and folate, and additional minerals, typically potassium, zinc, and magnesium
2. Harmful nutrients
    - FDA: total fat, saturated fat, cholesterol, sodium
    - European Comission: saturated fat, trans-fat, total sugar, sodium
    - British FSA: energy, saturated fat, total sugar, sodium
    - USDA: solid fat, added sugar, alcohol
3. Combonation of above
4. Contents of fruits, veg, nuts, whole grain

**CHOICE FOR NOW**:

Nutrient Rich Food NRFNn.3 (NRFNn - LIM) = Unweighted arithmetic mean of %DV for n nutrients minus 3 negative nutrients

NRFNn = Nutrient Density Score NDS6 = Protein, fiber, vitamin A, vitamin C, Calcium, Iron

LIM = Limited Nutrients Score Saturated Fat, Added Sugar, Sodium

**UPDATE**

Thinking about adding in more nutrients - specifically adding MUFA to positive, and trans fats to negative
Also need to deal with weighting nutrients based on population needs

# Scientific Report of the 2015 Dietary Guidelines Advisory Committee

#### Advisory Report to the Secretary of Health and Human Services and the Secretary of Agriculture

*Executive Summary*
"shortfall nutrients: vitamin A,
vitamin D, vitamin E, vitamin C, folate, calcium,
magnesium, fiber, and potassium." pg 1
"sodium and saturated fat²
are overconsumed by the U.S. population relative to the
Tolerable Upper Intake Level" pg 2
"key food groups that are important
sources of the shortfall nutrients, including vegetables,
fruits, whole grains, and dairy." pg 2
"population
intake is too high for refined grains and added sugars." pg 2

*Nutrients of Concern*
1. Shortfall nutrients - underconsumed relative to the Estimated Average Requirement
 - compared intakes in the US population amoun age, sex and race groups
 - vitamin D, vitamin E, magnesium, calcium, vitamin A, vitamin C, fiber, potassium, Iron
2. Overconsumed nutrients - consumed in amounts above the Tolerable Upper Limit of Intake
 - sodium and saturated fat overconsumed
3. Food groups
 - "Increasing low-fat/fat-free fluid milk and
yogurt and decreasing cheese would result in
higher intakes of magnesium, potassium,
vitamin A, and vitamin D while simultaneously
decreasing the intake of sodium and saturated
fat."
 - "Replacing soft drinks and other sugarsweetened beverages (including sports drinks)
with non-fat fluid milk would substantially
reduce added sugars and empty calories and
increase the intake of shortfall nutrients,
including calcium, vitamin D, and magnesium."
 - "Consuming all vegetables, including starchy
vegetables, with minimal additions of salt and
solid fat will help minimize intake of
overconsumed nutrients ± sodium and
saturated fat." pg 64


## Import Libraries and Read in Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances
import matplotlib.pyplot as plt
import seaborn as sns


The primary data source that I will use to train my recommender is from Food Data Central.  Food Data Central compiles existing food data from the USDA.  I am focusing on the branded food data.  I believe the healthier grocery recommender will be the most useful on branded foods.  For example, someone buying a frozen chicken dinner can be alerted of an alternative brand that offers a healthier frozen chicken dinner (per definition above **need to define still**)

In [2]:
branded_food = pd.read_csv(r'branded_food.csv')
food = pd.read_csv(r'food.csv')
food_nutrient = pd.read_csv(r'food_nutrient.csv')
food_attribute = pd.read_csv(r'food_attribute.csv')
food_attribute_type = pd.read_csv(r'food_attribute_type.csv')
food_calorie_conversion_factor = pd.read_csv(r'food_calorie_conversion_factor.csv')
food_category = pd.read_csv(r'food_category.csv')
food_component = pd.read_csv(r'food_component.csv')
food_nutrient = pd.read_csv(r'food_nutrient.csv')
food_nutrient_conversion_factor = pd.read_csv(r'food_nutrient_conversion_factor.csv')
food_nutrient_derivation = pd.read_csv(r'food_nutrient_derivation.csv')
food_nutrient_source = pd.read_csv(r'food_nutrient_source.csv')
food_portion = pd.read_csv(r'food_portion.csv')
food_protein_conversion_factor = pd.read_csv(r'food_protein_conversion_factor.csv')
foundation_food = pd.read_csv(r'foundation_food.csv')
input_food = pd.read_csv(r'input_food.csv')
lab_method = pd.read_csv(r'lab_method.csv')
lab_method_code = pd.read_csv(r'lab_method_code.csv')
lab_method_nutrient = pd.read_csv(r'lab_method_nutrient.csv')
market_acquisition = pd.read_csv(r'market_acquisition.csv')
measure_unit = pd.read_csv(r'measure_unit.csv')
nutrient = pd.read_csv(r'nutrient.csv')
nutrient_incoming_name = pd.read_csv(r'nutrient_incoming_name.csv')
retention_factor = pd.read_csv(r'retention_factor.csv')
sample_food = pd.read_csv(r'sample_food.csv')
sr_legacy_food = pd.read_csv(r'sr_legacy_food.csv')
sub_sample_food = pd.read_csv(r'sub_sample_food.csv')
sub_sample_result = pd.read_csv(r'sub_sample_result.csv')
survey_fndds_food = pd.read_csv(r'survey_fndds_food.csv')
wweia_food_category = pd.read_csv(r'wweia_food_category.csv')

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


## Exploratory Data Analysis

**Below dataframe contains all the branded foods approved by the FDA.  These will be the items used in the recommender.**

In [3]:
branded_food.head()

Unnamed: 0,fdc_id,brand_owner,gtin_upc,ingredients,serving_size,serving_size_unit,household_serving_fulltext,branded_food_category,data_source,modified_date,available_date
0,356425,"G. T. Japan, Inc.",19022128593,"ICE CREAM INGREDIENTS: MILK, CREAM, SUGAR, STR...",40.0,g,1 PIECE,Ice Cream & Frozen Yogurt,LI,2017-11-15,2017-11-15
1,356426,FRESH & EASY,5051379043735,"WATER, SUGAR, TOMATO PASTE, MOLASSES, DISTILLE...",37.0,g,2 Tbsp,"Ketchup, Mustard, BBQ & Cheese Sauce",LI,2018-04-26,2018-04-26
2,356427,FRESH & EASY,5051379009434,"SUGAR, WATER, DISTILLED VINEGAR, TOMATO PASTE,...",34.0,g,2 Tbsp,"Ketchup, Mustard, BBQ & Cheese Sauce",LI,2018-04-26,2018-04-26
3,356428,FRESH & EASY,5051379019969,"TOMATO PUREE (WATER, TOMATO PASTE), SUGAR, DIS...",35.0,g,2 Tbsp,"Ketchup, Mustard, BBQ & Cheese Sauce",LI,2018-04-26,2018-04-26
4,356429,FRESH & EASY,5051379009526,"SUGAR, DISTILLED VINEGAR, WATER, TOMATO PASTE,...",37.0,g,2 Tbsp,"Ketchup, Mustard, BBQ & Cheese Sauce",LI,2018-04-26,2018-04-26


In [4]:
branded_food['serving_size_unit'].value_counts()

g     226163
ml     34207
Name: serving_size_unit, dtype: int64

In [5]:
branded_food.shape

(260370, 11)

**Below dataframe has all the food descriptions, which I will need to pull in to the branded foods dataframe**

In [6]:
food.head() #will use this

Unnamed: 0,fdc_id,data_type,description,food_category_id,publication_date
0,321612,sample_food,"Broccoli, raw (IN1,NY1) - CY0906E",11.0,2019-04-01
1,321613,market_acquisition,"Broccoli, raw, 2 bunches (IN1) - NFY0905DD",11.0,2019-04-01
2,321614,market_acquisition,"Broccoli, raw, 3 bunches (NY1) - NFY0905EV",11.0,2019-04-01
3,321615,market_acquisition,"Broccoli, raw, 3 bunches (IN1) - NFY0905DC",11.0,2019-04-01
4,321618,sample_food,"Broccoli, cooked (IN1,NY1) - CY0906K",11.0,2019-04-01


**For each branded food, I will need the nutrient information.  Below has the nutrient information by food type, which I will need to use in the calculation of the nutrient density score**

In [7]:
food_nutrient.head() #will use this

Unnamed: 0,id,fdc_id,nutrient_id,amount,data_points,derivation_id,min,max,median,footnote,min_year_acquired
0,2220550,321616,1051,89.0,1.0,1.0,,,,,
1,2220551,321617,1162,92.2,1.0,1.0,,,,,
2,2220552,321619,1051,89.2,1.0,1.0,,,,,
3,2220553,321620,1162,78.4,1.0,1.0,,,,,
4,2220554,321626,1051,88.1,1.0,1.0,,,,,


In [8]:
food_nutrient.drop(columns='id', inplace=True)

In [9]:
food_nutrient.rename(columns={'nutrient_id': 'id'}, inplace=True)

**For whatever stupid fucking reason they have a separate dataframe that lists out what those nutrients are by ID, which I will need to combine with the above**

In [10]:
nutrient.head() #will use this

Unnamed: 0,id,name,unit_name,nutrient_nbr,rank
0,1002,Nitrogen,G,202.0,500.0
1,1003,Protein,G,203.0,600.0
2,1004,Total lipid (fat),G,204.0,800.0
3,1005,"Carbohydrate, by difference",G,205.0,1110.0
4,1007,Ash,G,207.0,1000.0


**I will need a way to categorize the data, so that the recommender can pull from similar foods.  I can use these food attributes to tag the foods**

In [11]:
food_attribute.head() #will use this

Unnamed: 0,id,fdc_id,seq_num,food_attribute_type_id,name,value
0,2074,167525,0.0,1000,,Latino food
1,2075,167526,0.0,1000,,Latino food
2,2076,167527,0.0,1000,,Latino food
3,2077,167528,0.0,1000,,"Latino food, pastelios de guava, turnovers"
4,2078,167529,0.0,1000,,Latino food


In [12]:
food_attribute = food_attribute[food_attribute['food_attribute_type_id']==1000]
food_attribute.head(10)

Unnamed: 0,id,fdc_id,seq_num,food_attribute_type_id,name,value
0,2074,167525,0.0,1000,,Latino food
1,2075,167526,0.0,1000,,Latino food
2,2076,167527,0.0,1000,,Latino food
3,2077,167528,0.0,1000,,"Latino food, pastelios de guava, turnovers"
4,2078,167529,0.0,1000,,Latino food
5,2079,167530,0.0,1000,,Latino food
6,2080,167571,0.0,1000,,confectioner's coating
7,2081,167605,0.0,1000,,"Potatoes, hashed brown"
8,2082,167622,0.0,1000,,"sitka deer, venison"
9,2083,167626,0.0,1000,,sheefish


In [13]:
food_attribute.isnull().sum()

id                           0
fdc_id                       0
seq_num                      0
food_attribute_type_id       0
name                      1084
value                        4
dtype: int64

In [14]:
food_attribute.sort_values(by='value').tail() #none of these are in branded, dont need to address

Unnamed: 0,id,fdc_id,seq_num,food_attribute_type_id,name,value
29,2093,167789,0.0,1000,,yellow cherries
900,2964,173681,0.0,1000,,
901,2965,173725,0.0,1000,,
902,2966,173726,0.0,1000,,
1081,3145,175130,0.0,1000,,


In [15]:
branded_food[branded_food['fdc_id']==175130]

Unnamed: 0,fdc_id,brand_owner,gtin_upc,ingredients,serving_size,serving_size_unit,household_serving_fulltext,branded_food_category,data_source,modified_date,available_date


**Do not think I will need the data below, but leaving it in for now in case**

In [16]:
food_attribute_type.head()

Unnamed: 0,id,name,description
0,999,Attribute,Generic attributes
1,1000,Common Name,Common names associated with a food.
2,1001,Additional Description,Additional descriptions for the food.
3,1002,Adjustments,"Adjustments made to foods, including moisture ..."


In [17]:
food_nutrient_conversion_factor.head()

Unnamed: 0,id,fdc_id
0,16365,167512
1,16366,167513
2,11672,167518
3,16367,167524
4,16368,167526


In [18]:
food_component.head()

Unnamed: 0,id,fdc_id,name,pct_weight,is_refuse,gram_weight,data_points,min_year_acquired
0,57035,330966,Handling loss,,N,6.7,1,
1,57036,330966,Skin and separable fat,,Y,16.0,1,
2,57037,330966,Bone and cartilage,,Y,51.5,1,
3,57038,330966,Edible portion,,N,87.6,1,
4,57039,330966,Total gram weight,,N,162.0,1,


In [19]:
food_nutrient_derivation.head()

Unnamed: 0,id,code,description,source_id
0,1,A,Analytical,1
1,2,AI,Analytical data; from the literature or gover...,10
2,3,AR,Analytical data; derived by linear regression,1
3,4,AS,Summed,1
4,5,BD,Based on same food; Drained solids from solids...,2


In [20]:
food_nutrient_source.head()

Unnamed: 0,id,code,description
0,1,1,Analytical or derived from analytical
1,2,4,Calculated or imputed
2,3,5,Value manufacturer based label claim for added...
3,4,6,Aggregated data involving combinations of sour...
4,5,7,Assumed zero


In [21]:
food_portion.head()

Unnamed: 0,id,fdc_id,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired
0,119701,329720,,1.0,1000,,grated,100.0,1.0,,
1,119702,329721,,1.0,1000,,grated,115.0,1.0,,
2,119703,329735,,1.0,1000,,grated,95.5,1.0,,
3,119704,329736,,1.0,1000,,grated,94.8,1.0,,
4,119705,329754,,1.0,1000,,grated,82.1,1.0,,


In [22]:
food_protein_conversion_factor.head()

Unnamed: 0,food_nutrient_conversion_factor_id,value
0,16365,6.25
1,16366,6.25
2,16367,6.25
3,16368,6.25
4,16369,6.25


In [23]:
foundation_food.head()

Unnamed: 0,fdc_id,NDB_number,footnote
0,321358,16158.0,
1,321359,1079.0,
2,321505,2047.0,
3,321611,11056.0,
4,321900,11090.0,


In [24]:
input_food.head()

Unnamed: 0,id,fdc_id,fdc_id_of_input_food,seq_num,amount,sr_code,sr_description,unit,portion_code,portion_description,gram_weight,retention_code,survey_flag
0,10428,321358,319874.0,,,,,,,,,,
1,10429,321358,319879.0,,,,,,,,,,
2,10430,321358,319885.0,,,,,,,,,,
3,10431,321358,319894.0,,,,,,,,,,
4,10432,321358,319901.0,,,,,,,,,,


In [25]:
lab_method.head()

Unnamed: 0,id,description,technique
0,1000,NIST Handbook 133,Gravimetric
1,1001,AOAC 968.06 + 992.15,Combustion
2,1002,AOAC 960.39 39.1,Extraction
3,1003,AOAC 922.06,Acid hydrolysis
4,1004,AOAC 923.03,Gravimetric


In [26]:
lab_method_code.head()

Unnamed: 0,id,lab_method_id,code
0,1000,1000,SPGP_S
1,1001,1001,DGEN_S
2,1003,1003,FAT_AH_S
3,1004,1004,ASHM_S
4,1005,1005,SUGN_S


In [27]:
lab_method_nutrient.head()

Unnamed: 0,id,lab_method_id,nutrient_id
0,1000,1000,1024
1,1001,1001,1002
2,1002,1001,1003
3,1003,1002,1004
4,1004,1003,1004


In [28]:
market_acquisition.head()

Unnamed: 0,fdc_id,brand_description,expiration_date,label_weight,location,acquisition_date,sales_type,sample_lot_nbr,sell_by_date,store_city,store_name,store_state,upc_code
0,321613,,,,,2009-11-16,Retail store,,,ANDERSON,Pay Less Super Market,IN,
1,321614,,,,,2009-11-16,Retail store,,,COLLEGE POINT,Waldbaum,NY,
2,321615,,,,,2009-11-16,Retail store,,,ANDERSON,Pay Less Super Market,IN,
3,321622,OCEAN MIST,,,,2009-12-01,Retail store,PLU 4060,,CARRBORO,Food Lion,NC,
4,321623,OCEAN MIST,,,,2009-12-01,Retail store,PLU 4060,,CARRBORO,Food Lion,NC,


In [29]:
measure_unit.head()

Unnamed: 0,id,name
0,1000,cup
1,1001,tablespoon
2,1002,teaspoon
3,1003,liter
4,1004,milliliter


In [30]:
nutrient_incoming_name.head()

Unnamed: 0,id,name,nutrient_id
0,1000,NITROGEN-DUMAS METHO,1002
1,1001,Nitrogen,1002
2,1002,NITROGEN-DUMAS METHOD,1002
3,1003,Nitrogen - Kjeldahl,1002
4,1004,Protein,1003
