## Project 2: Food Facts
#### Presented by Carlee Price, Yubo Zhang, and Nikki Haas
#### August 2016


### Introduction

### Part 0: The Questions

Our project will attempt to answer the following questions:

1.  How likely is it that a basket of commercially available food items is going to meet recommended levels of nutrients while staying within recommended calorie allowance?
2.  How often does a daily allotment of food contain ingredients known to cause harm to human health, such as high fructose corn syrup and hydrogenated oils?
3.  What are the most common categories of food available from the dataset, and how often are those foods considered nutrient dense?

### Part 1: The Data

Our main data source was provided by the Open Food Facts project<sup>1</sup>.  This is a user-contributed database containing nutrition data throughout the world.  We will be studying the subset containing foods from the US market that have the macronutrient fields available.

The data will be groomed and manipulated using the Pandas, Numpy, Matplotlib, and SciPy modules avaialbe for Python.

#### *Overview of Data:*

In [2]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt
import re


df = pd.read_csv("openfoodfacts_search.csv", sep = '\t')
#the datafields are quite long; tell pandas to show the whole fields
pd.set_option('display.max_colwidth', -1)
df.head()

3.5.2 |Anaconda 4.1.1 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
1.11.1
0.18.1


Unnamed: 0,code,url,creator,created_t,last_modified_t,product_name,generic_name,quantity,packaging,packaging_tags,...,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g
0,5010092093045,http://world.openfoodfacts.org/product/5010092093045,bcatelin,1389309305,1461479010,Soft white,White bread,800g,Plastic bag,plastic-bag,...,,,,,,,,125.0,-1.0,-1.0
1,44000030377,http://world.openfoodfacts.org/product/0044000030377,openfoodfacts-contributors,1385850411,1459174448,Wheat Thins Original,,258g,,,...,,,,,,,,,,
2,7832309,http://world.openfoodfacts.org/product/07832309,openfoodfacts-contributors,1403210081,1458995984,Diet Dr Pepper,,,can,can,...,,,,,,,,,,
3,5099353000169,http://world.openfoodfacts.org/product/5099353000169,bcatelin,1385926289,1413659845,Eggs,Eggs,6,Cardbox,cardbox,...,,,,,,,,,,
4,82592720153,http://world.openfoodfacts.org/product/0082592720153,openfoodfacts-contributors,1389308826,1459174499,Green Machine,,15.2 fl. oz (450 mL),,,...,,,,,,,,,,


In [3]:
list(df.columns)

['code',
 'url',
 'creator',
 'created_t',
 'last_modified_t',
 'product_name',
 'generic_name',
 'quantity',
 'packaging',
 'packaging_tags',
 'brands',
 'brands_tags',
 'categories',
 'categories_tags',
 'labels',
 'labels_tags',
 'origins',
 'origins_tags',
 'manufacturing_places',
 'manufacturing_places_tags',
 'emb_codes',
 'emb_codes_tags',
 'cities',
 'cities_tags',
 'purchase_places',
 'stores',
 'countries',
 'ingredients_text',
 'allergens',
 'allergens_tags',
 'traces',
 'traces_tags',
 'serving_size',
 'no_nutriments',
 'additives_n',
 'additives',
 'additives_tags',
 'ingredients_from_palm_oil_n',
 'ingredients_from_palm_oil',
 'ingredients_from_palm_oil_tags',
 'ingredients_that_may_be_from_palm_oil_n',
 'ingredients_that_may_be_from_palm_oil',
 'ingredients_that_may_be_from_palm_oil_tags',
 'pnns_groups_1',
 'pnns_groups_2',
 'main_category',
 'image_url',
 'image_small_url',
 'image_front_url',
 'image_front_small_url',
 'image_ingredients_url',
 'image_ingredients_smal

We can see that there is a column for energy per 100 grams, however it does not appear to correlate to calories per 100 grams.  We can test this against an example:

In [4]:
df[['code', 'url', 'product_name', 'energy_100g','carbohydrates_100g', 
    'fat_100g', 'proteins_100g','serving_size']].iloc[20]

code                  36632036506                                         
url                   http://world.openfoodfacts.org/product/0036632036506
product_name          Activia light blueberry                             
energy_100g           222                                                 
carbohydrates_100g    9.73                                                
fat_100g              0                                                   
proteins_100g         3.54                                                
serving_size          1 container (113g)                                  
Name: 20, dtype: object

In [11]:
site = 'http://www.activia.us.com/probiotic-yogurt/products/activia-light-blueberry'
blueberry_yougurt = pd.read_html(site)[0].set_index('Nutritional Facts')
blueberry_yougurt

Unnamed: 0_level_0,per serving (113g),Calories from Fat
Nutritional Facts,Unnamed: 1_level_1,Unnamed: 2_level_1
,,% Daily Value*
Calories,60,0
Total Fat,0g,0%
Saturated Fat,0g,0%
Trans Fat,0g,
Cholesterol,<5mg,1%
Sodium,65mg,3%
Potassium,200mg,6%
Total Carbohydrate,11g,4%
Dietary Fiber,2g,8%


The calroies(energy) does not look like it is 220 cals/100gr, but the carb/protein/fat ratio looks about right.  verify the macronutrients per 100gr are equivalent to the food facts database:

In [6]:

print(float(blueberry_yougurt.loc['Protein'][0].strip('g'))/1.13)
print(float(blueberry_yougurt.loc['Total Carbohydrate'][0].strip('g'))/1.13)
print(float(blueberry_yougurt.loc['Total Fat'][0].strip('g'))/1.13)

3.5398230088495577
9.734513274336283
0.0


Calories is a function of carbohydrates, fat, and protein, so we can derive the calories per one hundred grams fairly easily:

In [8]:
def cals(w, x, y):
    return (9*w) + (4*(x + y))

df['calories_100g'] = np.vectorize(cals)(df['fat_100g'], df['proteins_100g'], df['carbohydrates_100g'])
df[['code', 'url', 'quantity', 'product_name', 'calories_100g','carbohydrates_100g', 
    'fat_100g', 'proteins_100g','serving_size', 'ingredients_text', 'categories', 'nutrition-score-uk_100g']].iloc[20]

code                       36632036506                                         
url                        http://world.openfoodfacts.org/product/0036632036506
quantity                   113 g                                               
product_name               Activia light blueberry                             
calories_100g              53.08                                               
carbohydrates_100g         9.73                                                
fat_100g                   0                                                   
proteins_100g              3.54                                                
serving_size               1 container (113g)                                  
ingredients_text           NaN                                                 
categories                 yogurt                                              
nutrition-score-uk_100g   -3                                                   
Name: 20, dtype: object

We will be unable to check the ingredients on this particular item for harmful substances like high fructose corn syrup and hydrogenated oils. but we are able to pull in information regarding product size, its category, and its Nutrition Score in the UK.  The UK's nutrition score<sup>2</sup> is calculated based upon the items nutrient value, constituent ingredients, and calories.  A low score denotes a healthy food, and a high score denotes an unhealthy food.  A food scoring 4 or more points, or drinks scoring 1 or more points, are classified as 'less healthy' by the Food Standards Agency in the United Kingdom.  We can use this score when available to check an item's overall healthfulness if other metrics are not available.

The categories are important to know for nutrition purposes as well.  If a basket has too many 'snack' items in it, we hypothesize that the basket will not meet nutrition guidlines set by regulatory bodies.


This dataset is quite large, so we will remove some unnessary columns.  Items that do not contain nutritive data will also be removed.  The algorythm for selecting or removing columns is defined as follows:

In [9]:
column_list = list(df.columns.values)
print(column_list)
column_list_remove = []
for column in column_list:
    if re.search('image', column):
        print(column)
        column_list_remove.append(column)
print(column_list_remove)

['code', 'url', 'creator', 'created_t', 'last_modified_t', 'product_name', 'generic_name', 'quantity', 'packaging', 'packaging_tags', 'brands', 'brands_tags', 'categories', 'categories_tags', 'labels', 'labels_tags', 'origins', 'origins_tags', 'manufacturing_places', 'manufacturing_places_tags', 'emb_codes', 'emb_codes_tags', 'cities', 'cities_tags', 'purchase_places', 'stores', 'countries', 'ingredients_text', 'allergens', 'allergens_tags', 'traces', 'traces_tags', 'serving_size', 'no_nutriments', 'additives_n', 'additives', 'additives_tags', 'ingredients_from_palm_oil_n', 'ingredients_from_palm_oil', 'ingredients_from_palm_oil_tags', 'ingredients_that_may_be_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil', 'ingredients_that_may_be_from_palm_oil_tags', 'pnns_groups_1', 'pnns_groups_2', 'main_category', 'image_url', 'image_small_url', 'image_front_url', 'image_front_small_url', 'image_ingredients_url', 'image_ingredients_small_url', 'image_nutrition_url', 'image_nutrition_sma

Our dataframe will be parsed down to around 1,200 items to select from, given that we want to disregard incomplete records:

In [10]:
target_list = ['code', 'url', 'quantity', 'product_name', 'calories_100g','carbohydrates_100g', 
    'fat_100g', 'proteins_100g','serving_size', 'ingredients_text', 'categories', 'nutrition-score-uk_100g']
for item in target_list:
    print(item, ":", df[item].count())

code : 2820
url : 2820
quantity : 1846
product_name : 2356
calories_100g : 1270
carbohydrates_100g : 1292
fat_100g : 1295
proteins_100g : 1284
serving_size : 1362
ingredients_text : 1165
categories : 1695
nutrition-score-uk_100g : 1110


  ### Part 2: Statistical Methodology
  
We will use the bootstrapping method and the Student-T method to construct a random set of 10,000 days.  Of those days, we will see how often the day meets, exceeds, or falls short of the FDA guidelines for calorie and nutrient intake.   

The bootstrapping method is a random sampling method with replacement on a set that is assumed to be incomplete.<sup>1</sup>  Numpy's Random selection function allows for with replacement sampling of a set<sup>2</sup> so it is perfect for building a bootstrapping model.  The SciPy module includes a student-T distribution method, so we will be able to use these packages to answer our study questions.<sup>3</sup>

### Part 3: Regulatory Body Recommendations

something about FDA and USDA here

### Part 4: Bias

Bias is introduced in our study via numerous channels.  The entire layout of a grocery store is extremely biased, with soda and snack companies paying top dollar to occupy the most coveted positions and aisles.  Customers have to take extra care to avoid the appealing packages and advertising of calorically dense food and often have to go to seek out the healthy and minially-processed foods. <sup>1</sup> 

Raw foods such as bulk rice, beans, and produce often lack Universal Price Codes (UPCs).  Our datasource is based upon UPCs, and thus may exclude some of the most healthful foods in the average supermarket.  The Food Facts datasource may therefore skew towards potentially unhealthy foods and not reflect what people actually purchase at the grocery store.

Food Facts' collection methods also allow for bias.  The database is built upon an opt-in model of user contributed data.  Contributors both have to know about the project and take the time and effort from their day to upload images, UPCs, and packaging details to Food Facts.  This will bias against consumers who are unable to afford the time or luxury of a smart phone to contribute. <sup>2</sup> 

There is personal bias from the preparers of this study.  One of the authors is a plant based vegetarian, and another is a food scientist.  Our unique perspectives may introduce bias to our findings.  All of the authors reside  in the United States and thus our findings may not hold true in all markets.

### Part 5: Limitations & Shopping Habits

The Open Food Facts' database is populated by consumers who opt in. It does not include a comprehensive list of what is available in the typical grocery store, nor does it represent all foods available in all markets.  In addition, the package sizing data is not consistent for all products.  Some products mention their package contains 800 grams, while other products mention they contain six servings.  The data may or may not include a serving size, and if it does, the units of measure are not consistent.

To mitigate these limitations, we have found that the typical US resident consumes roughly 4 to 4.5 pounds of food per day (about 2040 grams per day)<sup>1</sup>, and that the average person purchases 20 items at a grocery store per week<sup>2</sup>, and eats out about six times a week<sup>3</sup>.  Quite a bit of the set contains data on nutrition information per 100 grams, so we have been able to determine that each person must purchase 20 5 serving items from a grocery store each week, and that the servings are about 100 grams each.  Thus, a perosn must eat 20 100 gram servings per day.  We will therefore subset the data to exclude products that have no nutritive information listed and randomly select 20 items to represent 1 day's worth of food.  In addition, the calorie field did not exist in the dataset so we had to derive it as a calculated field.  

### References

#### Part 1:  The  data
1. [Open Food Facts](http://world.openfoodfacts.org/)
2. [Food Standards Agency - Food Profiling](http://www.food.gov.uk/sites/default/files/multimedia/pdfs/techguidenutprofiling.pdf)


#### Part 2: Statistical Metholodgy
1. [Introduction to Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics))
2. [Implementing Bootstrapping using Python](http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html)
3. [Student-T Python Module](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html)


#### Part 3: Regulatory Body Recommendations
1. [FDA Dietary Guidelines, 2015](https://health.gov/dietaryguidelines/2015/guidelines/)
2. [USDA National Nutrient Database](https://ndb.nal.usda.gov/)

#### Part 4: Bias
1. [The Omnivore's Dilemma by Michael Pollan](http://michaelpollan.com/books/the-omnivores-dilemma/)
2. [Open Food Fact's Terms of Contribution](http://world.openfoodfacts.org/terms-of-use#contribution)

#### Part 5: Limitations
1. [USDA Fact Book](http://www.usda.gov/factbook/chapter2.pdf)
2. [Grocery Shoppers' Habits](http://www.marketingcharts.com/traditional/the-average-grocery-shopper-buys-less-than-1-of-available-items-over-the-course-of-the-year-39360/)
3. [United States Healthful Food Council](http://ushfc.org/about/)