# FoodDude: Nutritional Analysis of Foods
### Team Dabberlee - Jaiden Gerig, Oron Hazi, Justin Katz, Kyle Wilson

## Overview and Motivation
In this notebook, we wanted to explore the different foods on the USDA database and see if there are foods out there that can provide all the necessary nutrients needed by a basic 2000 calorie diet. That is, if a single food could possibly be able to provide all the nutrients a human needs while staying within a reasonable calorie range. 

Originally, one of our team members, Kyle Wilson, prepared a data science spotlight about trying to fulfill desired macronutrients for the least amount of money with fast food. We wanted to look more closely at food analysis through nutrients without regard to price. The USDA database provided us with nutrients for many groceries and fast foods, which gave us a nice base to work off of.

## Related work
Soylent advertises itself as the only food a human needs in a day. It provides “all the protein, carbohydrates, lipids, and micronutrients that a body needs to thrive.” While Soylent is engineered specifically for this purpose, we wanted to see if there is a food already out there that can provide the same benefits that Soylent can.

## Initial Questions
* Is there a food that, if eaten solely, could fulfill all the nutritional requirements of a 2000 calorie diet?
* What are the foods that provide all the nutritional value you need for the fewest amount of calories?
* What are some overall trends of the best foods?
* What are the trends of all the foods that provide you with enough nutrients?

## Exploratory Analysis

All of the data we needed was readily available in a CSV file.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from bokeh.io import push_notebook,show,output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.charts import Bar, output_file, show
from bokeh.models import Range1d
from bokeh.charts.operations import blend
from bokeh import palettes
%matplotlib inline
output_notebook()
calfoods = pd.read_csv('ABBREV.csv', index_col=0)
calfoods.head(5)

After importing, it we did some light cleaning where we:
* Removed foods that don't contain all the essential nutrients
* Normalized all the nutriets to 1 calorie 
* Made the column names a bit easier to read

In [None]:
#Rename some of our columns to something a bit easier on the eyes
calfoods = calfoods.rename(index=str,columns={'Shrt_Desc':'Name','Protein_(g)':'Protein (g)','Lipid_Tot_(g)':'Total Fat(g)','Cholestrl_(mg)':'Cholesterol (mg)',
               'FA_Sat_(g)':'Saturated Fat (g)','Sodium_(mg)':'Sodium (mg)','Potassium_(mg)':'Potassium (mg)',
               'Carbohydrt_(g)':'Carbohydrates (g)','Fiber_TD_(g)':'Fiber (g)','Energ_Kcal':'Calories'})
# Look at a specific subset of nutrients
calnutrients = ['Name','Protein (g)','Total Fat(g)','Cholesterol (mg)',
               'Saturated Fat (g)','Sodium (mg)','Potassium (mg)',
               'Carbohydrates (g)','Fiber (g)','Calories','Weight (g)']
calfoods['Weight (g)'] = 100.0
calfoods = calfoods[calnutrients]
calfoods = calfoods.fillna(0)
# Get rid of foods we dont have serving sizes for
calfoods = calfoods[calfoods.apply(lambda x:x['Calories'] > 0, axis=1)]

# Normalizing all our foods to 1 calorie
def normalizeNutrientsCal(x):
    ratio = x['Calories']
    for nutrient in calnutrients:
        if(type(x[nutrient]) is str):
            continue
        x[nutrient] = x[nutrient]/ratio
    return x
calfoods = calfoods.apply(normalizeNutrientsCal,axis=1)
calfoods.head()

# We only want foods that have a chance to sustain our needs
def filterNutrientsCal(x):
    for nutrient in calnutrients:
        if(x[nutrient] <= 0):
            return False
    return True
calfoods = calfoods[calfoods.apply(filterNutrientsCal, axis=1)]
print "Foods found:",len(calfoods)
calfoods.head(5)

Once we had all the foods cleaned, we could scale them up to meet the recommended daily amount of nutrients for a 2000 Calorie diet which are:


| Nutrient              | Unit of Measure | Daily Values |
|-----------------------|-----------------|--------------|
| Total Fat             | grams (g)       | 65           |
| Saturated fatty acids | grams (g)       | 20           |
| Cholesterol           | milligrams (mg) | 300          |
| Sodium                | milligrams (mg) | 2400         |
| Potassium             | milligrams (mg) | 3500         |
| Total carbohydrate    | grams (g)       | 300          |
| Fiber                 | grams (g)       | 25           |
| Protein               | grams (g)       | 50           |

For the purposes of our analysis we ignored vitamins as they would disqualify too many foods and could be obtained without any calories through a multi-vitamin.

In [None]:
# http://www.netrition.com/rdi_page.html
recommended = [-1,50,65,300,20,2400,3500,300,25,-1,-1]
def findSatisfyingWeightCal(food):
    for x in range(0,len(calnutrients)):
        nutrient = calnutrients[x]
        rec = recommended[x]
        if(rec == -1 or food[nutrient] >= rec):
            continue
        ratio = rec/food[nutrient]
        for y in calnutrients:
            if(type(food[y]) is str):
                continue
            food[y] = food[y]*ratio
    return food   
calweighted_foods = calfoods.apply(findSatisfyingWeightCal,axis=1)
display = ['Name','Calories','Weight (g)','Protein (g)','Total Fat(g)','Cholesterol (mg)',
               'Saturated Fat (g)','Sodium (mg)','Potassium (mg)',
               'Carbohydrates (g)','Fiber (g)']

### The Best Foods
Withouth further ado, here are the best foods that we found sorted by total calories to achieve your daily necessary nutrients:

In [None]:
calweighted_foods[display].sort_values(by='Calories').head(10)

This looks a lot like what we would expect "healthy" foods to be, mostly soups, potatoes, and baby food.

*Wait, did you just say baby food?*

While baby food may seem odd an odd pick for a "healthy" food at first, it's important to remember that babies often have a very singular diet since they have yet to develop the ability to eat most foods, and it would only make sense that one of the very few foods they eat would contain a good balance of essential nutrients.

As for our goal of finding a singular food that fits the 2000 calorie diet, it seems the closest we can get to the ideal 2000 Calories is with vegetable spinach baby food at 2220 Calories. So it seems that there is no single USDA approved food that can give you the right amount of nutrients for 2000 Calories or less.

After looking at the data it was hard to judge just how closely these foods stuck to the recommended amount of nutrients, so we converted the raw numbers into their percentage above the recommended daily amount.

In [None]:
def findOveragesCal(food):
    for x in range(0,len(calnutrients)):
        nutrient = calnutrients[x]
        rec = recommended[x]
        if(rec == -1):
            continue
        food[nutrient] -= rec
    return food   
caloverage_foods = calweighted_foods.apply(findOveragesCal,axis=1)
def findPercentOveragesCal(food):
    for x in range(0,len(calnutrients)):
        nutrient = calnutrients[x]
        rec = recommended[x]
        if(rec == -1):
            continue
        food[nutrient] = ((food[nutrient]/rec)*100)-100
    return food   
caloverage_foods = calweighted_foods.apply(findPercentOveragesCal,axis=1)
caloverage_foods[display].sort_values(by='Calories').head(10).rename(index=str,columns={'Protein (g)':'Protein (%)','Total Fat(g)':'Total Fat(%)','Cholesterol (mg)':'Cholesterol (%)',
               'Saturated Fat (g)':'Saturated Fat (%)','Sodium (mg)':'Sodium (%)','Potassium (mg)':'Potassium (%)',
               'Carbohydrates (g)':'Carbohydrates (%)','Fiber (g)':'Fiber (g)'})
df = caloverage_foods.sort_values(by='Calories').head(10).rename(index=str,columns={'Protein (g)':'Protein (%)','Total Fat(g)':'Total Fat(%)','Cholesterol (mg)':'Cholesterol (%)',
               'Saturated Fat (g)':'Saturated Fat (%)','Sodium (mg)':'Sodium (%)','Potassium (mg)':'Potassium (%)',
               'Carbohydrates (g)':'Carbohydrates (%)','Fiber (g)':'Fiber (%)'})
a = Bar(df, label='vars',group='Name', 
        values=blend('Protein (%)', 'Total Fat(%)','Cholesterol (%)',
                     'Saturated Fat (%)','Sodium (%)','Potassium (%)',
                     'Carbohydrates (%)','Fiber (%)',name='values', labels_name='vars'),
        title="Excess Nutrients (% above recommended daily intake)",width=900,palette=palettes.BrBG11)
a.xaxis.axis_label = ""
a.yaxis.axis_label = "% above reccomended daily intake"
show(a)

Wow! that's a ton of sodium!

Looking at this graph makes it seem as though sodium and potassium are the largest overages by a huge margin, but it's important to keep in mind that both of these nutrients are measured in milligrams and thus, changes to their content have a greater impact on these percentages.

We decided we needed to take another look at this graph without potassium and sodium to get a clearer picture of how the nutrients measured in grams stacked up to eachother.

In [None]:
df = caloverage_foods.sort_values(by='Calories').head(10).rename(index=str,columns={'Protein (g)':'Protein (%)','Total Fat(g)':'Total Fat(%)','Cholesterol (mg)':'Cholesterol (%)',
               'Saturated Fat (g)':'Saturated Fat (%)','Sodium (mg)':'Sodium (%)','Potassium (mg)':'Potassium (%)',
               'Carbohydrates (g)':'Carbohydrates (%)','Fiber (g)':'Fiber (%)'})
a = Bar(df, label='vars',group='Name', 
        values=blend('Protein (%)', 'Total Fat(%)','Cholesterol (%)',
                     'Saturated Fat (%)',
                     'Carbohydrates (%)','Fiber (%)',name='values', labels_name='vars'),
        title="Excess Nutrients (% above recommended daily intake) (Excluding Sodium & Potassium)",width=900,height=1000,palette=palettes.BrBG11)
a.xaxis.axis_label = ""
a.yaxis.axis_label = "% above reccomended daily intake"
show(a)

So second to sodium and potassium, our best foods have a the highest overages in proteins and saturated fats. So while these foods may be great for giving you all the essential nutrients you need, you may get a bit more then you bargained for in the form of sodium, protein, and saturated fats.

Taking another look at our darling child, vegetable-spinach baby food shows that its most prominent overages are in saturated fats, fiber, and proteins. This again, makes sense if you think about the context of a baby's life as it needs a lot of these nutrients to grow bigger and stronger.

The final thing to note is that out of our best foods, the lowest nutrient overall seems to be carbohydrates. This may explain why low carb diets often work for so many people as they may be getting a much more balanced set of nutrients, which would improve their health overall.


### The Worst Foods
After looking at the best foods, we thought it was only fair to look at the worst by calorie as well hoping that it would provide insight into what makes certain foods more nutritious than others.

In [None]:
calweighted_foods[display].sort_values(by='Calories',ascending=False).head(10)

More than half that list is just candy and cookies with Mother's circus animals and it's variations being the most prominent on the list.

Sausage, however, is the clear leader here but it's not clear just how much until we look at the overages chart.

The most interesting part of the list was McDonald's yogurt parfaits, as usually a parfait is seen as one of the healthier deserts you can have and its appearance on our list actually presents a small hole in our methodology. The food itself is actually fairly healthy overall, but it lacks in many of the key nutrients were looking for and thus ends up on our worst foods list because it has to be scaled up very high to get these nutrients to where we want them to be.

In [None]:
def findOveragesCal(food):
    for x in range(0,len(calnutrients)):
        nutrient = calnutrients[x]
        rec = recommended[x]
        if(rec == -1):
            continue
        food[nutrient] -= rec
    return food   
caloverage_foods = calweighted_foods.apply(findOveragesCal,axis=1)
def findPercentOveragesCal(food):
    for x in range(0,len(calnutrients)):
        nutrient = calnutrients[x]
        rec = recommended[x]
        if(rec == -1):
            continue
        food[nutrient] = ((food[nutrient]/rec)*100)-100
    return food   
caloverage_foods = calweighted_foods.apply(findPercentOveragesCal,axis=1)
caloverage_foods[display].sort_values(by='Calories',ascending=False).head(10).rename(index=str,columns={'Protein (g)':'Protein (%)','Total Fat(g)':'Total Fat(%)','Cholesterol (mg)':'Cholesterol (%)',
               'Saturated Fat (g)':'Saturated Fat (%)','Sodium (mg)':'Sodium (%)','Potassium (mg)':'Potassium (%)',
               'Carbohydrates (g)':'Carbohydrates (%)','Fiber (g)':'Fiber (g)'})
df = caloverage_foods.sort_values(by='Calories',ascending=False).head(10).rename(index=str,columns={'Protein (g)':'Protein (%)','Total Fat(g)':'Total Fat(%)','Cholesterol (mg)':'Cholesterol (%)',
               'Saturated Fat (g)':'Saturated Fat (%)','Sodium (mg)':'Sodium (%)','Potassium (mg)':'Potassium (%)',
               'Carbohydrates (g)':'Carbohydrates (%)','Fiber (g)':'Fiber (%)'})
a = Bar(df, label='vars',group='Name', 
        values=blend('Protein (%)', 'Total Fat(%)','Cholesterol (%)',
                     'Saturated Fat (%)','Sodium (%)','Potassium (%)',
                     'Carbohydrates (%)','Fiber (%)',name='values', labels_name='vars'),
        title="Excess Nutrients (% above recommended daily intake)",width=900,palette=palettes.BrBG11)
a.xaxis.axis_label = ""
a.yaxis.axis_label = "% above reccomended daily intake"
show(a)

Pork sausage is far and away the winner here, destroying the competition in five of the 8 nutrients we're looking at. This large scaling is due to it's lack of carbohydrates and fiber which accents it's wealth of saturated fat, protein, and sodium.

Overall though, these foods are much lower in sodium, potassium, and protein than our best foods.

### Looking At  Everything

After looking at the best and worst foods, we wanted a better look at the trends across all our foods, so we took a look at the average overages across our entire corpus of data.


In [None]:
avg_nutrients = caloverage_foods.rename(index=str,columns={'Protein (g)':'Protein (%)','Total Fat(g)':'Total Fat(%)','Cholesterol (mg)':'Cholesterol (%)',
               'Saturated Fat (g)':'Saturated Fat (%)','Sodium (mg)':'Sodium (%)','Potassium (mg)':'Potassium (%)',
               'Carbohydrates (g)':'Carbohydrates (%)','Fiber (g)':'Fiber (%)'}).mean()
avg_nutrients = avg_nutrients.drop("Weight (g)")
avg_nutrients = avg_nutrients.drop("Calories")
p = Bar(avg_nutrients)
show(p)

So over our entire data set, we're seeing very high levels of protein, fats and sodium compared to other nutrients. Surprisingly, the lowest average was cholesterol, which indicates that many foods have a healthy proportion of cholesterol overall.

We also observed similar trends when looking at the median of our data, so this is a very real trend in our data.

## Final Analysis

Several foods could singularly fulfill the nutritional recommendations of a traditional diet. Spinach baby food provides the best balance of nutrients. Considering that the amount that must be eaten to reach 2000 calories is twice as much as that of the other foods, perhaps we should include grams as an additional parameter to draw a more reasonable conclusion (see appendix). If you're skeptical about reliving the good ol' days with spinach mush, chicken noodle soup, chili and beans, or Kashi chicken fettuccini are great 'superfood' alternatives.
Our 'best foods' tend to be lower in carbohydrates than the worse foods, and have a good balance of nutrients. Sodium levels in the foods tend to be extremely high, which makes sense, as most of them are pre-prepared meals. On average, foods in the USDA database tend to have more excess fat and sodium than any other nutrient.

## Appendix: What about grams?
Throughout this entire notebook we looked at our food through the lens of calories, but we also had a few interesting findings when we looked at which foods meet nutritional guidelines for the least amount of weight.

In [None]:
calweighted_foods[display].sort_values(by='Weight (g)').head(10)

There are a lot of pancakes, pizza, and school lunches on the list.
To put some of these foods into perspective:

* Potato Pancakes: 1078 grams = 49 pancakes
* Chicken Noodle Soup: 1143 grams = 15 packets
* Digiorno Thin Crust Pizza: 1305 grams = 2.4 Pizzas

These are usually the foods that people buy when they need to feed a lot of people, but what's interesting is that the companies seem to be maximizing the nutritional content of these foods for the least amount of weight possible.

In [None]:
df = caloverage_foods.sort_values(by='Weight (g)').head(10).rename(index=str,columns={'Protein (g)':'Protein (%)','Total Fat(g)':'Total Fat(%)','Cholesterol (mg)':'Cholesterol (%)',
               'Saturated Fat (g)':'Saturated Fat (%)','Sodium (mg)':'Sodium (%)','Potassium (mg)':'Potassium (%)',
               'Carbohydrates (g)':'Carbohydrates (%)','Fiber (g)':'Fiber (g)'})
a = Bar(df, label='vars',group='Name', 
        values=blend('Protein (%)', 'Total Fat(%)','Cholesterol (%)',
                     'Saturated Fat (%)','Sodium (%)','Potassium (%)',
                     'Carbohydrates (%)','Fiber (%)',name='values', labels_name='vars'),
        title="Excess Nutrients (% above recommended daily intake)",width=900)
a.xaxis.axis_label = ""
a.yaxis.axis_label = "% above reccomended daily intake"
show(a)

When you look at the percent overages on the foods in the above graph, it's clear that they run into the same problems as most of the foods in our data set with high levels of sodium, protein, and fats.

On the whole, the results we get when optimizing by weight exhibit similar characteristics to those found when we optimized by calorie, but reveal a different set of foods that serve a different purpose.