# Exploring the data downloaded from USDA FoodData Central

See the download here: https://fdc.nal.usda.gov/download-datasets.html

Data available in `.data/`.

Data dictionary available in  `nutrify/data_exploration/data/FoodData_Central_foundation_food_csv_2021-04-28/Download & API Field Descriptions April 2021.pdf`





In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Get Data

In [2]:
df = pd.read_csv("data/FoodData_Central_foundation_food_csv_2021-04-28/food.csv")
df

Unnamed: 0,fdc_id,data_type,description,food_category_id,publication_date
0,319874,sample_food,"HUMMUS, SABRA CLASSIC",16.0,2019-04-01
1,319875,market_acquisition,"HUMMUS, SABRA CLASSIC",16.0,2019-04-01
2,319876,market_acquisition,"HUMMUS, SABRA CLASSIC",16.0,2019-04-01
3,319877,sub_sample_food,Hummus,16.0,2019-04-01
4,319878,sub_sample_food,Hummus,16.0,2019-04-01
...,...,...,...,...,...
27588,1757386,sub_sample_food,,16.0,2021-04-23
27589,1757387,sub_sample_food,,16.0,2021-04-23
27590,1757388,sub_sample_food,,16.0,2021-04-23
27591,1757389,sub_sample_food,,16.0,2021-04-23


In [15]:
# How many unique?
unique_descriptions = df["description"].unique()
len(unique_descriptions)

11368

Beautiful, this gives us ~11368 foods to work with as a goal to model. But surely they can be split into less categories?

In [16]:
unique_descriptions[:10]

array(['HUMMUS, SABRA CLASSIC', 'Hummus', 'HUMMUS, OTHER',
       'Hummus - NFY12140O', 'Hummus - NFY12140P', 'Hummus - NFY12140Q',
       'Hummus - NFY12140R', 'Hummus - NFY12140S', 'Hummus - NFY12140F',
       'Hummus - NFY12140G'], dtype=object)

In [17]:
unique_descriptions[-10:]

array(['MUSHROOMS, LIONS MANE', 'OIL, PEANUT', 'OIL, SUNFLOWER',
       'OIL, SAFFLOWER', 'OIL, OLIVE, EXTRA LIGHT',
       'SPINACH, REGULAR (MATURE)', 'SPINACH, BABY', 'TOMATOES, ROMA',
       "MUSHROOMS, LION'S MANE", nan], dtype=object)

In [18]:
# Find random indexes of food to explore
import random
random_number = random.randint(0, len(unique)-10)
unique_descriptions[random_number:random_number+10]

array(['Tomatoes, diced, canned, STORE BRAND, SUNNY SELECT & VINE RIPE (CA2, NC) -  CY120AX',
       'Tomatoes, diced, canned, STORE BRAND, VINE RIPE (NC) - NFY120AAU',
       'Tomatoes, diced, canned, STORE BRAND, SUNNY SELECT (CA) - NFY120AD2',
       'Tomatoes, diced, canned, STORE BRAND, VINE RIPE (NC) - NFY120AAV',
       'Tomatoes, diced, canned, STORE BRAND, SUNNY SELECT (CA) - NFY120AD1',
       'Minerals, Tomatoes, diced, canned, STORE BRAND, SUNNY SELECT & VINE RIPE (CA2, NC) - NFY120ANZ',
       'Proximates, Tomatoes, diced, canned, STORE BRAND, SUNNY SELECT & VINE RIPE (CA2, NC) - NFY120AO0',
       'Sugars, Tomatoes, diced, canned, STORE BRAND, SUNNY SELECT & VINE RIPE (CA2, NC) - NFY120AO1',
       'Tomatoes, diced, canned, STORE BRAND, HYTOP PREMIUM QUALITY & AMERICAS CHOICE (CA1, NY) -  CY120B0',
       'Tomatoes, diced, canned, STORE BRAND, AMERICAS CHOICE (NY) - NFY120AAL'],
      dtype=object)

In [14]:
df.columns

Index(['fdc_id', 'data_type', 'description', 'food_category_id',
       'publication_date'],
      dtype='object')

### Food Categories

Let's dive into food categories. 

In [20]:
unique_categories = df["food_category_id"].unique()
len(unique_categories)

19

19 different food categories... I wonder what these are?

In [21]:
df["food_category_id"].value_counts()

1.0     6406
9.0     3982
11.0    3788
4.0     2924
16.0    2450
5.0     1503
14.0     918
15.0     913
7.0      795
10.0     613
20.0     588
6.0      568
18.0     488
25.0     474
13.0     454
2.0      386
12.0     267
19.0      54
Name: food_category_id, dtype: int64

In [25]:
# Get food categories
food_cats = pd.read_csv("data/FoodData_Central_Supporting_Data_csv_2021-04-28/food_category.csv")
food_cats

Unnamed: 0,id,code,description
0,1,100,Dairy and Egg Products
1,2,200,Spices and Herbs
2,3,300,Baby Foods
3,4,400,Fats and Oils
4,5,500,Poultry Products
5,6,600,"Soups, Sauces, and Gravies"
6,7,700,Sausages and Luncheon Meats
7,8,800,Breakfast Cereals
8,9,900,Fruits and Fruit Juices
9,10,1000,Pork Products
