# Background

### The Five Data Types of FoodData Central

1. USDA National Nutrient Database for Standard Reference (SR)
    - last version labeled SR Legacy and released in 2018
    - primary food-composition data type in US for decades
    - provides average values from selected foods in the marketplace, derived from analyses of composites, calculations, and/or literature
    - analytical nutrient profiles were determined for foods in the marketplace and selected based on the market share, representing an estimate of average exposure across the United States
    - values will become less relevant as the food supply evolves over time while SR remains static
<br><br>
2. Global Branded Food Products Database (GBFPD): database of food products based on commercial label data
    - includes data from various groups
    - data centers on nutrients required by FDA food labeling, but additional data may also be available
    - nutrient data that appears on branded and private label foods and are voluntarily provided by the food industry
<br><br>
3. Foundation Foods (FF)
    - goal is to provide information on foods used to make more complex foods (such as those from GBFPD)
    - provides information on single-ingredient foods
    - provides analytical values for components of individual food samples (NOT composite values)
    - provides data including number of samples, sampling location, date of collection, analytical approaches used, agricultural information (e.g. genotype, growing location, production practices)
    - intended to be used in coordination with food-composition data from other USDA datasets (e.g. Nutrient Uptake and Outcomes Network, the Agricultural Collaborative Research Outcomes System, Economic Research Service)
    - considered the future of USDA food information
<br><br>
4. Experimental Foods (EF): data type for food-composition data within the context of an experimental design or derived from new analytical methodology research
    - foods produced, acquired, or studied under unique conditions (e.g. alternative management systems, experimental genotypes, or research/analytical protocols)
    - set of experiments and results
    - implemented for an intended audience of researchers
<br><br>
5. Food and Nutrient Database for Dietary Studies (FNDDS): a data type developed by Beltsville Human Nutrition Research Center's Food Surveys Research Group for the purposes of national nutrition monitoring (released every 2 years in coordination with the release of "What We Eat in America" as part of the NHANES). 
    - provides nutrient values for the foods and beverages reported in What We Eat in America (the dietary intake component of the National Heath and Nutrition Examination Survey (NHANES))
    - nutrient profiles for a majority of foods and beverages were derived via calculation utilizing 2 or more ingredient codes from FF and SR Legacy data

"USDA’s FoodData Central: what is it and why is it needed today?" (https://www.sciencedirect.com/science/article/pii/S0002916522001794?via%3Dihub)

---

# Files

include
- `food.csv`: contains ID, data_type, description (name), food category, publication_date
- `food_nutrient.csv`: table containing nutrient values for a food; contains ID of the corresponding food, ID of the corresponding nutrient, amount (per 100g), additional data such as number of observations, min, max, etc.
- `nutrient.csv`: mapping of nutrient ID to name, unit_name
- `foundation_food.csv`: footnotes on select Foundation Food items
- `food_category.csv`: mapping of food category ID to food category description

exclude
- `Download & API Field Descriptions April 2023.pdf`: description of data fields
- `acquisition_samples.csv`: maps acquisition ID to sample ID (acquisitions may be blended with other acquisitions to create a sample food, and a given acquisition may be used in the creation of multiple sample foods)
- `agricultural_samples.csv`: non-processed foods; n=810; contains ID, date obtained, the name of the specific kind of food, special conditions relevant to the production of this food (e.g. "drought"), the state where this food was produced
- `all_downloaded_table_record_counts.csv`: maps table name to number of records in the table
- `food_attribute.csv`: entries for food attributes, with FDC_ID mapping the attribute to the food that the attribute pertains to, the name of the attribute, the value of the attribute (e.g. ontology name for source, is organic, ingredients, barcode, etc.) 
- `food_attribute_type.csv`: mapping of food attribute type ID to name and description (not all food attributes have a value for this field)
- `food_calorie_conversion_factor.csv`: mapping of ID (from nutrient_conversion_factor table) to multiplication factors for protein, fat, and carbohydrates used when calculating energy from macronutrients
- `food_component.csv`: data on part of a food (e.g. percent weight)
- `food_nutrient_conversion_factor.csv`: mapping of nutrient conversion factor ID to food ID
- `food_portion.csv`: food portions (e.g. 3 tsp, 1 slice) with data on gram_weight
- `food_protein_conversion_factor.csv`: mapping of ID from nutrient_conversion_factor table to multiplication factor used to calculate protein from nitrogen
- `food_update_log_entry.csv`: record on when the food was published to FoodData Central
- `input_food.csv`: foods that ingredients in other foods
- `lab_method.csv`: mapping of method ID to description for measuring amount of a nutrient in a given food
- `lab_method_code.csv`: mapping of method ID to method code
- `lab_method_nutrient.csv`: mapping of lab method ID to nutrient ID
- `market_acquisition.csv`: data on foods acquired for analysis (e.g. brand name, store location, store name, etc.)
- `measure_unit.csv`: mapping of measurement ID to measurement description (for foods; e.g. banana, bar, can, lb)
- `sample_food.csv`: a food that is acquired as a representative sample of the food supply. It may be created from a single acquired food, or from a composite of multiple acquired foods (only contains food ID)
- `sub_sample_food.csv`: mapping of fdc_id to sample_food used in analysis
- `sub_sample_result.csv`: result of chemical analysis on a particular sample for a particular nutrient
- `survey_fndds_food.csv`: foods whose consumption is measured in NHANES What We Eat in America

source: https://fdc.nal.usda.gov/fdc-datasets/FoodData_Central_foundation_food_csv_2023-04-20.zip

---

# Development Plan

1. ~~read background on USDA Food Data Central (continue reading at EF section)~~

2. ~~download files for `Foundation Foods` and review available data files~~

3. ~~clean data~~

4. ~~setup models in Django with SQLite~~

5. ~~load data into SQLite instance~~

### Future Updates
- add GBFPD foods for branded products
- add SR foods (?)
- PostGres, GraphQL?

In [1]:
import os, pandas as pd

data_dir='../data/FoodData_Central_foundation_food_csv_2023-04-20'
save_dir='../data'

## 1. Load data

In [2]:
food=pd.read_csv(f'{data_dir}/food.csv')
food_nutrient=pd.read_csv(f'{data_dir}/food_nutrient.csv')
nutrient=pd.read_csv(f'{data_dir}/nutrient.csv')
foundation_food_codes=pd.read_csv(f'{data_dir}/foundation_food.csv').values
food_category=pd.read_csv(f'{data_dir}/food_category.csv',dtype={'id':int,'code':int,'description':str})

  exec(code_obj, self.user_global_ns, self.user_ns)


## 2. Clean data

In [3]:
food['description'].fillna("",inplace=True)
nutrient.drop(columns=["nutrient_nbr","rank"],inplace=True)
foundation_food=food[food.apply(lambda x:x['fdc_id'] in foundation_food_codes,axis=1)]
food_category.drop(columns=['code'],inplace=True)

In [4]:
# reorder columns to match database implementation as SQLite .import assumes columns
# are in the same order
foundation_food=foundation_food[['fdc_id','data_type','description','publication_date','food_category_id']]

## 3. Save data

In [10]:
foundation_food.to_csv(f'{save_dir}/cleaned_foundation_food.csv',index=False)
food_nutrient.to_csv(f'{save_dir}/cleaned_food_nutrient.csv',index=False)
nutrient.to_csv(f'{save_dir}/cleaned_nutrient.csv',index=False)
food_category.to_csv(f'{save_dir}/cleaned_food_category.csv',index=False)