# Macro Density Analysis: Nutritional Structure Exploration

This notebook explores the latent structure of food nutrition using multivariate statistical analysis and unsupervised learning techniques.

## Research Objective

This project investigates the folowing question:
- Can foods be grouped by butritional density patterns?
- What latent dimensions define the structure of food composition?
- Can multivariate statistical techniques reveal hidden nutritional clusters?

Rather than analyzing nutruents independently, this study treats nutrition as a high-dimensional structured space.

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

## Dataset Overview

The dataset used in this analysis is the USDA Food Composition Dataset, which contains detailed nutritional information for a wide range of foods.
Each food item includes macro and micronutrient measurements, allowing for statistical exploration of nutritional patterns.

In [2]:
df = pd.read_csv("../data/USDA.csv")
df.head()

Unnamed: 0,ID,Description,Calories,Protein,TotalFat,Carbohydrate,Sodium,SaturatedFat,Cholesterol,Sugar,Calcium,Iron,Potassium,VitaminC,VitaminE,VitaminD
0,1001,"BUTTER,WITH SALT",717.0,0.85,81.11,0.06,714.0,51.368,215.0,0.06,24.0,0.02,24.0,0.0,2.32,1.5
1,1002,"BUTTER,WHIPPED,WITH SALT",717.0,0.85,81.11,0.06,827.0,50.489,219.0,0.06,24.0,0.16,26.0,0.0,2.32,1.5
2,1003,"BUTTER OIL,ANHYDROUS",876.0,0.28,99.48,0.0,2.0,61.924,256.0,0.0,4.0,0.0,5.0,0.0,2.8,1.8
3,1004,"CHEESE,BLUE",353.0,21.4,28.74,2.34,1395.0,18.669,75.0,0.5,528.0,0.31,256.0,0.0,0.25,0.5
4,1005,"CHEESE,BRICK",371.0,23.24,29.68,2.79,560.0,18.764,94.0,0.51,674.0,0.43,136.0,0.0,0.26,0.5


## Dataset Structure

Before conducting statistical analysis, we first examine the structure of the dataset, including the number of observations, feature types, and potential missing values.

In [3]:
df.shape

(7058, 16)

In [4]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 7058 entries, 0 to 7057
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            7058 non-null   int64  
 1   Description   7058 non-null   str    
 2   Calories      7057 non-null   float64
 3   Protein       7057 non-null   float64
 4   TotalFat      7057 non-null   float64
 5   Carbohydrate  7057 non-null   float64
 6   Sodium        6974 non-null   float64
 7   SaturatedFat  6757 non-null   float64
 8   Cholesterol   6770 non-null   float64
 9   Sugar         5148 non-null   float64
 10  Calcium       6922 non-null   float64
 11  Iron          6935 non-null   float64
 12  Potassium     6649 non-null   float64
 13  VitaminC      6726 non-null   float64
 14  VitaminE      4338 non-null   float64
 15  VitaminD      4224 non-null   float64
dtypes: float64(14), int64(1), str(1)
memory usage: 882.4 KB


In [5]:
df.describe()

Unnamed: 0,ID,Calories,Protein,TotalFat,Carbohydrate,Sodium,SaturatedFat,Cholesterol,Sugar,Calcium,Iron,Potassium,VitaminC,VitaminE,VitaminD
count,7058.0,7057.0,7057.0,7057.0,7057.0,6974.0,6757.0,6770.0,5148.0,6922.0,6935.0,6649.0,6726.0,4338.0,4224.0
mean,14259.821196,219.695338,11.710368,10.320614,20.69786,322.05922,3.452267,41.551994,8.25654,73.530627,2.828368,301.357949,9.43598,1.487462,0.576918
std,8577.179705,172.198755,10.919356,16.814191,27.630443,1045.416931,6.921267,122.963028,15.361509,222.445338,6.019878,415.638949,71.256536,5.386914,4.301147
min,1001.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,8387.25,85.0,2.29,0.72,0.0,37.0,0.172,0.0,0.0,9.0,0.52,135.0,0.0,0.12,0.0
50%,13293.5,181.0,8.2,4.37,7.13,79.0,1.256,3.0,1.395,19.0,1.33,250.0,0.0,0.27,0.0
75%,18336.75,331.0,20.43,12.7,28.17,386.0,4.028,69.0,7.875,56.0,2.62,348.0,3.1,0.71,0.1
max,93600.0,902.0,88.32,100.0,100.0,38758.0,95.6,3100.0,99.8,7364.0,123.6,16500.0,2400.0,149.4,250.0


## Nutrient Distribution Analysis


To understand the statistical structure of the dataset, we examine the distribution of several key nutritional variables, including calories, protein, fat, and carbohydrates.
Distribution plots help reveal skewness, outliers, and general patterns in nutritional density.