# Generating Nutrition Data

This notebook focuses on **dataset creation and understanding**, not analysis. The goal is to:

* generate a realistic time-series dataset
* understand what the data represents
* clarify assumptions
* freeze the dataset for reuse across the project

In [1]:
import pandas as pd
import numpy as np

In [2]:
np.random.seed(42)

date_range = pd.date_range(
    start="2022-01-11",
    end="2025-12-25",
    freq="D"
)

meals = ["Breakfast", "Lunch", "Supper", "Dinner"]

food_items = {
    "Breakfast": ["Oats", "Eggs", "Toast", "Smoothie", "Pancakes", "Poha"],
    "Lunch": ["Rice", "Chicken", "Dal", "Salad", "Paneer", "Curd"],
    "Supper": ["Fruit", "Yogurt", "Nuts", "Sandwich", "Sprouts"],
    "Dinner": ["Roti", "Fish", "Vegetables", "Soup", "Khichdi", "Tofu"]
}

meal_times = {
    "Breakfast": "08:00",
    "Lunch": "13:00",
    "Supper": "17:00",
    "Dinner": "20:30"
}

rows = []

for date in date_range:

    # Daily total water intake (realistic range)
    daily_water = np.random.randint(1800, 3500)

    # Randomly choose meals to be "skipped"
    skipped_meals = np.random.choice(
        meals,
        size=np.random.randint(0, 3),
        replace=False
    )

    # Split water across meals
    water_split = np.random.dirichlet(np.ones(len(meals))) * daily_water

    for i, meal in enumerate(meals):

        # Explicit missing meal
        if meal in skipped_meals:
            rows.append([
                date,
                meal,
                np.nan,
                np.nan,
                np.nan,
                np.nan,
                np.nan,
                np.nan,
                np.nan
            ])
            continue

        food = np.random.choice(food_items[meal])

        calories = np.random.randint(250, 700)
        protein = np.random.uniform(10, 45)
        carbs = np.random.uniform(25, 90)
        fat = np.random.uniform(8, 35)

        # Introduce missing macros (explicit)
        if np.random.rand() < 0.12:
            protein = np.nan
        if np.random.rand() < 0.12:
            carbs = np.nan
        if np.random.rand() < 0.12:
            fat = np.nan

        meal_time = meal_times[meal]
        if np.random.rand() < 0.08:
            meal_time = np.nan

        water = round(water_split[i], 0)
        if np.random.rand() < 0.10:
            water = np.nan

        rows.append([
            date,
            meal,
            food,
            calories,
            round(protein, 1) if not np.isnan(protein) else np.nan,
            round(carbs, 1) if not np.isnan(carbs) else np.nan,
            round(fat, 1) if not np.isnan(fat) else np.nan,
            meal_time,
            water
        ])

df = pd.DataFrame(
    rows,
    columns=[
        "Date", "Meal", "Food_Item", "Calories",
        "Protein_g", "Carbs_g", "Fat_g",
        "Meal_Time", "Water_ml"
    ]
)
df.head(10)

Unnamed: 0,Date,Meal,Food_Item,Calories,Protein_g,Carbs_g,Fat_g,Meal_Time,Water_ml
0,2022-01-11,Breakfast,Pancakes,349.0,15.0,67.3,,08:00,2036.0
1,2022-01-11,Lunch,Chicken,271.0,10.2,,22.2,13:00,
2,2022-01-11,Supper,Nuts,613.0,28.0,63.5,,17:00,378.0
3,2022-01-11,Dinner,Fish,514.0,10.6,40.0,14.5,20:30,133.0
4,2022-01-12,Breakfast,,,,,,,
5,2022-01-12,Lunch,Chicken,451.0,,63.9,,13:00,116.0
6,2022-01-12,Supper,Yogurt,616.0,39.0,48.2,15.6,,1977.0
7,2022-01-12,Dinner,Roti,385.0,10.2,78.0,,20:30,845.0
8,2022-01-13,Breakfast,Poha,284.0,26.5,32.8,27.3,08:00,59.0
9,2022-01-13,Lunch,Chicken,291.0,25.4,38.1,32.2,13:00,337.0


In [3]:
df.shape

(5780, 9)

In [4]:
df.isnull().sum()

Date            0
Meal            0
Food_Item    1411
Calories     1411
Protein_g    1920
Carbs_g      1937
Fat_g        1958
Meal_Time    1743
Water_ml     1857
dtype: int64

### Time Coverage

* **Date Range:** `2022-01-11` to `2025-12-25`
* Covers ~4 years of data
* Concepts:
  * long-term trend analysis
  * seasonality detection
  * meaningful forecasting later


### Granularity

* **Event-based time series**
* **4 meals per day**:
  * Breakfast
  * Lunch
  * Supper
  * Dinner
* Each row represents **one meal event**, not an entire day

This allows:
* intra-day analysis
* aggregation from meal → daily → weekly → monthly


### Synthetic but Realistic Data

This dataset is **synthetically generated**, but designed to closely mimic real-world nutrition tracking:
* Meals may be **skipped**
* Nutritional values may be **missing**
* Logging is **inconsistent**, just like real users
* Noise is added where it logically makes sense


### Explicit Missing Values (Intentional)

Missing values are **deliberately introduced** to simulate real behavior:

| Column                | Reason for Missingness       |
| --------------------- | ---------------------------- |
| Food_Item             | Meal skipped or not logged   |
| Calories              | Meal skipped                 |
| Macros                | Partial macro tracking       |
| Meal_Time             | Time not recorded            |
| Water_ml              | Incomplete hydration logging |


### Water Intake Logic

* `Water_ml` represents **daily total water intake**.
* Daily total is **distributed across meals**.
* Missing values simulate inconsistent hydration tracking.

This reflects how people typically think about water intake. It's **daily**, not per meal.


### Dataset Columns Explained

| Column    | Description                                  |
| --------- | -------------------------------------------- |
| Date      | Calendar date of the meal                    |
| Meal      | Meal type (Breakfast, Lunch, Supper, Dinner) |
| Food_Item | Food consumed during the meal                |
| Calories  | Calories consumed in that meal               |
| Protein_g | Protein intake (grams)                       |
| Carbs_g   | Carbohydrate intake (grams)                  |
| Fat_g     | Fat intake (grams)                           |
| Meal_Time | Time of meal (if logged)                     |
| Water_ml  | Portion of daily water intake                |

### Saving the Dataset for Reuse

In [5]:
df.to_csv("nutrition_data.csv", index=False)

----------