# 03 — Data Preparation

**Objective**  
Create a deterministic train/validation/test split and persist split manifests for reproducibility.

**Inputs**  
- Image dataset: `inputs/cherry_leaves_dataset/`  
  - Subfolders: `healthy/`, `powdery_mildew/`

**Outputs**  
- CSV manifests for each split under `inputs/manifests/v1/`:  
  - `train.csv`, `val.csv`, `test.csv`  
  Each file contains `filepath` and `label` columns.

**Notes**  
Splits are stratified by class with a fixed random seed to ensure reproducibility.

In [1]:
from pathlib import Path
import os

# Ensure working directory is project root
nb_cwd = Path.cwd()
project_root = nb_cwd if (nb_cwd / "inputs").exists() else nb_cwd.parent
os.chdir(project_root)

DATA_DIR = Path("inputs/cherry_leaves_dataset")
CLASSES = ("healthy", "powdery_mildew")

MANIFESTS_DIR = Path("inputs") / "manifests" / "v1"
MANIFESTS_DIR.mkdir(parents=True, exist_ok=True)

print("CWD:", Path.cwd())
print("DATA_DIR:", DATA_DIR.resolve())
print("MANIFESTS_DIR:", MANIFESTS_DIR.resolve())
for cls in CLASSES:
    print(f"{cls:>16} ->", (DATA_DIR / cls).exists())

CWD: c:\Users\ksstr\Documents\Coding\milestone-project-5
DATA_DIR: C:\Users\ksstr\Documents\Coding\milestone-project-5\inputs\cherry_leaves_dataset
MANIFESTS_DIR: C:\Users\ksstr\Documents\Coding\milestone-project-5\inputs\manifests\v1
         healthy -> True
  powdery_mildew -> True
