# 01 — Data Collection

**Purpose**  
Set up local dataset access and verify class distributions to ensure the project can load images reliably.

**Objectives**  
- Verify dataset folder structure (healthy / powdery_mildew).  
- Count images per class and detect missing/corrupted files.  
- Produce a compact summary table for later reference.

**Inputs**  
- Local dataset path: `../inputs/cherry_leaves_dataset/`  
  - Subfolders: `healthy/`, `powdery_mildew/`

**Outputs**  
- Summary table with image counts per class (for README and EDA planning).  
- Notes on data integrity (missing files, unexpected extensions).

**Notes**  
This notebook does *not* perform any model training. It only validates the availability and integrity of the raw data used later in EDA and modelling.

In [None]:
from pathlib import Path
import os

# Ensure CWD is project root
nb_cwd = Path.cwd()
project_root = nb_cwd if (nb_cwd / "inputs").exists() else nb_cwd.parent
os.chdir(project_root)

print("CWD:", Path.cwd())
assert Path("inputs").exists(), "Missing 'inputs' folder at project root."

CWD: c:\Users\ksstr\Documents\Coding\milestone-project-5


In [15]:
from pathlib import Path
import pandas as pd

DATA_DIR = Path("inputs/cherry_leaves_dataset")
CLASSES = ("healthy", "powdery_mildew")
PATTERNS = ("*.jpg", "*.jpeg", "*.png", "*.JPG", "*.JPEG", "*.PNG")

def count_images(base: Path, classes, patterns):
    counts = {}
    for cls in classes:
        p = base / cls
        n = sum(len(list(p.glob(pt))) for pt in patterns) if p.exists() else 0
        counts[cls] = n
    return pd.DataFrame.from_dict(counts, orient="index", columns=["image_count"])

print("DATA_DIR:", DATA_DIR.resolve())
assert DATA_DIR.exists(), f"Dataset folder not found: {DATA_DIR.resolve()}"

df_summary = count_images(DATA_DIR, CLASSES, PATTERNS)
df_summary

DATA_DIR: C:\Users\ksstr\Documents\Coding\milestone-project-5\inputs\cherry_leaves_dataset


Unnamed: 0,image_count
healthy,4208
powdery_mildew,4208


In [None]:
# Integrity scan (unexpected extensions, zero-byte files)
ALLOWED = {".jpg", ".jpeg", ".png"}

unexpected, zero = [], []
for cls in CLASSES:
    p = DATA_DIR / cls
    if not p.exists():
        continue
    for f in p.rglob("*"):
        if f.is_file():
            suf = f.suffix.lower()
            if suf not in ALLOWED:
                unexpected.append(f)
            elif f.stat().st_size == 0:
                zero.append(f)

print(f"Unexpected file types: {len(unexpected)}")
print(f"Zero-byte files: {len(zero)}")

unexpected[:3], zero[:3]

Unexpected file types: 0
Zero-byte files: 0


([], [])