# 05 — Modelling & Evaluation (CNN)

**Objective**  
Train and evaluate a baseline CNN to predict powdery mildew from cherry leaf images.

**Inputs**  
- Split manifests: `inputs/manifests/v1/{train,val,test}.csv`
- Images: `inputs/cherry_leaves_dataset/{healthy,powdery_mildew}`

**Outputs (planned)**  
- Trained model artifacts under `artifacts/v1/models/`
- Training history & evaluation plots under `plots/v3/`
- Metrics report under `artifacts/v1/reports/`

**Notes**  
Images will be resized to a fixed input size and normalized. Early stopping and model checkpointing will be used.

In [14]:
from pathlib import Path
import sys

def find_project_root(start: Path) -> Path:
    """Walk up until a folder containing 'src' is found, else return start."""
    p = start
    for _ in range(5):
        if (p / "src").exists():
            return p
        p = p.parent
    return start

PROJECT_ROOT = find_project_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src.paths import PROJECT_ROOT, DATA_DIR, MANIFESTS_DIR, PLOTS_DIR, ARTIFACTS_DIR

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_DIR:", DATA_DIR)
print("MANIFESTS_DIR:", MANIFESTS_DIR)
print("PLOTS_DIR:", PLOTS_DIR)
print("ARTIFACTS_DIR:", ARTIFACTS_DIR)

PROJECT_ROOT: C:\Users\ksstr\Documents\Coding\milestone-project-5
DATA_DIR: C:\Users\ksstr\Documents\Coding\milestone-project-5\inputs\cherry_leaves_dataset
MANIFESTS_DIR: C:\Users\ksstr\Documents\Coding\milestone-project-5\inputs\manifests\v1
PLOTS_DIR: C:\Users\ksstr\Documents\Coding\milestone-project-5\plots\v1
ARTIFACTS_DIR: C:\Users\ksstr\Documents\Coding\milestone-project-5\artifacts


In [15]:
# Modelling configuration
IMG_SIZE = (100, 100)   # (width, height)
BATCH_SIZE = 32
SEED = 42

print("Config → IMG_SIZE:", IMG_SIZE, "| BATCH_SIZE:", BATCH_SIZE, "| SEED:", SEED)

Config → IMG_SIZE: (100, 100) | BATCH_SIZE: 32 | SEED: 42


In [16]:
import pandas as pd

paths = {
    "train": MANIFESTS_DIR / "train.csv",
    "val":   MANIFESTS_DIR / "val.csv",
    "test":  MANIFESTS_DIR / "test.csv",
}

for name, p in paths.items():
    assert p.exists(), f"Missing manifest: {p}"

df_train = pd.read_csv(paths["train"])
df_val   = pd.read_csv(paths["val"])
df_test  = pd.read_csv(paths["test"])

for name, df in [("train", df_train), ("val", df_val), ("test", df_test)]:
    print(f"{name:>5} n={len(df)}")
    vc = df["label"].value_counts(normalize=True).rename("proportion").round(3)
    print(vc, "\n")

display(df_train.head(3))

train n=2945
label
powdery_mildew    0.5
healthy           0.5
Name: proportion, dtype: float64 

  val n=631
label
healthy           0.501
powdery_mildew    0.499
Name: proportion, dtype: float64 

 test n=632
label
powdery_mildew    0.5
healthy           0.5
Name: proportion, dtype: float64 



Unnamed: 0,filepath,label
0,C:\Users\ksstr\Documents\Coding\milestone-proj...,healthy
1,C:\Users\ksstr\Documents\Coding\milestone-proj...,powdery_mildew
2,C:\Users\ksstr\Documents\Coding\milestone-proj...,healthy


### Pre-flight checks

- Split manifests found and loaded successfully.
- Class proportions are approximately balanced across train/val/test.
- Next step: implement a TensorFlow `tf.data` pipeline (decode → resize → normalize → batch → prefetch) using these manifests.

In [17]:
import tensorflow as tf
import numpy as np 

# Ensure reproducibility
SEED = 42
tf.random.set_seed(SEED)
np.random.seed(SEED)

AUTOTUNE = tf.data.AUTOTUNE

# Stable label index from train manifest (sorted → reproducible)
labels_sorted = sorted(df_train["label"].unique().tolist())
label_to_index = {lbl: i for i, lbl in enumerate(labels_sorted)}
index_to_label = {i: lbl for lbl, i in label_to_index.items()}

print("Label map (in-memory):", label_to_index)

Label map (in-memory): {'healthy': 0, 'powdery_mildew': 1}


In [18]:
def decode_and_resize(img_bytes, img_size=(100, 100)):
    """Decode JPEG/PNG, force 3 channels, resize, and return float32 in [0,1]."""
    img = tf.io.decode_image(img_bytes, channels=3, expand_animations=False)
    img.set_shape([None, None, 3])  # static rank for TF graph
    img = tf.image.resize(img, img_size, method=tf.image.ResizeMethod.BILINEAR)
    img = tf.cast(img, tf.float32) / 255.0
    return img

def load_example(path, label, img_size=(100, 100)):
    """Read file, decode, resize, normalize; return (image, int_label)."""
    bytes_ = tf.io.read_file(path)
    img = decode_and_resize(bytes_, img_size)
    return img, label

def make_dataset(manifest_df, img_size=(100, 100), batch_size=32, training=False, seed=42):
    """Build a performant tf.data pipeline from a manifest DataFrame."""
    paths = manifest_df["filepath"].astype(str).values
    labels = manifest_df["label"].map(label_to_index).astype("int32").values

    ds = tf.data.Dataset.from_tensor_slices((paths, labels))
    if training:
        ds = ds.shuffle(buffer_size=len(manifest_df), seed=seed, reshuffle_each_iteration=True)

    ds = ds.map(lambda p, y: load_example(p, y, img_size), num_parallel_calls=AUTOTUNE)
    ds = ds.batch(batch_size, drop_remainder=training)  # fixed batch size during training
    ds = ds.prefetch(AUTOTUNE)
    return ds

### Functional Smoke Test — Input Pipeline Validation

Quick functional smoke test is run to ensure the `tf.data` pipeline operates correctly.

This test confirms that:
- All file paths in the manifests are readable.
- Images are properly decoded, resized to 100×100, and normalized to [0,1].
- Labels are correctly mapped to their integer indices based on the stable label map.
- Batch dimensions and value ranges are consistent with expectations.

If this step passes without errors, the dataset is ready for model training.

In [19]:
# Build datasets
train_ds = make_dataset(df_train, img_size=IMG_SIZE, batch_size=BATCH_SIZE, training=True,  seed=SEED)
val_ds   = make_dataset(df_val,   img_size=IMG_SIZE, batch_size=BATCH_SIZE, training=False, seed=SEED)
test_ds  = make_dataset(df_test,  img_size=IMG_SIZE, batch_size=BATCH_SIZE, training=False, seed=SEED)

# Inspect one batch
xb, yb = next(iter(train_ds))
print("Train batch →", xb.shape, yb.shape, "| range:", float(tf.reduce_min(xb)), "→", float(tf.reduce_max(xb)))
print("Labels sample (int):", yb[:8].numpy())
print("Decoded labels sample:", [index_to_label[int(i)] for i in yb[:8].numpy()])

Train batch → (32, 100, 100, 3) (32,) | range: 0.0 → 1.0
Labels sample (int): [0 0 1 1 1 0 1 0]
Decoded labels sample: ['healthy', 'healthy', 'powdery_mildew', 'powdery_mildew', 'powdery_mildew', 'healthy', 'powdery_mildew', 'healthy']


### Input pipeline checks

- Manifests successfully converted to `tf.data` datasets.
- Decoding, resizing to 100×100, and normalization to [0,1] verified on a sample batch.

**Next:** define a compact baseline CNN (softmax), add EarlyStopping & ModelCheckpoint, and run v1 training.