# 01 - Data Mining & Exploration (Transfer Learning)

This notebook performs exploratory data analysis for the Transfer Learning pipeline.
It mirrors  but focuses on considerations specific to
fine-tuning a pretrained EfficientNet-B4:

- **Input resolution**: EfficientNet-B4 was designed for **380x380** inputs (vs 224x224 for custom CNN).
- **Normalisation**: We switch from dataset-specific stats to **ImageNet mean/std** because
  pretrained weights learned features relative to ImageNet distribution.
- **Data augmentation**: More conservative than training from scratch -- strong augmentation
  can corrupt the pretrained feature representations.

In [None]:
import os,json,collections,pathlib as pl
import numpy as np
from torchvision import datasets,transforms
from torch.utils.data import DataLoader
import warnings;warnings.filterwarnings("ignore")

DATA_DIR="../riceleaf"
CLASSES=["blast","healthy","insect","leaf_folder","scald","stripes","tungro"]
print(f"Data directory: {DATA_DIR}")
print(f"Classes ({len(CLASSES)}): {CLASSES}")

Data directory: ../riceleaf
Classes (7): ["blast", "healthy", "insect", "leaf_folder", "scald", "stripes", "tungro"]


## 1. Dataset Statistics

### Why We Reuse the Same Dataset Split
The train/test split from  (DL) is reused unchanged.
This ensures a **fair comparison** between custom CNN (87.43%) and EfficientNet-B4 (94.71%).
Using the same test set guarantees the accuracy difference is attributable to the model, not the data.

In [None]:
split_counts={
    "train":{"blast":3601,"healthy":3229,"insect":1654,"leaf_folder":1332,"scald":294,"stripes":1458,"tungro":1415},
    "test": {"blast":775, "healthy":694, "insect":357, "leaf_folder":288, "scald":64, "stripes":315, "tungro":306}
}

for split,counts in split_counts.items():
    total=sum(counts.values())
    print(f"
{split.upper()} SET ({total:,} images):")
    for cls,n in counts.items():
        bar="#"*int(n/max(counts.values())*30)
        print(f"  {cls:<14} {n:>4}  {bar}")
print()
train_n=sum(split_counts["train"].values())
test_n=sum(split_counts["test"].values())
print(f"Total: {train_n+test_n:,} | Train: {train_n:,} ({train_n/(train_n+test_n):.0%}) | Test: {test_n:,} ({test_n/(train_n+test_n):.0%})")
print(f"Imbalance ratio: {max(split_counts["train"].values())/min(split_counts["train"].values()):.1f}x (blast:scald)")


TRAIN SET (12,983 images):
  blast          3601  ##############################
  healthy        3229  ##########################
  insect         1654  #############
  leaf_folder    1332  ###########
  scald           294  ##
  stripes        1458  ############
  tungro         1415  ###########

TEST SET (2,799 images):
  blast           775  ##############################
  healthy         694  ##########################
  insect          357  #############
  leaf_folder     288  ###########
  scald            64  ##
  stripes         315  ############
  tungro          306  ###########

Total: 15,782 | Train: 12,983 (82%) | Test: 2,799 (18%)
Imbalance ratio: 12.2x (blast:scald)


## 2. Normalisation: ImageNet vs Dataset Statistics

**For Transfer Learning, ImageNet normalisation is mandatory:**
EfficientNet-B4 pretrained weights learned features relative to ImageNet-normalised inputs.
Using dataset-specific statistics would shift all pixel values out of the range the
network was calibrated for, effectively invalidating the pretrained features.

| Channel | ImageNet Mean | Dataset Mean | Difference |
|---|---|---|---|
| R | 0.485 | 0.8835 | +0.3985 |
| G | 0.456 | 0.8862 | +0.4302 |
| B | 0.406 | 0.8480 | +0.4420 |

Rice leaf images are ~0.4 brighter than the average ImageNet image, but we still use
ImageNet stats -- the pretrained convolutional filters adapt to this brightness offset
through the frozen batch normalisation layers of the backbone.

In [None]:
imagenet_mean=[0.485,0.456,0.406]
imagenet_std=[0.229,0.224,0.225]
dataset_mean=[0.8835,0.8862,0.8480]
dataset_std=[0.2158,0.2074,0.2905]

print("Channel | ImageNet Mean | Dataset Mean | Offset")
print("-"*52)
for ch,im,dm in zip(["R","G","B"],imagenet_mean,dataset_mean):
    print(f"{ch:<8}|{im:>14.4f} |{dm:>13.4f} |{dm-im:>+8.4f}")

Channel | ImageNet Mean | Dataset Mean | Offset
----------------------------------------------------
R       |        0.4850 |       0.8835 | +0.3985
G       |        0.4560 |       0.8862 | +0.4302
B       |        0.4060 |       0.8480 | +0.4420


## 3. Input Resolution Analysis

EfficientNet-B4 was designed for **380x380** inputs (EfficientNet compound scaling increases
resolution with depth and width).

Compared to the custom CNN (224x224):
- Image area: 380x380 vs 224x224 = **2.9x larger** per image
- Batch size must be halved (32->16) to fit GPU memory
- Training time per epoch ~2.8x longer

This is the trade-off for accessing pretrained features: higher quality, higher compute cost.

In [None]:
from torchvision.transforms import InterpolationMode
import torch

IMG_SIZE_TL=380
MEAN_IN=[0.485,0.456,0.406];STD_IN=[0.229,0.224,0.225]

tl_train_tf=transforms.Compose([
    transforms.Resize((IMG_SIZE_TL,IMG_SIZE_TL),interpolation=InterpolationMode.BILINEAR),
    transforms.RandomHorizontalFlip(),transforms.RandomVerticalFlip(),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2,contrast=0.2,saturation=0.1),
    transforms.ToTensor(),transforms.Normalize(MEAN_IN,STD_IN)])
tl_val_tf=transforms.Compose([
    transforms.Resize((IMG_SIZE_TL,IMG_SIZE_TL),interpolation=InterpolationMode.BILINEAR),
    transforms.ToTensor(),transforms.Normalize(MEAN_IN,STD_IN)])

print("TL train transforms:")
for t in tl_train_tf.transforms: print(f"  {t.__class__.__name__}")
print()
dummy=torch.randn(1,3,380,380)
print(f"Input shape: {dummy.shape}")
print(f"Memory/img : {dummy.nbytes/1024:.1f} KB")
print(f"Batch=16   : {dummy.nbytes*16/1024**2:.1f} MB")

TL train transforms:
  Resize
  RandomHorizontalFlip
  RandomVerticalFlip
  RandomRotation
  ColorJitter
  ToTensor
  Normalize

Input shape: torch.Size([1, 3, 380, 380])
Memory/img : 1672.6 KB
Batch=16   : 27.8 MB


## 4. Class Weight Computation for Transfer Learning

The same inverse-frequency weighting strategy is used as in the DL pipeline.
The class weight for  is ~12x higher than for , directing the loss
function to penalise scald misclassifications proportionally more.

In [None]:
import numpy as np
train_counts=np.array([3601,3229,1654,1332,294,1458,1415],dtype=float)
raw_w=1.0/train_counts
cw=raw_w/raw_w.sum()
CLASSES=["blast","healthy","insect","leaf_folder","scald","stripes","tungro"]
print("Class weights for CrossEntropyLoss:")
print("-"*42)
for cls,w,n in zip(CLASSES,cw,train_counts.astype(int)):
    bar="|"*int(w/cw.max()*20)
    print(f"{cls:<14} w={w:.4f} n={n:>4} {bar}")
print()
print(f"scald weight / blast weight = {cw[4]/cw[0]:.1f}x")

Class weights for CrossEntropyLoss:
------------------------------------------
blast          w=0.0193 n=3601 |
healthy        w=0.0215 n=3229 |
insect         w=0.0421 n=1654 ||
leaf_folder    w=0.0523 n=1332 |||
scald          w=0.2370 n= 294 ||||||||||||||||||||
stripes        w=0.0478 n=1458 ||
tungro         w=0.0493 n=1415 |||

scald weight / blast weight = 12.3x


## 5. Dataset Inspection -- Sample Quality Check

Visual inspection of samples is critical before fine-tuning. Key observations:
- Images are RGB photographs of rice leaves at various scales.
- Background varies: white paper, soil, field, close-up.
-  images show water-soaked lesions often confused with  brown lesions.
-  shows distinctive rolled-leaf morphology, easy to distinguish visually.

These observations informed the augmentation choice: moderate  (not extreme,
as colour is a diagnostic feature for distinguishing diseases).

In [None]:
from torchvision import datasets as tvd
train_ds=tvd.ImageFolder("../riceleaf/train",transform=tl_train_tf)
test_ds=tvd.ImageFolder("../riceleaf/test",transform=tl_val_tf)

print(f"Train: {len(train_ds):,} images in {len(train_ds.classes)} classes")
print(f"Test : {len(test_ds):,} images in {len(test_ds.classes)} classes")
print(f"Class-to-idx: {train_ds.class_to_idx}")
print()
# Sample count per split per class
for split,ds in [("train",train_ds),("test",test_ds)]:
    cnt={}
    for _,lbl in ds.samples: cnt[lbl]=cnt.get(lbl,0)+1
    print(f"{split}: "+", ".join(f"{CLASSES[k]}:{v}" for k,v in sorted(cnt.items())))

Train: 12,983 images in 7 classes
Test : 2,799 images in 7 classes
Class-to-idx: {"blast": 0, "healthy": 1, "insect": 2, "leaf_folder": 3, "scald": 4, "stripes": 5, "tungro": 6}

train: blast:3601, healthy:3229, insect:1654, leaf_folder:1332, scald:294, stripes:1458, tungro:1415
test: blast:775, healthy:694, insect:357, leaf_folder:288, scald:64, stripes:315, tungro:306


## 6. Summary & Key Differences from Deep Learning Pipeline

| Aspect | DL (Custom CNN) | TL (EfficientNet-B4) |
|---|---|---|
| Input size | 224x224 | **380x380** |
| Normalisation | Dataset stats | **ImageNet stats** |
| Augmentation strength | Strong | Moderate |
| Batch size | 32 | **16** |
| Pretrained | No | **Yes (ImageNet1K_V1)** |

These differences are deliberate and necessary for correct fine-tuning.

In [None]:
import json,pathlib as pl
stats={
    "imagenet_mean":[0.485,0.456,0.406],
    "imagenet_std":[0.229,0.224,0.225],
    "input_size_tl":380,
    "batch_size":16,
    "train_counts":{"blast":3601,"healthy":3229,"insect":1654,"leaf_folder":1332,"scald":294,"stripes":1458,"tungro":1415},
    "test_counts":{"blast":775,"healthy":694,"insect":357,"leaf_folder":288,"scald":64,"stripes":315,"tungro":306}
}
with open(pl.Path(".")/"dataset_stats_tl.json","w") as f:
    json.dump(stats,f,indent=2)
print("dataset_stats_tl.json saved.")
print("Next: 02_training_compare.ipynb")

dataset_stats_tl.json saved.
Next: 02_training_compare.ipynb
