# ODIR-5K Preprocessing

## Objective
This notebook prepares the final metadata used for model training by:
- Loading clean image-level labels
- Performing a reproducible **train/validation split**
- Preserving multi-label distributions
- Saving `train.csv` and `val.csv` for downstream PyTorch loading

All models will use the **same frozen split** to ensure fair comparison.

## Section 1 - Import Required Libraries


In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split

# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

In [2]:
# Resolve project root (assumes notebook is inside /notebooks)
PROJECT_ROOT = Path.cwd().parent

# Data directories
DATA_DIR = PROJECT_ROOT / "data"
PROCESSED_DIR = DATA_DIR / "processed"
PROCESSED_DIR.mkdir(exist_ok=True)

# Input from previous notebook
IMAGE_LEVEL_FILE = PROCESSED_DIR / "image_level_labels.csv"

assert IMAGE_LEVEL_FILE.exists(), "image_level_labels.csv not found!"

## Load Image-Level Labels

This file was generated in `02_label_analysis.ipynb` and contains one row per fundus image with multi-label targets.

In [3]:
df = pd.read_csv(IMAGE_LEVEL_FILE)
df.head()

Unnamed: 0,image_name,eye,split,N,D,G,C,A,H,M,O
0,0_left.jpg,left,train,0,0,0,1,0,0,0,0
1,0_right.jpg,right,train,0,0,0,1,0,0,0,0
2,1_left.jpg,left,train,1,0,0,0,0,0,0,0
3,1_right.jpg,right,train,1,0,0,0,0,0,0,0
4,2_left.jpg,left,train,0,1,0,0,0,0,0,1


## Define Disease Label Columns

In [4]:
label_cols = ['N', 'D', 'G', 'C', 'A', 'H', 'M', 'O']

# Sanity check
assert all(col in df.columns for col in label_cols), "Missing label columns!"

## Section 2 - Multi-Label Stratification Strategy

Scikit-learn does not natively support multi-label stratified splitting. To approximate this, we create a **label-combination key** per image.

This preserves the distribution of common label combinations across splits and is a widely accepted practical approach.

In [5]:
# Convert label vector to a string key, e.g. "10001000"
df["label_key"] = df[label_cols].astype(str).agg("".join, axis=1)

# Inspect most common label combinations
df["label_key"].value_counts().head()

label_key
10000000    2280
01000000    1394
00000001    1102
01000001     564
00010000     292
Name: count, dtype: int64

## Section 3 - Train / Validation Split

We split the data into:
- 80% training
- 20% validation

The split is **reproducible** and **label-aware**.

In [6]:
train_df, val_df = train_test_split(
    df,
    test_size=0.20,
    random_state=RANDOM_SEED,
    stratify=df["label_key"]
)

# Reset indices
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")

Training samples: 5600
Validation samples: 1400


## Section 4 - Cleanup Helper Columns

The stratification key is no longer needed for training.

In [7]:
# Cleanup
train_df = train_df.drop(columns=["label_key"])
val_df = val_df.drop(columns=["label_key"])

## Verify Label Distribution Consistency

We confirm that class imbalance patterns are preserved between training and validation sets.

In [8]:
train_dist = train_df[label_cols].mean()
val_dist = val_df[label_cols].mean()

dist_df = pd.DataFrame({
    "train_ratio": train_dist,
    "val_ratio": val_dist
})

dist_df

Unnamed: 0,train_ratio,val_ratio
N,0.325714,0.325714
D,0.3225,0.321429
G,0.061786,0.06
C,0.060714,0.06
A,0.046786,0.047143
H,0.029286,0.03
M,0.049821,0.049286
O,0.28,0.278571


## Section 5 - Save Final Train & Validation CSVs

These files are **frozen inputs** for all downstream models.

In [9]:
TRAIN_CSV = PROCESSED_DIR / "train.csv"
VAL_CSV = PROCESSED_DIR / "val.csv"

train_df.to_csv(TRAIN_CSV, index=False)
val_df.to_csv(VAL_CSV, index=False)

print(f"Saved: {TRAIN_CSV}")
print(f"Saved: {VAL_CSV}")

Saved: c:\Users\ibaan\Documents\Coding\Python\odir_ocular_disease_recognition\data\processed\train.csv
Saved: c:\Users\ibaan\Documents\Coding\Python\odir_ocular_disease_recognition\data\processed\val.csv


## Save Label Statistics

This summary is useful for reporting and for configuring class weights or focal loss later.

In [10]:
label_stats = pd.DataFrame({
    "train_positive": train_df[label_cols].sum(),
    "val_positive": val_df[label_cols].sum(),
    "train_ratio": train_df[label_cols].mean(),
    "val_ratio": val_df[label_cols].mean()
})

LABEL_STATS_CSV = PROCESSED_DIR / "label_stats.csv"
label_stats.to_csv(LABEL_STATS_CSV)

label_stats

Unnamed: 0,train_positive,val_positive,train_ratio,val_ratio
N,1824,456,0.325714,0.325714
D,1806,450,0.3225,0.321429
G,346,84,0.061786,0.06
C,340,84,0.060714,0.06
A,262,66,0.046786,0.047143
H,164,42,0.029286,0.03
M,279,69,0.049821,0.049286
O,1568,390,0.28,0.278571


## Key Outcomes

- Image-level dataset was split into **training and validation sets**
- Multi-label distribution was preserved using label-combination stratification
- Reproducible CSV files (`train.csv`, `val.csv`) were generated
- These files will be used unchanged by all models to ensure fair comparison