## üìÇ Section 1: Data Cleaning & Dataset Preparation

In this section, I prepare a **clean and structured dataset** from the original rice disease image dataset provided in the assignment. The original dataset includes labeled images along with metadata such as `image_id`, `label` (disease), `variety`, and `age`.

A clean dataset is essential for reliable model training. Therefore, I conduct several preprocessing steps to ensure **data integrity**, **consistency**, and **fair distribution** across classes.

---

### ‚úÖ Step-by-Step Breakdown

### 1. Imports and Configuration

Here, I import all necessary Python libraries for:
- **File handling** (`os`, `shutil`)
- **Metadata management** (`pandas`)
- **Image inspection** (`PIL.Image`)
- **Dataset splitting** (`train_test_split`)
- **Duplicate detection** (`hashlib.md5`)

I also define constants such as the directory structure and the accepted image formats. This ensures maintainability and reduces hardcoding.

---

### 2. Metadata Loading

I load the `meta_train.csv` file, which includes labels and metadata for each image. I convert the `image_id` to string to ensure consistent filename matching later on, especially if there are IDs with numeric formatting (e.g., `00001` vs `1`).

---

### 3. Duplicate ID Detection

Before proceeding, I check for duplicate `image_id` entries in the metadata. Duplicate IDs could lead to **data leakage**, label conflicts, or redundancy. The goal is to confirm that each image in the dataset is uniquely identifiable.

---

### 4. Label Folder Verification

I compare the list of labels (disease names) in the CSV with the actual subfolders in the `train_images/` directory. This step helps identify:
- Any **missing folders**
- **Spelling inconsistencies** or formatting errors

Why I do this: 
> Many deep learning pipelines (e.g., Keras `flow_from_directory`) rely on folder names as class labels. Mismatches can silently cause errors or mislabel data during training.

---

### 5. Image Existence Check

I iterate through each row in the metadata and verify whether the image file actually exists at the expected location (`train_images/<label>/<image_id>`). Missing files are logged, and only valid paths are retained.

This step protects against broken links and helps prevent `FileNotFoundError` during training.

---

### 6. Corrupted Image Detection

Using PIL's `Image.verify()` method, I identify corrupted or unreadable images. These are **excluded** from the dataset to prevent:
- Training crashes
- Unexpected behavior in augmentation or batching

---

### 7. Remove Corrupted Files

After detecting corrupted images, I update the list of valid images and labels to **exclude them entirely**. This guarantees that downstream processes only work with healthy images.

---

### 8. Duplicate Image Content Check

I calculate the **MD5 hash** of each image‚Äôs pixel data to detect visual duplicates (e.g., same image saved under different names). This is useful to:
- Prevent overfitting from repeated data
- Avoid artificially inflating model performance

Note: In this script, I only *flag* duplicates ‚Äî I could extend it to remove them later if needed.

---

### 9. Stratified Train/Validation Split

I use `train_test_split` with stratification to split the cleaned data into 80% training and 20% validation sets. **Stratification ensures the same class distribution** across both sets, which is crucial in imbalanced datasets like this one.

Why not use random split?
> Random split might result in some disease classes being overrepresented in one set and underrepresented in the other ‚Äî leading to poor generalization.

---

### 10. File Copying

Finally, I copy images into a new directory structure.

In [6]:
import os
import shutil
import pandas as pd
from collections import defaultdict
from sklearn.model_selection import train_test_split
from PIL import Image
from hashlib import md5

# === CONFIG ===
BASE_PATH = os.path.abspath(os.path.join(os.getcwd(), ".."))  # Run from MLmodels/
SOURCE_DIR = os.path.join(BASE_PATH, "train_images")
DEST_DIR = os.path.join(BASE_PATH, "dataset")
META_CSV = os.path.join(BASE_PATH, "meta_train.csv")
IMG_EXT = (".jpg", ".jpeg", ".png")
LOG_PREFIX = "[CLEAN]"

# === Load metadata ===
print(f"{LOG_PREFIX} Loading metadata...")
df = pd.read_csv(META_CSV)
df['image_id'] = df['image_id'].astype(str)

# === Check for duplicate metadata entries ===
dupe_count = df['image_id'].duplicated().sum()
print(f"{LOG_PREFIX} Duplicate image_id entries in CSV: {dupe_count}")

# === Verify that label folders exist ===
label_folders = set(os.listdir(SOURCE_DIR))
csv_labels = set(df['label'].unique())
invalid_labels = csv_labels - label_folders
if invalid_labels:
    print(f"{LOG_PREFIX} ‚ùå Labels in CSV not found in image folders: {invalid_labels}")
else:
    print(f"{LOG_PREFIX} ‚úÖ All CSV labels match image folders.")

# === Check that images listed in metadata exist ===
print(f"{LOG_PREFIX} Checking file existence...")
valid_images = []
valid_labels = []
missing = 0

for _, row in df.iterrows():
    img_name = row['image_id']
    label = row['label']
    path = os.path.join(SOURCE_DIR, label, img_name)
    if os.path.isfile(path):
        valid_images.append(path)
        valid_labels.append(label)
    else:
        missing += 1

print(f"{LOG_PREFIX} ‚úÖ Valid image files found: {len(valid_images)}")
if missing:
    print(f"{LOG_PREFIX} ‚ùå Missing images: {missing}")

# === Check for corrupted/unreadable images ===
print(f"{LOG_PREFIX} Checking for corrupted images...")
corrupted_paths = []

for path in valid_images:
    try:
        with Image.open(path) as img:
            img.verify()
    except Exception:
        corrupted_paths.append(path)

print(f"{LOG_PREFIX} ‚úÖ Healthy images: {len(valid_images) - len(corrupted_paths)}")
print(f"{LOG_PREFIX} ‚ùå Corrupted images: {len(corrupted_paths)}")

# === Remove corrupted images from list ===
valid_image_set = set(valid_images) - set(corrupted_paths)
valid_images = [p for p in valid_images if p in valid_image_set]
valid_labels = [l for p, l in zip(valid_images, valid_labels) if p in valid_image_set]

# === Check for duplicate image content ===
print(f"{LOG_PREFIX} Checking for duplicate image content...")
hashes = {}
duplicate_content = []

for path in valid_images:
    try:
        with Image.open(path) as img:
            img_hash = md5(img.tobytes()).hexdigest()
            if img_hash in hashes:
                duplicate_content.append((path, hashes[img_hash]))
            else:
                hashes[img_hash] = path
    except:
        continue  # Already handled

print(f"{LOG_PREFIX} ‚úÖ Unique image content: {len(valid_images) - len(duplicate_content)}")
print(f"{LOG_PREFIX} ‚ö†Ô∏è Duplicate image content found: {len(duplicate_content)}")

# === Final train/val split ===
print(f"{LOG_PREFIX} Splitting cleaned data into train/val (80/20)...")
train_paths, val_paths, train_labels, val_labels = train_test_split(
    valid_images, valid_labels, test_size=0.2, stratify=valid_labels, random_state=42
)

# === Copy files to cleaned dataset ===
def copy_files(paths, labels, subset):
    for path, label in zip(paths, labels):
        dest = os.path.join(DEST_DIR, subset, label)
        os.makedirs(dest, exist_ok=True)
        shutil.copy(path, dest)

# === Clear old dataset if exists ===
if os.path.exists(DEST_DIR):
    shutil.rmtree(DEST_DIR)

copy_files(train_paths, train_labels, "train")
copy_files(val_paths, val_labels, "val")

print(f"{LOG_PREFIX} ‚úÖ Finished: dataset/train/ and dataset/val/ created with clean data.")


[CLEAN] Loading metadata...
[CLEAN] Duplicate image_id entries in CSV: 0
[CLEAN] ‚úÖ All CSV labels match image folders.
[CLEAN] Checking file existence...
[CLEAN] ‚úÖ Valid image files found: 10407
[CLEAN] Checking for corrupted images...
[CLEAN] ‚úÖ Healthy images: 10407
[CLEAN] ‚ùå Corrupted images: 0
[CLEAN] Checking for duplicate image content...
[CLEAN] ‚úÖ Unique image content: 10333
[CLEAN] ‚ö†Ô∏è Duplicate image content found: 74
[CLEAN] Splitting cleaned data into train/val (80/20)...
[CLEAN] ‚úÖ Finished: dataset/train/ and dataset/val/ created with clean data.
