<a href="https://colab.research.google.com/github/malakelbasha78/AI-Skills-Projects/blob/main/data/preprocessing_restnet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Image Preprocessing Notebook (NO MERGE – NO DATA LEAKAGE)

This notebook performs **safe preprocessing** for an image classification task:
- Keeps **train / val / test** completely separate
- No merge, no reshuffle, no resplit
- Converts images to RGB
- Removes corrupted images
- Removes duplicate images **inside each split only**
- No resize, no normalization, no augmentation (these are done in training)

This design guarantees **zero data leakage**.



## 1️⃣ Imports
We import only what is required for preprocessing.


In [None]:

import os
import hashlib
from PIL import Image
from collections import defaultdict



## 2️⃣ Configuration
Define dataset structure and splits.


In [None]:

DATASET_DIR = "dataset"   # contains train / val / test
SPLITS = ["train", "val", "test"]



## 3️⃣ Dataset Structure Check
Verify that train / val / test folders exist.


In [None]:

for split in SPLITS:
    split_path = os.path.join(DATASET_DIR, split)
    if os.path.exists(split_path):
        print(f"✔️ Found {split} folder")
    else:
        print(f"❌ Missing {split} folder")



## 4️⃣ Count Images Before Preprocessing
We count images per class to compare before vs after.


In [None]:

def count_images():
    counts = defaultdict(dict)
    for split in SPLITS:
        split_path = os.path.join(DATASET_DIR, split)
        if not os.path.exists(split_path):
            continue
        for cls in os.listdir(split_path):
            cls_path = os.path.join(split_path, cls)
            if os.path.isdir(cls_path):
                counts[split][cls] = len(os.listdir(cls_path))
    return counts

before_counts = count_images()
print("Image counts BEFORE preprocessing:")
before_counts



## 5️⃣ Remove Corrupted Images + Convert to RGB
This ensures all images are readable and consistent.


In [None]:

def clean_images(split):
    split_path = os.path.join(DATASET_DIR, split)
    for cls in os.listdir(split_path):
        cls_path = os.path.join(split_path, cls)
        if not os.path.isdir(cls_path):
            continue
        for img_name in os.listdir(cls_path):
            img_path = os.path.join(cls_path, img_name)
            try:
                img = Image.open(img_path).convert("RGB")
                img.save(img_path)
            except Exception:
                os.remove(img_path)



## 6️⃣ Remove Duplicate Images (Inside Each Split)
Duplicates are removed **per split only** to avoid leakage.


In [None]:

def remove_duplicates(split):
    split_path = os.path.join(DATASET_DIR, split)
    for cls in os.listdir(split_path):
        cls_path = os.path.join(split_path, cls)
        if not os.path.isdir(cls_path):
            continue

        hashes = set()
        for img_name in os.listdir(cls_path):
            img_path = os.path.join(cls_path, img_name)
            try:
                img = Image.open(img_path)
                img_hash = hashlib.md5(img.tobytes()).hexdigest()
                if img_hash in hashes:
                    os.remove(img_path)
                else:
                    hashes.add(img_hash)
            except Exception:
                os.remove(img_path)



## 7️⃣ Run Preprocessing
Apply cleaning and duplicate removal independently for each split.


In [None]:

for split in SPLITS:
    split_path = os.path.join(DATASET_DIR, split)
    if os.path.exists(split_path):
        print(f"Processing {split}...")
        clean_images(split)
        remove_duplicates(split)



## 8️⃣ Count Images After Preprocessing
Ensure no class leakage or unexpected changes occurred.


In [None]:

after_counts = count_images()
print("Image counts AFTER preprocessing:")
after_counts



## 9️⃣ Final Sanity Check
Verify that preprocessing did not mix data between splits.


In [None]:

print("✅ Preprocessing finished")
print("✔️ No merge")
print("✔️ No resplit")
print("✔️ No augmentation")
print("✔️ No data leakage")
