## 03_DataCleaning.ipynb

### 🎯 Objective
Identify and remove corrupted or unreadable images from the dataset to ensure clean input for the modeling pipeline.

### 🧩 Steps
- Iterate through dataset folders
- Attempt to open each image using `PIL`
- Flag and list any corrupted or unreadable images
- (Optionally) delete them

#### ℹ️ Note:
- PIL is used for the cleaning/validation step.
- It’s more strict, which is better when you need to guarantee integrity of your dataset before training.

In [9]:
from PIL import Image
import os

def find_invalid_images(folder_path):
    corrupted_files = []

    for label in os.listdir(folder_path):
        if label.startswith('.'):  # Skip hidden/system folders
            continue
        label_path = os.path.join(folder_path, label)
        if not os.path.isdir(label_path):
            continue

        for filename in os.listdir(label_path):
            if filename.startswith('.'):  # Skip hidden/system files
                continue
            file_path = os.path.join(label_path, filename)
            try:
                with Image.open(file_path) as img:
                    img.verify()  # Check if image can be opened
            except Exception:
                corrupted_files.append(file_path)

    print(f"Found {len(corrupted_files)} invalid image(s).")
    if corrupted_files:
        print("Example invalid file(s):")
        print("\n".join(corrupted_files[:5]))
    
    return corrupted_files


In [10]:
corrupted_files = find_invalid_images("../inputs/cherry_leaves/cherry-leaves")

Found 0 invalid image(s).


#### ℹ️ Note:
Since there were no invalid images found, there was no need for me to implement a delete.