## Data Collection

### Objective

Fetch data from the kaggle url and prepare it for further processes

### Inputs

Kaggle JSON file - the authentication token.

### Outputs

Generate Dataset: inputs/datasets/cherry_leaves

### Import Packages

In [1]:
!pip3 install pandas numpy matplotlib seaborn plotly streamlit pillow joblib scikit-learn tensorflow-cpu keras kaggle



### Import Dataset

In [2]:
import os

In [3]:
# Dynamically find the kaggle.json path
config_path = os.path.join(os.getcwd(), "jupyter_notebooks", "kaggle.json")

# Set Kaggle config env var
os.environ['KAGGLE_CONFIG_DIR'] = os.path.dirname(config_path)

# Apply permissions
!chmod 600 jupyter_notebooks/kaggle.json

#### ℹ️ Note:
Downloading the dataset from kaggle website to our defined destination folder.

In [4]:
! kaggle datasets download -d "codeinstitute/cherry-leaves" -p "inputs/cherry_leaves"

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/cherry_leaves
  0%|                                               | 0.00/55.0M [00:00<?, ?B/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:00<00:00, 1.95GB/s]


Unzip the downloaded file

## Data Preparation

Unzip the downloaded files using the zipfile.

In [5]:
import zipfile
with zipfile.ZipFile("inputs/cherry_leaves/cherry-leaves.zip", 'r') as zip_ref:
    zip_ref.extractall("inputs/cherry_leaves/")

#### ℹ️ Note:
Use PIL for the cleaning/validation step.
It’s more strict, which is better when you need to guarantee integrity of your dataset before training.

In [8]:
from PIL import Image

def find_invalid_images(folder_path):
    corrupted_files = []

    for label in os.listdir(folder_path):
        label_path = os.path.join(folder_path, label)
        if not os.path.isdir(label_path):
            continue

        for filename in os.listdir(label_path):
            file_path = os.path.join(label_path, filename)
            try:
                with Image.open(file_path) as img:
                    img.verify()  # Check if image can be opened
            except Exception as e:
                corrupted_files.append(file_path)

    print(f"Found {len(corrupted_files)} invalid image(s).")
    if corrupted_files:
        print("Example invalid file(s):")
        print("\n".join(corrupted_files[:5]))
    
    return corrupted_files


In [9]:
corrupted_files = find_invalid_images("inputs/cherry_leaves/cherry-leaves")

Found 0 invalid image(s).


#### ℹ️ Note:
Since there were no invalid images found, there will be no need to implement a delete.

In [12]:
from pathlib import Path
import shutil
import random

def organize_dataset_into_subsets(image_dir, output_dir, splits=(0.7, 0.2, 0.1)):
    image_dir = Path(image_dir)
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)

    labels = [d.name for d in image_dir.iterdir() if d.is_dir()]
    print(f"Found labels: {labels}")

    for label in labels:
        files = [f for f in (image_dir / label).iterdir() if f.is_file()]
        random.shuffle(files)

        train_count = int(splits[0] * len(files))
        val_count = int(splits[1] * len(files))

        subsets = {
            "train": files[:train_count],
            "validation": files[train_count:train_count + val_count],
            "test": files[train_count + val_count:]
        }

        for subset_name, subset_files in subsets.items():
            subset_path = output_dir / subset_name / label
            subset_path.mkdir(parents=True, exist_ok=True)

            for file in subset_files:
                shutil.copy(file, subset_path)

    print("\n✅ Dataset split complete.")
    for subset in ["train", "validation", "test"]:
        total = sum(1 for _ in (output_dir / subset).rglob("*.*"))
        print(f"{subset.capitalize()} set: {total} images")

#### ℹ️ Note:
A separate output directory is used to preserve the original dataset. This approach ensures data integrity, supports reproducibility, and allows for safe reprocessing or experimentation without altering the source files.


In [14]:
organize_dataset_into_subsets(
    image_dir="inputs/cherry_leaves/cherry-leaves",
    output_dir="inputs/split-leaves",
)

Found labels: ['powdery_mildew', 'healthy']

✅ Dataset split complete.
Train set: 4052 images
Validation set: 1782 images
Test set: 980 images


#### ℹ️ Note:
A 70-20-10 split was chosen to ensure that the model has sufficient data to learn from (70%), while also providing enough examples to validate its performance during training (20%) and evaluate its generalization on unseen data (10%).
