## 📥 Data Collection

This notebook handles the collection of the raw dataset used in this project.

- The dataset is hosted on Kaggle and requires authentication via the Kaggle API.
- We configure Kaggle credentials securely and download the dataset programmatically.
- Once downloaded, the dataset is extracted from its zip archive and made ready for further analysis.

This phase does **not** involve any modification or transformation of the data. The goal is to retrieve the dataset in its original form for inspection and preprocessing.

### Objective
Fetch data from the kaggle url and prepare it for further processes

### Inputs
Kaggle JSON file - the authentication token.

### Outputs
Generate Dataset: inputs/datasets/cherry_leaves

### Import Dataset

In [1]:
import os

In [3]:
import os

for root, dirs, files in os.walk("."):
    for file in files:
        if "kaggle.json" in file:
            print("✅ Found:", os.path.join(root, file))

✅ Found: ./kaggle.json


In [4]:
# Dynamically find the kaggle.json path
config_path = os.path.join(os.getcwd(), "kaggle.json")

# Set Kaggle config env var
os.environ['KAGGLE_CONFIG_DIR'] = os.path.dirname(config_path)

# Apply permissions
!chmod 600 kaggle.json

#### ℹ️ Note:
Downloading the dataset from kaggle website to our defined destination folder.

In [6]:
! kaggle datasets download -d "codeinstitute/cherry-leaves" -p "../inputs/cherry_leaves"

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to ../inputs/cherry_leaves
  0%|                                               | 0.00/55.0M [00:00<?, ?B/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:00<00:00, 1.45GB/s]


"

**Unzip the downloaded files using the zipfile.**

In [9]:
import zipfile
with zipfile.ZipFile("../inputs/cherry_leaves/cherry-leaves.zip", 'r') as zip_ref:
    zip_ref.extractall("../inputs/cherry_leaves/")

### Dataset Statistics Collection
We count the total number of images in each class (healthy vs infected cherry leaves) and save the statistics for use in the Streamlit dashboard metrics.
- 📌 **Insight:** This provides essential dataset overview information including total size and class distribution balance for model development planning.

In [12]:
import os
import json

# Count images in each folder
healthy_path = "../inputs/cherry_leaves/cherry-leaves/healthy"
infected_path = "../inputs/cherry_leaves/cherry-leaves/powdery_mildew"

healthy_count = len(os.listdir(healthy_path))
infected_count = len(os.listdir(infected_path))
total_count = healthy_count + infected_count

# Create results dictionary
dataset_stats = {
    "total_images": total_count,
    "healthy_leaves": healthy_count,
    "powdery_mildew_leaves": infected_count
}

# Save to JSON file
with open("../outputs/dataset_stats.json", "w") as f:
    json.dump(dataset_stats, f, indent=2)

print(f"Dataset stats saved to outputs/dataset_stats.json")
print(f"Total Images: {total_count}")
print(f"Healthy Leaves: {healthy_count}")
print(f"Infected Leaves: {infected_count}")

Dataset stats saved to outputs/dataset_stats.json
Total Images: 4208
Healthy Leaves: 2104
Infected Leaves: 2104


## Split Dataset into Train, Validation, and Test

This function organizes the cleaned cherry leaf dataset into separate subsets: 
- **70% for training**, 
- **20% for validation**, and 
- **10% for testing**.

We also:
- **Skip hidden folders/files** such as `.ipynb_checkpoints` which are automatically generated by Jupyter and are not actual image data.
- **Filter for only valid image formats** like `.jpg`, `.jpeg`, and `.png` to avoid including non-image files by mistake.
- **Separate output directory** is used to preserve the original dataset. This approach ensures data integrity, supports reproducibility, and allows for safe reprocessing or experimentation without altering the source files.

This structure supports robust training and evaluation.

In [14]:
from pathlib import Path
import shutil
import random

def organize_dataset_into_subsets(image_dir, output_dir, splits=(0.7, 0.2, 0.1)):
    """
    Organizes the dataset into train, validation, and test folders.
    Skips hidden folders and filters only valid image formats.
    """
    valid_extensions = ['.jpg', '.jpeg', '.png']
    image_dir = Path(image_dir)
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)

    # Skip hidden/system folders like .ipynb_checkpoints
    labels = [d.name for d in image_dir.iterdir() if d.is_dir() and not d.name.startswith('.')]
    print(f"Found labels: {labels}")

    for label in labels:
        files = [
            f for f in (image_dir / label).iterdir()
            if f.is_file() and not f.name.startswith('.') and f.suffix.lower() in valid_extensions
        ]
        random.shuffle(files)

        train_count = int(splits[0] * len(files))
        val_count = int(splits[1] * len(files))

        subsets = {
            "train": files[:train_count],
            "validation": files[train_count:train_count + val_count],
            "test": files[train_count + val_count:]
        }

        for subset_name, subset_files in subsets.items():
            subset_path = output_dir / subset_name / label
            subset_path.mkdir(parents=True, exist_ok=True)

            for file in subset_files:
                shutil.copy(file, subset_path)

    print("\n✅ Dataset split complete.")
    for subset in ["train", "validation", "test"]:
        total = sum(1 for _ in (output_dir / subset).rglob("*.*"))
        print(f"{subset.capitalize()} set: {total} images")


In [15]:
organize_dataset_into_subsets(
    image_dir="../inputs/cherry_leaves/cherry-leaves",
    output_dir="../inputs/split-leaves",
)

Found labels: ['powdery_mildew', 'healthy']

✅ Dataset split complete.
Train set: 2944 images
Validation set: 840 images
Test set: 424 images


#### ℹ️ Note:
A 70-20-10 split was chosen to ensure that the model has sufficient data to learn from (70%), while also providing enough examples to validate its performance during training (20%) and evaluate its generalization on unseen data (10%).
