## 📥 Data Collection

This notebook handles the collection of the raw dataset used in this project.

- The dataset is hosted on Kaggle and requires authentication via the Kaggle API.
- We configure Kaggle credentials securely and download the dataset programmatically.
- Once downloaded, the dataset is extracted from its zip archive and made ready for further analysis.

This phase does **not** involve any modification or transformation of the data. The goal is to retrieve the dataset in its original form for inspection and preprocessing.

### Objective
Fetch data from the kaggle url and prepare it for further processes

### Inputs
Kaggle JSON file - the authentication token.

### Outputs
Generate Dataset: inputs/datasets/cherry_leaves

### Import Dataset

In [None]:
import os

In [3]:
# Dynamically find the kaggle.json path
config_path = os.path.join(os.getcwd(), "jupyter_notebooks", "kaggle.json")

# Set Kaggle config env var
os.environ['KAGGLE_CONFIG_DIR'] = os.path.dirname(config_path)

# Apply permissions
!chmod 600 jupyter_notebooks/kaggle.json

#### ℹ️ Note:
Downloading the dataset from kaggle website to our defined destination folder.

In [4]:
! kaggle datasets download -d "codeinstitute/cherry-leaves" -p "inputs/cherry_leaves"

Dataset URL: https://www.kaggle.com/datasets/...
License(s): unknown
Downloading cherry-leaves.zip to inputs/cherry_leaves
  0%|                                               | 0.00/55.0M [00:00<?, ?B/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:00<00:00, 1.95GB/s]


Unzip the downloaded file

**Unzip the downloaded files using the zipfile.**

In [5]:
import zipfile
with zipfile.ZipFile("../inputs/cherry_leaves/cherry-leaves.zip", 'r') as zip_ref:
    zip_ref.extractall("inputs/cherry_leaves/")

### Dataset Statistics Collection
We count the total number of images in each class (healthy vs infected cherry leaves) and save the statistics for use in the Streamlit dashboard metrics.
- 📌 **Insight:** This provides essential dataset overview information including total size and class distribution balance for model development planning.

In [5]:
import os
import json

# Count images in each folder
healthy_path = "../inputs/cherry_leaves/cherry-leaves/healthy"
infected_path = "../inputs/cherry_leaves/cherry-leaves/powdery_mildew"

healthy_count = len(os.listdir(healthy_path))
infected_count = len(os.listdir(infected_path))
total_count = healthy_count + infected_count

# Create results dictionary
dataset_stats = {
    "total_images": total_count,
    "healthy_leaves": healthy_count,
    "powdery_mildew_leaves": infected_count
}

# Save to JSON file
with open("../outputs/dataset_stats.json", "w") as f:
    json.dump(dataset_stats, f, indent=2)

print(f"Dataset stats saved to outputs/dataset_stats.json")
print(f"Total Images: {total_count}")
print(f"Healthy Leaves: {healthy_count}")
print(f"Infected Leaves: {infected_count}")

Dataset stats saved to outputs/dataset_stats.json
Total Images: 4208
Healthy Leaves: 2104
Infected Leaves: 2104
