# Dataset Preparation

This script splits the acquired mushroom image dataset into training, validation, and test datasets for neural network model development.

After acquiring the images from [MushroomObserver](https://mushroomobserver.org/articles/20), the data needs to be properly divided to ensure reliable model evaluation. This script creates a stratified split where each species is represented in all three sets according to the specified ratios.

## Dataset Split Configuration

The default split configuration follows standard machine learning practices:

- **Training set**: 70% - Used to train the model
- **Validation set**: 15% - Used for hyperparameter tuning and model selection  
- **Test set**: 15% - Used for final model evaluation

The script processes each species directory separately and generates CSV files containing the image paths and corresponding binary labels for each image. This format ensures easy data loading during model training and evaluation.

## Binary Classification Setup

The dataset is configured for binary classification with one species designated as the positive class (label 1) and all others as negative class (label 0). This approach is particularly useful for identifying specific mushroom species versus all others.

### Output files

The split process generates three CSV files in the `data/` directory:

- `training_split.csv`: Contains training images with their labels
- `validation_split.csv`: Contains validation images with their labels  
- `testing_split.csv`: Contains test images with their labels

Each CSV file contains columns: `species_name`, `image_path`, and `label_id`

In [None]:
import os

from utils.dataset_preparation import split_dataset

# Configuration parameters
DATA_PATH = os.path.join("data", "images")
TRAINING_IDS_PATH = os.path.join("data", "training_split.csv")
VALIDATION_IDS_PATH = os.path.join("data", "validation_split.csv")
TESTING_IDS_PATH = os.path.join("data", "testing_split.csv")

TRAINING_SPLIT_RATIO = 0.7
VALIDATION_SPLIT_RATIO = 0.15
TESTING_SPLIT_RATIO = 0.15  # Unused, but kept for clarity

POSITIVE_CLASS = "Tylopilus felleus"  # Species to be labeled as positive (1)

split_dataset(
    data_path=DATA_PATH,
    training_ids_path=TRAINING_IDS_PATH,
    validation_ids_path=VALIDATION_IDS_PATH,
    testing_ids_path=TESTING_IDS_PATH,
    training_split_ratio=TRAINING_SPLIT_RATIO,
    validation_split_ratio=VALIDATION_SPLIT_RATIO,
    positive_class=POSITIVE_CLASS,
    random_seed=0,
)

2025-07-17 17:30:28,164 - INFO - Processed Tylopilus felleus: 420 train, 90 val, 91 test images (label_id: 1)


2025-07-17 17:30:28,175 - INFO - Processed Boletus edulis: 415 train, 88 val, 90 test images (label_id: 0)
2025-07-17 17:30:28,178 - INFO - Processed Imleria badia: 151 train, 32 val, 34 test images (label_id: 0)
2025-07-17 17:30:28,178 - INFO - Dataset created with Tylopilus felleus as positive class (randomized 3-way split)
2025-07-17 17:30:28,179 - INFO - Split ratios: Train 70%, Val 15%, Test 15%
