## Dataset Preparation and Preprocessing for Waste Classification

This notebook prepares the Helene waste dataset for machine learning model training. The process includes loading labeled waste images, creating standardized category mappings, and organizing the data into train/validation/test splits while maintaining class distribution. The resulting structured dataset enables efficient training and evaluation of the waste classification chatbot system.

### 1. Setup and Library Imports

In [1]:
import shutil
from pathlib import Path
import pandas as pd
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import sys

### 2. Configuration Setup

This code adds the project's config directory to the Python path and imports the dataset file paths from the configuration file. It sets up three main path variables: SOURCE (original dataset), OUTPUT (processed data destination), and excel_path (waste category labels file).

In [None]:
# Add config directory to Python path for importing project settings
project_root = Path(r"C:\Users\Lejlum\Documents\PA2_Recycling_Chatbot\waste_recycling_chatbot_pa2\config")
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import dataset paths and configuration from config file
from config import WASTE_CHATBOT_DATASET, PROCESSED_NEW, WASTE_CHATBOT_EXCEL

# Set source dataset and output directories
SOURCE = WASTE_CHATBOT_DATASET          # Original dataset location
OUTPUT = PROCESSED_NEW                  # Processed dataset output location  
excel_path = WASTE_CHATBOT_EXCEL        # Excel file with waste category labels

print(f"Source dataset exists: {SOURCE.exists()}")
print(f"Labels file exists: {excel_path.exists()}")

Source dataset exists: True
Labels file exists: True


### 3. Data Loading and Exploration

The Excel file containing waste image labels and metadata is loaded into a pandas DataFrame. Column names are defined for processing - "ID" for image filenames and "Label" for waste categories. Basic dataset information including shape, column names, and sample rows are displayed to understand the data structure.

In [3]:
# Load Excel file containing image labels and metadata
df = pd.read_excel(excel_path) 

# Define column names for processing
id_col = "ID"           # Column containing image filenames/IDs
label_col = "Label"     # Column containing waste category labels

print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
print(df.head())

Dataset shape: (5795, 4)
Columns: ['ID', 'Label', 'Source', 'Prompt Cumo V1']

First few rows:
         ID              Label    Source  \
0  0001.jpg  Plastic,Aluminium  own work   
1  0002.jpg  Plastic,Aluminium  own work   
2  0003.jpg  Plastic,Aluminium  own work   
3  0004.jpg  Plastic,Aluminium  own work   
4  0005.jpg  Plastic,Aluminium  own work   

                                      Prompt Cumo V1  
0  This item consists of plastic and aluminium. T...  
1  This item consists of plastic and aluminium. T...  
2  This item consists of plastic and aluminium. T...  
3  This item consists of plastic and aluminium. T...  
4  This item consists of plastic and aluminium. T...  


### 4. Label Processing and Category Mapping

Unique waste categories are extracted from the dataset and sorted alphabetically. A mapping dictionary is created to convert original label names into standardized folder names by converting to lowercase and replacing spaces and commas with underscores. The output displays all detected categories with their standardized versions and shows the total number of waste categories found.

In [4]:
# Extract unique waste categories from dataset
labels = sorted(df[label_col].astype(str).str.strip().unique().tolist())

# Create mapping from original labels to standardized folder names
# Convert to lowercase and replace spaces/commas with underscores
MAPPING = {label: label.lower().replace(" ", "_").replace(",", "_") for label in labels}

print("Detected waste categories:")
for original, standardized in MAPPING.items():
    print(f"{original} -> {standardized}")

print(f"\nTotal number of categories: {len(MAPPING)}")

Detected waste categories:
Aluminium -> aluminium
Brown Glass -> brown_glass
Cardboard -> cardboard
Composite Carton -> composite_carton
Green Glass -> green_glass
Hazardous waste (Battery) -> hazardous_waste_(battery)
Metal -> metal
Organic waste -> organic_waste
PET -> pet
Paper -> paper
Plastic -> plastic
Plastic,Aluminium -> plastic_aluminium
Residual waste -> residual_waste
Rigid plastic container -> rigid_plastic_container
White Glass -> white_glass
White Glass,Metal -> white_glass_metal

Total number of categories: 16


### 5. Directory Structure Creation

A nested folder structure is created for the dataset organization. Three main directories (train, val, test) are generated, each containing subdirectories for every waste category. The `mkdir()` function with `parents=True` creates all necessary parent directories, while `exist_ok=True` prevents errors if directories already exist. The final structure follows the pattern: OUTPUT/split/category/ for organized model training.

In [5]:
# Create train/validation/test split directories
splits = ["train", "val", "test"]

# Create folder structure: OUTPUT/split/category/
for split in splits:
    for category in MAPPING.values():
        (OUTPUT / split / category).mkdir(parents=True, exist_ok=True)

print("Created directory structure:")
print(f"Base output directory: {OUTPUT}")
print(f"Splits: {splits}")
print(f"Categories per split: {len(MAPPING)}")

Created directory structure:
Base output directory: C:\Users\Lejlum\Documents\PA2_Recycling_Chatbot\data\processed_new\organized_dataset
Splits: ['train', 'val', 'test']
Categories per split: 16


### 6. Data Splitting Strategy

A stratified split is performed to maintain proportional class distribution across all subsets. The dataset is divided into 70% training, 15% validation, and 15% test data through two sequential splits. The `stratify` parameter ensures each waste category appears proportionally in all splits, while `random_state=42` guarantees reproducible results. Split statistics are displayed showing the number and percentage of images in each subset.

In [6]:
# Perform stratified split to maintain class distribution
# First split: 70% train, 30% temp (for val+test)
train_df, temp_df = train_test_split(
    df, 
    test_size=0.3, 
    stratify=df[label_col], 
    random_state=42
)

# Second split: Split temp into 15% val, 15% test
val_df, test_df = train_test_split(
    temp_df, 
    test_size=0.5, 
    stratify=temp_df[label_col], 
    random_state=42
)

# Store splits in dictionary for easy iteration
splits_data = {
    "train": train_df,
    "val": val_df,
    "test": test_df
}

# Display split statistics
print("Dataset split summary:")
for split_name, split_df in splits_data.items():
    print(f"{split_name}: {len(split_df)} images ({len(split_df)/len(df)*100:.1f}%)")

Dataset split summary:
train: 4056 images (70.0%)
val: 869 images (15.0%)
test: 870 images (15.0%)


### 7. File Copy Operations

Images are copied from the source directory to their designated split and category folders. For each dataset split, the process iterates through all assigned images with a progress bar display. Source image paths are constructed using the ID column, and missing files are skipped with a warning message. Target paths follow the structure OUTPUT/split/category/image_name, with shutil.copy2() preserving file metadata during the copy operation.

In [8]:
# Copy images to appropriate directories based on split and category
for split_name, split_df in splits_data.items():
    print(f"\nCopying {split_name} images...")
    
    for _, row in tqdm(split_df.iterrows(), total=len(split_df), desc=f"Processing {split_name}"):
        # Source image path
        src = SOURCE / row[id_col]
        
        # Skip if source image doesn't exist
        if not src.exists():
            print(f"Warning: Image {src.name} not found, skipping...")
            continue
        
        # Determine target category folder
        category_folder = MAPPING[row[label_col]]
        
        # Target path: OUTPUT/split/category/image_name
        dst = OUTPUT / split_name / category_folder / src.name
        
        # Copy image with metadata preservation
        shutil.copy2(src, dst)


Copying train images...


Processing train: 100%|██████████| 4056/4056 [00:53<00:00, 75.56it/s] 



Copying val images...


Processing val: 100%|██████████| 869/869 [00:11<00:00, 77.48it/s]



Copying test images...


Processing test: 100%|██████████| 870/870 [00:11<00:00, 75.04it/s]


### 8. Summary and Validation

Image counts are verified for each dataset split by traversing the created directory structure and counting files in all category subdirectories. The validation process ensures all images were successfully copied to their designated locations. Final statistics display the total number of images per split and confirm the processed dataset location for subsequent model training steps.

In [10]:
# Count images in each split for validation
for split_name in splits_data.keys():
    split_path = OUTPUT / split_name
    total_images = sum(len(list(category_path.glob("*"))) 
                      for category_path in split_path.iterdir() 
                      if category_path.is_dir())
    print(f"{split_name.upper()}: {total_images} images")

print(f"\nProcessed dataset location: {OUTPUT}")

TRAIN: 4056 images
VAL: 869 images
TEST: 870 images

Processed dataset location: C:\Users\Lejlum\Documents\PA2_Recycling_Chatbot\data\processed_new\organized_dataset
