# Split Dataset

## Réalisé par : Ahmed Mrabet

## Description
Dans ce notebook, nous allons diviser le dataset en 3 parties pour l'entrainement, la validation et le test.

In [6]:
import os
import shutil
from sklearn.model_selection import train_test_split

### Definition de la fonction split_dataset
La fonction split_dataset prend en entrée le dataset et les proportions de division.
Le resultat est un nouveau dataset contenant les 3 parties.

**Arguments:**
- input_dir : le chemin du dataset
- output_dir : le chemin du nouveau dataset divisé
- train_size : la proportion de la partie train
- val_size : la proportion de la partie validation

In [7]:

def split_dataset(input_dir, output_dir, train_size=0.7, val_size=0.2):
    """
    Splits dataset into training, validation, and test sets.

    Args:
        input_dir (str): Path to the dataset organized by class folders.
        output_dir (str): Path to the output directory for train, test, and validation sets.
        train_size (float): Proportion of the dataset to include in the training set.
        val_size (float): Proportion of the dataset to include in the validation set.
    """
    # Ensure train_size and val_size are compatible
    assert train_size + val_size < 1.0, "Train and validation sizes must sum to less than 1.0"
    
    # Create directories for train, test, and validation
    for split in ['train', 'validation', 'test']:
        split_dir = os.path.join(output_dir, split)
        os.makedirs(split_dir, exist_ok=True)

    # Iterate through each class folder
    for class_name in os.listdir(input_dir):
        class_path = os.path.join(input_dir, class_name)
        
        if not os.path.isdir(class_path):
            continue
        
        # Get all images in the class folder
        images = os.listdir(class_path)
        images = [os.path.join(class_path, img) for img in images if os.path.isfile(os.path.join(class_path, img))]
        
        # Split data
        train, remaining = train_test_split(images, test_size=1 - train_size, random_state=42)
        val, test = train_test_split(remaining, test_size=(1 - train_size - val_size) / (1 - train_size), random_state=42)
        
        # Copy files to their respective folders
        for split, split_data in zip(['train', 'validation', 'test'], [train, val, test]):
            split_class_dir = os.path.join(output_dir, split, class_name)
            os.makedirs(split_class_dir, exist_ok=True)
            
            for img in split_data:
                shutil.copy(img, split_class_dir)
    
    print(f"Dataset split completed. Organized into {output_dir}")


### Split du dataset

Nous allons diviser le dataset en 3 parties :
- Train : 60%
- Validation : 20%
- Test : 20%

In [8]:
input_dir = "chest_xray_dataset_no_split"
output_dir = "chest_xray_dataset"
split_dataset(input_dir, output_dir, train_size=0.7, val_size=0.2)

Dataset split completed. Organized into chest_xray_dataset
