# 02. Data Splitting

This notebook:
- Loads the cleaned dataset
- Splits it into training, validation, and test sets (70/15/15)
- Stratifies the splits to maintain target class distribution
- Saves the split datasets into `data/processed/` directory

In [1]:
# 02_data_split.ipynb

# ====================================================
# 02. Data Splitting
# ----------------------------------------------------
# Objective:
# - Load the cleaned dataset using a relative path
# - Split the dataset into training, validation, and test sets
# - Save the split datasets into the processed data directory
# ====================================================

## 1. Import necessary libraries
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## 2. Define relative file paths
CLEANED_DATA_PATH = Path('../data/interim/alcohol_consumption_cleaned.csv')  # Cleaned input
PROCESSED_TRAIN_PATH = Path('../data/processed/train.csv')
PROCESSED_VAL_PATH = Path('../data/processed/val.csv')
PROCESSED_TEST_PATH = Path('../data/processed/test.csv')

## 3. Load the cleaned dataset
df_cleaned = pd.read_csv(CLEANED_DATA_PATH)

print("Shape of cleaned dataset:", df_cleaned.shape)
df_cleaned.head()

## 4. Separate features (X) and target (y)
# Assuming the target column is 'DRK_YN'
TARGET_COLUMN = 'DRK_YN'
X = df_cleaned.drop(columns=[TARGET_COLUMN])
y = df_cleaned[TARGET_COLUMN]

print("\nFeature matrix X shape:", X.shape)
print("Target vector y shape:", y.shape)

## 5. Split the dataset: train (70%), val (15%), test (15%)
# Step 1: Split train vs temp (val+test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, 
    test_size=0.30, 
    stratify=y, 
    random_state=42
)

# Step 2: Split temp into val and test
relative_val_size = 0.5  # Half of temp goes to validation, half to test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, 
    test_size=0.5, 
    stratify=y_temp, 
    random_state=42
)

## 6. Verify the split ratios
print("\nTrain set shape:", X_train.shape, y_train.shape)
print("Validation set shape:", X_val.shape, y_val.shape)
print("Test set shape:", X_test.shape, y_test.shape)

## 7. Combine features and targets for saving
train_set = pd.concat([X_train, y_train], axis=1)
val_set = pd.concat([X_val, y_val], axis=1)
test_set = pd.concat([X_test, y_test], axis=1)

## 8. Save the split datasets
train_set.to_csv(PROCESSED_TRAIN_PATH, index=False)
val_set.to_csv(PROCESSED_VAL_PATH, index=False)
test_set.to_csv(PROCESSED_TEST_PATH, index=False)

print("\nDatasets saved successfully to 'data/processed/' directory.")


Shape of cleaned dataset: (991346, 24)

Feature matrix X shape: (991346, 23)
Target vector y shape: (991346,)

Train set shape: (693942, 23) (693942,)
Validation set shape: (148702, 23) (148702,)
Test set shape: (148702, 23) (148702,)

Datasets saved successfully to 'data/processed/' directory.
