# Synthetic Data Generation for XGBoost Classification

This notebook generates synthetic tabular data for demonstrating the end-to-end Kubeflow workflow with XGBoost.

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

## Generate Synthetic Dataset

We'll create a dataset with the following characteristics:
- 10,000 samples
- 20 features (15 informative, 5 redundant)
- Binary classification
- Some correlation between features to make it realistic

In [None]:
# Generate synthetic data
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    random_state=RANDOM_STATE,
    class_sep=0.8,  # Make classes somewhat separable
    flip_y=0.1      # Add some noise
)

# Create feature names
feature_names = [f'feature_{i}' for i in range(20)]

# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

# Display first few rows and basic information
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
display(df.head())
print("\nClass distribution:")
display(df['target'].value_counts(normalize=True))

## Add Some Missing Values

To make the dataset more realistic, we'll introduce some missing values randomly.

In [None]:
# Add 5% missing values to some features
features_with_missing = feature_names[:5]  # First 5 features will have missing values
for feature in features_with_missing:
    mask = np.random.random(len(df)) < 0.05  # 5% missing rate
    df.loc[mask, feature] = np.nan

print("Missing values per column:")
display(df.isnull().sum())

## Split Data into Train, Validation, and Test Sets

In [None]:
# First split: separate test set
train_val_data, test_data = train_test_split(
    df,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=df['target']
)

# Second split: separate train and validation sets
train_data, val_data = train_test_split(
    train_val_data,
    test_size=0.25,  # 0.25 x 0.8 = 0.2 of original data
    random_state=RANDOM_STATE,
    stratify=train_val_data['target']
)

print("Dataset splits:")
print(f"Train set: {len(train_data)} samples")
print(f"Validation set: {len(val_data)} samples")
print(f"Test set: {len(test_data)} samples")

## Save Data to Disk

Save the datasets in CSV format for use in the preprocessing notebook.

In [None]:
# Create data directory if it doesn't exist
import os
os.makedirs('../data/raw', exist_ok=True)

# Save datasets
train_data.to_csv('../data/raw/train.csv', index=False)
val_data.to_csv('../data/raw/validation.csv', index=False)
test_data.to_csv('../data/raw/test.csv', index=False)

print("Datasets saved successfully!")

## Data Description

Save a description of the dataset for documentation purposes.

In [None]:
data_description = {
    'n_samples': len(df),
    'n_features': len(feature_names),
    'features': feature_names,
    'target': 'Binary classification (0 or 1)',
    'missing_values': features_with_missing,
    'train_samples': len(train_data),
    'val_samples': len(val_data),
    'test_samples': len(test_data)
}

import json
with open('../data/raw/data_description.json', 'w') as f:
    json.dump(data_description, f, indent=2)

print("Data description saved successfully!")