# ML WSL Boilerplate - Complete Data Science Workflow

This notebook demonstrates a complete machine learning workflow using the ML WSL Boilerplate framework.

## What we'll cover:

1. **Project Setup** - Configuration and imports
2. **Data Loading** - Using our custom DataLoader
3. **Exploratory Data Analysis** - Understanding the data
4. **Feature Engineering** - Preprocessing and feature creation
5. **Model Training** - Multiple algorithms with cross-validation
6. **Model Evaluation** - Comprehensive metrics and visualization
7. **Experiment Tracking** - MLflow integration
8. **Model Persistence** - Saving and loading models
9. **Results & Next Steps** - Interpretation and recommendations

---

## 1. Project Setup

Let's start by setting up our environment and loading the configuration.

In [None]:
# Standard imports
import sys
import warnings
from pathlib import Path
import time

# Data science stack
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Add our source directory to path
sys.path.append('../src')

# Our custom modules
from src.utils.config import get_config
from src.utils.logging_config import setup_logging, get_logger
from src.data.loader import DataLoader
from src.models.trainer import get_trainer

# Configure display
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Setup logging
ml_logger = setup_logging(log_level="INFO")
logger = get_logger(component="notebook")

print("ML WSL Boilerplate - Ready for Data Science!")
print("=" * 50)

# Load configuration
config = get_config()

print("Configuration Overview:")
print(f"Project: {config.get('project.name')}")
print(f"Model Type: {config.get('model.type')}")
print(f"Train/Test Split: {config.get('data.train_test_split')}")
print(f"CV Folds: {config.get('training.cv_folds')}")
print(f"Random Seed: {config.get('environment.seed')}")
print(f"MLflow Enabled: {config.get('mlflow.enabled')}")

# Validate configuration
if config.validate_config():
    print("Configuration is valid!")
else:
    print("Configuration validation failed!")

## 2. Data Loading & Generation

For this demo, we'll generate a synthetic dataset that simulates a real-world classification problem.

In [None]:
# Generate synthetic dataset
logger.info("Generating synthetic dataset...")

X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_clusters_per_class=2,
    weights=[0.6, 0.4],  # Slightly imbalanced
    flip_y=0.01,  # Add some noise
    random_state=config.get('environment.seed', 42)
)

# Convert to DataFrame for easier handling
feature_names = [f"feature_{i:02d}" for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

print(f"Dataset created: {df.shape[0]:,} samples, {df.shape[1]-1} features")
print(f"Target distribution:")
print(df['target'].value_counts(normalize=True).round(3))

# Use DataLoader for validation
data_loader = DataLoader()
validation_results = data_loader.validate_data(df)

print(f"\nData Validation:")
print(f"   Shape: {validation_results['shape']}")
print(f"   Missing values: {sum(validation_results['missing_values'].values())}")
print(f"   Duplicates: {validation_results['duplicates']}")
print(f"   Memory usage: {validation_results['memory_usage'] / 1024:.1f} KB")

## 3. Exploratory Data Analysis

Let's explore our data to understand its characteristics and patterns.

In [None]:
# Basic statistics
print("Dataset Statistics:")
print(df.describe().round(3))

In [None]:
# Visualization setup
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Exploratory Data Analysis', fontsize=16, fontweight='bold')

# 1. Target distribution
df['target'].value_counts().plot(kind='bar', ax=axes[0,0], color=['skyblue', 'lightcoral'])
axes[0,0].set_title('Target Distribution')
axes[0,0].set_xlabel('Class')
axes[0,0].set_ylabel('Count')
axes[0,0].tick_params(axis='x', rotation=0)

# 2. Feature correlation heatmap (top 10 features)
top_features = feature_names[:10] + ['target']
corr_matrix = df[top_features].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, ax=axes[0,1], cbar_kws={"shrink": .8})
axes[0,1].set_title('Feature Correlation (Top 10)')

# 3. Feature distributions by class
feature_to_plot = 'feature_00'  # Most important feature
for class_val in [0, 1]:
    subset = df[df['target'] == class_val][feature_to_plot]
    axes[1,0].hist(subset, alpha=0.7, label=f'Class {class_val}', bins=30)
axes[1,0].set_title(f'{feature_to_plot} Distribution by Class')
axes[1,0].set_xlabel('Feature Value')
axes[1,0].set_ylabel('Frequency')
axes[1,0].legend()

# 4. Feature importance (correlation with target)
feature_importance = df[feature_names].corrwith(df['target']).abs().sort_values(ascending=False)
feature_importance.head(10).plot(kind='barh', ax=axes[1,1], color='lightgreen')
axes[1,1].set_title('Top 10 Feature Correlations with Target')
axes[1,1].set_xlabel('Absolute Correlation')

plt.tight_layout()
plt.show()

print(f"\nMost correlated features:")
for i, (feature, corr) in enumerate(feature_importance.head(5).items()):
    print(f"   {i+1}. {feature}: {corr:.3f}")

## 4. Feature Engineering & Data Preprocessing

Let's prepare our data for machine learning using our custom preprocessing pipeline.

In [None]:
# Separate features and target
X = df[feature_names].copy()
y = df['target'].copy()

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Clean data using our DataLoader
X_clean = data_loader.clean_data(
    X,
    drop_duplicates=True,
    fill_missing=None  # No missing values in synthetic data
)

print(f"Data cleaning completed")
print(f"   Rows before: {len(X)}, after: {len(X_clean)}")

# Split data according to config
test_size = 1.0 - config.get('data.train_test_split', 0.8)
random_state = config.get('environment.seed', 42)

X_train, X_test, y_train, y_test = train_test_split(
    X_clean, y,
    test_size=test_size,
    random_state=random_state,
    stratify=y
)

print(f"\nData Split:")
print(f"   Training: {len(X_train):,} samples ({len(X_train)/len(X):.1%})")
print(f"   Testing:  {len(X_test):,} samples ({len(X_test)/len(X):.1%})")
print(f"   Train target dist: {y_train.value_counts(normalize=True).round(3).tolist()}")
print(f"   Test target dist:  {y_test.value_counts(normalize=True).round(3).tolist()}")

## 5. Model Training & Cross-Validation

Now let's train multiple models and compare their performance using our training framework.

In [None]:
# Define models to test
model_types = ['sklearn', 'xgboost', 'lightgbm']
results = {}

print("Starting model training and evaluation...\n")

for model_type in model_types:
    print(f"Training {model_type.upper()} model...")

    try:
        # Update config for current model
        config.set('model.type', model_type)

        # Get trainer
        trainer = get_trainer(model_type)

        # Measure training time
        start_time = time.time()

        # Cross-validation
        cv_results = trainer.cross_validate(
            X_train, y_train,
            cv_folds=config.get('training.cv_folds', 5),
            scoring='accuracy'
        )

        # Train final model
        train_metrics = trainer.fit(X_train, y_train)

        # Evaluate on test set
        test_metrics = trainer.evaluate(X_test, y_test, stage='test')

        training_time = time.time() - start_time

        # Store results
        results[model_type] = {
            'trainer': trainer,
            'cv_results': cv_results,
            'train_metrics': train_metrics,
            'test_metrics': test_metrics,
            'training_time': training_time
        }

        print(f"   CV Accuracy: {cv_results['cv_mean']:.4f} (±{cv_results['cv_std']:.4f})")
        print(f"   Test Accuracy: {test_metrics['accuracy']:.4f}")
        print(f"   Training time: {training_time:.2f}s\n")

    except Exception as e:
        print(f"   Failed to train {model_type}: {e}\n")
        continue

print("Model training completed!")

## 6. Model Evaluation & Comparison

Let's compare the performance of our models and visualize the results.

In [None]:
# Create comparison DataFrame
comparison_data = []

for model_type, result in results.items():
    row = {
        'Model': model_type.upper(),
        'CV_Accuracy': result['cv_results']['cv_mean'],
        'CV_Std': result['cv_results']['cv_std'],
        'Test_Accuracy': result['test_metrics']['accuracy'],
        'Test_Precision': result['test_metrics']['precision'],
        'Test_Recall': result['test_metrics']['recall'],
        'Test_F1': result['test_metrics']['f1'],
        'Training_Time': result['training_time']
    }

    # Add ROC-AUC if available
    if 'roc_auc' in result['test_metrics']:
        row['Test_ROC_AUC'] = result['test_metrics']['roc_auc']

    comparison_data.append(row)

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.round(4)

print("Model Comparison Results:")
print("=" * 80)
print(comparison_df.to_string(index=False))
print("=" * 80)

# Find best model
best_model = comparison_df.loc[comparison_df['Test_Accuracy'].idxmax(), 'Model']
best_accuracy = comparison_df['Test_Accuracy'].max()
print(f"\nBest Model: {best_model} (Accuracy: {best_accuracy:.4f})")

In [None]:
# Visualization of model comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Model Performance Comparison', fontsize=16, fontweight='bold')

# 1. Accuracy comparison
x_pos = range(len(comparison_df))
axes[0,0].bar(x_pos, comparison_df['Test_Accuracy'],
              color=['skyblue', 'lightgreen', 'lightcoral'], alpha=0.8)
axes[0,0].set_title('Test Accuracy by Model')
axes[0,0].set_xlabel('Model')
axes[0,0].set_ylabel('Accuracy')
axes[0,0].set_xticks(x_pos)
axes[0,0].set_xticklabels(comparison_df['Model'])
axes[0,0].set_ylim(0, 1)

# Add value labels on bars
for i, v in enumerate(comparison_df['Test_Accuracy']):
    axes[0,0].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom')

# 2. Multiple metrics comparison
metrics = ['Test_Accuracy', 'Test_Precision', 'Test_Recall', 'Test_F1']
x = np.arange(len(comparison_df))
width = 0.2

for i, metric in enumerate(metrics):
    axes[0,1].bar(x + i*width, comparison_df[metric], width,
                  label=metric.replace('Test_', ''), alpha=0.8)

axes[0,1].set_title('Multiple Metrics Comparison')
axes[0,1].set_xlabel('Model')
axes[0,1].set_ylabel('Score')
axes[0,1].set_xticks(x + width * 1.5)
axes[0,1].set_xticklabels(comparison_df['Model'])
axes[0,1].legend()
axes[0,1].set_ylim(0, 1)

# 3. Training time comparison
axes[1,0].bar(x_pos, comparison_df['Training_Time'],
              color=['gold', 'orange', 'red'], alpha=0.8)
axes[1,0].set_title('Training Time by Model')
axes[1,0].set_xlabel('Model')
axes[1,0].set_ylabel('Training Time (seconds)')
axes[1,0].set_xticks(x_pos)
axes[1,0].set_xticklabels(comparison_df['Model'])

# Add value labels
for i, v in enumerate(comparison_df['Training_Time']):
    axes[1,0].text(i, v + 0.1, f'{v:.2f}s', ha='center', va='bottom')

# 4. Cross-validation scores with error bars
axes[1,1].errorbar(x_pos, comparison_df['CV_Accuracy'],
                   yerr=comparison_df['CV_Std'],
                   fmt='o', capsize=5, capthick=2, markersize=8)
axes[1,1].set_title('Cross-Validation Accuracy (±1 Std)')
axes[1,1].set_xlabel('Model')
axes[1,1].set_ylabel('CV Accuracy')
axes[1,1].set_xticks(x_pos)
axes[1,1].set_xticklabels(comparison_df['Model'])
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Detailed Analysis of Best Model

Let's dive deeper into the performance of our best model.

In [None]:
# Get best model results
best_model_name = best_model.lower()
best_result = results[best_model_name]
best_trainer = best_result['trainer']

print(f"Detailed Analysis of {best_model} Model")
print("=" * 50)

# Predictions
y_pred = best_trainer.model.predict(X_test)
y_pred_proba = None

if hasattr(best_trainer.model, 'predict_proba'):
    y_pred_proba = best_trainer.model.predict_proba(X_test)[:, 1]

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:")
print(f"   True Negatives:  {cm[0,0]:4d}    False Positives: {cm[0,1]:4d}")
print(f"   False Negatives: {cm[1,0]:4d}    True Positives:  {cm[1,1]:4d}")

In [None]:
# Visualize confusion matrix and additional plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle(f'{best_model} Model - Detailed Performance Analysis', fontsize=16)

# 1. Confusion Matrix Heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'],
            ax=axes[0])
axes[0].set_title('Confusion Matrix')

# 2. Feature Importance (if available)
if hasattr(best_trainer.model, 'feature_importances_'):
    feature_imp = pd.DataFrame({
        'feature': feature_names,
        'importance': best_trainer.model.feature_importances_
    }).sort_values('importance', ascending=False)

    # Plot top 10 features
    top_features = feature_imp.head(10)
    axes[1].barh(range(len(top_features)), top_features['importance'],
                 color='lightgreen', alpha=0.8)
    axes[1].set_yticks(range(len(top_features)))
    axes[1].set_yticklabels(top_features['feature'])
    axes[1].set_xlabel('Feature Importance')
    axes[1].set_title('Top 10 Feature Importances')
    axes[1].invert_yaxis()
else:
    axes[1].text(0.5, 0.5, 'Feature importance\nnot available for\nthis model type',
                 ha='center', va='center', transform=axes[1].transAxes, fontsize=12)
    axes[1].set_title('Feature Importance')

# 3. Prediction Probability Distribution (if available)
if y_pred_proba is not None:
    # Separate probabilities by actual class
    prob_class_0 = y_pred_proba[y_test == 0]
    prob_class_1 = y_pred_proba[y_test == 1]

    axes[2].hist(prob_class_0, bins=30, alpha=0.7, label='Actual Class 0',
                 color='skyblue', density=True)
    axes[2].hist(prob_class_1, bins=30, alpha=0.7, label='Actual Class 1',
                 color='lightcoral', density=True)
    axes[2].axvline(x=0.5, color='black', linestyle='--', alpha=0.8, label='Decision Threshold')
    axes[2].set_xlabel('Predicted Probability (Class 1)')
    axes[2].set_ylabel('Density')
    axes[2].set_title('Prediction Probability Distribution')
    axes[2].legend()
else:
    axes[2].text(0.5, 0.5, 'Probability predictions\nnot available for\nthis model type',
                 ha='center', va='center', transform=axes[2].transAxes, fontsize=12)
    axes[2].set_title('Prediction Probabilities')

plt.tight_layout()
plt.show()

## 8. Model Persistence

Let's save our best model for future use.

In [None]:
# Save the best model
print(f"Saving {best_model} model...")

# Save model using our trainer
model_path = best_trainer.save_model()
print(f"Model saved to: {model_path}")

# Demonstrate loading the model
print(f"\nTesting model loading...")
from src.models.trainer import get_trainer

# Create new trainer and load model
new_trainer = get_trainer(best_model_name)
new_trainer.load_model(model_path)

# Test prediction
test_sample = X_test.iloc[:5]  # First 5 test samples
original_pred = best_trainer.model.predict(test_sample)
loaded_pred = new_trainer.model.predict(test_sample)

print(f"Original predictions: {original_pred}")
print(f"Loaded predictions:   {loaded_pred}")
print(f"Predictions match: {np.array_equal(original_pred, loaded_pred)}")

if np.array_equal(original_pred, loaded_pred):
    print("Model saved and loaded successfully!")
else:
    print("Model loading verification failed!")

## 9. Experiment Summary & Results

Let's summarize our findings and provide recommendations.

In [None]:
# Create experiment summary
print("EXPERIMENT SUMMARY")
print("=" * 60)
print(f"Dataset: Synthetic classification dataset")
print(f"Samples: {len(df):,} total, {len(X_train):,} train, {len(X_test):,} test")
print(f"Features: {len(feature_names)}")
print(f"Class balance: {df['target'].value_counts(normalize=True).round(3).tolist()}")
print(f"Cross-validation: {config.get('training.cv_folds')}-fold stratified")
print(f"Random seed: {config.get('environment.seed')}")

print(f"\nBEST MODEL: {best_model}")
print("=" * 60)
best_metrics = best_result['test_metrics']
print(f"Test Accuracy:  {best_metrics['accuracy']:.4f}")
print(f"Test Precision: {best_metrics['precision']:.4f}")
print(f"Test Recall:    {best_metrics['recall']:.4f}")
print(f"Test F1-Score:  {best_metrics['f1']:.4f}")
if 'roc_auc' in best_metrics:
    print(f"Test ROC-AUC:   {best_metrics['roc_auc']:.4f}")
print(f"Training Time:  {best_result['training_time']:.2f} seconds")

cv_result = best_result['cv_results']
print(f"CV Accuracy:    {cv_result['cv_mean']:.4f} (±{cv_result['cv_std']:.4f})")

print(f"\nMODEL ARTIFACTS:")
print("=" * 60)
print(f"Saved model:     {model_path}")
print(f"Configuration:   ../config/config.yaml")
print(f"Experiment logs: ../logs/")

## 10. Next Steps & Recommendations

Based on our analysis, here are the recommended next steps for improving the model:

In [None]:
print("NEXT STEPS & RECOMMENDATIONS")
print("=" * 60)

# Performance-based recommendations
if best_metrics['accuracy'] > 0.95:
    performance_level = "Excellent"
    recommendations = [
        "Model performance is excellent!",
        "Consider deploying to production",
        "Monitor for concept drift over time",
        "Collect more diverse real-world data"
    ]
elif best_metrics['accuracy'] > 0.85:
    performance_level = "Good"
    recommendations = [
        "Model performance is good",
        "Try hyperparameter optimization",
        "Consider ensemble methods",
        "Feature engineering improvements"
    ]
else:
    performance_level = "Needs Improvement"
    recommendations = [
        "Model needs improvement",
        "Collect more training data",
        "Try different algorithms",
        "Feature selection and engineering",
        "Address class imbalance if present"
    ]

print(f"Performance Level: {performance_level}")
print(f"Current Accuracy: {best_metrics['accuracy']:.4f}")
print("\nIMMEDIATE ACTIONS:")
for i, rec in enumerate(recommendations, 1):
    print(f"   {i}. {rec}")

print("\nTECHNICAL IMPROVEMENTS:")
tech_improvements = [
    "Implement hyperparameter optimization (Optuna)",
    "Add feature selection pipeline",
    "Set up automated model retraining",
    "Implement model monitoring",
    "Add A/B testing framework",
    "Create model explanation tools (SHAP)"
]

for i, improvement in enumerate(tech_improvements, 1):
    print(f"   {i}. {improvement}")

print("\nLEARNING RESOURCES:")
resources = [
    "MLflow documentation for experiment tracking",
    "Optuna for hyperparameter optimization",
    "SHAP for model interpretability",
    "MLOps best practices and tools"
]

for i, resource in enumerate(resources, 1):
    print(f"   {i}. {resource}")

print("\n" + "=" * 60)
print("EXPERIMENT COMPLETED SUCCESSFULLY!")
print("=" * 60)

---

## Summary

This notebook demonstrated a complete machine learning workflow using the **ML WSL Boilerplate** framework:

- **Configuration Management** - Flexible, hierarchical configuration  
- **Data Loading & Validation** - Custom DataLoader with validation  
- **Exploratory Data Analysis** - Comprehensive data understanding  
- **Model Training** - Multiple algorithms with cross-validation  
- **Model Evaluation** - Detailed performance analysis  
- **Model Persistence** - Save/load functionality  
- **Experiment Tracking** - Structured logging and metrics  

### Framework Benefits:

- **Reproducible**: Fixed random seeds and versioned configurations
- **Modular**: Separate components for data, models, and utilities
- **Extensible**: Easy to add new models and features
- **Production-Ready**: Logging, testing, and deployment tools
- **Developer-Friendly**: Type hints, documentation, and VS Code integration

### Ready for Production:

Use `make train` to run the training pipeline from command line, or customize the configuration files for your specific use case!

---

*Happy Machine Learning!*