# Model Selection and Pipelines - Data Science Koans

Welcome to Notebook 14: Model Selection and Pipelines!

## What You Will Learn
- Pipeline fundamentals for workflow automation
- ColumnTransformer for feature-specific preprocessing  
- Custom transformers extending scikit-learn
- Integrated hyperparameter tuning with pipelines
- Production-ready ML pipelines

## Why This Matters
ML Pipelines are essential for production machine learning because they:
- **Prevent Data Leakage**: Ensure proper train/test separation
- **Ensure Reproducibility**: Same preprocessing steps every time
- **Enable Automation**: From raw data to predictions in one step
- **Simplify Deployment**: Encapsulate entire workflow in one object
- **Reduce Errors**: Eliminate manual preprocessing mistakes

## Key Concepts
- **Pipeline**: Sequential chain of data transformations + final estimator
- **ColumnTransformer**: Apply different transformations to different columns
- **Custom Transformers**: Extend sklearn with domain-specific preprocessing
- **Pipeline + GridSearch**: Tune preprocessing AND model parameters together
- **Production Considerations**: Serialization, versioning, monitoring

## Prerequisites  
- Hyperparameter Tuning (Notebook 13)
- Understanding of preprocessing techniques
- Experience with scikit-learn workflows

## How to Use
1. Build increasingly sophisticated pipeline components
2. Learn to avoid common data leakage pitfalls
3. Implement custom preprocessing logic
4. Combine pipelines with hyperparameter tuning
5. Create deployment-ready ML workflows

Ready to build production-grade ML pipelines? Let's automate your workflow! 🔧

In [None]:
# Setup - Run first!
import sys
sys.path.append('../..')

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer, load_wine, make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.base import BaseEstimator, TransformerMixin
import joblib

from koans.core.validator import KoanValidator
from koans.core.progress import ProgressTracker

validator = KoanValidator("14_pipelines")
tracker = ProgressTracker()

print("Setup complete!")
print(f"Current progress: {tracker.get_notebook_progress('14_pipelines')}%")

## KOAN 14.1: Pipeline Basics - Workflow Automation
**Objective**: Build a basic pipeline chaining preprocessing and modeling  
**Difficulty**: Advanced

Pipelines ensure that preprocessing steps are applied consistently during training and prediction, preventing data leakage and ensuring reproducible results.

**Key Concept**: Pipelines apply transformations in sequence, with the final step being an estimator. Each step gets the output of the previous step as input.

In [None]:
def create_basic_pipeline():
    """
    Create a basic pipeline that scales features and applies logistic regression.
    
    Returns:
        dict: Pipeline performance results
    """
    # Load breast cancer dataset
    cancer = load_breast_cancer()
    X, y = cancer.data, cancer.target
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # TODO: Create a pipeline with StandardScaler and LogisticRegression
    # Pipeline steps should be: [('scaler', StandardScaler()), ('classifier', LogisticRegression(random_state=42))]
    pipeline = None
    
    # TODO: Fit the pipeline
    # pipeline.fit(X_train, y_train)
    
    # TODO: Make predictions
    # y_pred = pipeline.predict(X_test)
    # accuracy = accuracy_score(y_test, y_pred)
    accuracy = None
    
    # Test individual steps
    scaler_step = pipeline.named_steps['scaler'] if pipeline else None
    classifier_step = pipeline.named_steps['classifier'] if pipeline else None
    
    return {
        'pipeline': pipeline,
        'accuracy': accuracy,
        'scaler': scaler_step,
        'classifier': classifier_step,
        'pipeline_steps': len(pipeline.steps) if pipeline else 0
    }

@validator.koan(1, "Pipeline Basics - Workflow Automation", difficulty="Advanced")
def validate():
    results = create_basic_pipeline()
    
    assert results['pipeline'] is not None, "Pipeline not created"
    assert results['accuracy'] is not None, "Accuracy not calculated"
    assert results['pipeline_steps'] == 2, f"Pipeline should have 2 steps, got {results['pipeline_steps']}"
    assert 0.85 <= results['accuracy'] <= 1.0, f"Accuracy should be reasonable, got {results['accuracy']:.3f}"
    
    # Check pipeline components
    assert results['scaler'] is not None, "Scaler step not found"
    assert results['classifier'] is not None, "Classifier step not found"
    assert isinstance(results['scaler'], StandardScaler), "First step should be StandardScaler"
    assert isinstance(results['classifier'], LogisticRegression), "Second step should be LogisticRegression"
    
    print("✓ Basic pipeline created successfully!")
    print(f"  - Pipeline steps: {results['pipeline_steps']}")
    print(f"  - Test accuracy: {results['accuracy']:.4f}")
    print(f"  - Scaler: {type(results['scaler']).__name__}")
    print(f"  - Classifier: {type(results['classifier']).__name__}")
    
    # Show pipeline structure
    pipeline = results['pipeline']
    print(f"\n  🔧 Pipeline Structure:")
    for i, (name, step) in enumerate(pipeline.steps):
        print(f"    {i+1}. {name}: {type(step).__name__}")
    
    print(f"\n  💡 Pipeline Benefits:")
    print(f"    • Prevents data leakage (scaler fit only on training data)")
    print(f"    • Ensures consistent preprocessing")
    print(f"    • Simplifies prediction workflow")
    print(f"    • Easy to serialize and deploy")
    
    print(f"\n  🎯 Key Methods:")
    print(f"    • fit(): Fits all steps on training data")
    print(f"    • predict(): Transforms data through all steps, predicts with final")
    print(f"    • named_steps: Access individual pipeline components")

validate()

## KOAN 14.2: ColumnTransformer - Feature-Specific Preprocessing
**Objective**: Use ColumnTransformer to apply different preprocessing to different columns  
**Difficulty**: Advanced

Real datasets often have mixed data types requiring different preprocessing. ColumnTransformer lets you apply specific transformations to specific columns.

**Key Concept**: ColumnTransformer applies different transformers to different subsets of columns, then combines results. Essential for mixed-type datasets.

In [None]:
def create_column_transformer_pipeline():
    """
    Create a pipeline using ColumnTransformer for mixed data types.
    We'll create a synthetic dataset with numeric and categorical features.
    
    Returns:
        dict: Pipeline results with mixed data preprocessing
    """
    # Create a mixed dataset
    np.random.seed(42)
    n_samples = 1000
    
    # Numeric features (different scales)
    numeric_data = np.random.randn(n_samples, 3)
    numeric_data[:, 0] *= 100  # Different scale
    numeric_data[:, 1] *= 10   # Different scale
    
    # Categorical features
    categories = ['A', 'B', 'C']
    cat_data = np.random.choice(categories, size=(n_samples, 2))
    
    # Combine into DataFrame
    df = pd.DataFrame(numeric_data, columns=['num1', 'num2', 'num3'])
    df['cat1'] = cat_data[:, 0] 
    df['cat2'] = cat_data[:, 1]
    
    # Create target (classification)
    y = (df['num1'] + df['num2'] > 0).astype(int)
    
    X_train, X_test, y_train, y_test = train_test_split(
        df, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # TODO: Define column names for different transformations
    numeric_columns = None  # ['num1', 'num2', 'num3']
    categorical_columns = None  # ['cat1', 'cat2']
    
    # TODO: Create ColumnTransformer
    # Use StandardScaler for numeric columns, OneHotEncoder for categorical
    preprocessor = None
    # ColumnTransformer([
    #     ('num', StandardScaler(), numeric_columns),
    #     ('cat', OneHotEncoder(drop='first'), categorical_columns)
    # ])
    
    # TODO: Create pipeline with preprocessor and classifier
    pipeline = None
    # Pipeline([
    #     ('preprocessor', preprocessor),
    #     ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
    # ])
    
    # TODO: Fit pipeline and calculate accuracy
    # pipeline.fit(X_train, y_train)
    # accuracy = pipeline.score(X_test, y_test)
    accuracy = None
    
    return {
        'pipeline': pipeline,
        'preprocessor': preprocessor,
        'accuracy': accuracy,
        'numeric_columns': numeric_columns,
        'categorical_columns': categorical_columns,
        'n_numeric': len(numeric_columns) if numeric_columns else 0,
        'n_categorical': len(categorical_columns) if categorical_columns else 0
    }

@validator.koan(2, "ColumnTransformer - Feature-Specific Preprocessing", difficulty="Advanced")
def validate():
    results = create_column_transformer_pipeline()
    
    assert results['pipeline'] is not None, "Pipeline not created"
    assert results['preprocessor'] is not None, "Preprocessor not created" 
    assert results['accuracy'] is not None, "Accuracy not calculated"
    assert results['n_numeric'] == 3, f"Should have 3 numeric columns, got {results['n_numeric']}"
    assert results['n_categorical'] == 2, f"Should have 2 categorical columns, got {results['n_categorical']}"
    assert 0.7 <= results['accuracy'] <= 1.0, f"Accuracy should be reasonable, got {results['accuracy']:.3f}"
    
    print("✓ ColumnTransformer pipeline created successfully!")
    print(f"  - Test accuracy: {results['accuracy']:.4f}")
    print(f"  - Numeric columns: {results['n_numeric']} (scaled)")
    print(f"  - Categorical columns: {results['n_categorical']} (one-hot encoded)")
    
    # Show preprocessor structure
    preprocessor = results['preprocessor']
    print(f"\n  🔄 Preprocessing Steps:")
    for name, transformer, columns in preprocessor.transformers:
        print(f"    {name}: {type(transformer).__name__} → {columns}")
    
    # Test the preprocessing
    pipeline = results['pipeline']
    if pipeline:
        # Get transformed feature names (if available)
        try:
            feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()
            print(f"\n  📊 Transformed Features: {len(feature_names)} total")
            print(f"    Sample names: {feature_names[:5]}...")
        except:
            print(f"\n  📊 Feature transformation completed")
    
    print(f"\n  💡 ColumnTransformer Benefits:")
    print(f"    • Different preprocessing for different data types")
    print(f"    • Automatic feature concatenation")
    print(f"    • Maintains column-specific transformations")
    print(f"    • Handles mixed datasets elegantly")
    
    print(f"\n  🎯 Common Use Cases:")
    print(f"    • Scale numeric, encode categorical")
    print(f"    • Different imputation strategies per column type")
    print(f"    • Feature selection on subsets")
    print(f"    • Custom transformations per feature group")

validate()

## KOAN 14.3: Custom Transformers - Extending scikit-learn
**Objective**: Create custom transformer classes for domain-specific preprocessing  
**Difficulty**: Advanced

Sometimes you need preprocessing logic that isn't available in scikit-learn. Custom transformers let you integrate domain-specific transformations into pipelines.

**Key Concept**: Custom transformers inherit from BaseEstimator and TransformerMixin, implementing fit() and transform() methods to work seamlessly in pipelines.

In [None]:
class FeatureEngineeringTransformer(BaseEstimator, TransformerMixin):
    """
    Custom transformer that creates new features from existing ones.
    
    This example creates polynomial features and interaction terms.
    """
    
    def __init__(self, create_interactions=True, create_polynomials=True):
        self.create_interactions = create_interactions
        self.create_polynomials = create_polynomials
        self.feature_names_ = None
    
    def fit(self, X, y=None):
        """
        Fit the transformer. For this transformer, we just store feature names.
        
        Args:
            X: Input features
            y: Target (unused)
        Returns:
            self
        """
        # TODO: Store original feature names for later use
        if hasattr(X, 'columns'):
            self.feature_names_ = list(X.columns)
        else:
            self.feature_names_ = [f'feature_{i}' for i in range(X.shape[1])]
        
        return self
    
    def transform(self, X):
        """
        Transform the input by creating new engineered features.
        
        Args:
            X: Input features
        Returns:
            X_transformed: Original features plus engineered features
        """
        # Convert to DataFrame if needed
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X, columns=self.feature_names_)
        
        X_new = X.copy()
        
        # TODO: Create polynomial features (squares) if enabled
        if self.create_polynomials:
            for col in self.feature_names_:
                if pd.api.types.is_numeric_dtype(X_new[col]):
                    # Create squared feature
                    pass  # X_new[f'{col}_squared'] = X_new[col] ** 2
        
        # TODO: Create interaction features if enabled  
        if self.create_interactions:
            numeric_cols = [col for col in self.feature_names_ 
                          if pd.api.types.is_numeric_dtype(X_new[col])]
            
            # Create interactions between first few numeric columns (to avoid explosion)
            for i, col1 in enumerate(numeric_cols[:3]):
                for col2 in numeric_cols[i+1:4]:  # Limit interactions
                    # Create interaction feature
                    pass  # X_new[f'{col1}_x_{col2}'] = X_new[col1] * X_new[col2]
        
        return X_new.values if not isinstance(X, pd.DataFrame) else X_new

def test_custom_transformer():
    """
    Test the custom transformer in a complete pipeline.
    
    Returns:
        dict: Results from pipeline with custom transformer
    """
    # Load wine dataset
    wine = load_wine()
    X, y = wine.data, wine.target
    
    # Convert to DataFrame for easier feature engineering
    X_df = pd.DataFrame(X, columns=wine.feature_names)
    
    # Use only first 5 features to keep it manageable
    X_subset = X_df.iloc[:, :5]
    
    X_train, X_test, y_train, y_test = train_test_split(
        X_subset, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # TODO: Create pipeline with custom transformer
    pipeline = None
    # Pipeline([
    #     ('feature_eng', FeatureEngineeringTransformer(create_interactions=True, create_polynomials=True)),
    #     ('scaler', StandardScaler()),
    #     ('classifier', LogisticRegression(random_state=42, max_iter=1000))
    # ])
    
    # TODO: Fit pipeline and calculate accuracy
    # pipeline.fit(X_train, y_train)
    # accuracy = pipeline.score(X_test, y_test)
    accuracy = None
    
    # Compare with baseline (no feature engineering)
    baseline_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(random_state=42, max_iter=1000))
    ])
    baseline_pipeline.fit(X_train, y_train)
    baseline_accuracy = baseline_pipeline.score(X_test, y_test)
    
    # Get feature count after engineering
    if pipeline:
        feature_eng = pipeline.named_steps['feature_eng']
        X_transformed = feature_eng.transform(X_train)
        n_engineered_features = X_transformed.shape[1]
    else:
        n_engineered_features = 0
    
    return {
        'pipeline': pipeline,
        'accuracy': accuracy,
        'baseline_accuracy': baseline_accuracy,
        'improvement': (accuracy - baseline_accuracy) if accuracy else 0,
        'original_features': X_subset.shape[1],
        'engineered_features': n_engineered_features
    }

@validator.koan(3, "Custom Transformers - Extending scikit-learn", difficulty="Advanced")
def validate():
    results = test_custom_transformer()
    
    assert results['pipeline'] is not None, "Pipeline with custom transformer not created"
    assert results['accuracy'] is not None, "Accuracy not calculated"
    assert results['engineered_features'] > results['original_features'], "Should create additional features"
    assert 0.7 <= results['accuracy'] <= 1.0, f"Accuracy should be reasonable, got {results['accuracy']:.3f}"
    assert 0.7 <= results['baseline_accuracy'] <= 1.0, f"Baseline should be reasonable, got {results['baseline_accuracy']:.3f}"
    
    print("✓ Custom transformer pipeline created successfully!")
    print(f"  - Original features: {results['original_features']}")
    print(f"  - Engineered features: {results['engineered_features']}")
    print(f"  - Feature expansion: +{results['engineered_features'] - results['original_features']} features")
    
    print(f"\n  📊 Performance Comparison:")
    print(f"    Baseline (no engineering): {results['baseline_accuracy']:.4f}")
    print(f"    With feature engineering: {results['accuracy']:.4f}")
    print(f"    Improvement: {results['improvement']:+.4f}")
    
    if results['improvement'] > 0.01:
        print(f"    🎉 Significant improvement from feature engineering!")
    elif results['improvement'] > -0.01:
        print(f"    ✓ Feature engineering didn't hurt performance")
    else:
        print(f"    ⚠️ Feature engineering may have added noise")
    
    print(f"\n  🛠️ Custom Transformer Requirements:")
    print(f"    • Inherit from BaseEstimator, TransformerMixin")
    print(f"    • Implement fit(X, y=None) method")
    print(f"    • Implement transform(X) method")
    print(f"    • Return self from fit() for method chaining")
    
    print(f"\n  💡 Custom Transformer Use Cases:")
    print(f"    • Domain-specific feature engineering")
    print(f"    • Business rule transformations")
    print(f"    • Complex data cleaning logic")
    print(f"    • Integration with external APIs")
    print(f"    • Time series feature extraction")

validate()

## KOAN 14.4: Pipeline with GridSearch - Integrated Hyperparameter Tuning
**Objective**: Combine pipelines with GridSearchCV to tune both preprocessing and model parameters  
**Difficulty**: Advanced

One of the most powerful features of pipelines is the ability to tune preprocessing parameters alongside model parameters, preventing data leakage and finding optimal combinations.

**Key Concept**: Use double underscores (__) to specify parameters for specific pipeline steps in GridSearchCV parameter grids.

In [None]:
def pipeline_with_gridsearch():
    """
    Create a pipeline and use GridSearchCV to tune both preprocessing and model parameters.
    
    Returns:
        dict: GridSearch results with best pipeline configuration
    """
    # Load breast cancer dataset
    cancer = load_breast_cancer()
    X, y = cancer.data, cancer.target
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # TODO: Create pipeline with multiple preprocessing options
    pipeline = None
    # Pipeline([
    #     ('scaler', StandardScaler()),  # This will be tuned
    #     ('classifier', SVC(random_state=42))  # This will be tuned
    # ])
    
    # TODO: Define parameter grid for pipeline tuning
    # Use step_name__parameter_name format
    param_grid = [
        {
            'scaler': None,  # [StandardScaler(), MinMaxScaler()]
            'classifier__C': None,  # [0.1, 1, 10]
            'classifier__kernel': None,  # ['linear', 'rbf']
            'classifier__gamma': None   # ['scale', 'auto']
        }
    ]
    
    # TODO: Create GridSearchCV for the pipeline
    grid_search = None
    # GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
    
    # TODO: Fit the grid search
    # grid_search.fit(X_train, y_train)
    
    # Evaluate best pipeline
    best_score = grid_search.best_score_ if grid_search else 0
    test_score = grid_search.score(X_test, y_test) if grid_search else 0
    best_params = grid_search.best_params_ if grid_search else {}
    
    return {
        'grid_search': grid_search,
        'best_params': best_params,
        'best_cv_score': best_score,
        'test_score': test_score,
        'best_pipeline': grid_search.best_estimator_ if grid_search else None,
        'n_combinations': len(grid_search.cv_results_['params']) if grid_search else 0
    }

@validator.koan(4, "Pipeline with GridSearch - Integrated Hyperparameter Tuning", difficulty="Advanced")
def validate():
    results = pipeline_with_gridsearch()
    
    assert results['grid_search'] is not None, "GridSearch not created"
    assert results['best_params'] is not None, "Best parameters not found"
    assert results['best_cv_score'] > 0, "Best CV score not calculated"
    assert results['test_score'] > 0, "Test score not calculated"
    assert results['n_combinations'] > 0, "No parameter combinations tested"
    assert 0.85 <= results['best_cv_score'] <= 1.0, f"CV score should be good, got {results['best_cv_score']:.3f}"
    
    print("✓ Pipeline GridSearch optimization complete!")
    print(f"  - Parameter combinations tested: {results['n_combinations']}")
    print(f"  - Best CV score: {results['best_cv_score']:.4f}")
    print(f"  - Test score: {results['test_score']:.4f}")
    
    print(f"\n  🏆 Best Configuration:")
    for param, value in results['best_params'].items():
        step, param_name = param.split('__') if '__' in param else ('pipeline', param)
        print(f"    {step} {param_name}: {type(value).__name__ if hasattr(value, '__name__') else value}")
    
    # Check generalization
    cv_test_diff = abs(results['best_cv_score'] - results['test_score'])
    if cv_test_diff < 0.02:
        print(f"\n  ✓ Excellent generalization (CV-Test diff: {cv_test_diff:.3f})")
    elif cv_test_diff < 0.05:
        print(f"\n  ✓ Good generalization (CV-Test diff: {cv_test_diff:.3f})")
    else:
        print(f"\n  ⚠️ Possible overfitting (CV-Test diff: {cv_test_diff:.3f})")
    
    print(f"\n  🔧 Pipeline Parameter Tuning:")
    print(f"    • Use step_name__parameter_name syntax")
    print(f"    • Can tune preprocessing AND model parameters")
    print(f"    • Prevents data leakage automatically")
    print(f"    • Finds optimal preprocessing-model combinations")
    
    print(f"\n  💡 Advanced Tuning Tips:")
    print(f"    • Try different scalers (Standard, MinMax, Robust)")
    print(f"    • Tune feature selection parameters")
    print(f"    • Include different algorithms in same grid")
    print(f"    • Use nested CV for unbiased evaluation")

validate()

## KOAN 14.5: Production Pipeline - Deployment-Ready Workflow
**Objective**: Create a complete, production-ready ML pipeline with serialization  
**Difficulty**: Advanced

Production pipelines must handle missing data, unknown categories, feature validation, and be serializable for deployment. This final koan brings together all pipeline concepts.

**Key Concept**: Production pipelines need robust error handling, data validation, versioning, and the ability to be saved/loaded for deployment.

In [None]:
def create_production_pipeline():
    """
    Create a comprehensive, production-ready ML pipeline.
    
    Handles: missing values, mixed data types, feature engineering, 
    model training, serialization, and prediction.
    
    Returns:
        dict: Complete production pipeline results
    """
    # Create realistic dataset with missing values and mixed types
    np.random.seed(42)
    n_samples = 1000
    
    # Create mixed dataset with missing values
    data = {
        'numeric_1': np.random.randn(n_samples),
        'numeric_2': np.random.randn(n_samples) * 10,
        'numeric_3': np.random.randn(n_samples) * 100,
        'category_1': np.random.choice(['A', 'B', 'C', None], n_samples, p=[0.4, 0.3, 0.2, 0.1]),
        'category_2': np.random.choice(['X', 'Y', 'Z'], n_samples),
        'binary_feature': np.random.choice([0, 1], n_samples)
    }
    
    # Add some missing values to numeric features
    missing_mask = np.random.random(n_samples) < 0.05
    data['numeric_1'][missing_mask] = np.nan
    
    df = pd.DataFrame(data)
    y = (df['numeric_1'].fillna(0) + df['numeric_2'] > 0).astype(int)
    
    X_train, X_test, y_train, y_test = train_test_split(
        df, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # TODO: Define column groups
    numeric_columns = None  # ['numeric_1', 'numeric_2', 'numeric_3']
    categorical_columns = None  # ['category_1', 'category_2'] 
    binary_columns = None  # ['binary_feature']
    
    # TODO: Create comprehensive preprocessor
    preprocessor = None
    # ColumnTransformer([
    #     ('num', Pipeline([
    #         ('imputer', SimpleImputer(strategy='median')),
    #         ('scaler', StandardScaler())
    #     ]), numeric_columns),
    #     ('cat', Pipeline([
    #         ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    #         ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
    #     ]), categorical_columns),
    #     ('bin', 'passthrough', binary_columns)  # Keep binary as-is
    # ])
    
    # TODO: Create full production pipeline
    production_pipeline = None
    # Pipeline([
    #     ('preprocessor', preprocessor),
    #     ('feature_selection', 'passthrough'),  # Could add feature selection
    #     ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
    # ])
    
    # TODO: Fit the pipeline
    # production_pipeline.fit(X_train, y_train)
    
    # Evaluate performance
    train_score = production_pipeline.score(X_train, y_train) if production_pipeline else 0
    test_score = production_pipeline.score(X_test, y_test) if production_pipeline else 0
    
    # TODO: Save pipeline for deployment
    pipeline_filename = 'production_model.joblib'
    # joblib.dump(production_pipeline, pipeline_filename)
    
    # TODO: Load pipeline (simulate deployment)
    # loaded_pipeline = joblib.load(pipeline_filename)
    loaded_pipeline = production_pipeline  # For validation purposes
    
    # Test loaded pipeline
    loaded_test_score = loaded_pipeline.score(X_test, y_test) if loaded_pipeline else 0
    
    return {
        'pipeline': production_pipeline,
        'loaded_pipeline': loaded_pipeline,
        'train_score': train_score,
        'test_score': test_score, 
        'loaded_test_score': loaded_test_score,
        'numeric_cols': len(numeric_columns) if numeric_columns else 0,
        'categorical_cols': len(categorical_columns) if categorical_columns else 0,
        'binary_cols': len(binary_columns) if binary_columns else 0,
        'serialization_works': abs(test_score - loaded_test_score) < 0.001 if test_score and loaded_test_score else False
    }

@validator.koan(5, "Production Pipeline - Deployment-Ready Workflow", difficulty="Advanced")
def validate():
    results = create_production_pipeline()
    
    assert results['pipeline'] is not None, "Production pipeline not created"
    assert results['loaded_pipeline'] is not None, "Pipeline serialization/loading failed"
    assert results['train_score'] > 0, "Training score not calculated"
    assert results['test_score'] > 0, "Test score not calculated"
    assert results['numeric_cols'] == 3, f"Should handle 3 numeric columns, got {results['numeric_cols']}"
    assert results['categorical_cols'] == 2, f"Should handle 2 categorical columns, got {results['categorical_cols']}"
    assert results['binary_cols'] == 1, f"Should handle 1 binary column, got {results['binary_cols']}"
    assert 0.7 <= results['test_score'] <= 1.0, f"Test score should be reasonable, got {results['test_score']:.3f}"
    
    print("✓ Production pipeline created successfully!")
    print(f"  - Training accuracy: {results['train_score']:.4f}")
    print(f"  - Test accuracy: {results['test_score']:.4f}")
    print(f"  - Loaded model accuracy: {results['loaded_test_score']:.4f}")
    
    if results['serialization_works']:
        print(f"  ✓ Serialization working correctly")
    else:
        print(f"  ⚠️ Serialization may have issues")
    
    # Check for overfitting
    train_test_diff = results['train_score'] - results['test_score']
    if train_test_diff < 0.05:
        print(f"  ✓ No overfitting detected (diff: {train_test_diff:.3f})")
    elif train_test_diff < 0.1:
        print(f"  ⚠️ Mild overfitting (diff: {train_test_diff:.3f})")
    else:
        print(f"  🚨 Significant overfitting (diff: {train_test_diff:.3f})")
    
    print(f"\n  🏗️ Pipeline Architecture:")
    print(f"    Numeric columns: {results['numeric_cols']} (impute → scale)")
    print(f"    Categorical columns: {results['categorical_cols']} (impute → encode)")
    print(f"    Binary columns: {results['binary_cols']} (passthrough)")
    
    print(f"\n  🚀 Production Readiness Checklist:")
    print(f"    ✅ Handles missing values gracefully")
    print(f"    ✅ Handles unknown categories")
    print(f"    ✅ Mixed data type support")
    print(f"    ✅ Serialization for deployment")
    print(f"    ✅ Consistent preprocessing pipeline")
    print(f"    ✅ No data leakage")
    
    print(f"\n  📦 Deployment Considerations:")
    print(f"    • Version your pipelines (model versioning)")
    print(f"    • Monitor feature distributions in production")
    print(f"    • Handle feature drift and concept drift")  
    print(f"    • Implement A/B testing for model updates")
    print(f"    • Add input validation and error handling")
    print(f"    • Consider batch vs. real-time prediction needs")

validate()

## 🎉 Congratulations!

You have mastered ML Pipelines - the foundation of production machine learning!

### What You've Built
- ✅ **Basic Pipelines**: Automated preprocessing + modeling workflows
- ✅ **ColumnTransformer**: Mixed data type preprocessing strategies
- ✅ **Custom Transformers**: Domain-specific transformation logic
- ✅ **Integrated Tuning**: Combined preprocessing and model optimization  
- ✅ **Production Pipelines**: Deployment-ready ML workflows

### Critical Skills Gained
1. **Data Leakage Prevention**: Proper train/test separation in preprocessing
2. **Workflow Automation**: End-to-end ML pipelines  
3. **Robust Preprocessing**: Handling missing values, mixed types, unknowns
4. **Hyperparameter Integration**: Tuning preprocessing + model together
5. **Deployment Readiness**: Serializable, versioned, production workflows

### Real-World Impact
- 🛡️ **Eliminates 90%** of data leakage bugs in ML projects
- ⚡ **10x faster** development with reusable pipeline components
- 🎯 **Better models** through integrated hyperparameter tuning
- 🚀 **Seamless deployment** with serialized pipeline objects
- 📈 **Maintainable ML** with standardized workflows

### Next Steps
- **Notebook 15**: Ethics and Bias (responsible ML practices!)
- **Advanced**: MLOps, model monitoring, automated retraining
- **Practice**: Build pipelines for your own ML projects

### Production Pipeline Checklist
- ✅ Input validation and error handling
- ✅ Missing value imputation strategies  
- ✅ Unknown category handling
- ✅ Feature scaling and encoding
- ✅ Model serialization/versioning
- ✅ Consistent preprocessing logic
- ✅ Performance monitoring hooks
- ✅ A/B testing capabilities

You're now ready to build production-grade ML systems! 🏭✨

In [None]:
# Final Progress Check
progress = tracker.get_notebook_progress('14_pipelines')
print(f"\n📊 Your Progress: {progress}% complete!")

if progress == 100:
    print("🎉 Phenomenal! You've mastered all pipeline techniques!")
    print("🎯 Ready for Notebook 15: Ethics and Bias (the final frontier!)")
elif progress >= 75:
    print("🌟 Excellent progress! Pipeline mastery is within reach.")
elif progress >= 50:
    print("💪 Great work! You're building production-ready skills.")
else:
    print("🚀 Keep going! Each pipeline concept builds toward deployment readiness.")

print(f"\n📈 Overall course progress:")
total_notebooks = 15
completed_notebooks = len([nb for nb in range(1, 15) if tracker.get_notebook_progress(f'{nb:02d}_*') == 100])
print(f"   Completed notebooks: {completed_notebooks}/{total_notebooks}")
print(f"   Course progress: {(completed_notebooks/total_notebooks)*100:.1f}%")

print(f"\n🏭 Production ML Pipeline Mastery Achieved!")
print(f"   You can now build enterprise-grade ML systems! 🎖️")

if completed_notebooks >= 14:
    print(f"\n🎊 Almost finished! One more notebook to complete your")  
    print(f"   Data Science Koans journey: Ethics and Bias! 🧭")