# Diabetes Progression Prediction: Spark Data Processing + XGBoost Training

This example demonstrates a hybrid approach using **Spark for data processing** and **native XGBoost for model training**.

## Architecture

- **Data Processing**: PySpark for distributed ETL operations (scales to large datasets)
- **Training**: Native XGBoost (efficient gradient boosting on driver)
- **Model Logging**: MLflow xgboost flavor (reliable and compatible)

## Execution Modes

This notebook supports two execution modes:
- **Darwin Cluster Mode**: Uses Darwin SDK with Ray for distributed Spark processing
- **Local Mode**: Uses local Spark session for development/testing

## Dataset

- **Name**: Diabetes Dataset
- **Samples**: 442
- **Features**: 10 baseline variables
- **Target**: Quantitative measure of disease progression one year after baseline
- **Type**: Regression

## Model

- **Framework**: Native XGBoost
- **Data Processing**: PySpark
- **Objective**: Squared error regression

## Features

The dataset includes 10 baseline variables:
- `age`: Age in years
- `sex`: Sex
- `bmi`: Body mass index
- `bp`: Average blood pressure
- `s1`: Total serum cholesterol
- `s2`: Low-density lipoproteins
- `s3`: High-density lipoproteins
- `s4`: Total cholesterol / HDL
- `s5`: Log of serum triglycerides level
- `s6`: Blood sugar level

## Key Features

- Spark handles data loading, transformation, and splitting (can scale to big data)
- Native XGBoost handles model training (efficient and fast)
- Model logged using `mlflow.xgboost` flavor (works with any MLflow server)
- Fast model loading at serving time (no Spark dependencies needed)

In [None]:
# Fix pyOpenSSL/cryptography compatibility issue first
%pip install --upgrade pyOpenSSL cryptography

# Install main dependencies (pin MLflow to match server version)
%pip install xgboost pandas numpy scikit-learn mlflow==2.12.2 pyspark

In [None]:
import os
import argparse
import json
import tempfile
import numpy as np
import pandas as pd
from datetime import datetime

# XGBoost imports
import xgboost as xgb

# Spark imports (for data processing only)
from pyspark.sql import SparkSession

# MLflow imports
import mlflow
import mlflow.xgboost
from mlflow import set_tracking_uri, set_experiment
from mlflow.client import MlflowClient
from mlflow.models import infer_signature

# Scikit-learn imports (for loading dataset and metrics)
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Darwin SDK imports (optional - only available on Darwin cluster)
DARWIN_SDK_AVAILABLE = False
try:
    import ray
    from darwin import init_spark_with_configs, stop_spark
    DARWIN_SDK_AVAILABLE = True
    print("Darwin SDK available - will use distributed Spark on Darwin cluster")
except ImportError as e:
    print(f"Darwin SDK not available: {e}")
    print("Running in LOCAL mode - will use local Spark session")
except AttributeError as e:
    # This typically happens with pyOpenSSL/cryptography version mismatch
    if "X509_V_FLAG" in str(e) or "lib" in str(e):
        print("=" * 80)
        print("ERROR: pyOpenSSL/cryptography version conflict detected!")
        print("Please run the following in a cell before importing:")
        print("  %pip install --upgrade pyOpenSSL cryptography")
        print("Then restart the kernel and try again.")
        print("=" * 80)
        raise
    else:
        raise

In [None]:
def initialize_spark():
    """Initialize Spark session for data processing.
    
    Uses Darwin SDK on cluster, local Spark otherwise.
    Spark is used for distributed data processing (ETL, splitting).
    Training is done with native XGBoost on the driver.
    """
    print("\n" + "=" * 80)
    print("INITIALIZING SPARK SESSION")
    print("=" * 80)
    
    # Base Spark configurations for data processing
    spark_configs = {
        "spark.sql.execution.arrow.pyspark.enabled": "true",
        "spark.sql.session.timeZone": "UTC",
        "spark.sql.shuffle.partitions": "4",
        "spark.default.parallelism": "4",
        "spark.executor.memory": "1g",
        "spark.executor.cores": "1",
        "spark.driver.memory": "1g",
        "spark.executor.instances": "2",
    }
    
    if DARWIN_SDK_AVAILABLE:
        # Running on Darwin cluster - use distributed Spark via Ray
        print("Mode: Darwin Cluster (Distributed)")
        ray.init()
        spark = init_spark_with_configs(spark_configs=spark_configs)
    else:
        # Running locally - use local Spark session
        print("Mode: Local Spark Session")
        builder = SparkSession.builder \
            .appName("Diabetes-Spark-DataProcessing") \
            .master("local[*]")
        
        for key, value in spark_configs.items():
            builder = builder.config(key, value)
        
        spark = builder.getOrCreate()
    
    print(f"Spark version: {spark.version}")
    print(f"Application ID: {spark.sparkContext.applicationId}")
    
    return spark


def cleanup_spark(spark):
    """Stop Spark session properly based on environment."""
    print("\nStopping Spark session...")
    if DARWIN_SDK_AVAILABLE:
        stop_spark()
    else:
        spark.stop()
    print("Spark session stopped.")

In [None]:
def setup_mlflow(mlflow_uri: str, username: str, password: str) -> MlflowClient:
    """Configure MLflow tracking and return client."""
    os.environ["MLFLOW_TRACKING_USERNAME"] = username
    os.environ["MLFLOW_TRACKING_PASSWORD"] = password
    
    set_tracking_uri(mlflow_uri)
    client = MlflowClient(mlflow_uri)
    
    print(f"MLflow tracking URI: {mlflow_uri}")
    return client


def load_and_prepare_data(spark: SparkSession):
    """Load Diabetes dataset using Spark for processing, return pandas for training.
    
    Uses Spark for distributed data operations (can scale to large datasets).
    Returns pandas DataFrames for XGBoost training.
    """
    print("\n" + "=" * 80)
    print("LOADING DATASET")
    print("=" * 80)
    
    # Load dataset
    data = load_diabetes(as_frame=True)
    pdf = data.data.copy()
    pdf['target'] = data.target
    
    feature_names = data.feature_names
    
    print(f"Dataset: Diabetes")
    print(f"Samples: {len(pdf):,}")
    print(f"Features: {len(feature_names)}")
    
    print(f"\nFeature names:")
    for i, col in enumerate(feature_names, 1):
        print(f"  {i}. {col}")
    
    print(f"\nTarget statistics:")
    print(f"  Mean: {pdf['target'].mean():.2f}")
    print(f"  Std: {pdf['target'].std():.2f}")
    print(f"  Min: {pdf['target'].min():.2f}")
    print(f"  Max: {pdf['target'].max():.2f}")
    
    # Use Spark for distributed data splitting (demonstrates Spark processing)
    print("\nUsing Spark for distributed data splitting...")
    spark_df = spark.createDataFrame(pdf)
    train_spark, test_spark = spark_df.randomSplit([0.8, 0.2], seed=42)
    
    # Collect to pandas for XGBoost training
    print("Collecting to pandas for training...")
    train_pdf = train_spark.toPandas()
    test_pdf = test_spark.toPandas()
    
    print(f"\nTrain samples: {len(train_pdf):,}")
    print(f"Test samples: {len(test_pdf):,}")
    
    return train_pdf, test_pdf, feature_names


def train_model(train_pdf, test_pdf, feature_names, hyperparams: dict):
    """Train XGBoost model using native XGBoost."""
    print("\n" + "=" * 80)
    print("TRAINING MODEL (Native XGBoost)")
    print("=" * 80)
    
    print("Hyperparameters:")
    for key, value in hyperparams.items():
        print(f"  {key}: {value}")
    
    # Prepare data
    X_train = train_pdf[feature_names].values
    y_train = train_pdf["target"].values
    X_test = test_pdf[feature_names].values
    y_test = test_pdf["target"].values
    
    print(f"\nTraining samples: {len(X_train)}, Test samples: {len(X_test)}")
    
    # Create DMatrix for XGBoost
    dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=list(feature_names))
    dtest = xgb.DMatrix(X_test, label=y_test, feature_names=list(feature_names))
    
    # XGBoost parameters
    params = {
        "objective": hyperparams.get("objective", "reg:squarederror"),
        "max_depth": hyperparams.get("max_depth", 5),
        "learning_rate": hyperparams.get("learning_rate", 0.1),
        "subsample": hyperparams.get("subsample", 0.8),
        "colsample_bytree": hyperparams.get("colsample_bytree", 0.8),
        "seed": hyperparams.get("random_state", 42),
    }
    
    # Train model
    print("\nTraining XGBoost model...")
    model = xgb.train(
        params,
        dtrain,
        num_boost_round=hyperparams.get("n_estimators", 100),
        evals=[(dtrain, "train"), (dtest, "test")],
        verbose_eval=False
    )
    
    print("Training completed!")
    
    # Make predictions
    y_train_pred = model.predict(dtrain)
    y_test_pred = model.predict(dtest)
    
    return model, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred


def calculate_metrics(y_true, y_pred, dataset_name="Test"):
    """Calculate evaluation metrics using sklearn."""
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    return {
        f"{dataset_name.lower()}_rmse": rmse,
        f"{dataset_name.lower()}_mae": mae,
        f"{dataset_name.lower()}_r2": r2
    }


def log_to_mlflow(model, X_train, y_train, y_test, y_train_pred, y_test_pred, 
                  hyperparams, feature_names):
    """Log XGBoost model, parameters, and metrics to MLflow."""
    print("\n" + "=" * 80)
    print("LOGGING TO MLFLOW")
    print("=" * 80)
    
    # Log hyperparameters
    for key, value in hyperparams.items():
        mlflow.log_param(key, value)
    
    # Log additional params
    mlflow.log_param("training_framework", "xgboost")
    mlflow.log_param("data_processing", "spark")
    
    # Calculate and log metrics
    train_metrics = calculate_metrics(y_train, y_train_pred, dataset_name="Train")
    test_metrics = calculate_metrics(y_test, y_test_pred, dataset_name="Test")
    all_metrics = {**train_metrics, **test_metrics}
    
    for metric_name, metric_value in all_metrics.items():
        mlflow.log_metric(metric_name, metric_value)
    
    print("\nModel Performance:")
    print(f"  Training RMSE: {train_metrics['train_rmse']:.4f}")
    print(f"  Training R2: {train_metrics['train_r2']:.4f}")
    print(f"  Test RMSE: {test_metrics['test_rmse']:.4f}")
    print(f"  Test MAE: {test_metrics['test_mae']:.4f}")
    print(f"  Test R2: {test_metrics['test_r2']:.4f}")
    
    # Log model using mlflow.xgboost
    model_logged = False
    print("\nSaving model artifacts...")
    try:
        # Create sample input for signature
        sample_input = pd.DataFrame([X_train[0]], columns=feature_names)
        sample_output = pd.DataFrame({"prediction": [0.0]})
        signature = infer_signature(sample_input, sample_output)
        
        # Log XGBoost model
        print("  Logging to MLflow using xgboost flavor...")
        mlflow.xgboost.log_model(
            xgb_model=model,
            artifact_path="model",
            signature=signature,
            input_example=sample_input
        )
        
        model_logged = True
        print("Model artifacts logged successfully!")
    except Exception as e:
        print(f"Warning: Could not log model artifacts: {e}")
        import traceback
        traceback.print_exc()
        print("Metrics and parameters were logged successfully.")
    
    # Store native model reference for later use
    all_metrics["_native_model"] = model
    all_metrics["_model_logged"] = model_logged
    
    return all_metrics


def register_model(client: MlflowClient, model_name: str, run_id: str, experiment_id: str):
    """Register model in MLflow Model Registry."""
    print("\n" + "=" * 80)
    print("REGISTERING MODEL")
    print("=" * 80)
    
    model_uri = f"runs:/{run_id}/model"
    
    # Create registered model if it doesn't exist
    try:
        client.get_registered_model(model_name)
        print(f"Model '{model_name}' already exists in registry")
    except Exception:
        try:
            client.create_registered_model(model_name)
            print(f"Created registered model: {model_name}")
        except Exception as e:
            print(f"Could not create registered model: {e}")
    
    # Create model version
    try:
        result = client.create_model_version(
            name=model_name,
            source=model_uri,
            run_id=run_id
        )
        print(f"Model version registered successfully!")
        print(f"   Model Name: {model_name}")
        print(f"   Version: {result.version}")
        print(f"   Run ID: {run_id}")
        return result.version
    except Exception as e:
        print(f"Model registration failed (model still usable via run URI): {e}")
        print(f"   You can deploy using: mlflow-artifacts:/{experiment_id}/{run_id}/artifacts/model")
        return None


def load_model_and_predict(model_uri: str, sample_data: pd.DataFrame, native_model=None):
    """Load model from MLflow URI and run prediction.
    
    Args:
        model_uri: MLflow model URI (e.g., runs:/{run_id}/model)
        sample_data: Pandas DataFrame with feature data
        native_model: Optional - use this model directly instead of loading from MLflow
    """
    print("\n" + "=" * 80)
    print("LOADING MODEL AND RUNNING PREDICTION")
    print("=" * 80)
    
    if native_model is not None:
        # Use the native model directly (no need to load from MLflow)
        print("Using in-memory model for prediction")
        loaded_model = native_model
    else:
        # Try to load from MLflow
        print(f"Model URI: {model_uri}")
        try:
            # Try MLflow's xgboost loader first
            loaded_model = mlflow.xgboost.load_model(model_uri)
            print("Model loaded from MLflow successfully!")
        except Exception as e:
            # Fallback: Download artifact and load manually
            print(f"MLflow loader failed, trying artifact download: {e}")
            try:
                client = MlflowClient()
                run_id = model_uri.split("/")[1] if model_uri.startswith("runs:/") else model_uri
                
                with tempfile.TemporaryDirectory() as tmpdir:
                    local_path = client.download_artifacts(run_id, "model/model.json", tmpdir)
                    loaded_model = xgb.Booster()
                    loaded_model.load_model(local_path)
                    print("Model loaded from artifact successfully!")
            except Exception as e2:
                print(f"Could not load model: {e2}")
                raise
    
    # Create DMatrix for prediction
    dmatrix = xgb.DMatrix(sample_data)
    
    # Run prediction
    predictions = loaded_model.predict(dmatrix)
    
    print("\n" + "=" * 80)
    print("SAMPLE PREDICTION RESULTS")
    print("=" * 80)
    print(f"\nInput features:")
    for col in sample_data.columns:
        print(f"  {col}: {sample_data[col].iloc[0]:.4f}")
    print(f"\nPredicted diabetes progression: {predictions[0]:.2f}")
    
    return predictions


def print_deployment_info(run_id: str, experiment_id: str, model_name: str, model_version: str):
    """Print deployment instructions and sample payloads."""
    print("\n" + "=" * 80)
    print("TRAINING COMPLETE!")
    print("=" * 80)
    
    print(f"\nRun Information:")
    print(f"  Run ID: {run_id}")
    print(f"  Experiment ID: {experiment_id}")
    print(f"  Model URI (run): runs:/{run_id}/model")
    if model_version:
        print(f"  Model URI (registry): models:/{model_name}/{model_version}")
    
    print("\n" + "=" * 80)
    print("DEPLOYMENT PAYLOAD (deploy-model API)")
    print("=" * 80)
    
    deploy_payload = {
        "serve_name": "diabetes-xgboost-spark-regressor",
        "model_uri": f"mlflow-artifacts:/{experiment_id}/{run_id}/artifacts/model",
        "env": "local",
        "cores": 2,
        "memory": 4,
        "node_capacity": "spot",
        "min_replicas": 1,
        "max_replicas": 3
    }
    
    print(json.dumps(deploy_payload, indent=2))

In [None]:
def main():
    parser = argparse.ArgumentParser(description="Train Diabetes Regression Model (Spark data processing + XGBoost training)")
    parser.add_argument(
        "--mlflow-uri",
        default="http://darwin-mlflow-lib.darwin.svc.cluster.local:8080",
        help="MLflow tracking URI"
    )
    parser.add_argument(
        "--username",
        default="abc@gmail.com",
        help="MLflow username"
    )
    parser.add_argument(
        "--password",
        default="password",
        help="MLflow password"
    )
    parser.add_argument(
        "--experiment-name",
        default="diabetes_spark_xgboost_regression",
        help="MLflow experiment name"
    )
    parser.add_argument(
        "--model-name",
        default="DiabetesXGBoostRegressor",
        help="Registered model name"
    )
    
    args, _ = parser.parse_known_args()
    
    print("\n" + "=" * 80)
    print("DIABETES REGRESSION: SPARK DATA PROCESSING + XGBOOST TRAINING")
    print("=" * 80)
    print(f"Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    
    # Initialize Spark for data processing
    spark = initialize_spark()
    
    # Setup MLflow
    client = setup_mlflow(args.mlflow_uri, args.username, args.password)
    set_experiment(experiment_name=args.experiment_name)
    print(f"Experiment: {args.experiment_name}")
    
    # Load and prepare data using Spark (returns pandas DataFrames)
    train_pdf, test_pdf, feature_names = load_and_prepare_data(spark)
    
    # Define hyperparameters
    hyperparams = {
        "objective": "reg:squarederror",
        "max_depth": 5,
        "learning_rate": 0.1,
        "n_estimators": 100,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "random_state": 42
    }
    
    # Start MLflow run
    with mlflow.start_run(run_name=f"xgboost_diabetes_{datetime.now().strftime('%Y%m%d_%H%M%S')}"):
        # Train XGBoost model
        model, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred = train_model(
            train_pdf, test_pdf, feature_names, hyperparams
        )
        
        # Log to MLflow
        metrics = log_to_mlflow(
            model, X_train, y_train, y_test, y_train_pred, y_test_pred,
            hyperparams, feature_names
        )
        
        # Get run information
        run_id = mlflow.active_run().info.run_id
        experiment_id = mlflow.active_run().info.experiment_id
    
    # Register model (outside of run context) - only if artifacts were logged
    model_version = None
    if metrics.get("_model_logged", False):
        model_version = register_model(client, args.model_name, run_id, experiment_id)
    else:
        print("\nSkipping model registration (artifacts not logged to MLflow)")
    
    # Demo prediction with XGBoost model
    print("\n" + "=" * 80)
    print("SAMPLE PREDICTION")
    print("=" * 80)
    model_uri = f"runs:/{run_id}/model"
    sample_pdf = test_pdf[feature_names].head(1)
    
    # Use the native model for prediction
    native_model = metrics.get("_native_model")
    predictions = load_model_and_predict(model_uri, sample_pdf, native_model=native_model)
    
    # Get actual value for comparison
    actual_value = test_pdf["target"].iloc[0]
    print(f"\nActual diabetes progression: {actual_value:.2f}")
    print(f"Prediction error: {abs(predictions[0] - actual_value):.2f}")
    
    # Print deployment information
    print_deployment_info(run_id, experiment_id, args.model_name, model_version)
    
    # Cleanup: Stop Spark session
    cleanup_spark(spark)
    
    print("\nScript completed successfully!")
    print("=" * 80 + "\n")


if __name__ == "__main__":
    main()