# üìò Day 2: MLOps Best Practices

**üéØ Goal:** Master MLOps workflows for production ML systems

**‚è±Ô∏è Time:** 120-150 minutes

**üåü Why This Matters for AI (2024-2025):**
- MLOps is THE #1 skill gap in AI - companies desperately need MLOps engineers
- 90% of ML models fail in production due to poor MLOps practices
- Model versioning prevents disasters and enables rollbacks
- Experiment tracking saves months of wasted work
- Model monitoring catches problems before users do
- A/B testing proves your model actually works in production
- Every successful AI company has robust MLOps pipelines

**What You'll Build Today:**
1. **Model versioning** with MLflow and Weights & Biases
2. **Experiment tracking** to compare 100s of model runs
3. **Model monitoring** to detect drift and degradation
4. **A/B testing** to validate models in production
5. **Complete MLOps pipeline** from training to monitoring

---

## üåç MLOps Landscape (2024-2025)

**MLOps = DevOps for Machine Learning**

### üéØ What is MLOps?

**The practice of deploying and maintaining ML models in production reliably and efficiently.**

```
Traditional Software:        Machine Learning:
Code ‚Üí Test ‚Üí Deploy        Code + Data + Model ‚Üí Test ‚Üí Deploy ‚Üí Monitor
       ‚Üì                                                            ‚Üì
    DevOps                                                       MLOps
```

### üèóÔ∏è MLOps Components:

#### 1Ô∏è‚É£ **Experiment Tracking**
**Problem:** "Which hyperparameters gave best results 2 weeks ago?"  
**Solution:** Track every experiment automatically

**Tools:**
- **MLflow** (open-source, industry standard)
- **Weights & Biases** (W&B) - best visualizations
- **Neptune.ai** - enterprise features
- **TensorBoard** - TensorFlow focused

#### 2Ô∏è‚É£ **Model Versioning**
**Problem:** "Which model is in production? Can we rollback?"  
**Solution:** Version models like code (Git for models)

**Tools:**
- **MLflow Model Registry**
- **DVC** (Data Version Control)
- **Pachyderm**

#### 3Ô∏è‚É£ **Model Monitoring**
**Problem:** "Model accuracy dropped from 95% to 60%!"  
**Solution:** Monitor performance, data drift, predictions

**Tools:**
- **Evidently AI**
- **WhyLabs**
- **Arize AI**
- **Fiddler**

#### 4Ô∏è‚É£ **CI/CD for ML**
**Problem:** "Manual deployment is slow and error-prone"  
**Solution:** Automate training, testing, deployment

**Tools:**
- **GitHub Actions** + ML
- **Jenkins** + ML pipelines
- **Argo Workflows**
- **Kubeflow**

#### 5Ô∏è‚É£ **Feature Stores**
**Problem:** "Different features in training vs production"  
**Solution:** Centralized feature management

**Tools:**
- **Feast** (open-source)
- **Tecton**
- **Hopsworks**

### üìä MLOps Maturity Levels:

| Level | Description | Characteristics |
|-------|-------------|------------------|
| **0 - Manual** | No automation | Jupyter notebooks, manual deployment |
| **1 - DevOps** | Code automation | Automated testing, CI/CD for code |
| **2 - Automated Training** | ML automation | Automated retraining, pipelines |
| **3 - Full MLOps** | Complete automation | Auto deploy, monitor, retrain |

**Most companies in 2024-2025: Level 1-2**  
**Goal: Reach Level 3**

### üåü Why MLOps Matters:

**Without MLOps:**
- ‚ùå Lost experiments (can't reproduce results)
- ‚ùå Model rot (accuracy degrades over time)
- ‚ùå Slow iteration (manual everything)
- ‚ùå Production failures (untested deployments)
- ‚ùå No accountability (who deployed what?)

**With MLOps:**
- ‚úÖ Reproducible experiments
- ‚úÖ Automated monitoring and alerts
- ‚úÖ Fast iteration cycles
- ‚úÖ Reliable deployments
- ‚úÖ Complete audit trail

Let's build MLOps pipelines!

---

## üõ†Ô∏è Setup & Installation

In [None]:
# Install MLOps libraries
import sys

# Core ML libraries
!{sys.executable} -m pip install scikit-learn numpy pandas --quiet

# MLOps tools
!{sys.executable} -m pip install mlflow wandb --quiet

# Monitoring
!{sys.executable} -m pip install evidently --quiet

# Visualization
!{sys.executable} -m pip install matplotlib seaborn plotly --quiet

print("‚úÖ MLOps libraries installed successfully!")
print("\nüì¶ Installed:")
print("   - MLflow (experiment tracking & model registry)")
print("   - Weights & Biases (W&B)")
print("   - Evidently (model monitoring)")
print("\nüöÄ Ready for MLOps!")

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.datasets import make_classification

# MLOps libraries
import mlflow
import mlflow.sklearn

# Set random seed
np.random.seed(42)

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("üì¶ Libraries imported successfully!")
print("üéØ Ready to build MLOps pipelines!\n")

## üìä Step 1: Experiment Tracking with MLflow

**MLflow = THE industry standard for ML experiment tracking**

### üéØ What is MLflow?

**Open-source platform for the complete ML lifecycle:**
1. **Tracking**: Log params, metrics, artifacts
2. **Projects**: Package ML code
3. **Models**: Manage and deploy models
4. **Registry**: Version and stage models

### üèóÔ∏è MLflow Tracking:

```
Experiment
    ‚Üì
‚îú‚îÄ Run 1 (model=RandomForest, n_estimators=100)
‚îÇ  ‚îú‚îÄ Parameters: {n_estimators: 100, max_depth: 10}
‚îÇ  ‚îú‚îÄ Metrics: {accuracy: 0.85, f1: 0.83}
‚îÇ  ‚îî‚îÄ Artifacts: model.pkl, feature_importance.png
‚îÇ
‚îú‚îÄ Run 2 (model=RandomForest, n_estimators=200)
‚îÇ  ‚îú‚îÄ Parameters: {n_estimators: 200, max_depth: 15}
‚îÇ  ‚îú‚îÄ Metrics: {accuracy: 0.87, f1: 0.86}
‚îÇ  ‚îî‚îÄ Artifacts: model.pkl, feature_importance.png
‚îÇ
‚îî‚îÄ Run 3 (model=GradientBoosting)
   ‚îî‚îÄ ...
```

### üåü Why Use MLflow?

‚úÖ **Never lose experiments**: All runs saved automatically  
‚úÖ **Easy comparison**: Compare 100s of runs visually  
‚úÖ **Reproducible**: Track everything needed to reproduce  
‚úÖ **Team collaboration**: Share results with team  
‚úÖ **Production ready**: Deploy best models directly  

Let's track experiments!

In [None]:
# Create a sample dataset for our experiments

print("üìä Creating Sample Dataset\n")
print("="*70)

# Generate synthetic classification dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"‚úÖ Dataset created:")
print(f"   Training samples: {len(X_train)}")
print(f"   Test samples: {len(X_test)}")
print(f"   Features: {X.shape[1]}")
print(f"   Classes: {len(np.unique(y))}")

# Class distribution
unique, counts = np.unique(y_train, return_counts=True)
print(f"\nüìà Class distribution:")
for cls, count in zip(unique, counts):
    print(f"   Class {cls}: {count} samples ({count/len(y_train)*100:.1f}%)")

print("\n" + "="*70)

In [None]:
# Basic MLflow experiment tracking

print("üî¨ Running MLflow Experiment\n")
print("="*70)

# Set experiment name
mlflow.set_experiment("binary_classification")

# Start MLflow run
with mlflow.start_run(run_name="random_forest_v1"):
    
    # Define hyperparameters
    params = {
        'n_estimators': 100,
        'max_depth': 10,
        'min_samples_split': 2,
        'random_state': 42
    }
    
    # Log parameters
    mlflow.log_params(params)
    
    print("üìù Logged parameters:")
    for key, value in params.items():
        print(f"   {key}: {value}")
    
    # Train model
    print("\nüéØ Training model...")
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1_score': f1_score(y_test, y_pred)
    }
    
    # Log metrics
    mlflow.log_metrics(metrics)
    
    print("\nüìä Logged metrics:")
    for key, value in metrics.items():
        print(f"   {key}: {value:.4f}")
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    print("\n‚úÖ Model logged to MLflow!")
    
    # Get run info
    run = mlflow.active_run()
    print(f"\nüîó Run ID: {run.info.run_id}")
    print(f"üìÅ Artifact URI: {run.info.artifact_uri}")

print("\n" + "="*70)
print("\nüí° View results: mlflow ui")
print("   Then open: http://localhost:5000")

In [None]:
# Run multiple experiments with different models

print("üî¨ Running Multiple Experiments\n")
print("="*70)

# Define different models and their hyperparameters
experiments = [
    {
        'name': 'random_forest_small',
        'model': RandomForestClassifier,
        'params': {'n_estimators': 50, 'max_depth': 5, 'random_state': 42}
    },
    {
        'name': 'random_forest_medium',
        'model': RandomForestClassifier,
        'params': {'n_estimators': 100, 'max_depth': 10, 'random_state': 42}
    },
    {
        'name': 'random_forest_large',
        'model': RandomForestClassifier,
        'params': {'n_estimators': 200, 'max_depth': 15, 'random_state': 42}
    },
    {
        'name': 'gradient_boosting',
        'model': GradientBoostingClassifier,
        'params': {'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1, 'random_state': 42}
    },
    {
        'name': 'logistic_regression',
        'model': LogisticRegression,
        'params': {'max_iter': 1000, 'random_state': 42}
    }
]

# Store results for comparison
results = []

# Run all experiments
for i, exp in enumerate(experiments, 1):
    print(f"\nüî¨ Experiment {i}/{len(experiments)}: {exp['name']}")
    
    with mlflow.start_run(run_name=exp['name']):
        # Log parameters
        mlflow.log_params(exp['params'])
        mlflow.log_param('model_type', exp['model'].__name__)
        
        # Train model
        model = exp['model'](**exp['params'])
        model.fit(X_train, y_train)
        
        # Predict
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        metrics = {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred),
            'recall': recall_score(y_test, y_pred),
            'f1_score': f1_score(y_test, y_pred)
        }
        
        # Log metrics
        mlflow.log_metrics(metrics)
        
        # Log model
        mlflow.sklearn.log_model(model, "model")
        
        # Store for comparison
        results.append({
            'name': exp['name'],
            'model': exp['model'].__name__,
            **metrics
        })
        
        print(f"   ‚úÖ Accuracy: {metrics['accuracy']:.4f}")

print("\n" + "="*70)

# Display results
results_df = pd.DataFrame(results)
print("\nüìä Experiment Results:\n")
print(results_df.to_string(index=False))

# Find best model
best_idx = results_df['accuracy'].idxmax()
best_model = results_df.iloc[best_idx]

print(f"\nüèÜ Best Model: {best_model['name']}")
print(f"   Model Type: {best_model['model']}")
print(f"   Accuracy: {best_model['accuracy']:.4f}")
print(f"   F1 Score: {best_model['f1_score']:.4f}")

print("\nüí° All experiments tracked in MLflow!")
print("   Run 'mlflow ui' to compare visually")

In [None]:
# Visualize experiment results

print("üìä Visualizing Experiment Results\n")
print("="*70)

# Create comparison plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Accuracy comparison
axes[0, 0].barh(results_df['name'], results_df['accuracy'], color='skyblue')
axes[0, 0].set_xlabel('Accuracy', fontsize=12)
axes[0, 0].set_title('Model Accuracy Comparison', fontsize=14, fontweight='bold')
axes[0, 0].set_xlim(0, 1)

# Plot 2: F1 Score comparison
axes[0, 1].barh(results_df['name'], results_df['f1_score'], color='lightcoral')
axes[0, 1].set_xlabel('F1 Score', fontsize=12)
axes[0, 1].set_title('Model F1 Score Comparison', fontsize=14, fontweight='bold')
axes[0, 1].set_xlim(0, 1)

# Plot 3: Precision vs Recall
axes[1, 0].scatter(results_df['precision'], results_df['recall'], s=200, alpha=0.6, c=range(len(results_df)), cmap='viridis')
for i, name in enumerate(results_df['name']):
    axes[1, 0].annotate(name, (results_df['precision'][i], results_df['recall'][i]), 
                        fontsize=8, ha='center')
axes[1, 0].set_xlabel('Precision', fontsize=12)
axes[1, 0].set_ylabel('Recall', fontsize=12)
axes[1, 0].set_title('Precision vs Recall', fontsize=14, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: All metrics comparison (heatmap)
metrics_only = results_df[['accuracy', 'precision', 'recall', 'f1_score']]
im = axes[1, 1].imshow(metrics_only.values, cmap='YlGnBu', aspect='auto', vmin=0, vmax=1)
axes[1, 1].set_xticks(range(len(metrics_only.columns)))
axes[1, 1].set_xticklabels(metrics_only.columns, rotation=45, ha='right')
axes[1, 1].set_yticks(range(len(results_df)))
axes[1, 1].set_yticklabels(results_df['name'])
axes[1, 1].set_title('All Metrics Heatmap', fontsize=14, fontweight='bold')

# Add values to heatmap
for i in range(len(results_df)):
    for j in range(len(metrics_only.columns)):
        text = axes[1, 1].text(j, i, f'{metrics_only.values[i, j]:.3f}',
                              ha="center", va="center", color="black", fontsize=9)

plt.colorbar(im, ax=axes[1, 1])
plt.tight_layout()
plt.savefig('experiment_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n‚úÖ Visualizations created!")
print("   üìÑ Saved to: experiment_comparison.png")
print("\nüí° This is how you compare 100s of experiments visually!")

## üóÑÔ∏è Step 2: Model Versioning & Registry

**Model Registry = Git for ML Models**

### üéØ Why Model Versioning?

**Problems without versioning:**
- ‚ùå "Which model is in production?"
- ‚ùå "Can we rollback to last week's model?"
- ‚ùå "Who deployed this model?"
- ‚ùå "What data was it trained on?"

**Solutions with versioning:**
- ‚úÖ Track all model versions
- ‚úÖ Easy rollback to any version
- ‚úÖ Complete audit trail
- ‚úÖ Stage-based deployment (staging ‚Üí production)

### üèóÔ∏è MLflow Model Registry:

```
Model: sentiment_classifier
‚îÇ
‚îú‚îÄ Version 1 [Production]
‚îÇ  ‚îú‚îÄ Accuracy: 0.85
‚îÇ  ‚îú‚îÄ Created: 2024-01-15
‚îÇ  ‚îî‚îÄ Status: Production
‚îÇ
‚îú‚îÄ Version 2 [Staging]
‚îÇ  ‚îú‚îÄ Accuracy: 0.87
‚îÇ  ‚îú‚îÄ Created: 2024-01-20
‚îÇ  ‚îî‚îÄ Status: Staging (testing)
‚îÇ
‚îî‚îÄ Version 3 [Archived]
   ‚îú‚îÄ Accuracy: 0.82
   ‚îú‚îÄ Created: 2024-01-10
   ‚îî‚îÄ Status: Archived (old)
```

### üåü Model Stages:

1. **None**: Just registered
2. **Staging**: Being tested
3. **Production**: Serving users
4. **Archived**: Deprecated

### üìä Versioning Best Practices:

‚úÖ **Version everything**: Model, code, data, config  
‚úÖ **Tag versions**: Add metadata (accuracy, dataset, etc.)  
‚úÖ **Stage gradually**: Staging ‚Üí Canary ‚Üí Production  
‚úÖ **Never delete**: Archive instead of delete  
‚úÖ **Audit trail**: Log who did what when  

Let's use the model registry!

In [None]:
# MLflow Model Registry demo

print("üóÑÔ∏è  MLflow Model Registry Demo\n")
print("="*70)

# Set tracking URI (use local file store)
mlflow.set_tracking_uri("sqlite:///mlflow.db")

# Register a model
model_name = "binary_classifier"

print(f"üìù Registering model: {model_name}")

# Train and register model
with mlflow.start_run(run_name="registry_demo") as run:
    # Train model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    # Calculate metrics
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Log parameters and metrics
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", accuracy)
    
    # Log and register model
    mlflow.sklearn.log_model(
        model,
        "model",
        registered_model_name=model_name
    )
    
    print(f"\n‚úÖ Model registered!")
    print(f"   Name: {model_name}")
    print(f"   Run ID: {run.info.run_id}")
    print(f"   Accuracy: {accuracy:.4f}")

print("\n" + "="*70)
print("\nüí° Model Registry Features:")
print("   ‚úÖ Version control for models")
print("   ‚úÖ Stage-based deployment (Staging ‚Üí Production)")
print("   ‚úÖ Model lineage (track training data, code)")
print("   ‚úÖ Easy rollback to previous versions")
print("\nüåê View in UI: mlflow ui --backend-store-uri sqlite:///mlflow.db")

## üìà Step 3: Model Monitoring

**Monitor models in production to catch issues early**

### üéØ What to Monitor?

#### 1Ô∏è‚É£ **Model Performance**
- Accuracy, precision, recall
- Latency (prediction time)
- Throughput (requests/second)

#### 2Ô∏è‚É£ **Data Drift**
**Problem:** Input data distribution changes over time

```
Training data:        Production data (6 months later):
Age: 25-35           Age: 45-55  ‚Üê DRIFT!
Income: $50k         Income: $80k ‚Üê DRIFT!
```

**Detection:**
- Statistical tests (KS test, Chi-square)
- Distribution comparison
- Feature drift monitoring

#### 3Ô∏è‚É£ **Concept Drift**
**Problem:** Relationship between features and target changes

```
Before: "Buy" button clicks ‚Üí High conversion
After: "Buy" button clicks ‚Üí Low conversion  ‚Üê CONCEPT DRIFT!
(User behavior changed)
```

#### 4Ô∏è‚É£ **Prediction Drift**
**Problem:** Model outputs change distribution

```
Week 1: 50% positive predictions
Week 2: 90% positive predictions ‚Üê DRIFT!
```

### üö® When to Alert?

**Trigger alerts when:**
- ‚ùå Accuracy drops > 5%
- ‚ùå Data drift detected (p-value < 0.05)
- ‚ùå Latency increases > 2x
- ‚ùå Error rate spikes
- ‚ùå Prediction distribution shifts significantly

### üõ†Ô∏è Monitoring Tools:

**Open Source:**
- **Evidently AI** - drift detection
- **Great Expectations** - data validation
- **Prometheus + Grafana** - metrics

**Commercial:**
- **WhyLabs** - data logging
- **Arize AI** - observability
- **Fiddler** - explainability

Let's implement monitoring!

In [None]:
# Model monitoring with Evidently

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset

print("üìà Model Monitoring Demo\n")
print("="*70)

# Create reference and current datasets
# Reference: Original training data
# Current: Simulated production data (with drift)

reference_data = pd.DataFrame(X_train, columns=[f'feature_{i}' for i in range(X_train.shape[1])])
reference_data['target'] = y_train

# Simulate production data with drift
# Add some noise to simulate distribution shift
X_prod = X_test + np.random.normal(0, 0.5, X_test.shape)
current_data = pd.DataFrame(X_prod, columns=[f'feature_{i}' for i in range(X_prod.shape[1])])
current_data['target'] = y_test

print("üìä Creating Drift Report...\n")

# Create drift report
report = Report(metrics=[
    DataDriftPreset(),
    DataQualityPreset()
])

report.run(
    reference_data=reference_data,
    current_data=current_data,
    column_mapping=None
)

# Save report
report.save_html('drift_report.html')

print("‚úÖ Drift report generated!")
print("   üìÑ Saved to: drift_report.html")
print("\nüí° Open the HTML file to see:")
print("   - Data drift detection")
print("   - Feature-by-feature analysis")
print("   - Distribution comparisons")
print("   - Data quality metrics")

print("\n" + "="*70)

In [None]:
# Custom monitoring: Track performance over time

print("üìä Performance Monitoring Over Time\n")
print("="*70)

# Simulate model performance over time
import datetime

# Generate time series of model performance
dates = pd.date_range(start='2024-01-01', periods=30, freq='D')

# Simulate degrading performance (model rot)
np.random.seed(42)
base_accuracy = 0.85
degradation = np.linspace(0, -0.15, 30)  # Gradual degradation
noise = np.random.normal(0, 0.02, 30)  # Daily variance
accuracy_over_time = base_accuracy + degradation + noise

# Create monitoring dataframe
monitoring_df = pd.DataFrame({
    'date': dates,
    'accuracy': accuracy_over_time,
    'predictions_count': np.random.randint(100, 1000, 30),
    'avg_latency_ms': np.random.normal(50, 10, 30)
})

# Plot monitoring dashboard
fig, axes = plt.subplots(3, 1, figsize=(14, 10))

# Plot 1: Accuracy over time
axes[0].plot(monitoring_df['date'], monitoring_df['accuracy'], marker='o', linewidth=2, color='blue')
axes[0].axhline(y=0.80, color='red', linestyle='--', label='Alert Threshold (80%)')
axes[0].axhline(y=0.85, color='green', linestyle='--', label='Target (85%)')
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Model Accuracy Over Time', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Highlight degradation period
degraded_dates = monitoring_df[monitoring_df['accuracy'] < 0.80]['date']
if len(degraded_dates) > 0:
    axes[0].axvspan(degraded_dates.min(), degraded_dates.max(), 
                    alpha=0.2, color='red', label='Degraded Performance')

# Plot 2: Prediction volume
axes[1].bar(monitoring_df['date'], monitoring_df['predictions_count'], 
            color='skyblue', alpha=0.7)
axes[1].set_ylabel('Predictions Count', fontsize=12)
axes[1].set_title('Daily Prediction Volume', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

# Plot 3: Latency
axes[2].plot(monitoring_df['date'], monitoring_df['avg_latency_ms'], 
            marker='s', linewidth=2, color='orange')
axes[2].axhline(y=100, color='red', linestyle='--', label='SLA Limit (100ms)')
axes[2].set_ylabel('Latency (ms)', fontsize=12)
axes[2].set_xlabel('Date', fontsize=12)
axes[2].set_title('Average Prediction Latency', fontsize=14, fontweight='bold')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('monitoring_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

# Print alerts
print("\nüö® Monitoring Alerts:\n")

# Check for accuracy degradation
low_accuracy_days = monitoring_df[monitoring_df['accuracy'] < 0.80]
if len(low_accuracy_days) > 0:
    print(f"‚ö†Ô∏è  ALERT: Accuracy below 80% on {len(low_accuracy_days)} days")
    print(f"   First occurrence: {low_accuracy_days['date'].min().strftime('%Y-%m-%d')}")
    print(f"   Lowest accuracy: {low_accuracy_days['accuracy'].min():.2%}")
    print(f"   üîß Action: Retrain model with recent data")
else:
    print("‚úÖ Accuracy within acceptable range")

# Check for latency issues
high_latency_days = monitoring_df[monitoring_df['avg_latency_ms'] > 100]
if len(high_latency_days) > 0:
    print(f"\n‚ö†Ô∏è  ALERT: High latency on {len(high_latency_days)} days")
    print(f"   Max latency: {monitoring_df['avg_latency_ms'].max():.1f}ms")
    print(f"   üîß Action: Optimize model or scale infrastructure")
else:
    print("\n‚úÖ Latency within SLA")

print("\n" + "="*70)
print("\nüí° In production, set up automated alerts:")
print("   - Email/Slack when accuracy drops")
print("   - PagerDuty for critical failures")
print("   - Auto-trigger retraining pipelines")

## üî¨ Step 4: A/B Testing ML Models

**Validate model improvements in production with real users**

### üéØ What is A/B Testing?

**Compare two models with real traffic:**

```
Users
  ‚Üì
  50% ‚Üí Model A (current)
  50% ‚Üí Model B (new)
  ‚Üì
Measure: Conversion, engagement, accuracy
  ‚Üì
Winner ‚Üí 100% traffic
```

### üèóÔ∏è A/B Testing Strategies:

#### 1Ô∏è‚É£ **Random Split**
- 50/50 split
- Simple and fair
- Fast results

#### 2Ô∏è‚É£ **Canary Deployment**
```
5% ‚Üí New model (canary)
95% ‚Üí Old model (stable)
```
- Lower risk
- Gradual rollout

#### 3Ô∏è‚É£ **Multi-Armed Bandit**
- Dynamic allocation
- More traffic to better model
- Faster convergence

### üìä What to Measure?

**Technical Metrics:**
- Accuracy, precision, recall
- Latency, throughput
- Error rate

**Business Metrics:**
- Click-through rate (CTR)
- Conversion rate
- Revenue per user
- User engagement

### ‚úÖ Statistical Significance:

**Don't declare winner too early!**

```python
# Need enough samples
p_value < 0.05  # 95% confidence
sample_size > 1000  # Minimum
```

### üéØ A/B Testing Best Practices:

‚úÖ **Define success metrics** before testing  
‚úÖ **Calculate required sample size**  
‚úÖ **Run for sufficient time** (1-2 weeks)  
‚úÖ **Check for statistical significance**  
‚úÖ **Monitor both groups** for differences  
‚úÖ **Have rollback plan** ready  

Let's implement A/B testing!

In [None]:
# A/B Testing simulation

from scipy import stats

print("üî¨ A/B Testing Simulation\n")
print("="*70)

# Simulate two models
print("Training two models...\n")

# Model A: Current (baseline)
model_a = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model_a.fit(X_train, y_train)

# Model B: New (challenger)
model_b = GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42)
model_b.fit(X_train, y_train)

print("‚úÖ Models trained!")

# Simulate A/B test with production data
print("\nüé≤ Simulating A/B Test...\n")

# Randomly assign users to groups
n_samples = len(X_test)
np.random.seed(42)
assignments = np.random.choice(['A', 'B'], size=n_samples, p=[0.5, 0.5])

# Make predictions
results_a = []
results_b = []

for i in range(n_samples):
    if assignments[i] == 'A':
        pred = model_a.predict([X_test[i]])[0]
        correct = (pred == y_test[i])
        results_a.append(correct)
    else:
        pred = model_b.predict([X_test[i]])[0]
        correct = (pred == y_test[i])
        results_b.append(correct)

# Calculate metrics
accuracy_a = np.mean(results_a)
accuracy_b = np.mean(results_b)

print("üìä A/B Test Results:\n")
print(f"Model A (Baseline):")
print(f"   Samples: {len(results_a)}")
print(f"   Accuracy: {accuracy_a:.4f} ({accuracy_a:.2%})")

print(f"\nModel B (Challenger):")
print(f"   Samples: {len(results_b)}")
print(f"   Accuracy: {accuracy_b:.4f} ({accuracy_b:.2%})")

# Statistical significance test
print("\nüî¨ Statistical Significance Test:\n")

# Chi-square test
contingency_table = [
    [sum(results_a), len(results_a) - sum(results_a)],  # A: correct, incorrect
    [sum(results_b), len(results_b) - sum(results_b)]   # B: correct, incorrect
]

chi2, p_value = stats.chi2_contingency(contingency_table)[:2]

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")

# Determine winner
alpha = 0.05  # Significance level

if p_value < alpha:
    if accuracy_b > accuracy_a:
        improvement = (accuracy_b - accuracy_a) / accuracy_a * 100
        print(f"\nüèÜ WINNER: Model B!")
        print(f"   ‚úÖ Statistically significant (p < {alpha})")
        print(f"   üìà Improvement: {improvement:.2f}%")
        print(f"\nüöÄ Recommendation: Deploy Model B to production")
    else:
        print(f"\nüèÜ WINNER: Model A (current)")
        print(f"   ‚úÖ Statistically significant (p < {alpha})")
        print(f"\n‚ö†Ô∏è  Recommendation: Keep Model A, don't deploy B")
else:
    print(f"\nü§î INCONCLUSIVE")
    print(f"   ‚ùå Not statistically significant (p >= {alpha})")
    print(f"\nüìä Recommendation:")
    print(f"      - Collect more data")
    print(f"      - Run test longer")
    print(f"      - Increase sample size")

print("\n" + "="*70)

In [None]:
# Visualize A/B test results

print("üìä Visualizing A/B Test Results\n")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Accuracy comparison
models = ['Model A\n(Baseline)', 'Model B\n(Challenger)']
accuracies = [accuracy_a, accuracy_b]
colors = ['lightblue', 'lightgreen']

bars = axes[0].bar(models, accuracies, color=colors, alpha=0.7, edgecolor='black')
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('A/B Test: Model Comparison', fontsize=14, fontweight='bold')
axes[0].set_ylim(0, 1)

# Add value labels
for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    axes[0].text(bar.get_x() + bar.get_width()/2., height,
                f'{acc:.2%}',
                ha='center', va='bottom', fontsize=12, fontweight='bold')

# Add winner indicator
if p_value < alpha:
    winner_idx = 1 if accuracy_b > accuracy_a else 0
    bars[winner_idx].set_edgecolor('gold')
    bars[winner_idx].set_linewidth(3)
    axes[0].text(winner_idx, accuracies[winner_idx] + 0.05, 'üëë WINNER',
                ha='center', fontsize=14, fontweight='bold', color='gold')

# Plot 2: Sample distribution
sizes = [len(results_a), len(results_b)]
axes[1].pie(sizes, labels=models, autopct='%1.1f%%',
           colors=colors, startangle=90, textprops={'fontsize': 12})
axes[1].set_title('Traffic Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('ab_test_results.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n‚úÖ A/B test visualization created!")
print("   üìÑ Saved to: ab_test_results.png")
print("\nüí° In production:")
print("   - Run for 1-2 weeks minimum")
print("   - Monitor both technical AND business metrics")
print("   - Have automatic rollback if B performs worse")

## üè≠ Real AI Example: Complete MLOps Pipeline

**End-to-end MLOps workflow from training to production**

This example demonstrates a production-ready MLOps pipeline with:
- Automated experiment tracking
- Model versioning and registry
- Performance monitoring
- Automated retraining triggers
- A/B testing framework

In [None]:
# Complete MLOps Pipeline

class MLOpsPipeline:
    """
    Production-ready MLOps pipeline
    
    Features:
    - Experiment tracking with MLflow
    - Model versioning
    - Performance monitoring
    - Automated retraining
    - A/B testing support
    """
    
    def __init__(self, experiment_name):
        self.experiment_name = experiment_name
        mlflow.set_experiment(experiment_name)
        self.performance_threshold = 0.80
        self.production_model = None
        
    def train_and_track(self, model, params, X_train, y_train, X_test, y_test):
        """Train model with full MLflow tracking"""
        
        with mlflow.start_run() as run:
            # Log parameters
            mlflow.log_params(params)
            mlflow.log_param('model_type', type(model).__name__)
            
            # Train model
            model.fit(X_train, y_train)
            
            # Evaluate
            y_pred = model.predict(X_test)
            
            # Calculate metrics
            metrics = {
                'accuracy': accuracy_score(y_test, y_pred),
                'precision': precision_score(y_test, y_pred, average='weighted'),
                'recall': recall_score(y_test, y_pred, average='weighted'),
                'f1_score': f1_score(y_test, y_pred, average='weighted')
            }
            
            # Log metrics
            mlflow.log_metrics(metrics)
            
            # Log model
            mlflow.sklearn.log_model(model, "model")
            
            return run.info.run_id, metrics
    
    def check_performance(self, metrics):
        """Check if model meets performance threshold"""
        
        if metrics['accuracy'] < self.performance_threshold:
            print(f"‚ö†Ô∏è  ALERT: Accuracy {metrics['accuracy']:.2%} below threshold {self.performance_threshold:.2%}")
            print("üîß Triggering retraining...")
            return False
        return True
    
    def deploy_with_ab_test(self, new_model, current_model, X_test, y_test, traffic_split=0.5):
        """Deploy new model with A/B testing"""
        
        print("\nüî¨ Starting A/B Test Deployment...\n")
        
        # Random assignment
        assignments = np.random.choice(['A', 'B'], size=len(X_test), 
                                      p=[traffic_split, 1-traffic_split])
        
        # Track performance
        results_a = []
        results_b = []
        
        for i in range(len(X_test)):
            if assignments[i] == 'A':
                pred = current_model.predict([X_test[i]])[0]
                results_a.append(pred == y_test[i])
            else:
                pred = new_model.predict([X_test[i]])[0]
                results_b.append(pred == y_test[i])
        
        acc_a = np.mean(results_a)
        acc_b = np.mean(results_b)
        
        # Statistical test
        contingency = [
            [sum(results_a), len(results_a) - sum(results_a)],
            [sum(results_b), len(results_b) - sum(results_b)]
        ]
        _, p_value = stats.chi2_contingency(contingency)[:2]
        
        # Decision
        if p_value < 0.05 and acc_b > acc_a:
            print(f"‚úÖ New model wins!")
            print(f"   Current: {acc_a:.2%}")
            print(f"   New: {acc_b:.2%}")
            print(f"   Improvement: {(acc_b-acc_a)/acc_a*100:.2f}%")
            print(f"\nüöÄ Deploying new model to production...")
            self.production_model = new_model
            return 'new'
        else:
            print(f"‚ö†Ô∏è  Keeping current model")
            print(f"   Current: {acc_a:.2%}")
            print(f"   New: {acc_b:.2%}")
            return 'current'

# Demo the pipeline
print("üè≠ Complete MLOps Pipeline Demo\n")
print("="*70)

# Initialize pipeline
pipeline = MLOpsPipeline("production_pipeline")

# Train initial model
print("\n1Ô∏è‚É£ Training initial model...\n")
model_v1 = RandomForestClassifier(n_estimators=100, random_state=42)
run_id_v1, metrics_v1 = pipeline.train_and_track(
    model_v1, 
    {'n_estimators': 100}, 
    X_train, y_train, X_test, y_test
)
print(f"‚úÖ Model v1 trained: Accuracy = {metrics_v1['accuracy']:.2%}")

# Check performance
print("\n2Ô∏è‚É£ Monitoring performance...\n")
pipeline.check_performance(metrics_v1)

# Train improved model
print("\n3Ô∏è‚É£ Training improved model...\n")
model_v2 = GradientBoostingClassifier(n_estimators=150, random_state=42)
run_id_v2, metrics_v2 = pipeline.train_and_track(
    model_v2,
    {'n_estimators': 150},
    X_train, y_train, X_test, y_test
)
print(f"‚úÖ Model v2 trained: Accuracy = {metrics_v2['accuracy']:.2%}")

# A/B test deployment
print("\n4Ô∏è‚É£ A/B Testing Deployment...")
winner = pipeline.deploy_with_ab_test(model_v2, model_v1, X_test, y_test)

print("\n" + "="*70)
print("\nüéâ MLOps Pipeline Complete!")
print("\nüí° In production, this pipeline would:")
print("   ‚úÖ Run automatically on schedule")
print("   ‚úÖ Monitor performance continuously")
print("   ‚úÖ Trigger retraining when needed")
print("   ‚úÖ A/B test before full deployment")
print("   ‚úÖ Rollback automatically if issues detected")

## üéØ Interactive Exercises

**Practice your MLOps skills!**

### Exercise 1: Build Your MLOps Pipeline

**Task:** Create a complete MLOps pipeline for a classification task

**Requirements:**
1. Use MLflow to track 5+ experiments
2. Try different models and hyperparameters
3. Register the best model
4. Create a monitoring dashboard
5. Implement A/B testing

**Dataset suggestions:**
- Iris classification
- Breast cancer detection
- Wine quality prediction

**Bonus:** Add automated retraining triggers!

In [None]:
# YOUR SOLUTION HERE

# TODO: Load dataset
# from sklearn.datasets import load_iris

# TODO: Set up MLflow experiment
# mlflow.set_experiment("my_mlops_pipeline")

# TODO: Train multiple models
# models = [RandomForestClassifier, GradientBoostingClassifier, ...]

# TODO: Track all experiments
# for model in models:
#     with mlflow.start_run():
#         ...

# TODO: Register best model
# mlflow.sklearn.log_model(..., registered_model_name="my_model")

# TODO: Create monitoring dashboard

# TODO: Implement A/B testing

print("Complete the exercise above!")
print("\nHints:")
print("1. Use sklearn.datasets for quick datasets")
print("2. Track at least: params, metrics, model")
print("3. Compare models using MLflow UI")
print("4. Use matplotlib for monitoring visualization")

### Exercise 2: Implement Drift Detection

**Task:** Build a drift detection system

**Requirements:**
1. Create reference dataset (training data)
2. Generate current dataset (with artificial drift)
3. Use Evidently to detect drift
4. Create custom drift metrics
5. Set up automated alerts

**Bonus:** Implement automatic retraining when drift detected!

In [None]:
# YOUR SOLUTION HERE

# TODO: Create reference data
# reference_data = ...

# TODO: Simulate drift (add noise, shift distribution)
# current_data = reference_data + noise

# TODO: Use Evidently for drift detection
# from evidently.report import Report
# from evidently.metric_preset import DataDriftPreset

# TODO: Create custom drift metrics
# def calculate_ks_statistic(ref, curr):
#     ...

# TODO: Set alert thresholds
# if drift_detected:
#     send_alert()
#     trigger_retraining()

print("Complete the exercise above!")
print("\nLibraries to explore:")
print("- evidently (drift detection)")
print("- scipy.stats (statistical tests)")
print("- alibi-detect (advanced drift detection)")

## üéâ Key Takeaways

**Congratulations! You've mastered MLOps best practices!**

### 1Ô∏è‚É£ **Experiment Tracking (MLflow)**
   - ‚úÖ Track parameters, metrics, artifacts
   - ‚úÖ Compare 100s of experiments easily
   - ‚úÖ Reproduce any experiment
   - **Use when:** Training any ML model (always!)

### 2Ô∏è‚É£ **Model Versioning**
   - ‚úÖ Version control for models
   - ‚úÖ Stage-based deployment (staging ‚Üí production)
   - ‚úÖ Easy rollback to previous versions
   - **Use when:** Deploying to production (essential!)

### 3Ô∏è‚É£ **Model Monitoring**
   - ‚úÖ Track performance over time
   - ‚úÖ Detect data and concept drift
   - ‚úÖ Automated alerts for issues
   - **Use when:** Models in production (mandatory!)

### 4Ô∏è‚É£ **A/B Testing**
   - ‚úÖ Validate improvements with real traffic
   - ‚úÖ Statistical significance testing
   - ‚úÖ Safe, gradual rollouts
   - **Use when:** Deploying model updates

---

## üåü MLOps Maturity Progression

**Your journey:**

```
Level 0: Manual
  ‚Üì
Level 1: Basic tracking (‚úÖ You are here after this lesson!)
  ‚Üì
Level 2: Automated pipelines
  ‚Üì
Level 3: Full MLOps (production-ready)
```

---

## üìä Production MLOps Stack (2024-2025)

**Recommended tools:**

| Component | Tool | Why |
|-----------|------|-----|
| **Tracking** | MLflow | Industry standard, open-source |
| **Versioning** | MLflow Registry | Integrated with tracking |
| **Monitoring** | Evidently + Grafana | Open-source, powerful |
| **Orchestration** | Airflow / Prefect | Workflow automation |
| **Feature Store** | Feast | Open-source |
| **Model Serving** | FastAPI + Docker | Fast, scalable |
| **CI/CD** | GitHub Actions | Free, integrated |

---

## ‚úÖ MLOps Checklist

**Before deploying to production:**

- [ ] All experiments tracked in MLflow
- [ ] Model versioned in registry
- [ ] Performance monitoring in place
- [ ] Drift detection configured
- [ ] Alerts set up (email/Slack/PagerDuty)
- [ ] A/B testing framework ready
- [ ] Rollback procedure documented
- [ ] Retraining pipeline automated
- [ ] Data versioning (DVC)
- [ ] Model documentation complete

---

## üöÄ Next Steps

**Continue your MLOps journey:**

1. **Day 3: Cloud Deployment**
   - Deploy to AWS, GCP, Azure
   - Serverless ML
   - Hugging Face Spaces
   - Streamlit apps

2. **Advanced MLOps Topics:**
   - Feature stores (Feast)
   - Model serving at scale (KServe)
   - ML pipelines (Kubeflow, Vertex AI)
   - Data versioning (DVC)

3. **Practice:**
   - Build end-to-end MLOps pipeline
   - Contribute to open-source MLOps tools
   - Set up CI/CD for ML project

---

**üí¨ Final Thoughts:**

*"MLOps is not optional - it's the difference between a demo and a production system. You now have the skills to build reliable, maintainable ML systems that companies actually use. MLflow + monitoring + A/B testing is the minimum viable MLOps stack for 2024-2025. Master these, and you're ready for ML engineering roles!"*

**üéâ Day 2 Complete! Tomorrow: Cloud Deployment! üöÄ**

---

**üìö Additional Resources:**
- MLflow Docs: https://mlflow.org
- Evidently AI: https://evidentlyai.com
- Made With ML (MLOps): https://madewithml.com
- ML-Ops.org: https://ml-ops.org
- Google's MLOps Guide: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

**Keep building! üåü**