# HubSpot ML Framework - Complete Demo
## Training with MLflow + Serving with FastAPI

**What this notebook demonstrates:**
1. Training a customer conversion model
2. Tracking experiments with MLflow
3. Testing the FastAPI prediction service
4. End-to-end ML workflow

---
## üìã Setup and Initialization

In [None]:
# Cell 1: Environment Setup
import os
import sys
from pathlib import Path

print("üîç Environment Check")
print("=" * 60)
print(f"Current directory: {os.getcwd()}")
print(f"Python version: {sys.version}")

# If in notebooks/ directory, move to project root
if os.path.basename(os.getcwd()) == 'notebooks':
    os.chdir('..')
    print(f"‚úì Changed to project root: {os.getcwd()}")

# Add src to Python path
src_path = Path(os.getcwd()) / 'src'
if src_path.exists():
    sys.path.insert(0, str(src_path))
    print(f"‚úì Added to path: {src_path}")

# Verify directory structure
print(f"\nüìÅ Directory Check:")
for folder in ['data', 'configs', 'artifacts', 'mlruns']:
    exists = Path(folder).exists()
    status = "‚úì" if exists else "‚úó"
    print(f"  {status} {folder}/ exists: {exists}")
print("=" * 60)

In [None]:
# Cell 2: Import Libraries
from ml_framework.training import Trainer
from ml_framework.utils import load_config
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mlflow

# For API testing
import requests
import json
from datetime import datetime

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ All libraries imported successfully!")

In [None]:
# Cell 3: Load Configuration
config = load_config('configs/config.yaml')

print("üìã Configuration Loaded")
print("=" * 60)
print(f"Experiment: {config.experiment.name}")
print(f"Model: {config.model.type}")
print(f"MLflow URI: {config.experiment.mlflow_tracking_uri}")
print(f"Test size: {config.data.test_size}")
print(f"Random seed: {config.reproducibility.seed}")
print("=" * 60)

---
## üéØ Part 1: Model Training with MLflow

In [None]:
# Cell 4: Train Model
print("üöÄ Starting Model Training with MLflow Tracking")
print("=" * 60)

# Initialize trainer
trainer = Trainer(config)

# Run training (automatically logs to MLflow)
results = trainer.train()

print("\n" + "=" * 60)
print("‚úÖ TRAINING COMPLETE!")
print("=" * 60)
print(f"\nüìä Performance Metrics:")
print(f"  Accuracy:  {results['metrics']['accuracy']:.4f}")
print(f"  Precision: {results['metrics']['precision']:.4f}")
print(f"  Recall:    {results['metrics']['recall']:.4f}")
print(f"  F1 Score:  {results['metrics']['f1']:.4f}")
print(f"\nüíæ Model saved to: {results['model_path']}")
print(f"üìù MLflow Run ID: {results['run_id']}")
print("=" * 60)

### üî¨ Explore MLflow Tracking

**To view experiments:**
1. Open a new terminal
2. Run: `mlflow ui --backend-store-uri ./mlruns`
3. Open browser: http://localhost:5000

**What you'll see:**
- All training runs
- Metrics comparison charts
- Hyperparameters
- Saved artifacts

In [None]:
# Cell 5: View MLflow Runs Programmatically
from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name(config.experiment.name)

if experiment:
    runs = client.search_runs(experiment.experiment_id)
    
    print(f"üìä MLflow Experiment: {config.experiment.name}")
    print("=" * 80)
    print(f"Total runs: {len(runs)}\n")
    
    for run in runs[:5]:  # Show last 5 runs
        print(f"Run ID: {run.info.run_id[:8]}...")
        print(f"  Status: {run.info.status}")
        print(f"  Start: {datetime.fromtimestamp(run.info.start_time / 1000)}")
        print(f"  Metrics:")
        for key, value in run.data.metrics.items():
            print(f"    {key}: {value:.4f}")
        print("-" * 80)
else:
    print("‚ö†Ô∏è No experiment found. Run training first!")

In [None]:
# Cell 6: Visualize Metrics Across Runs
if experiment and len(runs) > 0:
    # Extract metrics from all runs
    metrics_data = []
    for run in runs:
        metrics_data.append({
            'run_id': run.info.run_id[:8],
            'accuracy': run.data.metrics.get('accuracy', 0),
            'precision': run.data.metrics.get('precision', 0),
            'recall': run.data.metrics.get('recall', 0),
            'f1': run.data.metrics.get('f1', 0)
        })
    
    df_metrics = pd.DataFrame(metrics_data)
    
    # Plot comparison
    fig, ax = plt.subplots(figsize=(12, 6))
    df_metrics.set_index('run_id')[['accuracy', 'precision', 'recall', 'f1']].plot(
        kind='bar', ax=ax, width=0.8
    )
    ax.set_title('Model Performance Across Runs', fontsize=14, fontweight='bold')
    ax.set_xlabel('Run ID', fontsize=12)
    ax.set_ylabel('Score', fontsize=12)
    ax.set_ylim([0, 1])
    ax.legend(title='Metrics')
    ax.grid(axis='y', alpha=0.3)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    print("üìà Metrics Summary:")
    print(df_metrics.describe())
else:
    print("‚ö†Ô∏è No runs to visualize")

---
## üöÄ Part 2: FastAPI Testing

**Before running these cells:**
1. Open a NEW terminal
2. Run: `python run_api.py`
3. Server starts on http://localhost:8000
4. View docs: http://localhost:8000/docs

In [None]:
# Cell 7: Check API Health
API_URL = "http://localhost:8000"

try:
    response = requests.get(f"{API_URL}/health", timeout=5)
    if response.status_code == 200:
        health_data = response.json()
        print("‚úÖ API is healthy and ready for predictions!")
        print("=" * 60)
        print(f"Status: {health_data.get('status')}")
        print(f"Model Loaded: {health_data.get('model_loaded')}")
        if 'model_version' in health_data:
            print(f"Model Version: {health_data.get('model_version')}")
        print("=" * 60)
    else:
        print(f"‚ö†Ô∏è API returned status code: {response.status_code}")
except requests.exceptions.ConnectionError:
    print("‚ùå Cannot connect to API!")
    print("\nüìù To start the API:")
    print("1. Open a new terminal")
    print("2. cd to project root")
    print("3. Run: python run_api.py")
    print("4. Then re-run this cell")
except Exception as e:
    print(f"‚ùå Error: {e}")

In [None]:
# Cell 8: Single Company Prediction
print("üéØ Testing Single Prediction")
print("=" * 60)

# Example company data
company_data = {
    "id": 123,
    "ALEXA_RANK": 50000,
    "EMPLOYEE_RANGE": "26 to 50",
    "INDUSTRY": "COMPUTER_SOFTWARE",
    "total_actions": 150,
    "total_users": 5,
    "days_active": 30,
    "activity_frequency": 5.0
}

try:
    response = requests.post(
        f"{API_URL}/predict/single",
        json=company_data,
        timeout=10
    )
    
    if response.status_code == 200:
        result = response.json()
        
        print(f"Company ID: {result['company_id']}")
        print("\nüìä Prediction Result:")
        
        prediction_label = "üü¢ CUSTOMER" if result['prediction'] == 1 else "üî¥ NON-CUSTOMER"
        print(f"  {prediction_label}")
        print(f"  Probability: {result['conversion_probability']:.2%}")
        print(f"  Confidence: {result['confidence'].upper()}")
        print("=" * 60)
        
        # Visualize
        fig, ax = plt.subplots(figsize=(8, 2))
        prob = result['conversion_probability']
        ax.barh(['Conversion Probability'], [prob], color='green' if prob > 0.5 else 'red')
        ax.set_xlim([0, 1])
        ax.set_xlabel('Probability', fontsize=12)
        ax.set_title(f'Prediction: {prediction_label}', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
    else:
        print(f"‚ùå Error: {response.status_code}")
        print(response.text)
        
except requests.exceptions.ConnectionError:
    print("‚ùå API not running! Start it with: python run_api.py")
except Exception as e:
    print(f"‚ùå Error: {e}")

In [None]:
# Cell 9: Batch Predictions
print("üì¶ Testing Batch Predictions")
print("=" * 60)

# Multiple companies
companies_batch = [
    {
        "id": 101,
        "ALEXA_RANK": 10000,
        "EMPLOYEE_RANGE": "51 to 100",
        "INDUSTRY": "COMPUTER_SOFTWARE",
        "total_actions": 300,
        "total_users": 15,
        "days_active": 45,
        "activity_frequency": 6.7
    },
    {
        "id": 102,
        "ALEXA_RANK": 500000,
        "EMPLOYEE_RANGE": "1 to 10",
        "INDUSTRY": "RETAIL",
        "total_actions": 10,
        "total_users": 1,
        "days_active": 5,
        "activity_frequency": 2.0
    },
    {
        "id": 103,
        "ALEXA_RANK": 75000,
        "EMPLOYEE_RANGE": "26 to 50",
        "INDUSTRY": "INTERNET",
        "total_actions": 180,
        "total_users": 8,
        "days_active": 35,
        "activity_frequency": 5.1
    }
]

try:
    response = requests.post(
        f"{API_URL}/predict/batch",
        json=companies_batch,
        timeout=10
    )
    
    if response.status_code == 200:
        results = response.json()
        
        print(f"Processed {len(results)} companies:\n")
        
        for result in results:
            prediction_label = "‚úÖ Customer" if result['prediction'] == 1 else "‚ùå Non-Customer"
            print(f"Company {result['company_id']}: {prediction_label} ({result['conversion_probability']:.1%})")
        
        print("=" * 60)
        
        # Visualize batch results
        df_results = pd.DataFrame(results)
        df_results['label'] = df_results['prediction'].apply(
            lambda x: 'Customer' if x == 1 else 'Non-Customer'
        )
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
        
        # Probability distribution
        colors = ['green' if p > 0.5 else 'red' for p in df_results['conversion_probability']]
        ax1.bar(df_results['company_id'], df_results['conversion_probability'], color=colors)
        ax1.axhline(y=0.5, color='black', linestyle='--', label='Decision Boundary')
        ax1.set_xlabel('Company ID', fontsize=12)
        ax1.set_ylabel('Conversion Probability', fontsize=12)
        ax1.set_title('Batch Prediction Probabilities', fontsize=14, fontweight='bold')
        ax1.legend()
        ax1.grid(axis='y', alpha=0.3)
        
        # Prediction distribution
        prediction_counts = df_results['label'].value_counts()
        ax2.pie(prediction_counts, labels=prediction_counts.index, autopct='%1.1f%%',
                colors=['green', 'red'], startangle=90)
        ax2.set_title('Prediction Distribution', fontsize=14, fontweight='bold')
        
        plt.tight_layout()
        plt.show()
    else:
        print(f"‚ùå Error: {response.status_code}")
        print(response.text)
        
except requests.exceptions.ConnectionError:
    print("‚ùå API not running! Start it with: python run_api.py")
except Exception as e:
    print(f"‚ùå Error: {e}")

In [None]:
# Cell 10: Top Conversion Prospects
print("üéØ Finding Top Conversion Prospects")
print("=" * 60)

# Generate sample companies
np.random.seed(42)
sample_companies = []

industries = ["COMPUTER_SOFTWARE", "INTERNET", "RETAIL", "MARKETING"]
employee_ranges = ["1 to 10", "11 to 25", "26 to 50", "51 to 100"]

for i in range(10):
    sample_companies.append({
        "id": 1000 + i,
        "ALEXA_RANK": np.random.randint(10000, 200000),
        "EMPLOYEE_RANGE": np.random.choice(employee_ranges),
        "INDUSTRY": np.random.choice(industries),
        "total_actions": np.random.randint(50, 400),
        "total_users": np.random.randint(2, 20),
        "days_active": np.random.randint(10, 60),
        "activity_frequency": np.random.uniform(2.0, 8.0)
    })

try:
    response = requests.post(
        f"{API_URL}/predict/batch",
        json=sample_companies,
        timeout=10
    )
    
    if response.status_code == 200:
        results = response.json()
        
        # Sort by probability
        df_prospects = pd.DataFrame(results)
        df_prospects = df_prospects.sort_values('conversion_probability', ascending=False)
        
        print("\nüèÜ Top 5 Conversion Prospects:\n")
        for idx, row in df_prospects.head(5).iterrows():
            print(f"#{df_prospects.index.get_loc(idx) + 1}. Company {row['company_id']}")
            print(f"   Probability: {row['conversion_probability']:.1%}")
            print(f"   Confidence: {row['confidence'].upper()}")
            print()
        
        # Visualize top prospects
        fig, ax = plt.subplots(figsize=(10, 6))
        top_10 = df_prospects.head(10)
        colors = ['green' if x >= 0.7 else 'orange' if x >= 0.5 else 'red' 
                  for x in top_10['conversion_probability']]
        
        ax.barh(top_10['company_id'].astype(str), top_10['conversion_probability'], color=colors)
        ax.set_xlabel('Conversion Probability', fontsize=12)
        ax.set_ylabel('Company ID', fontsize=12)
        ax.set_title('Top 10 Conversion Prospects', fontsize=14, fontweight='bold')
        ax.set_xlim([0, 1])
        ax.axvline(x=0.5, color='black', linestyle='--', label='50% Threshold')
        ax.legend()
        ax.grid(axis='x', alpha=0.3)
        plt.tight_layout()
        plt.show()
        
except requests.exceptions.ConnectionError:
    print("‚ùå API not running!")
except Exception as e:
    print(f"‚ùå Error: {e}")

---
## üìö API Documentation & Testing

**Interactive API Documentation:**
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

**Available Endpoints:**
- `GET /health` - Check API health
- `POST /predict/single` - Single prediction
- `POST /predict/batch` - Batch predictions

In [None]:
# Cell 11: View API Info
try:
    response = requests.get(f"{API_URL}/docs", timeout=5)
    if response.status_code == 200:
        print("‚úÖ API Documentation Available")
        print("=" * 60)
        print("\nüìñ Interactive Documentation:")
        print(f"   Swagger UI: {API_URL}/docs")
        print(f"   ReDoc:      {API_URL}/redoc")
        print("\nüîå Available Endpoints:")
        print(f"   GET  {API_URL}/health")
        print(f"   POST {API_URL}/predict/single")
        print(f"   POST {API_URL}/predict/batch")
        print("=" * 60)
        print("\nüí° Tip: Open the Swagger UI in your browser to test interactively!")
    else:
        print(f"‚ö†Ô∏è Unexpected status code: {response.status_code}")
except requests.exceptions.ConnectionError:
    print("‚ùå API not running")
except Exception as e:
    print(f"‚ùå Error: {e}")

---
## üéì Summary

### What We Accomplished:

1. **‚úÖ Trained a customer conversion model**
   - Random Forest classifier
   - Feature engineering from company and activity data
   - Train/test split for evaluation

2. **‚úÖ Tracked experiments with MLflow**
   - Logged hyperparameters
   - Logged metrics (accuracy, precision, recall, F1)
   - Saved model artifacts
   - Compared runs visually

3. **‚úÖ Served predictions via FastAPI**
   - Health check endpoint
   - Single prediction endpoint
   - Batch prediction endpoint
   - Interactive API documentation

### Complete Workflow:

```
Data (CSV) ‚Üí Preprocessing ‚Üí Training ‚Üí MLflow Tracking
                                ‚Üì
                          Save Artifacts
                                ‚Üì
                          FastAPI Loads Model
                                ‚Üì
                          REST API Predictions
```

### Next Steps:

- üîÑ Experiment with different models (XGBoost, LightGBM)
- üéØ Tune hyperparameters
- üìä Add more features
- üöÄ Deploy to production
- üìà Add monitoring and logging

---
**üéâ Congratulations! You've built a complete ML system with MLflow tracking and FastAPI serving!**