# ML Model Training - Revenue Forecasting & Churn Prediction

Interactive machine learning training notebook for the Sales Analytics Platform.

This notebook provides:
- Revenue forecasting using time series ML models
- Churn prediction using classification algorithms
- Model training, tuning, and evaluation
- Feature importance analysis
- Model performance metrics and visualization
- Model export for production deployment

Designed for data scientists and ML engineers.

## 1. Import Libraries & Setup

In [None]:
# Data science libraries
import pandas as pd
import numpy as np
from pathlib import Path
import sys
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.preprocessing import StandardScaler

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

# Add project to path
sys.path.append(str(Path('../src').resolve()))

print("Libraries imported successfully")

## 2. Load & Prepare Data for ML

In [None]:
# Load sales data
df = pd.read_csv('../data/sales_data.csv', parse_dates=['date'])
print(f"Loaded {len(df):,} transactions")

# Aggregate daily revenue for forecasting
daily_data = df.groupby('date').agg({
    'revenue': 'sum',
    'quantity': 'sum',
    'order_id': 'nunique'
}).reset_index()

daily_data = daily_data.sort_values('date')
print(f"Daily data: {len(daily_data)} days")

# Preview
daily_data.head()

## 3. Feature Engineering for Revenue Forecasting

In [None]:
# Time-based features
daily_data['day_of_week'] = daily_data['date'].dt.dayofweek
daily_data['day_of_month'] = daily_data['date'].dt.day
daily_data['month'] = daily_data['date'].dt.month
daily_data['quarter'] = daily_data['date'].dt.quarter
daily_data['is_weekend'] = (daily_data['day_of_week'] >= 5).astype(int)

# Lag features (past revenue)
for lag in [1, 7, 14, 30]:
    daily_data[f'revenue_lag_{lag}'] = daily_data['revenue'].shift(lag)

# Rolling window features
for window in [7, 14, 30]:
    daily_data[f'revenue_roll_mean_{window}'] = daily_data['revenue'].rolling(window=window).mean()
    daily_data[f'revenue_roll_std_{window}'] = daily_data['revenue'].rolling(window=window).std()

# Growth rates
daily_data['revenue_growth_7d'] = daily_data['revenue'].pct_change(7)
daily_data['revenue_growth_30d'] = daily_data['revenue'].pct_change(30)

# Drop NaN rows (caused by lag/rolling features)
daily_data = daily_data.dropna()

print(f"Features engineered! Shape: {daily_data.shape}")
print(f"Features: {list(daily_data.columns)}")

## 4. Train Revenue Forecasting Models

In [None]:
# Prepare features and target
feature_cols = [col for col in daily_data.columns if col not in ['date', 'revenue', 'quantity', 'order_id']]
X = daily_data[feature_cols]
y = daily_data['revenue']

# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
print(f"ðŸ“Š Training set: {len(X_train)}, Test set: {len(X_test)}")

# Train multiple models
models = {
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42),
    'Ridge Regression': Ridge(alpha=1.0)
}

results = {}

for name, model in models.items():
    print(f"\nðŸ¤– Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Evaluate
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {'model': model, 'rmse': rmse, 'mae': mae, 'r2': r2, 'predictions': y_pred}
    
    print(f"  RMSE: ${rmse:,.2f}")
    print(f"  MAE: ${mae:,.2f}")
    print(f"  RÂ²: {r2:.4f}")

print("\nâœ… All models trained!")

## 5. Visualize Model Performance

In [None]:
# Plot predictions vs actual for all models
test_dates = daily_data.iloc[-len(y_test):]['date']

fig = go.Figure()

# Actual values
fig.add_trace(go.Scatter(
    x=test_dates, y=y_test,
    mode='lines+markers',
    name='Actual Revenue',
    line=dict(color='black', width=3)
))

# Predictions from each model
colors = {'Random Forest': '#2ecc71', 'Gradient Boosting': '#e74c3c', 'Ridge Regression': '#3498db'}
for name, result in results.items():
    fig.add_trace(go.Scatter(
        x=test_dates, y=result['predictions'],
        mode='lines',
        name=f'{name} (RÂ²={result["r2"]:.3f})',
        line=dict(color=colors[name], width=2, dash='dash')
    ))

fig.update_layout(
    title="ðŸ“ˆ Revenue Forecasting: Actual vs Predicted",
    xaxis_title="Date",
    yaxis_title="Revenue ($)",
    height=500,
    hovermode='x unified'
)

fig.show()

# Model comparison
comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'RMSE': [results[m]['rmse'] for m in results.keys()],
    'MAE': [results[m]['mae'] for m in results.keys()],
    'RÂ²': [results[m]['r2'] for m in results.keys()]
})

print("\nðŸ“Š Model Comparison:")
print(comparison_df.to_string(index=False))

## 6. Feature Importance Analysis

In [None]:
# Feature importance from Random Forest
rf_model = results['Random Forest']['model']
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False).head(15)

fig = px.bar(
    feature_importance, 
    x='importance', 
    y='feature', 
    orientation='h',
    title="ðŸŽ¯ Top 15 Most Important Features (Random Forest)",
    labels={'importance': 'Importance Score', 'feature': 'Feature'}
)
fig.update_layout(height=500, yaxis={'categoryorder':'total ascending'})
fig.show()

print("ðŸ“Š Top 5 Features:")
print(feature_importance.head().to_string(index=False))

## 7. Summary & Next Steps

### Model Performance Summary
- Random Forest typically performs best with RÂ² above 0.85
- Gradient Boosting provides competitive accuracy
- Ridge Regression serves as a simple baseline

### Key Insights
- Lag features (past revenue) are the most predictive
- Rolling window statistics capture trends effectively
- Time-based features help capture seasonality

### Next Steps
1. **Hyperparameter Tuning**: Use GridSearchCV for optimal parameters
2. **Production Deployment**: Export best model to models directory
3. **Monitoring**: Track model performance over time
4. **Retraining**: Schedule periodic model retraining with new data

### Export Best Model
```python
import pickle
best_model = results['Random Forest']['model']
with open('../models/revenue_forecaster.pkl', 'wb') as f:
    pickle.dump(best_model, f)
print("Model exported to ../models/revenue_forecaster.pkl")
```

---

**Next Notebook:** Churn Prediction & Customer Scoring