# Customer Churn Prediction - Model Training

This notebook demonstrates the complete model training workflow for the Customer Churn Prediction System.

## Overview

We'll walk through:
1. Loading and validating training data
2. Data preprocessing and feature engineering
3. Training an XGBoost model
4. Evaluating model performance
5. Saving the model with versioning

## Setup

First, let's import the required libraries and services.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report

# Import our custom services
from services.preprocessing import DataValidator, DataTransformer, FeatureEngineer
from services.model_training import ModelTrainer
from services.model_repository import ModelRepository

# Set display options
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

print("✓ All imports successful")

## Step 1: Load and Explore Data

Let's load the training and test datasets and examine their structure.

In [None]:
# Load data
train_df = pd.read_csv('../data/raw/training_data.csv')
test_df = pd.read_csv('../data/raw/test_data.csv')

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")
print(f"\nTraining data churn rate: {train_df['churn'].mean():.2%}")
print(f"Test data churn rate: {test_df['churn'].mean():.2%}")

In [None]:
# Display first few rows
train_df.head()

In [None]:
# Check data types and missing values
print("Data Info:")
print(train_df.info())
print(f"\nMissing values:\n{train_df.isnull().sum()}")

In [None]:
# Visualize churn distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Churn distribution
train_df['churn'].value_counts().plot(kind='bar', ax=axes[0], color=['green', 'red'])
axes[0].set_title('Churn Distribution')
axes[0].set_xlabel('Churn (0=No, 1=Yes)')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['No Churn', 'Churn'], rotation=0)

# Monthly charges distribution by churn
train_df.boxplot(column='monthly_charges', by='churn', ax=axes[1])
axes[1].set_title('Monthly Charges by Churn Status')
axes[1].set_xlabel('Churn (0=No, 1=Yes)')
axes[1].set_ylabel('Monthly Charges')

plt.tight_layout()
plt.show()

## Step 2: Data Validation

Validate that the data meets our requirements.

In [None]:
# Initialize validator
validator = DataValidator()

# Validate training data
try:
    validator.validate(train_df)
    print("✓ Training data validation passed")
except Exception as e:
    print(f"✗ Training data validation failed: {e}")

# Validate test data
try:
    validator.validate(test_df)
    print("✓ Test data validation passed")
except Exception as e:
    print(f"✗ Test data validation failed: {e}")

## Step 3: Feature Engineering

Create derived features to improve model performance.

In [None]:
# Initialize feature engineer
engineer = FeatureEngineer()

# Engineer features for training data
train_engineered = engineer.engineer_features(train_df.copy())
test_engineered = engineer.engineer_features(test_df.copy())

print(f"Original features: {train_df.shape[1]}")
print(f"After feature engineering: {train_engineered.shape[1]}")
print(f"\nNew features added: {train_engineered.shape[1] - train_df.shape[1]}")

In [None]:
# Display engineered features
new_columns = [col for col in train_engineered.columns if col not in train_df.columns]
print(f"New features: {new_columns}")
train_engineered[new_columns].head()

## Step 4: Data Preprocessing

Transform the data for model training (encoding, scaling, etc.).

In [None]:
# Separate features and target
X_train = train_engineered.drop('churn', axis=1)
y_train = train_engineered['churn']
X_test = test_engineered.drop('churn', axis=1)
y_test = test_engineered['churn']

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

In [None]:
# Initialize and fit transformer
transformer = DataTransformer()
X_train_transformed = transformer.fit_transform(X_train, y_train)
X_test_transformed = transformer.transform(X_test)

print(f"Transformed training data shape: {X_train_transformed.shape}")
print(f"Transformed test data shape: {X_test_transformed.shape}")
print(f"\n✓ Data preprocessing complete")

## Step 5: Train XGBoost Model

Train the churn prediction model with optimized hyperparameters.

In [None]:
# Initialize trainer
trainer = ModelTrainer()

# Train model
print("Training XGBoost model...")
model = trainer.train(X_train_transformed, y_train)
print("✓ Model training complete")

## Step 6: Evaluate Model Performance

Assess the model's performance on the test set.

In [None]:
# Evaluate on test set
metrics = trainer.evaluate(model, X_test_transformed, y_test)

print("Model Performance Metrics:")
print(f"  Precision: {metrics['precision']:.4f}")
print(f"  Recall: {metrics['recall']:.4f}")
print(f"  F1-Score: {metrics['f1_score']:.4f}")
print(f"  Accuracy: {metrics.get('accuracy', 'N/A')}")

# Check if recall meets threshold
if metrics['recall'] >= 0.85:
    print(f"\n✓ Model meets recall threshold (>= 85%)")
else:
    print(f"\n⚠ Model recall ({metrics['recall']:.2%}) is below 85% threshold")

In [None]:
# Generate predictions for confusion matrix
y_pred = model.predict(X_test_transformed)
y_pred_proba = model.predict_proba(X_test_transformed)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Churn', 'Churn'],
            yticklabels=['No Churn', 'Churn'])
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

print(f"\nTrue Negatives: {cm[0, 0]}")
print(f"False Positives: {cm[0, 1]}")
print(f"False Negatives: {cm[1, 0]}")
print(f"True Positives: {cm[1, 1]}")

In [None]:
# Classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churn']))

In [None]:
# Probability distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of predicted probabilities
axes[0].hist(y_pred_proba[y_test == 0], bins=30, alpha=0.5, label='No Churn', color='green')
axes[0].hist(y_pred_proba[y_test == 1], bins=30, alpha=0.5, label='Churn', color='red')
axes[0].axvline(x=0.5, color='black', linestyle='--', label='Threshold')
axes[0].set_xlabel('Predicted Probability')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Predicted Probabilities')
axes[0].legend()

# Feature importance
feature_importance = model.feature_importances_
feature_names = transformer.get_feature_names()
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': feature_importance
}).sort_values('importance', ascending=False).head(10)

axes[1].barh(importance_df['feature'], importance_df['importance'])
axes[1].set_xlabel('Importance')
axes[1].set_title('Top 10 Feature Importances')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

## Step 7: Save Model with Versioning

Save the trained model and transformer to the model repository.

In [None]:
# Initialize repository
repo = ModelRepository()

# Save model with metadata
version = repo.save(model, transformer, metrics)

print(f"✓ Model saved successfully")
print(f"  Version: {version}")
print(f"  Recall: {metrics['recall']:.4f}")
print(f"  Precision: {metrics['precision']:.4f}")
print(f"  F1-Score: {metrics['f1_score']:.4f}")

In [None]:
# List all available versions
versions = repo.list_versions()

print(f"\nTotal model versions: {len(versions)}")
print("\nAvailable versions:")
for v in versions[-5:]:  # Show last 5 versions
    print(f"  {v.version}")
    print(f"    Recall: {v.metadata.get('recall', 'N/A')}")
    print(f"    Timestamp: {v.metadata.get('timestamp', 'N/A')}")

## Summary

In this notebook, we:

1. ✓ Loaded and explored customer churn data
2. ✓ Validated data quality and schema
3. ✓ Engineered derived features
4. ✓ Preprocessed data (encoding, scaling)
5. ✓ Trained an XGBoost classifier
6. ✓ Evaluated model performance
7. ✓ Saved the model with versioning

The trained model is now ready for making predictions. See the next notebook (`02_prediction_and_explanation.ipynb`) to learn how to use the model for inference and generate SHAP explanations.

## Next Steps

- **Hyperparameter Tuning**: Experiment with different XGBoost parameters to improve performance
- **Feature Selection**: Analyze feature importance and remove low-impact features
- **Cross-Validation**: Use k-fold cross-validation for more robust evaluation
- **Threshold Optimization**: Adjust the classification threshold based on business requirements
- **Make Predictions**: Use the trained model in the prediction notebook