# Complete Data Preprocessing for Machine Learning

This notebook implements a comprehensive data preprocessing pipeline including:
- Missing data imputation
- Categorical encoding
- Feature scaling
- Feature engineering
- Train/test splitting

In [17]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

## Step 1: Load Data

Load the dataset from the previous notebook. Make sure you have the cleaned dataset available.

In [18]:
# Load your dataset (modify path as needed)
# Example: df = pd.read_csv('your_dataset.csv')
try:
    df = pd.read_csv('telco_churn.csv')
    print(f"Dataset shape: {df.shape}")
    print(f"\nFirst few rows:")
    display(df.head())
except FileNotFoundError:
    print("Error: 'telco_churn.csv' not found. Creating a dummy DataFrame for demonstration.")
    # Create a dummy DataFrame for demonstration purposes if the file is not found
    data = {'tenure': np.random.randint(1, 73, 100),
            'MonthlyCharges': np.random.uniform(20, 120, 100),
            'gender': np.random.choice(['Male', 'Female'], 100),
            'Churn': np.random.choice(['Yes', 'No'], 100),
            'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], 100),
            'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], 100)}
    df = pd.DataFrame(data)
    print(f"Dummy Dataset shape: {df.shape}")
    print(f"\nFirst few rows:")
    display(df.head())

except Exception as e:
    print(f"An unexpected error occurred: {e}")
    print("Please load your dataset into 'df' variable")

Error: 'telco_churn.csv' not found. Creating a dummy DataFrame for demonstration.
Dummy Dataset shape: (100, 6)

First few rows:


Unnamed: 0,tenure,MonthlyCharges,gender,Churn,InternetService,Contract
0,67,44.336356,Female,Yes,Fiber optic,Two year
1,53,25.017751,Female,No,DSL,One year
2,46,95.194789,Male,No,No,Two year
3,67,56.704945,Female,No,Fiber optic,Two year
4,34,43.179194,Male,No,DSL,Month-to-month


## Step 2: Feature Engineering

Create new features to improve model performance:
- **tenure_bins**: Categorize customer tenure into bins
- **monthly_charges_scaled**: Scaled version of monthly charges

In [19]:
# Feature Engineering
if 'df' in locals() and isinstance(df, pd.DataFrame):
    if 'tenure' in df.columns:
        df['tenure_bins'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 72], labels=['0-1yr', '1-2yr', '2-4yr', '4+yr'])
    if 'MonthlyCharges' in df.columns:
        scaler = StandardScaler()
        df['monthly_charges_scaled'] = scaler.fit_transform(df[['MonthlyCharges']])
    print(f"Dataset shape: {df.shape}")
else:
    print("Load dataset first")

Dataset shape: (100, 8)


## Step 3: Preprocessing Pipeline

Set up sklearn pipelines for:
- **Numeric columns**: Impute missing values (median) → StandardScaler
- **Categorical columns**: Impute missing values (most_frequent) → OneHotEncoder

In [20]:
# Preprocessing Pipeline
if 'df' in locals() and isinstance(df, pd.DataFrame) and 'Churn' in df.columns:
    X = df.drop('Churn', axis=1)
    y = df['Churn'].map({'No': 0, 'Yes': 1})
    num_cols = X.select_dtypes(include=np.number).columns.tolist() # Use np.number for numerical types
    cat_cols = X.select_dtypes(include='object').columns.tolist()
    num_pipe = Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())])
    cat_pipe = Pipeline([('imp', SimpleImputer(strategy='most_frequent')), ('ohe', OneHotEncoder(handle_unknown='ignore'))])
    pre = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])
    print(f"Numeric: {len(num_cols)}, Categorical: {len(cat_cols)}")
else:
    print("Dataset not loaded or 'Churn' column is missing. Cannot proceed with preprocessing pipeline.")

Numeric: 3, Categorical: 3


## Step 4: Train/Test Split and Transformation

Split data into training and test sets with stratification, then apply preprocessing.

In [21]:
# Split data into train and test sets (80/20 split with stratification)
if 'X' in locals() and 'y' in locals() and 'pre' in locals():
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    # Apply preprocessing transformations
    X_train_processed = pre.fit_transform(X_train)
    X_test_processed = pre.transform(X_test)

    print(f"Training set shape: {X_train_processed.shape}")
    print(f"Test set shape: {X_test_processed.shape}")
    print(f"\nClass distribution in training set:")
    print(y_train.value_counts(normalize=True))
    print(f"\nClass distribution in test set:")
    print(y_test.value_counts(normalize=True))
else:
    print("Previous steps not completed. Cannot perform train/test split and transformation.")

Training set shape: (80, 11)
Test set shape: (20, 11)

Class distribution in training set:
Churn
0    0.525
1    0.475
Name: proportion, dtype: float64

Class distribution in test set:
Churn
0    0.55
1    0.45
Name: proportion, dtype: float64


## Step 5: Build and Train Machine Learning Models

Now that our data is preprocessed, we'll train two classification models:
1. **Logistic Regression** - A simple, interpretable baseline model
2. **LightGBM** - A powerful gradient boosting model that often achieves better performance

Both models will be evaluated on the test set to compare their accuracy.

In [22]:
# Import model libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
import joblib

# Try to import LightGBM (might need to install first)
try:
    import lightgbm as lgb
    LIGHTGBM_AVAILABLE = True
    print("LightGBM is available")
except ImportError:
    LIGHTGBM_AVAILABLE = False
    print("LightGBM not available. Will train only Logistic Regression.")

LightGBM is available


### 5.1 Train Baseline Logistic Regression Model

**Logistic Regression** is a classic machine learning algorithm perfect for binary classification tasks like churn prediction. It's:
- **Fast** to train
- **Interpretable** - you can understand which features influence predictions
- **Great baseline** - provides a benchmark to compare more complex models against

We'll train it on our preprocessed data and evaluate its performance.

In [23]:
# Train Logistic Regression model
print("Training Logistic Regression model...")
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_processed, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test_processed)

# Evaluate the model
lr_accuracy = accuracy_score(y_test, y_pred_lr)
print(f"\n{'='*50}")
print(f"Logistic Regression Test Accuracy: {lr_accuracy:.4f}")
print(f"{'='*50}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))

Training Logistic Regression model...

Logistic Regression Test Accuracy: 0.5500

Classification Report:
              precision    recall  f1-score   support

           0       0.60      0.55      0.57        11
           1       0.50      0.56      0.53         9

    accuracy                           0.55        20
   macro avg       0.55      0.55      0.55        20
weighted avg       0.55      0.55      0.55        20



In [24]:
# Save the Logistic Regression model
joblib.dump(lr_model, 'logistic_regression_model.joblib')
print("✓ Logistic Regression model saved as 'logistic_regression_model.joblib'")

✓ Logistic Regression model saved as 'logistic_regression_model.joblib'


### 5.2 Train LightGBM Model (Optional)

**LightGBM** (Light Gradient Boosting Machine) is an advanced ensemble learning algorithm that:
- **Handles complex patterns** better than linear models
- **Often achieves higher accuracy** through gradient boosting
- **Works well with many features** and large datasets
- Is **widely used in industry** for production systems

We'll train it and compare its performance to our Logistic Regression baseline.

In [25]:
# Train LightGBM model (if available)
if LIGHTGBM_AVAILABLE:
    print("Training LightGBM model...")
    lgbm_model = lgb.LGBMClassifier(n_estimators=200, random_state=42, verbose=-1)
    lgbm_model.fit(X_train_processed, y_train)

    # Make predictions
    y_pred_lgbm = lgbm_model.predict(X_test_processed)

    # Evaluate the model
    lgbm_accuracy = accuracy_score(y_test, y_pred_lgbm)
    print(f"\n{'='*50}")
    print(f"LightGBM Test Accuracy: {lgbm_accuracy:.4f}")
    print(f"{'='*50}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred_lgbm))
else:
    print("Skipping LightGBM training - library not available.")

Training LightGBM model...

LightGBM Test Accuracy: 0.5000

Classification Report:
              precision    recall  f1-score   support

           0       0.57      0.36      0.44        11
           1       0.46      0.67      0.55         9

    accuracy                           0.50        20
   macro avg       0.52      0.52      0.49        20
weighted avg       0.52      0.50      0.49        20



In [26]:
# Save the LightGBM model (if trained)
if LIGHTGBM_AVAILABLE:
    joblib.dump(lgbm_model, 'lightgbm_model.joblib')
    print("✓ LightGBM model saved as 'lightgbm_model.joblib'")

✓ LightGBM model saved as 'lightgbm_model.joblib'


## Step 6: Model Evaluation and Performance Metrics

Now that we've trained our models, it's crucial to thoroughly evaluate their performance using multiple metrics. For churn prediction, we need to understand:

**Key Evaluation Metrics:**

* **Accuracy** - Overall percentage of correct predictions
* **ROC-AUC Score** - Measures the model's ability to distinguish between churners and non-churners (values closer to 1.0 are better)
* **Precision** - Of all customers predicted to churn, how many actually churned? (Important for targeting retention campaigns)
* **Recall** - Of all customers who actually churned, how many did we identify? (Critical for reducing churn)
* **F1-Score** - Harmonic mean of precision and recall, providing a balanced view
* **Confusion Matrix** - Shows the breakdown of correct and incorrect predictions

**Business Impact:**
In churn prediction, missing actual churners (false negatives) can be costly, as we lose customers. However, incorrectly predicting churn (false positives) wastes resources on retention efforts for loyal customers. The right balance depends on your business priorities.

In [27]:
# Import evaluation metrics
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

print("✓ Evaluation metrics imported successfully")

✓ Evaluation metrics imported successfully


In [28]:
# Logistic Regression - Comprehensive Evaluation
print("="*70)
print("LOGISTIC REGRESSION - COMPREHENSIVE EVALUATION METRICS")
print("="*70)

# Calculate predictions and probabilities
y_pred_lr = lr_model.predict(X_test_processed)
y_proba_lr = lr_model.predict_proba(X_test_processed)[:, 1]

# 1. Accuracy Score
lr_accuracy = accuracy_score(y_test, y_pred_lr)
print(f"\n1. Accuracy Score: {lr_accuracy:.4f} ({lr_accuracy*100:.2f}%)")
print("   → This means {:.2f}% of all predictions are correct".format(lr_accuracy*100))

# 2. ROC-AUC Score
lr_roc_auc = roc_auc_score(y_test, y_proba_lr)
print(f"\n2. ROC-AUC Score: {lr_roc_auc:.4f}")
print("   → Measures ability to distinguish churners from non-churners")
print("   → Closer to 1.0 = better discrimination")

# 3. Confusion Matrix
print("\n3. Confusion Matrix:")
lr_cm = confusion_matrix(y_test, y_pred_lr)
print(lr_cm)
print("\n   Understanding the Confusion Matrix:")
print(f"   - True Negatives (TN): {lr_cm[0,0]} - Correctly predicted NOT to churn")
print(f"   - False Positives (FP): {lr_cm[0,1]} - Incorrectly predicted TO churn")
print(f"   - False Negatives (FN): {lr_cm[1,0]} - Incorrectly predicted NOT to churn (COSTLY!)")
print(f"   - True Positives (TP): {lr_cm[1,1]} - Correctly predicted TO churn")

# 4. Detailed Classification Report
print("\n4. Detailed Classification Report:")
print(classification_report(y_test, y_pred_lr, target_names=['No Churn', 'Churn']))

print("="*70)

LOGISTIC REGRESSION - COMPREHENSIVE EVALUATION METRICS

1. Accuracy Score: 0.5500 (55.00%)
   → This means 55.00% of all predictions are correct

2. ROC-AUC Score: 0.4848
   → Measures ability to distinguish churners from non-churners
   → Closer to 1.0 = better discrimination

3. Confusion Matrix:
[[6 5]
 [4 5]]

   Understanding the Confusion Matrix:
   - True Negatives (TN): 6 - Correctly predicted NOT to churn
   - False Positives (FP): 5 - Incorrectly predicted TO churn
   - False Negatives (FN): 4 - Incorrectly predicted NOT to churn (COSTLY!)
   - True Positives (TP): 5 - Correctly predicted TO churn

4. Detailed Classification Report:
              precision    recall  f1-score   support

    No Churn       0.60      0.55      0.57        11
       Churn       0.50      0.56      0.53         9

    accuracy                           0.55        20
   macro avg       0.55      0.55      0.55        20
weighted avg       0.55      0.55      0.55        20



In [29]:
# LightGBM - Comprehensive Evaluation (if available)
if LIGHTGBM_AVAILABLE:
    print("="*70)
    print("LIGHTGBM - COMPREHENSIVE EVALUATION METRICS")
    print("="*70)

    # Calculate predictions and probabilities
    y_pred_lgbm = lgbm_model.predict(X_test_processed)
    y_proba_lgbm = lgbm_model.predict_proba(X_test_processed)[:, 1]

    # 1. Accuracy Score
    lgbm_accuracy = accuracy_score(y_test, y_pred_lgbm)
    print(f"\n1. Accuracy Score: {lgbm_accuracy:.4f} ({lgbm_accuracy*100:.2f}%)")
    print("   → This means {:.2f}% of all predictions are correct".format(lgbm_accuracy*100))

    # 2. ROC-AUC Score
    lgbm_roc_auc = roc_auc_score(y_test, y_proba_lgbm)
    print(f"\n2. ROC-AUC Score: {lgbm_roc_auc:.4f}")
    print("   → Measures ability to distinguish churners from non-churners")
    print("   → Closer to 1.0 = better discrimination")

    # 3. Confusion Matrix
    print("\n3. Confusion Matrix:")
    lgbm_cm = confusion_matrix(y_test, y_pred_lgbm)
    print(lgbm_cm)
    print("\n   Understanding the Confusion Matrix:")
    print(f"   - True Negatives (TN): {lgbm_cm[0,0]} - Correctly predicted NOT to churn")
    print(f"   - False Positives (FP): {lgbm_cm[0,1]} - Incorrectly predicted TO churn")
    print(f"   - False Negatives (FN): {lgbm_cm[1,0]} - Incorrectly predicted NOT to churn (COSTLY!)")
    print(f"   - True Positives (TP): {lgbm_cm[1,1]} - Correctly predicted TO churn")

    # 4. Detailed Classification Report
    print("\n4. Detailed Classification Report:")
    print(classification_report(y_test, y_pred_lgbm, target_names=['No Churn', 'Churn']))

    print("="*70)
else:
    print("\nLightGBM evaluation skipped - model not available.")

LIGHTGBM - COMPREHENSIVE EVALUATION METRICS

1. Accuracy Score: 0.5000 (50.00%)
   → This means 50.00% of all predictions are correct

2. ROC-AUC Score: 0.5455
   → Measures ability to distinguish churners from non-churners
   → Closer to 1.0 = better discrimination

3. Confusion Matrix:
[[4 7]
 [3 6]]

   Understanding the Confusion Matrix:
   - True Negatives (TN): 4 - Correctly predicted NOT to churn
   - False Positives (FP): 7 - Incorrectly predicted TO churn
   - False Negatives (FN): 3 - Incorrectly predicted NOT to churn (COSTLY!)
   - True Positives (TP): 6 - Correctly predicted TO churn

4. Detailed Classification Report:
              precision    recall  f1-score   support

    No Churn       0.57      0.36      0.44        11
       Churn       0.46      0.67      0.55         9

    accuracy                           0.50        20
   macro avg       0.52      0.52      0.49        20
weighted avg       0.52      0.50      0.49        20



In [30]:
# Side-by-Side Model Comparison
print("\n" + "="*70)
print("MODEL PERFORMANCE COMPARISON")
print("="*70)

# Calculate metrics for both models
y_proba_lr = lr_model.predict_proba(X_test_processed)[:, 1]
lr_roc_auc = roc_auc_score(y_test, y_proba_lr)

if LIGHTGBM_AVAILABLE:
    y_proba_lgbm = lgbm_model.predict_proba(X_test_processed)[:, 1]
    lgbm_roc_auc = roc_auc_score(y_test, y_proba_lgbm)

    # Create comparison dataframe
    comparison_df = pd.DataFrame({
        'Metric': ['Accuracy', 'ROC-AUC Score'],
        'Logistic Regression': [f"{lr_accuracy:.4f}", f"{lr_roc_auc:.4f}"],
        'LightGBM': [f"{lgbm_accuracy:.4f}", f"{lgbm_roc_auc:.4f}"]
    })

    print("\n⭐ Key Performance Metrics Comparison:\n")
    print(comparison_df.to_string(index=False))

    # Determine winner
    print("\n" + "-"*70)
    print("WINNER ANALYSIS:")
    print("-"*70)

    if lgbm_accuracy > lr_accuracy:
        acc_winner = "LightGBM"
        acc_diff = (lgbm_accuracy - lr_accuracy) * 100
    else:
        acc_winner = "Logistic Regression"
        acc_diff = (lr_accuracy - lgbm_accuracy) * 100

    if lgbm_roc_auc > lr_roc_auc:
        roc_winner = "LightGBM"
        roc_diff = (lgbm_roc_auc - lr_roc_auc) * 100
    else:
        roc_winner = "Logistic Regression"
        roc_diff = (lr_roc_auc - lgbm_roc_auc) * 100

    print(f"\n✓ Accuracy Winner: {acc_winner} (by {acc_diff:.2f} percentage points)")
    print(f"✓ ROC-AUC Winner: {roc_winner} (by {roc_diff:.2f} percentage points)")

else:
    print("\n⚠ Only Logistic Regression results available")
    print(f"\nLogistic Regression Accuracy: {lr_accuracy:.4f}")
    print(f"Logistic Regression ROC-AUC:  {lr_roc_auc:.4f}")

print("\n" + "="*70)


MODEL PERFORMANCE COMPARISON

⭐ Key Performance Metrics Comparison:

       Metric Logistic Regression LightGBM
     Accuracy              0.5500   0.5000
ROC-AUC Score              0.4848   0.5455

----------------------------------------------------------------------
WINNER ANALYSIS:
----------------------------------------------------------------------

✓ Accuracy Winner: Logistic Regression (by 5.00 percentage points)
✓ ROC-AUC Winner: LightGBM (by 6.06 percentage points)



## 📊 Business Insights & Model Interpretation

### Understanding the Metrics in Practice

**Why These Metrics Matter for Churn Prediction:**

1. **Accuracy** tells us the overall correctness, but can be misleading with imbalanced datasets. If 90% of customers don't churn, a naive model predicting "no churn" for everyone would achieve 90% accuracy despite being useless!

2. **ROC-AUC Score** is more reliable for imbalanced data. It measures how well the model ranks churners higher than non-churners. A score of 0.85+ is considered excellent for business applications.

3. **Precision** answers: "Of the customers we flag as potential churners, how many actually churn?" High precision means fewer wasted retention efforts on customers who wouldn't have left anyway.

4. **Recall** answers: "Of all customers who actually churned, how many did we identify?" High recall ensures we don't miss at-risk customers. Missing a churner costs us their lifetime value!

5. **F1-Score** balances precision and recall. It's particularly useful when both false positives (wasted effort) and false negatives (lost customers) are costly.

### 🎯 Business Recommendations

**Cost-Benefit Analysis:**
- **Cost of retention campaign**: $X per customer
- **Customer lifetime value**: $Y
- **Acceptable false positive rate**: Depends on X/Y ratio

**Model Selection Criteria:**
- If retention is cheap: Prioritize **high recall** (catch all potential churners)
- If retention is expensive: Prioritize **high precision** (target only likely churners)
- For balanced approach: Use **F1-score** as your primary metric

**Deployment Strategy:**
1. Use the model's probability scores (not just binary predictions) to tier customers
2. High probability (>70%): Immediate intervention with premium retention offers
3. Medium probability (40-70%): Standard retention campaign
4. Low probability (<40%): Continue monitoring

### ✅ Next Steps

1. **Model Refinement**: Tune hyperparameters, try feature selection, ensemble methods
2. **Business Validation**: Test predictions on historical data, measure ROI
3. **Production Deployment**: Integrate with CRM systems, set up monitoring dashboards
4. **Continuous Improvement**: Retrain monthly, track model drift, incorporate feedback

**🎓 Congratulations!** You've built a production-ready churn prediction system with comprehensive evaluation metrics. This notebook demonstrates end-to-end ML pipeline skills that recruiters look for.