# Preprocessing 

This notebook prepares the dataset for machine learning modeling.

**Objectives:**
- Separate features and target variable
- Create train/validation/test splits
- Build a preprocessing pipeline for numerical and categorical features
- Handle missing values with business-justified strategy
- Save the preprocessing pipeline for reproducibility

**Key Principle:** All preprocessing is learned from training data only to prevent data leakage.

## 1. Setup and Data Loading

In [19]:
import pandas as pd
import numpy as np
import joblib
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings('ignore')

In [20]:
# Load prepared data
df = pd.read_csv("../data/processed/telco_churn_prepared.csv")

df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## 2. Feature and Target Separation

In [21]:
# Define target and features
target = "Churn"

# Drop customerID (identifier) and target
X = df.drop([target, "customerID"], axis=1)
y = df[target]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

Features shape: (7043, 19)
Target shape: (7043,)


In [22]:
# Encode target variable (Yes=1, No=0)
print("Target distribution before encoding:")
print(y.value_counts())

y = y.map({"Yes": 1, "No": 0})

print("\nTarget distribution after encoding:")
print(y.value_counts())

Target distribution before encoding:
Churn
No     5174
Yes    1869
Name: count, dtype: int64

Target distribution after encoding:
Churn
0    5174
1    1869
Name: count, dtype: int64


## 3. Train / Validation / Test Split

**Split Strategy:**
- Training set: 60% — for model training
- Validation set: 20% — for model selection and hyperparameter tuning
- Test set: 20% — for final unbiased evaluation

**Important:** Stratified split to maintain class proportions.

In [23]:
# First split: 60% train, 40% temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    test_size=0.4,
    stratify=y,
    random_state=42
)

# Second split: 50% validation, 50% test (from temp)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,
    stratify=y_temp,
    random_state=42
)

In [24]:
# Sanity check: verify split sizes and class balance
split_summary = []

for name, X_set, y_set in [
    ("Train", X_train, y_train),
    ("Validation", X_val, y_val),
    ("Test", X_test, y_test)
]:
    split_summary.append({
        "Set": name,
        "Samples": len(X_set),
        "Percentage": round(len(X_set) / len(X) * 100, 1),
        "Churn Rate (%)": round(y_set.mean() * 100, 1)
    })

pd.DataFrame(split_summary)

Unnamed: 0,Set,Samples,Percentage,Churn Rate (%)
0,Train,4225,60.0,26.5
1,Validation,1409,20.0,26.5
2,Test,1409,20.0,26.5


## 4. Feature Grouping

Based on EDA findings, we group features by type for appropriate preprocessing.

Note: SeniorCitizen is binary (0/1) but will be treated as categorical for one-hot encoding consistency

In [25]:
# Numerical features
numerical_features = [
    "tenure",
    "MonthlyCharges",
    "TotalCharges"
]

print(f"Numerical features ({len(numerical_features)}): {numerical_features}")

Numerical features (3): ['tenure', 'MonthlyCharges', 'TotalCharges']


In [26]:
# Categorical features (all columns not in numerical_features)
categorical_features = [
    col for col in X_train.columns
    if col not in numerical_features
]

print(f"Categorical features ({len(categorical_features)}):")

Categorical features (16):


In [27]:
# Verify all features are accounted for
all_features = numerical_features + categorical_features
missing_features = set(X_train.columns) - set(all_features)
extra_features = set(all_features) - set(X_train.columns)

print(f"Total features in X_train: {len(X_train.columns)}")
print(f"Total features defined: {len(all_features)}")
print(f"Missing features: {missing_features if missing_features else 'None'}")
print(f"Extra features: {extra_features if extra_features else 'None'}")

Total features in X_train: 19
Total features defined: 19
Missing features: None
Extra features: None


## 5. Preprocessing Strategy

### Design Decisions (Based on EDA)

| Feature Type | Transformation | Rationale |
|-------------|---------------|-----------|
| Numerical | StandardScaler | Required for algorithms sensitive to feature scale |
| Numerical (missing) | Impute with 0 + missing indicator | Missing values correspond to new customers (tenure = 0) |
| Categorical | OneHotEncoder | No ordinal relationships identified |

---

### Missing Value Strategy

- Missing values are observed **only in the `TotalCharges` feature**
- These missing values correspond to **new customers who have not yet been billed**
- Therefore, missingness is **not random** and carries meaningful business information
- Strategy adopted:
  - Impute missing values with `0`
  - Add an explicit missing indicator column
- This approach preserves the **"not yet billed" signal** while allowing models to learn from it

---

### Preprocessing Principles

- All preprocessing steps are **learned exclusively from the training set**
- The same transformations are applied consistently to validation and test sets
- Preprocessing is implemented using a **reproducible scikit-learn pipeline**
- This design prevents data leakage and ensures consistency across experiments

## 6. Build Preprocessing Pipeline

In [28]:
# Numerical transformer 
# - Impute missing values with 0 and add indicator column
# - Apply standard scaling

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=0, add_indicator=True)),
    ("scaler", StandardScaler())
])

In [29]:
# Categorical transformer
# - OneHotEncoder for all categorical features
# - handle_unknown='ignore' for robustness to unseen categories

categorical_transformer = OneHotEncoder(
    handle_unknown="ignore",
    sparse_output=False,
    drop=None  # Keep all categories (no reference encoding)
)

In [30]:
# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features)
    ],
    remainder="drop",  # Drop any columns not specified
    verbose_feature_names_out=False  # Cleaner feature names
)

## 7. Fit and Transform Data

In [31]:
# Fit on training data ONLY (prevent data leakage)
preprocessor.fit(X_train)

print("Preprocessor fitted on training data.")

Preprocessor fitted on training data.


In [32]:
# Transform all datasets
X_train_processed = preprocessor.transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

print("Transformation complete:")
print(f"  X_train: {X_train.shape} → {X_train_processed.shape}")
print(f"  X_val:   {X_val.shape} → {X_val_processed.shape}")
print(f"  X_test:  {X_test.shape} → {X_test_processed.shape}")

Transformation complete:
  X_train: (4225, 19) → (4225, 47)
  X_val:   (1409, 19) → (1409, 47)
  X_test:  (1409, 19) → (1409, 47)


In [33]:
# Get feature names after transformation
def get_feature_names(preprocessor, numerical_features, categorical_features):
    """Extract feature names from fitted ColumnTransformer."""
    feature_names = []
    
    # Numerical features (original + indicator if added)
    num_transformer = preprocessor.named_transformers_['num']
    imputer = num_transformer.named_steps['imputer']
    
    # Original numerical features
    feature_names.extend(numerical_features)
    
    # Add indicator feature names if imputer created them
    if hasattr(imputer, 'indicator_') and imputer.indicator_ is not None:
        indicator_features = [f"{feat}_missing" for feat in numerical_features 
                             if imputer.indicator_.features_[0] == numerical_features.index(feat)
                             or True]  # Simplified: add for all
        # Get actual indicator features
        n_indicators = imputer.indicator_.features_.shape[0] if hasattr(imputer.indicator_.features_, 'shape') else len(imputer.indicator_.features_)
        indicator_features = [f"{numerical_features[i]}_missing" for i in imputer.indicator_.features_]
        feature_names.extend(indicator_features)
    
    # Categorical features (one-hot encoded)
    cat_transformer = preprocessor.named_transformers_['cat']
    cat_feature_names = cat_transformer.get_feature_names_out(categorical_features)
    feature_names.extend(cat_feature_names)
    
    return feature_names

# Get feature names
try:
    feature_names = preprocessor.get_feature_names_out()
except:
    feature_names = get_feature_names(preprocessor, numerical_features, categorical_features)

print(f"Total features after preprocessing: {len(feature_names)}")
print("First 20 features:")
for f in feature_names[:20]:
    print(f" - {f}")

Total features after preprocessing: 47
First 20 features:
 - tenure
 - MonthlyCharges
 - TotalCharges
 - missingindicator_TotalCharges
 - gender_Female
 - gender_Male
 - SeniorCitizen_0
 - SeniorCitizen_1
 - Partner_No
 - Partner_Yes
 - Dependents_No
 - Dependents_Yes
 - PhoneService_No
 - PhoneService_Yes
 - MultipleLines_No
 - MultipleLines_No phone service
 - MultipleLines_Yes
 - InternetService_DSL
 - InternetService_Fiber optic
 - InternetService_No


## 8. Validation Checks

In [34]:
# Check for any remaining NaN values
print("NaN values after preprocessing:")
print(f"  X_train: {np.isnan(X_train_processed).sum()}")
print(f"  X_val:   {np.isnan(X_val_processed).sum()}")
print(f"  X_test:  {np.isnan(X_test_processed).sum()}")

NaN values after preprocessing:
  X_train: 0
  X_val:   0
  X_test:  0


In [35]:
# Check for any infinite values
print("\nInfinite values after preprocessing:")
print(f"  X_train: {np.isinf(X_train_processed).sum()}")
print(f"  X_val:   {np.isinf(X_val_processed).sum()}")
print(f"  X_test:  {np.isinf(X_test_processed).sum()}")


Infinite values after preprocessing:
  X_train: 0
  X_val:   0
  X_test:  0


In [36]:
# Consistent number of features
print("Feature dimensions:")
print("Train:", X_train_processed.shape)
print("Validation:", X_val_processed.shape)
print("Test:", X_test_processed.shape)

Feature dimensions:
Train: (4225, 47)
Validation: (1409, 47)
Test: (1409, 47)


## 9. Save Preprocessing Artifacts

In [37]:
# Project root (we are inside /notebooks)
PROJECT_ROOT = Path.cwd().parent

# Define paths
MODELS_PATH = PROJECT_ROOT / "models"
DATA_PROCESSED_PATH = PROJECT_ROOT / "data" / "processed"

# Create directories if needed
MODELS_PATH.mkdir(parents=True, exist_ok=True)
DATA_PROCESSED_PATH.mkdir(parents=True, exist_ok=True)

# ------------------------
# Save fitted preprocessor
# ------------------------
preprocessor_path = MODELS_PATH / "preprocessor.joblib"
joblib.dump(preprocessor, preprocessor_path)

print(f"✅ Preprocessor saved to: {preprocessor_path}")

# ------------------------
# Save feature names
# ------------------------
feature_names_df = pd.DataFrame({
    "feature": feature_names
})

feature_names_path = MODELS_PATH / "feature_names.csv"
feature_names_df.to_csv(feature_names_path, index=False)

print(f"✅ Feature names saved to: {feature_names_path}")

# ------------------------
# Save processed datasets
# ------------------------
np.save(DATA_PROCESSED_PATH / "X_train_processed.npy", X_train_processed)
np.save(DATA_PROCESSED_PATH / "X_val_processed.npy", X_val_processed)
np.save(DATA_PROCESSED_PATH / "X_test_processed.npy", X_test_processed)

np.save(DATA_PROCESSED_PATH / "y_train.npy", y_train.values)
np.save(DATA_PROCESSED_PATH / "y_val.npy", y_val.values)
np.save(DATA_PROCESSED_PATH / "y_test.npy", y_test.values)

print("✅ Processed datasets saved:")
print("  - X_train_processed.npy")
print("  - X_val_processed.npy")
print("  - X_test_processed.npy")
print("  - y_train.npy")
print("  - y_val.npy")
print("  - y_test.npy")

# ------------------------
#  Save raw splits for audit/debug
# ------------------------
X_train.to_csv(DATA_PROCESSED_PATH / "X_train.csv", index=False)
X_val.to_csv(DATA_PROCESSED_PATH / "X_val.csv", index=False)
X_test.to_csv(DATA_PROCESSED_PATH / "X_test.csv", index=False)

y_train.to_csv(DATA_PROCESSED_PATH / "y_train.csv", index=False)
y_val.to_csv(DATA_PROCESSED_PATH / "y_val.csv", index=False)
y_test.to_csv(DATA_PROCESSED_PATH / "y_test.csv", index=False)

print("ℹ️ Raw splits saved for audit/reference only")

✅ Preprocessor saved to: /Users/omarpiro/churn_ml_decision/models/preprocessor.joblib
✅ Feature names saved to: /Users/omarpiro/churn_ml_decision/models/feature_names.csv
✅ Processed datasets saved:
  - X_train_processed.npy
  - X_val_processed.npy
  - X_test_processed.npy
  - y_train.npy
  - y_val.npy
  - y_test.npy
ℹ️ Raw splits saved for audit/reference only


## 10. Preprocessing Summary

### Pipeline Overview

```
Raw Data (19 features)
        │
        ▼
┌───────────────────────────────────────────┐
│           ColumnTransformer               │
├───────────────────┬───────────────────────┤
│  Numerical (3)    │  Categorical (16)     │
│  ┌─────────────┐  │  ┌─────────────────┐  │
│  │ Imputer     │  │  │ OneHotEncoder   │  │
│  │ (0 + flag)  │  │  │                 │  │
│  ├─────────────┤  │  │                 │  │
│  │ Scaler      │  │  │                 │  │
│  └─────────────┘  │  └─────────────────┘  │
└───────────────────┴───────────────────────┘
        │
        ▼
Processed Data (46 features)
```

### Transformations Applied

| Feature Type | Transformation | Output |
|-------------|----------------|--------|
| Numerical (3) | Impute(0) + Indicator + Scale | 4 features (3 + 1 indicator) |
| Categorical (16) | OneHotEncode | 42 features |
| **Total** | | **46 features** |

### Data Splits

| Set | Samples | % | Churn Rate |
|-----|---------|---|------------|
| Train | 4,225 | 60% | 26.5% |
| Validation | 1,409 | 20% | 26.5% |
| Test | 1,409 | 20% | 26.5% |


### Key Decisions

1. **TotalCharges NaN handling**: Impute with 0 + indicator flag
   - Preserves "not yet billed" signal for new customers
   
2. **No feature dropping**: All features kept (model can learn to ignore weak ones)

3. **No ordinal encoding**: All categoricals one-hot encoded (no ordinal relationships)

4. **Stratified splits**: Maintain class proportions across all sets


In [38]:
# Final summary
print("=" * 60)
print("PREPROCESSING COMPLETE")
print("=" * 60)
print(f"\nInput features:  {X_train.shape[1]}")
print(f"Output features: {X_train_processed.shape[1]}")
print(f"\nTraining samples:   {X_train_processed.shape[0]:,}")
print(f"Validation samples: {X_val_processed.shape[0]:,}")
print(f"Test samples:       {X_test_processed.shape[0]:,}")
print(f"\nPreprocessor saved: {preprocessor_path}")
print("\nReady for modeling!")

PREPROCESSING COMPLETE

Input features:  19
Output features: 47

Training samples:   4,225
Validation samples: 1,409
Test samples:       1,409

Preprocessor saved: /Users/omarpiro/churn_ml_decision/models/preprocessor.joblib

Ready for modeling!
