# Feature Engineering – Customer Churn Prediction

This notebook performs feature engineering on the dataset produced
during the Exploratory Data Analysis (EDA) stage.

The input data has already been cleaned and validated in the EDA notebook
and is further transformed here into a model-ready format.

## Objective

The objectives of this notebook are to:
- Apply feature transformations to the cleaned dataset from EDA
- Prevent data leakage through careful feature selection
- Encode and scale features using reproducible pipelines
- Produce train and test datasets ready for machine learning models


## Data Source

The dataset used in this notebook is **not raw data**.

It is the output of the **Exploratory Data Analysis (EDA) notebook**, where:
- Missing values were handled
- Data types were corrected
- Obvious inconsistencies were resolved
- Initial feature understanding was established

This separation ensures a clean and modular ML workflow.


## Assumptions from EDA

Based on findings from the EDA phase, the following assumptions apply:
- The dataset contains no duplicate customer records
- Missing values have been appropriately handled
- Feature distributions have been inspected
- The churn label has been validated

Feature engineering is therefore focused on transformation rather than cleaning.


## Feature Categorization

Features are grouped based on their role in the modeling pipeline:

- **Identifier columns:** Used for reference only and excluded from modeling
- **Numerical features:** Continuous and count-based variables
- **Categorical features:** Discrete customer attributes and service types
- **Target variable:** Customer churn indicator


## Importing Libraries

The following libraries and configurations are imported to ensure a robust, reproducible, and visually consistent environment for data analysis and machine learning tasks.


In [96]:
import matplotlib.pyplot as plt
import seaborn as sns
import math
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer




# Configure Seaborn style and Matplotlib figure size
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)

## Loading Dataset

In [97]:
# Load the Telco customer churn dataset from an Excel file into a pandas DataFrame
# Check if the file exists
if not os.path.exists("../data/processed/Telco_Churn_clean.csv"):
    raise FileNotFoundError("Dataset file missing.")

# Read the Excel file into a DataFrame
df = pd.read_csv("../data/processed/Telco_Churn_clean.csv")

## Data Overview and Inspection

In [98]:
df.head()

Unnamed: 0,CustomerID,Count,Country,State,City,Zip_Code,Lat_Long,Latitude,Longitude,Gender,...,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Churn_Label,Churn_Value,Churn_Score,CLTV,Churn_Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7032 non-null   object 
 1   Count              7032 non-null   int64  
 2   Country            7032 non-null   object 
 3   State              7032 non-null   object 
 4   City               7032 non-null   object 
 5   Zip_Code           7032 non-null   int64  
 6   Lat_Long           7032 non-null   object 
 7   Latitude           7032 non-null   float64
 8   Longitude          7032 non-null   float64
 9   Gender             7032 non-null   object 
 10  Senior_Citizen     7032 non-null   object 
 11  Partner            7032 non-null   object 
 12  Dependents         7032 non-null   object 
 13  Tenure_Months      7032 non-null   int64  
 14  Phone_Service      7032 non-null   object 
 15  Multiple_Lines     7032 non-null   object 
 16  Internet_Service   7032 

In [100]:
df.shape

(7032, 33)

In [72]:
df.isnull().sum()

CustomerID              0
Count                   0
Country                 0
State                   0
City                    0
Zip_Code                0
Lat_Long                0
Latitude                0
Longitude               0
Gender                  0
Senior_Citizen          0
Partner                 0
Dependents              0
Tenure_Months           0
Phone_Service           0
Multiple_Lines          0
Internet_Service        0
Online_Security         0
Online_Backup           0
Device_Protection       0
Tech_Support            0
Streaming_TV            0
Streaming_Movies        0
Contract                0
Paperless_Billing       0
Payment_Method          0
Monthly_Charges         0
Total_Charges           0
Churn_Label             0
Churn_Value             0
Churn_Score             0
CLTV                    0
Churn_Reason         5163
dtype: int64

## Dropped Columns and Data Leakage Prevention

Some columns are excluded because they:
- Contain post-churn or outcome-related information
- Act as identifiers with no predictive value
- Introduce unnecessary dimensionality

Removing these columns ensures that only pre-churn information
is used during model training.


In [102]:
# Define the list of columns to be excluded from the model
# These include identifiers, geographical noise, and leakage variables
cols_to_drop = [
    'CustomerID', 'Count', 'Country', 'State', 'City', 
    'Zip_Code', 'Lat_Long', 'Latitude', 'Longitude', 
    'Churn_Label', 'Churn_Score', 'CLTV', 'Churn_Reason'
]

# Create a cleaned DataFrame by dropping the specified columns
# We use axis=1 implicitly via columns parameter to drop vertical features
df_cleaned = df.drop(columns=cols_to_drop)

## Train-Test Split

The dataset is split into training and testing sets prior to scaling
to prevent information leakage.

Stratified sampling is used to maintain the churn class distribution
across both subsets.


In [104]:
# Separate the features (X) from the target variable (y)
# 'Churn_Value' is the label we want to predict
X = df_cleaned.drop(columns='Churn_Value')
y = df_cleaned['Churn_Value']

# Split the dataset into training and testing sets
# test_size=0.25: Allocates 25% of the data for testing and 75% for training
# random_state=42: Ensures the split is reproducible (you get the same results every time you run it)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## Total Services Feature

The `Total_Services` feature captures the total number of services a customer subscribes to, including both optional add-ons and internet service.

- Counts each add-on service: Phone, Multiple Lines, Online Security, Online Backup, Device Protection, Tech Support, Streaming TV, Streaming Movies, if the customer subscribes.
- Adds 1 for internet service (DSL or Fiber) to include the foundational service.
- Aggregating these services into a single numeric feature reduces dimensionality while retaining predictive information about customer engagement.

This feature allows the model to distinguish between lightly subscribed and fully subscribed customers, which is often correlated with churn likelihood.


In [105]:
# 1. List the add-on services
add_on_services = [
    'Phone_Service', 'Multiple_Lines', 'Online_Security', 
    'Online_Backup', 'Device_Protection', 'Tech_Support', 
    'Streaming_TV', 'Streaming_Movies'
]

for df_set in [X_train, X_test]:
    # Sums all 'Yes' values (Handles Phone and all add-ons)
    count = (df_set[add_on_services] == 'Yes').sum(axis=1)

    # Manually adds 1 for the Internet foundation (DSL/Fiber)
    count += df_set['Internet_Service'].apply(lambda x: 1 if x != 'No' else 0)

    df_set['Total_Services'] = count
              

# Verify: Customers with no internet should now have a lower score than those with 
print(X_train[['Internet_Service', 'Total_Services']].head())

     Internet_Service  Total_Services
3161      Fiber optic               4
4326               No               1
1922      Fiber optic               8
2310              DSL               9
856       Fiber optic               5


In [106]:
X_train.head(2)

Unnamed: 0,Gender,Senior_Citizen,Partner,Dependents,Tenure_Months,Phone_Service,Multiple_Lines,Internet_Service,Online_Security,Online_Backup,Device_Protection,Tech_Support,Streaming_TV,Streaming_Movies,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Total_Services
3161,Male,No,Yes,Yes,54,Yes,No,Fiber optic,No,Yes,No,Yes,No,No,Two year,Yes,Bank transfer (automatic),79.95,4362.05,4
4326,Male,No,No,Yes,12,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Bank transfer (automatic),19.35,212.3,1


## Feature Categorization

Features are grouped based on their role in the modeling pipeline:

- **Numerical features:** Continuous and count-based variables
- **Categorical features:** Discrete customer attributes and service types


In [107]:

# Identify numeric columns explicitly
numeric_features = [
    'Tenure_Months',
    'Monthly_Charges',
    'Total_Charges',
    'Total_Services'
]

# Identify categorical columns safely
categorical_features = (
    X_train
    .select_dtypes(include=['object'])
    .columns
    .difference(['CustomerID'])   # exclude ID-like columns if present
    .tolist()
)


## Preprocessing Pipeline

The preprocessing pipeline includes:
- **StandardScaler** for numerical feature normalization
- **OneHotEncoder** for categorical feature encoding

A `ColumnTransformer` ensures transformations are applied
only to their respective feature groups.


In [109]:
#  Create the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(
            drop='first',
            handle_unknown='ignore',
            sparse_output=False    # avoids sparse → dense conversion later
        ), categorical_features)
    ]
)

# Fit ONLY on training data
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

#  Recover feature names
cat_colnames = preprocessor.named_transformers_['cat'] \
    .get_feature_names_out(categorical_features)

all_colnames = numeric_features + list(cat_colnames)

# Restore index alignment (CRITICAL)
X_train_final = pd.DataFrame(
    X_train_transformed,
    columns=all_colnames,
    index=X_train.index
)

X_test_final = pd.DataFrame(
    X_test_transformed,
    columns=all_colnames,
    index=X_test.index
)

print("Preprocessing complete. Data is model-ready.")


Preprocessing complete. Data is model-ready.


## Saving Processed Artifacts

The following artifacts are saved for downstream modeling:
- Training and testing feature matrices
- Target vectors
- Preprocessing pipeline object

This enables reproducibility and efficient experimentation.


In [110]:
import joblib

# 1. Save the processed Feature sets
X_train_final.to_csv('../data/ml_ready/X_train_final.csv', index = True)
X_test_final.to_csv('../data/ml_ready/X_test_final.csv', index = True)

# 2. Save the Target sets (the 'y' values haven't changed, but we save them for consistency)
y_train.to_csv('../data/ml_ready/y_train.csv', index=True)
y_test.to_csv('../data/ml_ready/y_test.csv', index=True)

# 3. Save the Preprocessor Object
# This is vital! If you ever want to predict churn for a NEW customer, 
# you need this exact object to scale their data the same way.
joblib.dump(preprocessor, 'preprocessor.joblib')

['preprocessor.joblib']