# Telco Cusstomer Churn: Feature Engineering

This notebook performs feature engineering on the cleaned Telco Customer Churn dataset.

## Table of Contents
1. [Configure Settings and Data Loading](#1-configure-settings-and-data-loading)  
2. [Target Encoding: City](#2-target-encoding-city)  
3. [Feature Encoding](#3-feature-encoding)  
4. [Feature Selection](#4-feature-selection)  
5. [Test-Train Split](#5-test-train-split)  
6. [Feature Scaling](#6-feature-scaling)  
7. [Data Export](#7-data-export)  

## 1. Configure Settings and Data Loading

In [236]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import joblib
import seaborn as sns
import numpy as np
import warnings
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

# Configure settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
RANDOM_STATE = 42

In [237]:
# Load dataset
df = pd.read_csv(r"C:\Users\linto\Code\churn-x\ml\notebooks\artifacts\eda\cleaned_df.csv")
print(f"Dataset shape: {df.shape}")

Dataset shape: (7043, 29)


In [238]:
# Display first 3 rows
df.head(3)

Unnamed: 0,Country,State,City,Zip Code,Latitude,Longitude,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV
0,United States,California,Los Angeles,90003,33.964131,-118.272783,Male,No,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108,Yes,1,86,3239
1,United States,California,Los Angeles,90005,34.059281,-118.30742,Female,No,No,Yes,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151,Yes,1,67,2701
2,United States,California,Los Angeles,90006,34.048013,-118.293953,Female,No,No,Yes,8,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820,Yes,1,86,5372


In [239]:
# Analyze unique values in each column
print("Unique values analysis:")
for col in df.columns:
    if df[col].nunique() <= 5:
        print(f"{col} -> {df[col].nunique()} -> {df[col].unique()}")
    else:
        print(f"{col} -> {df[col].nunique()}")

Unique values analysis:
Country -> 1 -> ['United States']
State -> 1 -> ['California']
City -> 1129
Zip Code -> 1652
Latitude -> 1652
Longitude -> 1651
Gender -> 2 -> ['Male' 'Female']
Senior Citizen -> 2 -> ['No' 'Yes']
Partner -> 2 -> ['No' 'Yes']
Dependents -> 2 -> ['No' 'Yes']
Tenure Months -> 73
Phone Service -> 2 -> ['Yes' 'No']
Multiple Lines -> 3 -> ['No' 'Yes' 'No phone service']
Internet Service -> 3 -> ['DSL' 'Fiber optic' 'No']
Online Security -> 3 -> ['Yes' 'No' 'No internet service']
Online Backup -> 3 -> ['Yes' 'No' 'No internet service']
Device Protection -> 3 -> ['No' 'Yes' 'No internet service']
Tech Support -> 3 -> ['No' 'Yes' 'No internet service']
Streaming TV -> 3 -> ['No' 'Yes' 'No internet service']
Streaming Movies -> 3 -> ['No' 'Yes' 'No internet service']
Contract -> 3 -> ['Month-to-month' 'Two year' 'One year']
Paperless Billing -> 2 -> ['Yes' 'No']
Payment Method -> 4 -> ['Mailed check' 'Electronic check' 'Bank transfer (automatic)'
 'Credit card (automatic

## 2. Target Encoding: City

In [240]:
# Target Encoding for City Feature
# Apply target encoding to capture city-churn relationship
city_mean_churn = df.groupby('City')['Churn Value'].mean()

# Map encoding back to dataframe
df['city_encoded'] = df['City'].map(city_mean_churn)

# Check correlation with target
correlation = df['city_encoded'].corr(df['Churn Value'])
print(f"Correlation between city (encoded) and churn: {correlation:.4f}")

# Display encoding statistics
print(f"\nCity encoding statistics:")
print(f"Min churn rate: {df['city_encoded'].min():.4f}")
print(f"Max churn rate: {df['city_encoded'].max():.4f}")
print(f"Mean churn rate: {df['city_encoded'].mean():.4f}")

Correlation between city (encoded) and churn: 0.4185

City encoding statistics:
Min churn rate: 0.0000
Max churn rate: 1.0000
Mean churn rate: 0.2654


In [241]:
# Convert Series to dict
city_encoder = city_mean_churn.to_dict()

# Save city encoder
joblib.dump(city_encoder, 'artifacts/feature_engineering/city_encoder.pkl')

['artifacts/feature_engineering/city_encoder.pkl']

## 3.Feature Encoding

In [242]:
# Feature Encoding - One-Hot Encoding for Categorical Variables
nominal_cols = ['Senior Citizen', 'Partner', 'Dependents', 'Phone Service', 'Multiple Lines', 
                'Internet Service', 'Online Security', 'Online Backup', 'Device Protection', 
                'Tech Support', 'Streaming TV', 'Streaming Movies', 'Contract', 
                'Paperless Billing', 'Payment Method']

print(f"Applying one-hot encoding to {len(nominal_cols)} categorical columns")

Applying one-hot encoding to 15 categorical columns


In [243]:
# Apply one-hot encoding
df2 = pd.get_dummies(df, columns=nominal_cols, drop_first=True, dtype=int)
print(f"Shape after encoding: {df2.shape}")

Shape after encoding: (7043, 41)


## 4.Feature Selection

Based on analysis we can remove following columns:
- ```Country``` and ```State``` have only one unique value
- ```Churn Label``` and ```Churn Value```. Because, predicting ```Churn Score``` in model.
- ```Latitude``` and ```Longitude``` can remove
- ```Total Charges``` = ```Tenure Months``` * ```Monthly Charges```
- Remove ```Zip Code``` and ```Gender``` to reduce bias
- Instead of ```City``` we can use ```City_encoding```
- 

In [244]:
columns_to_drop = ['Country', 'State', 'Churn Label', 'Churn Value', 'Latitude', 'Longitude', 'Total Charges', 'Zip Code', 'Gender', 'City']
df3 = df2.drop(columns_to_drop, axis=1)
print(f"Removed Columns: {columns_to_drop}")
print(f"New dataset shape: {df2.shape}")

Removed Columns: ['Country', 'State', 'Churn Label', 'Churn Value', 'Latitude', 'Longitude', 'Total Charges', 'Zip Code', 'Gender', 'City']
New dataset shape: (7043, 41)


In [245]:
# Feature Selection - VIF Analysis Functions
def calculate_vif(data):
    """
    Calculate Variance Inflation Factor for each feature
    """
    vif_df = pd.DataFrame()
    vif_df['Column'] = data.columns
    vif_df['VIF'] = [variance_inflation_factor(data.values, i) 
                     for i in range(data.shape[1])]
    return vif_df

def reduce_vif(df, threshold=10.0):
    """
    Iteratively removes features with VIF above threshold
    
    Parameters:
        df: Input dataframe with numeric features
        threshold: Maximum allowed VIF
    
    Returns:
        Reduced dataframe and final VIF table
    """
    df_clean = df.copy()
    
    while True:
        vif_df = calculate_vif(df_clean)
        vif_df = vif_df.sort_values(by="VIF", ascending=False).reset_index(drop=True)
        
        max_vif = vif_df.loc[0, "VIF"]
        if max_vif > threshold:
            drop_col = vif_df.loc[0, "Column"]
            print(f"Dropping '{drop_col}' with VIF={max_vif:.2f}")
            df_clean = df_clean.drop(columns=[drop_col])
        else:
            break
    
    return df_clean, vif_df

print("VIF analysis functions defined")

VIF analysis functions defined


In [246]:
# Calculate initial VIF (excluding target variable)
initial_vif = calculate_vif(df3.drop('Churn Score', axis=1))
print("Initial VIF values:")
initial_vif.sort_values(by='VIF', ascending=False)

Initial VIF values:


Unnamed: 0,Column,VIF
11,Internet Service_No,inf
12,Online Security_No internet service,inf
18,Tech Support_No internet service,inf
20,Streaming TV_No internet service,inf
16,Device Protection_No internet service,inf
14,Online Backup_No internet service,inf
22,Streaming Movies_No internet service,inf
7,Phone Service_Yes,1776.493115
1,Monthly Charges,863.090966
10,Internet Service_Fiber optic,148.349647


In [247]:
# Apply VIF reduction to remove multicollinearity
df4, final_vif = reduce_vif(df3.drop('Churn Score', axis=1), threshold=10.0)
print(f"\nFinal dataset shape: {df4.shape}")

Dropping 'Internet Service_No' with VIF=inf
Dropping 'Online Security_No internet service' with VIF=inf
Dropping 'Tech Support_No internet service' with VIF=inf
Dropping 'Streaming TV_No internet service' with VIF=inf
Dropping 'Device Protection_No internet service' with VIF=inf
Dropping 'Online Backup_No internet service' with VIF=inf
Dropping 'Phone Service_Yes' with VIF=1776.49
Dropping 'Monthly Charges' with VIF=74.18
Dropping 'CLTV' with VIF=10.62

Final dataset shape: (7043, 21)


In [248]:
# Display final VIF values
print("Final VIF values (all <= 10):")
final_vif

Final VIF values (all <= 10):


Unnamed: 0,Column,VIF
0,Tenure Months,7.458194
1,Contract_Two year,3.426331
2,Internet Service_Fiber optic,3.271298
3,city_encoded,2.763273
4,Paperless Billing_Yes,2.726404
5,Streaming Movies_No internet service,2.67716
6,Streaming Movies_Yes,2.668018
7,Streaming TV_Yes,2.639613
8,Partner_Yes,2.576993
9,Multiple Lines_Yes,2.441669


In [249]:
# Add target variable back and analyze correlations
df4['Churn Score'] = df3['Churn Score']

# Compute correlations with target variable
corr_with_target = df4.corr()['Churn Score'].drop('Churn Score').sort_values(key=abs, ascending=False)

print("Feature correlations with Churn Score:")
print("-" * 50)
for feature, corr in corr_with_target.items():
    print(f"{feature:<40} | {corr:>7.4f}")

Feature correlations with Churn Score:
--------------------------------------------------
city_encoded                             |  0.2883
Tenure Months                            | -0.2250
Internet Service_Fiber optic             |  0.2087
Contract_Two year                        | -0.2005
Payment Method_Electronic check          |  0.1946
Dependents_Yes                           | -0.1750
Streaming Movies_No internet service     | -0.1455
Paperless Billing_Yes                    |  0.1293
Online Security_Yes                      | -0.1190
Contract_One year                        | -0.1167
Partner_Yes                              | -0.1110
Tech Support_Yes                         | -0.1063
Senior Citizen_Yes                       |  0.1022
Payment Method_Credit card (automatic)   | -0.0858
Payment Method_Mailed check              | -0.0631
Device Protection_Yes                    | -0.0549
Streaming Movies_Yes                     |  0.0483
Online Backup_Yes                        | 

## 5. Test-Train Split

In [250]:
# Separate features and target
X = df4.drop('Churn Score', axis=1)
y = df4['Churn Score']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=RANDOM_STATE
)

print("Train-Test Split completed:")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Train churn rate: {y_train.mean():.3f}")
print(f"Test churn rate: {y_test.mean():.3f}")

Train-Test Split completed:
Training set: 5634 samples
Test set: 1409 samples
Train churn rate: 58.408
Test churn rate: 59.865


## 6.Feature Scaling

In [251]:
# Scale numerical features (Tenure Months and city_encoded)
scaler = MinMaxScaler()
cols_to_scale = ['Tenure Months', 'city_encoded']

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])
X_test_scaled[cols_to_scale] = scaler.transform(X_test[cols_to_scale])

print("Data scaling completed")
print(f"X_train_scaled: {X_train_scaled.shape}")
print(f"X_test_scaled: {X_test_scaled.shape}")
print(f"y_train: {y_train.shape}")
print(f"y_test: {y_test.shape}")

Data scaling completed
X_train_scaled: (5634, 21)
X_test_scaled: (1409, 21)
y_train: (5634,)
y_test: (1409,)


## 7. Data Export

In [254]:
# If y was saved as a single column, convert to Series
y_train = y_train.squeeze()
y_test = y_test.squeeze()

# Save datasets
X_train_scaled.to_csv('artifacts/feature_engineering/X_train.csv', index=False)
X_test_scaled.to_csv('artifacts/feature_engineering/X_test.csv', index=False)
y_train.to_csv('artifacts/feature_engineering/y_train.csv', index=False)
y_test.to_csv('artifacts/feature_engineering/y_test.csv', index=False)

# Save feature names for future reference
feature_names = X_train_scaled.columns.tolist()
pd.DataFrame({'features': feature_names}).to_csv('artifacts/feature_engineering/feature_names.csv', index=False)

# Save scaler for future use
joblib.dump(scaler, 'artifacts/feature_engineering/scaler.pkl')

print("Data saved successfully:")
print("Feature names saved successfully:")
print("Scaler saved successfully:")

Data saved successfully:
Feature names saved successfully:
Scaler saved successfully:
