## Objective

In this notebook we will:

- Load the merged telco churn dataset
- Remove columns that are identifiers, leakage, or not useful for modelling
- Engineer two interpretable, high-impact features
- Convert all categorical variables into numeric format
- Save a model-ready dataset for use in the modelling notebook

## Load the dataset

In [2]:
from pathlib import Path 
import pandas as pd 
import numpy as np 

NOTEBOOK_DIR = Path.cwd()

DATA_PROCESSED = (NOTEBOOK_DIR / "../data/processed").resolve()

df = pd.read_csv(DATA_PROCESSED / "telco_churn_master.csv", keep_default_na=False)
print("Loaded:", df.shape)
df.head()

Loaded: (7043, 48)


Unnamed: 0,Customer ID,Gender,Age,Under 30,Senior Citizen,Married,Dependents,Number of Dependents,City,Zip Code,...,Total Revenue,Satisfaction Score,Customer Status,Churn Label,Churn Value,Churn Score,CLTV,Churn Category,Churn Reason,Population
0,8779-QRDMV,Male,78,No,Yes,No,No,0,Los Angeles,90022,...,59.65,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data,68701
1,7495-OOKFY,Female,74,No,Yes,Yes,Yes,1,Los Angeles,90063,...,1024.1,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer,55668
2,1658-BYGOY,Male,71,No,Yes,No,Yes,3,Los Angeles,90065,...,1910.88,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer,47534
3,4598-XLKNJ,Female,78,No,Yes,Yes,Yes,1,Inglewood,90303,...,2995.07,2,Churned,Yes,1,88,5337,Dissatisfaction,Limited range of services,27778
4,4846-WHAFZ,Female,80,No,Yes,Yes,Yes,1,Whittier,90602,...,3102.36,2,Churned,Yes,1,67,2793,Price,Extra data charges,26265


## Separate target (Churn) from features

We want the model to predict churn, so the first thing we do is to define this variable as the target.

In [3]:
target = "Churn Value"

## Removing Non-Model Features

### Based on our EDA and correlation analysis, several variables should be excluded before modelling. These fall into four main groups:


**1. Identifiers and overly granular location fields**

These variables do not generalise, provide no predictive value, or introduce high-cardinality noise:

- Customer ID - purely an identifier  
- City - over 1,000 categories, too granular  
- Zip Code - over 1,600 categories, behaves like an ID  
- Latitude, Longitude - overly specific geographic precision with no clear relationship to churn  


**2. Redundant age-derived fields**

- Under 30 - fully determined by *Age*.  
  We retained the *Senior Citizen* variable because we considered it as a meaningful demographic segment, while the age threshold adds no independent explanatory power.



**3. Leakage variables**

These variables contain information that is not available prior to churn and would cause the model to “cheat”:

- Customer Status (Stayed / Churned / Joined)  
- Churn Label (direct copy of the target)  
- Churn Score (precomputed using churn outcomes)  
- Satisfaction Score (strongly correlated with churn, likely measured post-hoc)  
- Churn Category (only defined after churn occurs)  
- Churn Reason (observed only for churned customers)  

All of these are excluded to prevent data leakage.


**4. Highly redundant lifetime and accumulated totals**

Correlation analysis revealed strong redundancy among accumulated spending variables, largely driven by tenure and historical usage:

- Tenure in Months <-> Total Charges (≈ 0.8+)  
- Total Revenue <-> Total Charges (≈ 0.9+)  
- Tenure in Months <-> Total Long Distance Charges (≈ 0.7)  

These variables largely reflect *how long* a customer has been active and *how much they have historically spent*, rather than independent behavioural signals. They do not add incremental information beyond:

- Tenure in Months  
- Monthly Charge  
- Usage-related variables  

We therefore drop the following accumulated totals:

- Total Charges  
- Total Revenue  
- Total Long Distance Charges  
- Total Extra Data Charges  
- Total Refunds  

These features remain available in the original dataset for descriptive analysis, but are excluded from the modelling feature set.


In [4]:
cols_to_drop = [
    # Identifiers / geography
    "Customer ID",
    "City",
    "Zip Code",
    "Latitude", "Longitude",

    # Age-derived redundancy
    "Under 30",

    # Leakage / post-churn labels
    "Customer Status",
    "Churn Label",
    "Churn Score",
    "Satisfaction Score",
    "Churn Category",
    "Churn Reason",

    # Highly redundant lifetime totals
    "Total Charges",
    "Total Revenue",
    "Total Long Distance Charges",
    "Total Extra Data Charges",
    "Total Refunds",
]

df = df.drop(columns=[c for c in cols_to_drop if c in df.columns])

print("Columns after dropping non-model features:", df.shape[1])
df.head()

Columns after dropping non-model features: 31


Unnamed: 0,Gender,Age,Senior Citizen,Married,Dependents,Number of Dependents,Referred a Friend,Number of Referrals,Tenure in Months,Offer,...,Streaming Movies,Streaming Music,Unlimited Data,Contract,Paperless Billing,Payment Method,Monthly Charge,Churn Value,CLTV,Population
0,Male,78,Yes,No,No,0,No,0,1,No Offer,...,Yes,No,No,Month-to-Month,Yes,Bank Withdrawal,39.65,1,5433,68701
1,Female,74,Yes,Yes,Yes,1,Yes,1,8,Offer E,...,No,No,Yes,Month-to-Month,Yes,Credit Card,80.65,1,5302,55668
2,Male,71,Yes,No,Yes,3,No,0,18,Offer D,...,Yes,Yes,Yes,Month-to-Month,Yes,Bank Withdrawal,95.45,1,3179,47534
3,Female,78,Yes,Yes,Yes,1,Yes,1,25,Offer C,...,Yes,No,Yes,Month-to-Month,Yes,Bank Withdrawal,98.5,1,5337,27778
4,Female,80,Yes,Yes,Yes,1,Yes,1,37,Offer C,...,No,No,Yes,Month-to-Month,Yes,Bank Withdrawal,76.5,1,2793,26265


## Adding new features that might help improve the model

We now create a small set of engineered features based on the EDA insights:

- TenureGroup - lifecycle bands capturing the sharp drop in churn after the first months
- NumServices - number of add-on services (strong retention effect)

### 1. TenureGroup

In [5]:
df["TenureGroup"] = pd.cut(
    df["Tenure in Months"],
    bins=[0, 6, 12, 24, 48, 72],
    labels=["0–6", "6–12", "12–24", "24–48", "48–72"],
    include_lowest=True
)

df["TenureGroup"].value_counts().sort_index()

TenureGroup
0–6      1470
6–12      716
12–24    1024
24–48    1594
48–72    2239
Name: count, dtype: int64

### 2. NumServices (bundled add-ons)

In [6]:
service_cols = [
    "Online Security",
    "Online Backup",
    "Device Protection Plan",
    "Premium Tech Support",
    "Streaming TV",
    "Streaming Movies",
    "Streaming Music",
]

df["NumServices"] = df[service_cols].eq("Yes").sum(axis=1)
df["NumServices"].value_counts().sort_index()

NumServices
0    2173
1     827
2     852
3     901
4     851
5     688
6     494
7     257
Name: count, dtype: int64

## Separate features and target, inspect types

We now separate:
- y = target (Churn Value)
- feature_df = all remaining columns used as inputs

Then we inspect numeric vs categorical columns before encoding.

In [7]:
# Separate target and feature frame
y = df[target]
feature_df = df.drop(columns=[target])

num_cols_before = feature_df.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_cols_before = feature_df.select_dtypes(include=["object"]).columns.tolist()

print("Numeric columns before encoding:", len(num_cols_before))
print("Categorical columns before encoding:", len(cat_cols_before))
cat_cols_before

Numeric columns before encoding: 10
Categorical columns before encoding: 21


['Gender',
 'Senior Citizen',
 'Married',
 'Dependents',
 'Referred a Friend',
 'Offer',
 'Phone Service',
 'Multiple Lines',
 'Internet Service',
 'Internet Type',
 'Online Security',
 'Online Backup',
 'Device Protection Plan',
 'Premium Tech Support',
 'Streaming TV',
 'Streaming Movies',
 'Streaming Music',
 'Unlimited Data',
 'Contract',
 'Paperless Billing',
 'Payment Method']

In [8]:
# Target
target = "Churn Value"

# Features before encoding
feature_df = df.drop(columns=[target])

print("Total features before encoding:", feature_df.shape[1])


Total features before encoding: 32


## Encode categorical variables

### 1. Map Yes/No to 1/0

First, we convert true binary Yes/No variables to 1/0.  
We only map columns where the unique values are a subset of {"Yes", "No"} to avoid touching other text fields.

In [7]:
# Identify true Yes/No columns among categoricals
yes_no_cols = [
    col for col in cat_cols_before
    if set(feature_df[col].dropna().unique()).issubset({"Yes", "No"})
]

yes_no_cols

['Senior Citizen',
 'Married',
 'Dependents',
 'Referred a Friend',
 'Phone Service',
 'Multiple Lines',
 'Internet Service',
 'Online Security',
 'Online Backup',
 'Device Protection Plan',
 'Premium Tech Support',
 'Streaming TV',
 'Streaming Movies',
 'Streaming Music',
 'Unlimited Data',
 'Paperless Billing']

In [8]:
for col in yes_no_cols:
    feature_df[col] = feature_df[col].map({"Yes": 1, "No": 0})

In [9]:
# Re-inspect numeric vs categorical columns after mapping
num_cols_after_yesno = feature_df.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_cols_after_yesno = feature_df.select_dtypes(include=["object", "category"]).columns.tolist()

print("Numeric columns after Yes/No mapping:", len(num_cols_after_yesno))
print("Remaining categorical columns:", len(cat_cols_after_yesno))
cat_cols_after_yesno

Numeric columns after Yes/No mapping: 26
Remaining categorical columns: 6


['Gender',
 'Offer',
 'Internet Type',
 'Contract',
 'Payment Method',
 'TenureGroup']

### 2. One-hot encode remaining categorical variables

The remaining categorical variables have more than two categories (Gender, Offer, Internet Type, Contract, Payment Method, TenureGroup).

We use one-hot encoding `(pd.get_dummies)` with `drop_first=True` to:

- convert them into numeric dummy variables
- avoid perfect multicollinearity between categories.

In [10]:
feature_df_encoded = pd.get_dummies(
    feature_df,
    drop_first=True
)

print("Shape before encoding:", feature_df.shape)
print("Shape after encoding:", feature_df_encoded.shape)
feature_df_encoded.head()

Shape before encoding: (7043, 32)
Shape after encoding: (7043, 43)


Unnamed: 0,Age,Senior Citizen,Married,Dependents,Number of Dependents,Referred a Friend,Number of Referrals,Tenure in Months,Phone Service,Avg Monthly Long Distance Charges,...,Internet Type_Fiber Optic,Internet Type_No Internet,Contract_One Year,Contract_Two Year,Payment Method_Credit Card,Payment Method_Mailed Check,TenureGroup_6–12,TenureGroup_12–24,TenureGroup_24–48,TenureGroup_48–72
0,78,1,0,0,0,0,0,1,0,0.0,...,False,False,False,False,False,False,False,False,False,False
1,74,1,1,1,1,1,1,8,1,48.85,...,True,False,False,False,True,False,True,False,False,False
2,71,1,0,1,3,0,0,18,1,11.33,...,True,False,False,False,False,False,False,True,False,False
3,78,1,1,1,1,1,1,25,1,19.76,...,True,False,False,False,False,False,False,False,True,False
4,80,1,1,1,1,1,1,37,1,6.33,...,True,False,False,False,False,False,False,False,True,False


## Final sanity checks and assemble model-ready dataframe

At this point:

- All features in `feature_df_encoded` should be numeric (`int64` / `float64` / `bool`).
- The target `Churn Value` is stored separately as `y`, but we will add it back to the final dataframe for convenience.

In [11]:
feature_df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 43 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Age                                7043 non-null   int64  
 1   Senior Citizen                     7043 non-null   int64  
 2   Married                            7043 non-null   int64  
 3   Dependents                         7043 non-null   int64  
 4   Number of Dependents               7043 non-null   int64  
 5   Referred a Friend                  7043 non-null   int64  
 6   Number of Referrals                7043 non-null   int64  
 7   Tenure in Months                   7043 non-null   int64  
 8   Phone Service                      7043 non-null   int64  
 9   Avg Monthly Long Distance Charges  7043 non-null   float64
 10  Multiple Lines                     7043 non-null   int64  
 11  Internet Service                   7043 non-null   int64

In [12]:
model_df = feature_df_encoded.copy()
model_df[target] = y

print("Final model-ready shape:", model_df.shape)
model_df.head()

Final model-ready shape: (7043, 44)


Unnamed: 0,Age,Senior Citizen,Married,Dependents,Number of Dependents,Referred a Friend,Number of Referrals,Tenure in Months,Phone Service,Avg Monthly Long Distance Charges,...,Internet Type_No Internet,Contract_One Year,Contract_Two Year,Payment Method_Credit Card,Payment Method_Mailed Check,TenureGroup_6–12,TenureGroup_12–24,TenureGroup_24–48,TenureGroup_48–72,Churn Value
0,78,1,0,0,0,0,0,1,0,0.0,...,False,False,False,False,False,False,False,False,False,1
1,74,1,1,1,1,1,1,8,1,48.85,...,False,False,False,True,False,True,False,False,False,1
2,71,1,0,1,3,0,0,18,1,11.33,...,False,False,False,False,False,False,True,False,False,1
3,78,1,1,1,1,1,1,25,1,19.76,...,False,False,False,False,False,False,False,True,False,1
4,80,1,1,1,1,1,1,37,1,6.33,...,False,False,False,False,False,False,False,True,False,1


### Why We Do Not Scale Features in This Notebook

Tree-based models - such as Random Forest, Gradient Boosting, XGBoost, and LightGBM - do NOT require scaling, because they split data using thresholds (e.g., Monthly Charge > 80?). They do not rely on distances, gradients, or dot products. As a result:

- Scaling does not change tree decisions
- Scaling does not improve accuracy
- Scaling does not improve training speed
- Scaling may even make feature interpretation harder

Since churn prediction is commonly dominated by tree models, we keep raw feature scales.

Linear and distance-based models, such as Logistic Regression, SVM, and KNN, do require scaling.
To avoid contaminating the dataset for tree models, we apply scaling inside the modelling pipeline only when needed, not during feature engineering.

This ensures:

- No mixing of scaled and unscaled versions
- No data leakage (scaling must be fit on the training split only)
- Flexibility to use both linear and tree-based models correctly

## Save model-ready dataset

We save the processed, fully numeric dataset into a new folder: `data/model_ready`.

This file will be the input for the modelling notebook (`04_Modelling.ipynb`).

In [13]:
MODEL_DATA = (NOTEBOOK_DIR / "../data/model_ready").resolve()
MODEL_DATA.mkdir(parents=True, exist_ok=True)

output_path = MODEL_DATA / "telco_churn_model_ready.csv"
model_df.to_csv(output_path, index=False)

print("Saved model-ready dataset to:", output_path)

Saved model-ready dataset to: /Users/pedro.cabeco/Project_EDSB-1/Sandbox/data/model_ready/telco_churn_model_ready.csv
