# **Telecom Churn Analytics:** 
## ***Building Predictive Models to Improve Customer Retention***

### Section 2 - ETL 🧹🛠️

---

### Objectives

In this section, the aim is to prepare a cleaned dataset for visualization and analysis from the raw data files. There are ETL procedures form data extraction, data cleaning and processing to data load.


### Inputs

* Datasets used for this analysis is the retail data set from Kaggle (https://www.kaggle.com/datasets/mubeenshehzadi/customer-churn-dataset/). 

* 1 raw file will be used.
    * [telecom_customer_churn.csv](../dataset/raw/telecom_customer_churn.csv) 

### Outputs

* A cleaned dataset will be save as a CSV file below
    * [telecom_customer_churn_cleaned.csv](../dataset/processed/telecom_customer_churn_cleaned
    .csv)





---

### Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

In [1]:
import os

os.chdir(os.path.dirname(os.getcwd()))
current_dir = os.getcwd()
print(f"You set a new current directory: {current_dir}")


You set a new current directory: /Users/denniskwok/Documents/data-analytics/telecom-churn-ml-prediction


---

## **Part A**

### **Data Extraction**

In [2]:
import numpy as np
import pandas as pd

#### Step 1: Load Dataset

In [3]:
# Load dataset from csv file
def load_csv(filepath):
    try:
        df = pd.read_csv(filepath)
        print(f"Loaded {filepath} successfully.")
        return df
    except FileNotFoundError:
        print(f"File not found: {filepath}")
        return pd.DataFrame()

df = load_csv("dataset/raw/telecom_customer_churn.csv")


Loaded dataset/raw/telecom_customer_churn.csv successfully.


### Step 2: Overview The RAW Dataset

**General dataframe information** 

In [4]:
df.info() # Display dataframe information  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


**Dataframe data overview**

In [5]:
df.head() # Display the first few rows of the dataframe

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


**Checking missing values**

In [6]:
df.isnull().sum() # Check for missing values

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

---

## **Part B**

### **Data Transformation** 
### *Data cleaning and processing pipeline*

### 🧠 Pipeline Summary: Telecom Churn Data Preprocessing & Feature Engineering

The preprocessing pipeline is designed to transform the raw Telecom Customer Churn dataset into a clean, machine-learning-ready format. It standardises data quality, engineers meaningful service-based features, and applies robust encoding and scaling methods to support predictive modelling.

**Steps in the ETL Pipeline:**

1️⃣ Data Loading & Cleaning

* Source: dataset/raw/Telecom_Customer_Churn.csv

* Removes personally identifiable information (customerID).

* Converts TotalCharges to numeric and imputes missing values using
TotalCharges = MonthlyCharges × tenure.

* Ensures the dataset contains no missing values before model training.

2️⃣ Feature Engineering

* Creates a new CustomerType feature categorising users as: "Phone only", "Internet only", "Both"

* Normalises inconsistent entries (“No internet service” → “No”).

* Derives NumInternetServices, counting the number of active internet-based add-ons such as
OnlineSecurity, OnlineBackup, TechSupport, StreamingTV, etc. This feature quantifies customer engagement with telecom products.

3️⃣ Preprocessing Pipelines

🔹 Numeric Pipeline

* Features: tenure, MonthlyCharges, TotalCharges, NumInternetServices

    Steps:

    * SimpleImputer(strategy='median') – handles any residual missing numeric values.

    * StandardScaler() – standardises features to mean = 0, std = 1 for stable model performance.

🔹 Categorical Pipeline

*   Features: all non-numeric columns (e.g., Contract, PaymentMethod, Gender, CustomerType, etc.)

    Steps:

    * SimpleImputer(strategy='most_frequent') – fills missing categorical values.

    * OneHotEncoder(handle_unknown='ignore', drop='first') – converts categories into binary variables, avoiding multicollinearity.

    Both pipelines are combined via a ColumnTransformer into a single unified preprocessing step.

4️⃣ Outputs & Diagnostics

* The pipeline produces a fully encoded numeric feature matrix ready for model training.

* Prints detailed diagnostics:

    * Number of categorical and numeric features.

    * Mapping of categories per feature.

    * Final encoded column names.

* Asserts that no missing values remain in the final dataset.

In [7]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# =========================
# 1. Add service-based features
# =========================
def add_service_features(df):
    # CustomerType: Phone only, Internet only, Both
    def service_type(row):
        if row['InternetService'] == 'No':
            return "Phone only"
        elif row['PhoneService'] == 'No':
            return "Internet only"
        else:
            return "Both"
    df['CustomerType'] = df.apply(service_type, axis=1)

    # Internet add-ons
    internet_add_ons = [
        "OnlineSecurity", "OnlineBackup", "DeviceProtection",
        "TechSupport", "StreamingTV", "StreamingMovies"
    ]

    # Normalize "No internet service" to "No"
    for col in internet_add_ons:
        df[col] = df[col].replace("No internet service", "No")

    # Count number of "Yes" internet services
    df["NumInternetServices"] = (df[internet_add_ons] == "Yes").sum(axis=1)

    return df

# =========================
# 2. Load and clean dataset
# =========================
def etl_clean_data(filepath):
    df = pd.read_csv(filepath)

    # Add engineered features
    df = add_service_features(df)

    # Drop PII
    if 'customerID' in df.columns:
        df.drop("customerID", axis=1, inplace=True)

    # Convert TotalCharges to numeric
    df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

    # Impute missing TotalCharges using MonthlyCharges * tenure
    df.loc[df['TotalCharges'].isnull(), 'TotalCharges'] = (
        df['MonthlyCharges'] * df['tenure']
    )

    return df

# =========================
# 3. Preprocessing Pipelines
# =========================
def build_preprocessor(df):
    categorical_features = df.select_dtypes(include=["object"]).columns.tolist()
    numeric_features = ["tenure", "MonthlyCharges", "TotalCharges", "NumInternetServices"]

    numeric_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ])

    categorical_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore", drop="first"))
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features),
            ("cat", categorical_transformer, categorical_features)
        ]
    )

    return preprocessor

# =========================
# Prepare cleaned and encoded dataset for usage
# =========================
df = etl_clean_data("dataset/raw/Telecom_Customer_Churn.csv")
preprocessor = build_preprocessor(df)
df_encoded = preprocessor.fit_transform(df)

print("✅ ETL and preprocessing pipeline built successfully!")
print(f"Data shape after cleaning: {df.shape}")
print(f"Categorical features: {len(df.select_dtypes(include=['object']).columns)}")
print(f"Numeric features: {len(df.select_dtypes(exclude=['object']).columns)}")
print(f"Data shape after encoding: {df_encoded.shape}")
# Get the OneHotEncoder from the preprocessor's categorical transformer
encoder = preprocessor.named_transformers_['cat'].named_steps['encoder']

# Get categorical feature names from the preprocessor
categorical_features = preprocessor.transformers_[1][2]

# Show the categories for each categorical feature
print('\nOneHotEncoder categories for each feature:\n')
for feature, cats in zip(categorical_features, encoder.categories_):
    print(f"{feature}: {cats}")

print('\nNew Columns after One-Hot Encoding:\n')
# Get the new column names after one-hot encoding
onehot_columns = encoder.get_feature_names_out(categorical_features)
print(onehot_columns.tolist(), '\n')

# Verify no missing values remain
assert df.isnull().sum().sum() == 0, "Missing values remain in the DataFrame"



✅ ETL and preprocessing pipeline built successfully!
Data shape after cleaning: (7043, 22)
Categorical features: 17
Numeric features: 5
Data shape after encoding: (7043, 27)

OneHotEncoder categories for each feature:

gender: ['Female' 'Male']
Partner: ['No' 'Yes']
Dependents: ['No' 'Yes']
PhoneService: ['No' 'Yes']
MultipleLines: ['No' 'No phone service' 'Yes']
InternetService: ['DSL' 'Fiber optic' 'No']
OnlineSecurity: ['No' 'Yes']
OnlineBackup: ['No' 'Yes']
DeviceProtection: ['No' 'Yes']
TechSupport: ['No' 'Yes']
StreamingTV: ['No' 'Yes']
StreamingMovies: ['No' 'Yes']
Contract: ['Month-to-month' 'One year' 'Two year']
PaperlessBilling: ['No' 'Yes']
PaymentMethod: ['Bank transfer (automatic)' 'Credit card (automatic)' 'Electronic check'
 'Mailed check']
Churn: ['No' 'Yes']
CustomerType: ['Both' 'Internet only' 'Phone only']

New Columns after One-Hot Encoding:

['gender_Male', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes', 'MultipleLines_No phone service', 'MultipleLines_Yes',

Remarks: After changing the data type, empty strings or invalid entries become NaN in TotolCharges column. the missing values of TotalCharges was imputed the formula assumes TotalCharges should roughly equal monthly charges multiplied by tenure.

---

## **Part C**

### **Data Load** 
### *Creating cleaned and encoded dataset*

* The resulting preprocessor in the pipeline can be directly plugged into an end-to-end machine learning workflow. Now we save a cleaned and encoded dataset for further use in ML model.

In [8]:
# Export the cleaned dataframe to a CSV file in the folder for processed CSV
df_cleaned = df.copy()
df_cleaned.to_csv('dataset/processed/telecom_customer_churn_cleaned.csv', index=False)

import os

# Get encoded column names
encoded_columns = encoder.get_feature_names_out(categorical_features)

# Define numeric features
numeric_features = ["tenure", "MonthlyCharges", "TotalCharges", "NumInternetServices"]

# Combine numeric features (already scaled) and encoded categorical features
df_encoded_final = pd.DataFrame(df_encoded, columns=list(encoded_columns) + numeric_features)

# Save to CSV
os.makedirs("dataset/processed", exist_ok=True)
df_encoded_final.to_csv("dataset/processed/telecom_customer_churn_encoded.csv", index=False)



print(f'Cleaned and encoded dataset exported to dataset/processed/')

Cleaned and encoded dataset exported to dataset/processed/


---

To be contined in Section 3 for Data Visualization.