# Data Cleaning & Preparation

Purpose:
- Prepare clean, analysis-ready tables
- Fix data quality issues identified during EDA
- Make explicit decisions about scope and assumptions

This notebook focuses on correctness and consistency, not analysis.


In [2]:
import pandas as pd
import numpy as np

data_path = "../data/raw/"

customer = pd.read_excel(data_path + "CustomerChurn.xlsx")
services = pd.read_excel(data_path + "Telco_customer_churn_services.xlsx")
status = pd.read_excel(data_path + "Telco_customer_churn_status.xlsx")
demographics = pd.read_excel(data_path + "Telco_customer_churn_demographics.xlsx")
location = pd.read_excel(data_path + "Telco_customer_churn_location.xlsx")
population = pd.read_excel(data_path + "Telco_customer_churn_population.xlsx")

In [3]:
customer["Total Charges"] = pd.to_numeric(customer["Total Charges"], errors="coerce")

In [4]:
customer["Total Charges"].isna().sum()

np.int64(11)

In [5]:
status["churned"] = status["Churn Label"].map({
    "Yes": 1,
    "No": 0
})

### Columns intentionally excluded
- Offer: not relevant to revenue risk analysis
- Device protection details: not decision-critical

In [7]:
customer["Customer ID"].isna().sum()

np.int64(0)

In [8]:
services["Customer ID"].isna().sum()

np.int64(0)

In [9]:
status["Customer ID"].isna().sum()

np.int64(0)

In [10]:
customer["Customer ID"].nunique()

7043

In [11]:
services["Customer ID"].nunique()

7043

In [12]:
status["Customer ID"].nunique()

7043

In [13]:
customer_clean = customer.copy()
services_clean = services.copy()
status_clean = status.copy()

In [14]:
customer_clean.to_csv("../data/processed/customer_clean.csv", index=False)
services_clean.to_csv("../data/processed/services_clean.csv", index=False)
status_clean.to_csv("../data/processed/status_clean.csv", index=False)

## Data Cleaning Summary

### Data type corrections
- Converted `Total Charges` from string to numeric to enable revenue-related calculations.
- Invalid or non-numeric values were coerced to nulls to explicitly surface data quality issues rather than masking them.

### Missing values handling
- Identified a small number of customers with missing `Total Charges` after conversion.
- These records were retained for churn analysis and excluded only from revenue calculations to avoid removing valid customers.

### Churn variable standardization
- Selected `Churn Label` as the authoritative churn indicator based on clear business semantics.
- Created a binary `churned` flag (1 = churned, 0 = retained) to support aggregation and KPI calculations.

### Scope control
- Limited cleaning to columns required for churn and revenue analysis.
- Descriptive attributes not directly supporting the defined business questions were intentionally excluded at this stage.

### Key validation
- Validated that `Customer ID` is non-null and unique across core tables.
- Confirmed 1â€“1 cardinality between customer, services, and status tables prior to modelling.

### Processed datasets
- Created cleaned, analysis-ready versions of core tables.
- Saved processed datasets separately from raw data to preserve source integrity and reproducibility.

