# Customer Churn - Data Preprocessing

This notebook prepares the data for modeling:
1. Handle missing values
2. Encode categorical variables
3. Scale numerical features
4. Split into train/test sets
5. Save processed data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully")

Libraries loaded successfully


## 1. Load the Raw Data

In [2]:
df = pd.read_csv('../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv')
print(f"Original dataset: {df.shape[0]} rows, {df.shape[1]} columns")
df.head(3)

Original dataset: 7043 rows, 21 columns


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


## 2. Handle Missing Values

From EDA, we found TotalCharges has 11 blank strings (not NaN).

In [3]:
# Convert TotalCharges to numeric (blanks become NaN)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Check how many NaN values we have now
print(f"Missing TotalCharges: {df['TotalCharges'].isna().sum()}")

Missing TotalCharges: 11


In [4]:
# Let's look at these rows
df[df['TotalCharges'].isna()][['customerID', 'tenure', 'MonthlyCharges', 'TotalCharges']]

Unnamed: 0,customerID,tenure,MonthlyCharges,TotalCharges
488,4472-LVYGI,0,52.55,
753,3115-CZMZD,0,20.25,
936,5709-LVOEQ,0,80.85,
1082,4367-NUYAO,0,25.75,
1340,1371-DWPAZ,0,56.05,
3331,7644-OMVMY,0,19.85,
3826,3213-VVOLG,0,25.35,
4380,2520-SGTTA,0,20.0,
5218,2923-ARZLG,0,19.7,
6670,4075-WKNIU,0,73.35,


In [5]:
# These are all new customers (tenure = 0)
# Their TotalCharges should logically be 0 or equal to MonthlyCharges
# We'll fill with 0 since they just joined

df['TotalCharges'] = df['TotalCharges'].fillna(0)
print(f"Missing values after fix: {df['TotalCharges'].isna().sum()}")

Missing values after fix: 0


## 3. Drop Unnecessary Columns

In [6]:
# customerID is just an identifier, not useful for prediction
df = df.drop('customerID', axis=1)
print(f"Columns after dropping customerID: {df.shape[1]}")

Columns after dropping customerID: 20


## 4. Encode the Target Variable

In [7]:
# Convert Churn from Yes/No to 1/0
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

print("Churn value counts after encoding:")
print(df['Churn'].value_counts())

Churn value counts after encoding:
Churn
0    5174
1    1869
Name: count, dtype: int64


## 5. Encode Categorical Features

We'll use two strategies:
- **Binary columns** (Yes/No): Simple 0/1 encoding
- **Multi-category columns**: One-hot encoding

In [8]:
# First, let's see all unique values for each categorical column
cat_columns = df.select_dtypes(include=['object']).columns
print(f"Categorical columns: {len(cat_columns)}\n")

for col in cat_columns:
    print(f"{col}: {df[col].unique()}")

Categorical columns: 15

gender: <StringArray>
['Female', 'Male']
Length: 2, dtype: str
Partner: <StringArray>
['Yes', 'No']
Length: 2, dtype: str
Dependents: <StringArray>
['No', 'Yes']
Length: 2, dtype: str
PhoneService: <StringArray>
['No', 'Yes']
Length: 2, dtype: str
MultipleLines: <StringArray>
['No phone service', 'No', 'Yes']
Length: 3, dtype: str
InternetService: <StringArray>
['DSL', 'Fiber optic', 'No']
Length: 3, dtype: str
OnlineSecurity: <StringArray>
['No', 'Yes', 'No internet service']
Length: 3, dtype: str
OnlineBackup: <StringArray>
['Yes', 'No', 'No internet service']
Length: 3, dtype: str
DeviceProtection: <StringArray>
['No', 'Yes', 'No internet service']
Length: 3, dtype: str
TechSupport: <StringArray>
['No', 'Yes', 'No internet service']
Length: 3, dtype: str
StreamingTV: <StringArray>
['No', 'Yes', 'No internet service']
Length: 3, dtype: str
StreamingMovies: <StringArray>
['No', 'Yes', 'No internet service']
Length: 3, dtype: str
Contract: <StringArray>
['Month

In [9]:
# Binary columns (Yes/No only)
binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']

for col in binary_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0})

print("Binary columns encoded:")
df[binary_cols].head()

Binary columns encoded:


Unnamed: 0,Partner,Dependents,PhoneService,PaperlessBilling
0,1,0,0,1
1,0,0,1,0
2,0,0,1,1
3,0,0,0,0
4,0,0,1,1


In [10]:
# Gender: Male/Female -> 0/1
df['gender'] = df['gender'].map({'Male': 0, 'Female': 1})
print(f"Gender encoded: {df['gender'].unique()}")

Gender encoded: [1 0]


In [11]:
# Multi-category columns - use one-hot encoding
multi_cat_cols = ['MultipleLines', 'InternetService', 'OnlineSecurity', 
                  'OnlineBackup', 'DeviceProtection', 'TechSupport',
                  'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod']

# One-hot encode
df = pd.get_dummies(df, columns=multi_cat_cols, drop_first=True)

print(f"Shape after one-hot encoding: {df.shape}")

Shape after one-hot encoding: (7043, 31)


In [12]:
# View all columns now
print("All columns after encoding:")
print(df.columns.tolist())

All columns after encoding:
['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'PaperlessBilling', 'MonthlyCharges', 'TotalCharges', 'Churn', 'MultipleLines_No phone service', 'MultipleLines_Yes', 'InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_No internet service', 'OnlineSecurity_Yes', 'OnlineBackup_No internet service', 'OnlineBackup_Yes', 'DeviceProtection_No internet service', 'DeviceProtection_Yes', 'TechSupport_No internet service', 'TechSupport_Yes', 'StreamingTV_No internet service', 'StreamingTV_Yes', 'StreamingMovies_No internet service', 'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check']


## 6. Separate Features and Target

In [13]:
# X = features (everything except Churn)
# y = target (Churn)

X = df.drop('Churn', axis=1)
y = df['Churn']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns ({X.shape[1]}): {X.columns.tolist()}")

Features shape: (7043, 30)
Target shape: (7043,)

Feature columns (30): ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'PaperlessBilling', 'MonthlyCharges', 'TotalCharges', 'MultipleLines_No phone service', 'MultipleLines_Yes', 'InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_No internet service', 'OnlineSecurity_Yes', 'OnlineBackup_No internet service', 'OnlineBackup_Yes', 'DeviceProtection_No internet service', 'DeviceProtection_Yes', 'TechSupport_No internet service', 'TechSupport_Yes', 'StreamingTV_No internet service', 'StreamingTV_Yes', 'StreamingMovies_No internet service', 'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check']


## 7. Train/Test Split

We split the data:
- **Training set (80%)**: Model learns from this
- **Test set (20%)**: We evaluate the model on this (data it's never seen)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,           # 20% for testing
    random_state=42,         # For reproducibility
    stratify=y               # Keep same churn ratio in both sets
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nChurn rate in training: {y_train.mean()*100:.1f}%")
print(f"Churn rate in test: {y_test.mean()*100:.1f}%")

Training set: 5634 samples
Test set: 1409 samples

Churn rate in training: 26.5%
Churn rate in test: 26.5%


## 8. Scale Numerical Features

Scaling puts all numerical features on the same scale (mean=0, std=1).
This helps many algorithms perform better.

**Important**: We fit the scaler on training data only, then transform both sets.

In [15]:
# Identify numerical columns to scale
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Before scaling
print("Before scaling (training set):")
print(X_train[numerical_cols].describe().round(2))

Before scaling (training set):
        tenure  MonthlyCharges  TotalCharges
count  5634.00         5634.00       5634.00
mean     32.49           64.93       2299.33
std      24.57           30.14       2279.20
min       0.00           18.40          0.00
25%       9.00           35.66        402.98
50%      29.00           70.50       1394.92
75%      55.00           90.00       3835.82
max      72.00          118.75       8684.80


In [16]:
# Initialize scaler
scaler = StandardScaler()

# Fit on training data, transform both
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

print("After scaling (training set):")
print(X_train[numerical_cols].describe().round(2))

After scaling (training set):
        tenure  MonthlyCharges  TotalCharges
count  5634.00         5634.00       5634.00
mean     -0.00           -0.00          0.00
std       1.00            1.00          1.00
min      -1.32           -1.54         -1.01
25%      -0.96           -0.97         -0.83
50%      -0.14            0.18         -0.40
75%       0.92            0.83          0.67
max       1.61            1.79          2.80


## 9. Save Processed Data

In [17]:
# Save to processed folder
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

print("Saved processed data to ../data/processed/")
print("  - X_train.csv")
print("  - X_test.csv")
print("  - y_train.csv")
print("  - y_test.csv")

Saved processed data to ../data/processed/
  - X_train.csv
  - X_test.csv
  - y_train.csv
  - y_test.csv


## 10. Summary

**What we did:**
1. Fixed 11 missing TotalCharges values (filled with 0 for new customers)
2. Dropped customerID (not useful for prediction)
3. Encoded target: Churn → 0/1
4. Encoded categorical features:
   - Binary (Yes/No) → 0/1
   - Multi-category → One-hot encoding
5. Split into 80% train / 20% test
6. Scaled numerical features (tenure, MonthlyCharges, TotalCharges)

**Data is now ready for modeling!**

---
*Next: Proceed to `03_modeling.ipynb` to build and evaluate models*