# Pre-processing and Training Data Development: Telco Customer Churn

This notebook prepares the Telco Customer Churn dataset for machine learning by performing the following preprocessing steps:

1. **Data Loading and Cleaning** - Load data and handle any missing values
2. **Creating Dummy Features** - Convert categorical variables into numeric format
3. **Feature Scaling** - Standardize numeric features so they're on the same scale
4. **Train/Test Split** - Divide data into training and testing sets

**Goal:** Create a clean, preprocessed dataset ready for building predictive models.


In [65]:
# Import the libraries needed for preprocessing work
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import preprocessing tools from sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Settings for displaying plots in the notebook
%matplotlib inline
sns.set(style='whitegrid')


## 1. Data Loading and Initial Cleaning

First, load the dataset and perform some basic cleaning that we identified during EDA.


In [66]:
# Load the Telco Customer Churn dataset
df = pd.read_csv('./WA_Fn-UseC_-Telco-Customer-Churn.csv')

# Take a look at the data
print(f"Dataset has {df.shape[0]} rows and {df.shape[1]} columns")
df.head()


Dataset has 7043 rows and 21 columns


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [67]:
# Check the data types of each column, see which columns are categorical vs numerical
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


### Fixing the TotalCharges Column

From the EDA, we found that the 'TotalCharges' column is stored as text (object) instead of a number. This is because some rows have empty spaces instead of numbers. We need to convert it to a numeric type.


In [68]:
# Replace empty spaces with NaN so pandas recognizes them as missing
df['TotalCharges'] = df['TotalCharges'].replace(' ', np.nan)

# Convert the column to numeric type
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])

# Check how many missing values we have now
print(f"Number of missing values in TotalCharges: {df['TotalCharges'].isnull().sum()}")


Number of missing values in TotalCharges: 11


In [69]:
# Look at the rows with missing TotalCharges
# These are new customers with 0 tenure (they just signed up)
df[df['TotalCharges'].isnull()][['customerID', 'tenure', 'MonthlyCharges', 'TotalCharges']]


Unnamed: 0,customerID,tenure,MonthlyCharges,TotalCharges
488,4472-LVYGI,0,52.55,
753,3115-CZMZD,0,20.25,
936,5709-LVOEQ,0,80.85,
1082,4367-NUYAO,0,25.75,
1340,1371-DWPAZ,0,56.05,
3331,7644-OMVMY,0,19.85,
3826,3213-VVOLG,0,25.35,
4380,2520-SGTTA,0,20.0,
5218,2923-ARZLG,0,19.7,
6670,4075-WKNIU,0,73.35,


In [70]:
# Fill the missing TotalCharges with the MonthlyCharges value
# This makes sense because if they have 0 tenure, their total charges should equal their monthly charges
df.loc[df['TotalCharges'].isnull(), 'TotalCharges'] = df.loc[df['TotalCharges'].isnull(), 'MonthlyCharges']

# Verify there are no more missing values
print(f"Missing values after filling: {df['TotalCharges'].isnull().sum()}")


Missing values after filling: 0


### Dropping Unnecessary Columns

The 'customerID' column is just an identifier and won't help the model predict churn, so we should remove it.


In [71]:
# Drop the customerID column since it's not useful for prediction
df = df.drop('customerID', axis=1)

print(f"Dataset now has {df.shape[1]} columns")
print(f"\nColumns in dataset: {list(df.columns)}")


Dataset now has 20 columns

Columns in dataset: ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']


## 2. Creating Dummy Features for Categorical Variables

Machine learning models work with numbers, not text. So we need to convert our categorical columns (like "Yes"/"No", or "Male"/"Female") into numeric format.


In [72]:
# Identify categorical columns (those with 'object' data type)
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"Categorical columns ({len(categorical_cols)} total):")
for col in categorical_cols:
    unique_values = df[col].unique()
    print(f"  - {col}: {unique_values}")


Categorical columns (16 total):
  - gender: ['Female' 'Male']
  - Partner: ['Yes' 'No']
  - Dependents: ['No' 'Yes']
  - PhoneService: ['No' 'Yes']
  - MultipleLines: ['No phone service' 'No' 'Yes']
  - InternetService: ['DSL' 'Fiber optic' 'No']
  - OnlineSecurity: ['No' 'Yes' 'No internet service']
  - OnlineBackup: ['Yes' 'No' 'No internet service']
  - DeviceProtection: ['No' 'Yes' 'No internet service']
  - TechSupport: ['No' 'Yes' 'No internet service']
  - StreamingTV: ['No' 'Yes' 'No internet service']
  - StreamingMovies: ['No' 'Yes' 'No internet service']
  - Contract: ['Month-to-month' 'One year' 'Two year']
  - PaperlessBilling: ['Yes' 'No']
  - PaymentMethod: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
  - Churn: ['No' 'Yes']


In [73]:
# Identify numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Numerical columns ({len(numerical_cols)} total):")
for col in numerical_cols:
    print(f"  - {col}: range from {df[col].min():.2f} to {df[col].max():.2f}")


Numerical columns (4 total):
  - SeniorCitizen: range from 0.00 to 1.00
  - tenure: range from 0.00 to 72.00
  - MonthlyCharges: range from 18.25 to 118.75
  - TotalCharges: range from 18.80 to 8684.80


### Converting the Target Variable (Churn)

Before creating dummy variables, convert the target variable "Churn" to numeric format:
- "Yes" → 1 (customer churned)
- "No" → 0 (customer stayed)


In [74]:
# Convert the Churn column: Yes = 1, No = 0
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# Verify the conversion worked
print("Churn value counts after conversion:")
print(df['Churn'].value_counts())


Churn value counts after conversion:
Churn
0    5174
1    1869
Name: count, dtype: int64


### Creating Dummy Variables for All Categorical Features

Now use `pd.get_dummies()` to convert all remaining categorical columns into dummy variables.

**Note:** We use `drop_first=True` to avoid redundancy. 


In [75]:
# Get the list of categorical columns (excluding Churn which we already converted)
categorical_cols_for_dummies = df.select_dtypes(include=['object']).columns.tolist()

print(f"Columns to convert to dummies: {categorical_cols_for_dummies}")


Columns to convert to dummies: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']


In [76]:
# Create dummy variables for all categorical columns
# drop_first=True removes one category from each variable to avoid multicollinearity
df_dummies = pd.get_dummies(df, columns=categorical_cols_for_dummies, drop_first=True)

# Check the shape before and after
print(f"Original dataframe shape: {df.shape}")
print(f"After creating dummies: {df_dummies.shape}")
print(f"\nWe went from {df.shape[1]} columns to {df_dummies.shape[1]} columns")


Original dataframe shape: (7043, 20)
After creating dummies: (7043, 31)

We went from 20 columns to 31 columns


In [77]:
# Print all the new column names
print("New column names after creating dummies:")
print(list(df_dummies.columns))


New column names after creating dummies:
['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn', 'gender_Male', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes', 'MultipleLines_No phone service', 'MultipleLines_Yes', 'InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_No internet service', 'OnlineSecurity_Yes', 'OnlineBackup_No internet service', 'OnlineBackup_Yes', 'DeviceProtection_No internet service', 'DeviceProtection_Yes', 'TechSupport_No internet service', 'TechSupport_Yes', 'StreamingTV_No internet service', 'StreamingTV_Yes', 'StreamingMovies_No internet service', 'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes', 'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check']


In [78]:
# Take a look at the first few rows of the transformed data
df_dummies.head()


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,29.85,29.85,0,False,True,False,False,True,...,False,False,False,False,False,False,True,False,True,False
1,0,34,56.95,1889.5,0,True,False,False,True,False,...,False,False,False,False,True,False,False,False,False,True
2,0,2,53.85,108.15,1,True,False,False,True,False,...,False,False,False,False,False,False,True,False,False,True
3,0,45,42.3,1840.75,0,True,False,False,False,True,...,False,False,False,False,True,False,False,False,False,False
4,0,2,70.7,151.65,1,False,False,False,True,False,...,False,False,False,False,False,False,True,False,True,False


## 3. Standardizing Numeric Features

The numeric features have very different ranges:
- tenure: 0 to 72 (months)
- MonthlyCharges: 18.25 to 118.75 (dollars)
- TotalCharges: 18.80 to 8684.80 (dollars)

Without scaling, features with larger values (like TotalCharges) might dominate the model.
We only scale the numeric feature columns, not the dummy variables (which are already 0s and 1s).


In [79]:
# Define the columns that need to be scaled (original numeric columns)
columns_to_scale = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Look at the current values before scaling
print("BEFORE Scaling:")
print(df_dummies[columns_to_scale].describe().round(2))


BEFORE Scaling:
        tenure  MonthlyCharges  TotalCharges
count  7043.00         7043.00       7043.00
mean     32.37           64.76       2279.80
std      24.56           30.09       2266.73
min       0.00           18.25         18.80
25%       9.00           35.50        398.55
50%      29.00           70.35       1394.55
75%      55.00           89.85       3786.60
max      72.00          118.75       8684.80


In [80]:
# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the data and transform it
df_dummies[columns_to_scale] = scaler.fit_transform(df_dummies[columns_to_scale])

# Look at the values after scaling
print("AFTER Scaling:")
print(df_dummies[columns_to_scale].describe().round(2))


AFTER Scaling:
        tenure  MonthlyCharges  TotalCharges
count  7043.00         7043.00       7043.00
mean     -0.00           -0.00         -0.00
std       1.00            1.00          1.00
min      -1.32           -1.55         -1.00
25%      -0.95           -0.97         -0.83
50%      -0.14            0.19         -0.39
75%       0.92            0.83          0.66
max       1.61            1.79          2.83


After scaling:
- The mean is now approximately 0 for all three columns
- The standard deviation is now approximately 1 for all three columns
- The values now range roughly from -2 to +3 instead of the original wide ranges

All numeric features are on the same scale.

## 4. Splitting Data into Training and Testing Sets

We need to evaluate how well the model performs on data it has never seen before. 
We will use an 80/20 split: 80% for training and 20% for testing.


In [81]:
# Separate features (X) from target variable (y)
# X contains all columns EXCEPT Churn (the features we use to predict)
# y contains ONLY Churn (what we are trying to predict)

X = df_dummies.drop('Churn', axis=1)  # Features
y = df_dummies['Churn']                # Target variable

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")


Features (X) shape: (7043, 30)
Target (y) shape: (7043,)


In [82]:
# Split the data into training and testing sets
# test_size=0.2 means 20% goes to testing, 80% goes to training
# random_state=42 ensures we get the same split each time we run this code (for reproducibility)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples ({100*X_train.shape[0]/len(X):.1f}%)")
print(f"Testing set size: {X_test.shape[0]} samples ({100*X_test.shape[0]/len(X):.1f}%)")


Training set size: 5634 samples (80.0%)
Testing set size: 1409 samples (20.0%)


In [83]:
# Verify that the churn distribution is similar in both sets to ensure we have a fair representation in both training and testing

print("Churn distribution in Training Set:")
print(y_train.value_counts(normalize=True).round(3) * 100)

print("\nChurn distribution in Testing Set:")
print(y_test.value_counts(normalize=True).round(3) * 100)

print("\nChurn distribution in Original Data:")
print(y.value_counts(normalize=True).round(3) * 100)


Churn distribution in Training Set:
Churn
0    73.4
1    26.6
Name: proportion, dtype: float64

Churn distribution in Testing Set:
Churn
0    73.5
1    26.5
Name: proportion, dtype: float64

Churn distribution in Original Data:
Churn
0    73.5
1    26.5
Name: proportion, dtype: float64


The churn distribution is similar across all sets (around 73-74% not churned, 26-27% churned). This is good - it means the split is fair and representative of the original data!


## 5. Saving the Preprocessed Data

Now that the data has been cleaned and prepared, save it so we can easily use it for model building later.


In [84]:
# Save the full preprocessed dataframe
df_dummies.to_csv('telco_churn_preprocessed.csv', index=False)
print("Saved preprocessed data to 'telco_churn_preprocessed.csv'")

# Save the training and testing sets separately (these are ready for modeling)
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

print("\nSaved training and testing sets:")
print("  - X_train.csv (training features)")
print("  - X_test.csv (testing features)")
print("  - y_train.csv (training target)")
print("  - y_test.csv (testing target)")


Saved preprocessed data to 'telco_churn_preprocessed.csv'

Saved training and testing sets:
  - X_train.csv (training features)
  - X_test.csv (testing features)
  - y_train.csv (training target)
  - y_test.csv (testing target)
