# Data Cleaning for Customer Churn Analysis

This notebook focuses on preparing raw data for machine learning.

# Customer Churn Data Cleaning

## Project Overview
This notebook focuses on the **data understanding and data cleaning** phase of a customer churn dataset.
The objective is to prepare a clean and reliable dataset that can be used for further analysis or predictive modeling.

## Objectives
- Understand the structure and characteristics of the dataset
- Identify data quality issues
- Apply appropriate data cleaning techniques
- Produce a cleaned dataset ready for analysis

In [32]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

In [33]:
df = pd.read_csv("../data/raw_data.csv")
df.head()

Unnamed: 0,customer_id,gender,age,tenure,monthly_charges,total_charges,churn


In [34]:
df.shape

(0, 7)

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   customer_id      0 non-null      object
 1   gender           0 non-null      object
 2   age              0 non-null      object
 3   tenure           0 non-null      object
 4   monthly_charges  0 non-null      object
 5   total_charges    0 non-null      object
 6   churn            0 non-null      object
dtypes: object(7)
memory usage: 132.0+ bytes


In [36]:
df.describe(include="all")

Unnamed: 0,customer_id,gender,age,tenure,monthly_charges,total_charges,churn
count,0.0,0.0,0.0,0.0,0.0,0.0,0.0
unique,0.0,0.0,0.0,0.0,0.0,0.0,0.0
top,,,,,,,
freq,,,,,,,


### Initial Observations
From the initial exploration, we can observe that:
- The dataset contains both numerical and categorical variables
- Some columns contain missing values
- Certain categorical variables may require normalization
- Further data quality checks are required before modeling

In [37]:
df.isnull().sum()

customer_id        0
gender             0
age                0
tenure             0
monthly_charges    0
total_charges      0
churn              0
dtype: int64

### Handling Missing Values
The following strategies are applied:
- Numerical columns: imputed using median values
- Categorical columns: imputed using the most frequent category
This approach is chosen to minimize the impact of outliers and preserve data distribution.

In [38]:
# Numerical columns
num_cols = df.select_dtypes(include=["int64", "float64"]).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Categorical columns
cat_cols = df.select_dtypes(include=["object"]).columns

for col in cat_cols:
    if df[col].isnull().all():
        continue
    df[col] = df[col].fillna(df[col].mode()[0])

### Note on Categorical Variables
Some categorical features contained only missing values.
To prevent invalid imputation, categorical handling was applied conditionally at the column level.
This approach ensures robustness when dealing with incomplete real-world datasets.

In [39]:
df.duplicated().sum()

np.int64(0)

In [40]:
df = df.drop_duplicates()

In [41]:
df.dtypes

customer_id        object
gender             object
age                object
tenure             object
monthly_charges    object
total_charges      object
churn              object
dtype: object

### Data Cleaning Summary
After applying the data cleaning steps:
- Missing values have been handled appropriately
- Duplicate records have been removed
- Data types have been validated
- The dataset is now consistent and ready for further analysis or modeling

In [42]:
df.to_csv("../data/cleaned_data.csv", index=False)

## Conclusion
This notebook demonstrates a structured approach to data understanding and data cleaning.
These steps ensure data quality and reproducibility, which are critical foundations for any data science workflow.