# üßπ Customer Churn Data Cleaning

## Project Overview
This notebook focuses on data understanding and cleaning for the Customer Churn dataset.
The objective is to ensure the dataset is clean, consistent, and ready for exploratory
data analysis (EDA) and predictive modeling.

## 1Ô∏è‚É£ Import Required Libraries

In [1]:
import pandas as pd
import numpy as np

## 2Ô∏è‚É£ Load Raw Dataset

In [2]:
df = pd.read_csv("../data/raw_data.csv")

## 3Ô∏è‚É£ Initial Data Inspection

### 3.1 Preview Dataset

In [3]:
df.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0


### 3.2 Dataset Shape

In [4]:
df.shape

(200, 28)

**Insight:**  
The dataset contains **200 rows** and **28 columns**.

## 4Ô∏è‚É£ Data Structure & Data Types

### 4.1 Dataset Information

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 28 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tenure    200 non-null    float64
 1   age       200 non-null    float64
 2   address   200 non-null    float64
 3   income    200 non-null    float64
 4   ed        200 non-null    float64
 5   employ    200 non-null    float64
 6   equip     200 non-null    float64
 7   callcard  200 non-null    float64
 8   wireless  200 non-null    float64
 9   longmon   200 non-null    float64
 10  tollmon   200 non-null    float64
 11  equipmon  200 non-null    float64
 12  cardmon   200 non-null    float64
 13  wiremon   200 non-null    float64
 14  longten   200 non-null    float64
 15  tollten   200 non-null    float64
 16  cardten   200 non-null    float64
 17  voice     200 non-null    float64
 18  pager     200 non-null    float64
 19  internet  200 non-null    float64
 20  callwait  200 non-null    float6

**Insight:**  
- All features are stored as numerical (`float64`)
- No categorical or string-type variables detected
- Dataset is memory-efficient and consistent

## 5Ô∏è‚É£ Missing Value Analysis

In [6]:
df.isnull().sum()

tenure      0
age         0
address     0
income      0
ed          0
employ      0
equip       0
callcard    0
wireless    0
longmon     0
tollmon     0
equipmon    0
cardmon     0
wiremon     0
longten     0
tollten     0
cardten     0
voice       0
pager       0
internet    0
callwait    0
confer      0
ebill       0
loglong     0
logtoll     0
lninc       0
custcat     0
churn       0
dtype: int64

**Insight:**  
No missing values were detected across all columns.  
Data imputation is not required.

## 6Ô∏è‚É£ Duplicate Data Check

In [7]:
df.duplicated().sum()

np.int64(0)

**Insight:**  
No duplicate records found in the dataset.

## 7Ô∏è‚É£ Target Variable Validation

### 7.1 Churn Distribution

In [8]:
df["churn"].value_counts()

churn
0.0    142
1.0     58
Name: count, dtype: int64

**Insight:**  
The target variable `churn` is available and properly encoded,
making it suitable for binary classification tasks.

## 8Ô∏è‚É£ Statistical Summary

In [9]:
df.describe()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,...,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,35.505,41.165,11.65,75.13,2.825,10.225,0.425,0.705,0.29,11.78925,...,0.275,0.44,0.455,0.46,0.44,2.193285,3.229185,3.951015,2.475,0.29
std,21.640971,13.076803,10.158419,128.430468,1.28555,8.95743,0.495584,0.457187,0.454901,9.88725,...,0.447635,0.497633,0.49922,0.499648,0.497633,0.731282,0.281019,0.752553,1.079445,0.454901
min,1.0,19.0,0.0,9.0,1.0,0.0,0.0,0.0,0.0,1.1,...,0.0,0.0,0.0,0.0,0.0,0.095,1.749,2.197,1.0,0.0
25%,16.75,31.0,3.0,31.0,2.0,3.0,0.0,0.0,0.0,5.5375,...,0.0,0.0,0.0,0.0,0.0,1.71175,3.2265,3.434,2.0,0.0
50%,33.5,40.0,9.0,48.0,3.0,7.5,0.0,1.0,0.0,8.25,...,0.0,0.0,0.0,0.0,0.0,2.11,3.24,3.871,2.0,0.0
75%,55.25,51.0,18.0,80.0,4.0,17.0,1.0,1.0,1.0,14.3,...,1.0,1.0,1.0,1.0,1.0,2.66,3.24,4.382,3.0,1.0
max,72.0,76.0,48.0,1668.0,5.0,44.0,1.0,1.0,1.0,62.3,...,1.0,1.0,1.0,1.0,1.0,4.132,4.227,7.419,4.0,1.0


**Insight:**  
- Feature ranges are within reasonable limits
- No extreme outliers observed at this stage

## 9Ô∏è‚É£ Export Clean Dataset

In [10]:
df.to_csv("../data/cleaned_data.csv", index=False)

**Insight:**  
The cleaned dataset has been successfully exported and is ready
for EDA and machine learning modeling.

## ‚úÖ Final Conclusion

### Data Cleaning Summary
- Dataset is complete (no missing values)
- Dataset is consistent (all numeric features)
- Dataset is free from duplicates and invalid entries

### Next Steps
- Perform Exploratory Data Analysis (EDA)
- Build predictive churn models
- Evaluate and improve model performance