## Churn Data â€” Type Description & Data Fixing

### By:
jdg

### Date:
2026-02-21

### Description:

Loads the intermediate Parquet file produced by `1-data/01_jdg_churn_data_loading_20260221.ipynb`.
The goals of this notebook are:
- Inspect inferred data types and identify columns with wrong types
- Detect and quantify missing values
- Explore the cardinality and value distribution of each column
- Apply the necessary fixes (type casting, whitespace handling, encoding corrections)
- Save the cleaned dataset as a new Parquet file ready for EDA


## ðŸ“š Import libraries

In [5]:
from pathlib import Path

import numpy as np
import pandas as pd

## ðŸ’¾ Load data

In [None]:
INTERMEDIATE_PATH = Path("../../data/02_intermediate/Churn/churn_raw.parquet")

df = pd.read_parquet(INTERMEDIATE_PATH)

print(f"Loaded: {df.shape[0]:,} rows x {df.shape[1]} columns")
df.head()

## ðŸ‘· Data description

### 1. Shape and column overview

In [7]:
print(f"Rows   : {df.shape[0]:,}")
print(f"Columns: {df.shape[1]}")
print()
print("Column names:")
for col in df.columns:
    print(f"  - {col}")

Rows   : 14,214
Columns: 20

Column names:
  - gender
  - SeniorCitizen
  - Partner
  - Dependents
  - tenure
  - PhoneService
  - MultipleLines
  - InternetService
  - OnlineSecurity
  - OnlineBackup
  - DeviceProtection
  - TechSupport
  - StreamingTV
  - StreamingMovies
  - Contract
  - PaperlessBilling
  - PaymentMethod
  - MonthlyCharges
  - TotalCharges
  - Churn


### 2. Inferred data types (`dtypes`)

In [8]:
dtype_summary = df.dtypes.reset_index().rename(columns={"index": "column", 0: "dtype"})
dtype_summary

Unnamed: 0,column,dtype
0,gender,object
1,SeniorCitizen,float64
2,Partner,object
3,Dependents,object
4,tenure,float64
5,PhoneService,object
6,MultipleLines,object
7,InternetService,object
8,OnlineSecurity,object
9,OnlineBackup,object


### 3. Detailed schema â€” `df.info()`

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14214 entries, 0 to 14213
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14187 non-null  object 
 1   SeniorCitizen     14154 non-null  float64
 2   Partner           14153 non-null  object 
 3   Dependents        14124 non-null  object 
 4   tenure            14114 non-null  float64
 5   PhoneService      14096 non-null  object 
 6   MultipleLines     14077 non-null  object 
 7   InternetService   14053 non-null  object 
 8   OnlineSecurity    14024 non-null  object 
 9   OnlineBackup      14013 non-null  object 
 10  DeviceProtection  14020 non-null  object 
 11  TechSupport       14017 non-null  object 
 12  StreamingTV       14015 non-null  object 
 13  StreamingMovies   14036 non-null  object 
 14  Contract          14092 non-null  object 
 15  PaperlessBilling  14092 non-null  object 
 16  PaymentMethod     14088 non-null  object

### 4. Numeric columns â€” descriptive statistics

In [10]:
df.describe(include=[np.number])

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,14154.0,14114.0,14101.0
mean,0.162074,32.36276,372028800.0
std,0.368532,24.568811,44155450000.0
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.9
max,1.0,72.0,5243355000000.0


### 5. Categorical / object columns â€” descriptive statistics

In [11]:
df.describe(include=[object])

Unnamed: 0,gender,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,TotalCharges,Churn
count,14187,14153,14124,14096,14077,14053,14024,14013,14020,14017,14015,14036,14092,14092,14088,14143.0,14125
unique,2,2,2,3,4,3,4,3,4,3,4,4,3,2,4,6537.0,2
top,Male,No,No,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,20.2,No
freq,7160,7316,9899,12740,6784,6182,6958,6154,6159,6899,5591,5549,7760,8343,4737,24.0,10365


In [None]:
print(df["MultipleLines"].value_counts(dropna=False))

### 6. Missing values

In [12]:
missing = pd.DataFrame(
    {
        "missing_count": df.isna().sum(),
        "missing_pct": (df.isna().sum() / len(df) * 100).round(2),
    }
)
missing[missing["missing_count"] > 0].sort_values("missing_pct", ascending=False)

Unnamed: 0,missing_count,missing_pct
OnlineBackup,201,1.41
StreamingTV,199,1.4
TechSupport,197,1.39
DeviceProtection,194,1.36
OnlineSecurity,190,1.34
StreamingMovies,178,1.25
InternetService,161,1.13
MultipleLines,137,0.96
PaymentMethod,126,0.89
Contract,122,0.86


In [13]:
print(f"Total cells with NaN: {df.isna().sum().sum()}")

Total cells with NaN: 2556


### 7. Unique value counts per categorical column

In [14]:
CARDINALITY_THRESHOLD = 10

object_cols = df.select_dtypes(include=object).columns

for col in object_cols:
    n_unique = df[col].nunique()
    print(f"\n{'â”€' * 50}")
    print(f"Column : {col}")
    print(f"Unique values: {n_unique}")
    if n_unique <= CARDINALITY_THRESHOLD:
        print(df[col].value_counts().to_string())
    else:
        print(f"  (high cardinality â€” sample: {df[col].unique()[:5].tolist()}...)")


â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Column : gender
Unique values: 2
gender
Male      7160
Female    7027

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Column : Partner
Unique values: 2
Partner
No     7316
Yes    6837

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Column : Dependents
Unique values: 2
Dependents
No     9899
Yes    4225

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Column : PhoneService
Unique values: 3
PhoneService
Yes        12740
No          1355
1324134        1

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”

### 8. Target variable â€” `Churn` distribution

In [15]:
churn_counts = df["Churn"].value_counts()
churn_pct = df["Churn"].value_counts(normalize=True).mul(100).round(2)

churn_summary = pd.DataFrame({"count": churn_counts, "pct": churn_pct})
print("Churn distribution:")
print(churn_summary)

Churn distribution:
       count    pct
Churn              
No     10365  73.38
Yes     3760  26.62


## ðŸ”§ Data fixing

Apply corrections identified in the description above. Work on a copy to preserve the original.

In [16]:
df_fixed = df.copy()

### Fix 1: `TotalCharges` â€” cast to numeric

`TotalCharges` is read as `object` because some rows contain whitespace instead of a number.
We coerce to numeric (whitespace â†’ `NaN`) and inspect the affected rows.

In [None]:
df_fixed["TotalCharges"] = pd.to_numeric(df_fixed["TotalCharges"], errors="coerce")

n_nulls = df_fixed["TotalCharges"].isna().sum()
print(f"NaN introduced in TotalCharges: {n_nulls}")

# Inspect the affected rows
df_fixed[df_fixed["TotalCharges"].isna()]

### Fix 3: `TotalCharges` â€” handle NaN rows

Decide strategy after inspecting the rows above (e.g. impute with 0 for new customers with `tenure == 0`, or drop).

In [None]:
# Example: impute with 0 where tenure == 0 (new customers, no charges yet)
mask_new = df_fixed["TotalCharges"].isna() & (df_fixed["tenure"] == 0)
df_fixed.loc[mask_new, "TotalCharges"] = 0.0

print(f"Remaining NaN in TotalCharges: {df_fixed['TotalCharges'].isna().sum()}")

### Fix 4: `SeniorCitizen` â€” cast to boolean

Stored as `int64` (0/1); convert to `bool` to match the semantic type of other binary columns.

In [None]:
df_fixed["SeniorCitizen"] = df_fixed["SeniorCitizen"].astype(bool)

### Verify fixed dataset

In [None]:
print("dtypes after fixing:")
print(df_fixed.dtypes)
print()
print(f"Remaining NaNs: {df_fixed.isna().sum().sum()}")

### Export fixed dataset

In [None]:
OUTPUT_PATH = Path("../../data/03_primary/Churn/churn_fixed.parquet")
OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)

df_fixed.to_parquet(OUTPUT_PATH)

print(f"Saved to: {OUTPUT_PATH}")
print(f"File size: {OUTPUT_PATH.stat().st_size / 1024:.1f} KB")