## Churn Data â€” Dtype Optimization & Cleaning

### By:
jdg

### Date:
2026-02-21

### Description:

Loads the intermediate Parquet file produced by `1-data/01_jdg_churn_data_loading_20260221.ipynb`.
The goals of this notebook are:
- Clean dirty string values (rogue numeric strings injected into categorical columns)
- Fix `TotalCharges` dtype (object â†’ float64)
- Cap the extreme outlier in `MonthlyCharges` at the 99th percentile
- Cast every column to its optimal dtype (bool, category, Int16, float64)
- Persist the result to `03_primary` as `churn_primary.parquet`

## ðŸ“š Import libraries

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype

## ðŸ’¾ Load data

In [None]:
INTERMEDIATE_PATH = Path("../../data/02_intermediate/Churn/churn_raw.parquet")

df = pd.read_parquet(INTERMEDIATE_PATH)

print(f"Loaded: {df.shape[0]:,} rows x {df.shape[1]} columns")
df.head()

## ðŸ”§ Cleaning & dtype optimization

### Step 1: Replace dirty string values â†’ NaN

Several categorical columns contain a single rogue numeric string that was injected
into otherwise clean categorical data. Replace each with `np.nan` before casting.

In [None]:
dirty = {
    "MultipleLines": "1244132",
    "OnlineSecurity": "23453432",
    "DeviceProtection": "1243524",
    "StreamingTV": "5412335",
    "StreamingMovies": "1523434",
}
for col, val in dirty.items():
    df[col] = df[col].replace(val, np.nan)

print("Dirty values replaced with NaN.")
for col in dirty:
    print(f"  {col}: {df[col].isna().sum()} NaNs")

### Step 2: Fix `TotalCharges` â€” object â†’ float64

`TotalCharges` was stored as object because some rows contain whitespace instead of a number.
`pd.to_numeric(..., errors='coerce')` converts those to `NaN`.

In [None]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

print(f"NaN in TotalCharges after cast: {df['TotalCharges'].isna().sum()}")

### Step 3: Cap `MonthlyCharges` outlier at 99th percentile

The descriptive statistics revealed an extreme outlier (~5 Ã— 10Â¹Â²) in `MonthlyCharges`.
Clip values above the 99th percentile to remove the influence of this single bad row.

In [None]:
cap = df["MonthlyCharges"].quantile(0.99)
df["MonthlyCharges"] = df["MonthlyCharges"].clip(upper=cap)

print(f"MonthlyCharges 99th-percentile cap: {cap:.2f}")
print(f"MonthlyCharges max after capping  : {df['MonthlyCharges'].max():.2f}")

### Step 4: Cast Yes/No columns â†’ bool

Columns with exactly two values (`Yes` / `No`) are mapped to Python `bool`.

In [None]:
yes_no_bool = ["Partner", "Dependents", "PhoneService", "PaperlessBilling", "Churn"]
for col in yes_no_bool:
    df[col] = df[col].map({"Yes": True, "No": False})

print("Yes/No columns cast to bool:", yes_no_bool)

### Step 5: Cast `SeniorCitizen` (0/1 float) â†’ nullable boolean

In [None]:
df["SeniorCitizen"] = df["SeniorCitizen"].astype("boolean")

print(f"SeniorCitizen dtype: {df['SeniorCitizen'].dtype}")

### Step 6: Cast `gender` â†’ unordered category

In [None]:
df["gender"] = df["gender"].astype("category")

print(f"gender dtype    : {df['gender'].dtype}")
print(f"gender categories: {df['gender'].cat.categories.tolist()}")

### Step 7: Cast ternary service columns â†’ unordered category

In [None]:
ternary_cols = [
    "MultipleLines",
    "InternetService",
    "OnlineSecurity",
    "OnlineBackup",
    "DeviceProtection",
    "TechSupport",
    "StreamingTV",
    "StreamingMovies",
]
for col in ternary_cols:
    df[col] = df[col].astype("category")

print("Ternary service columns cast to category:")
for col in ternary_cols:
    print(f"  {col}: {df[col].cat.categories.tolist()}")

### Step 8: Cast `PaymentMethod` â†’ unordered category

In [None]:
df["PaymentMethod"] = df["PaymentMethod"].astype("category")

print(f"PaymentMethod categories: {df['PaymentMethod'].cat.categories.tolist()}")

### Step 9: Cast `Contract` â†’ ordered category

Contract length has a natural ordering: Month-to-month < One year < Two year.

In [None]:
contract_order = CategoricalDtype(
    categories=["Month-to-month", "One year", "Two year"], ordered=True
)
df["Contract"] = df["Contract"].astype(contract_order)

print(f"Contract dtype  : {df['Contract'].dtype}")
print(f"Contract ordered: {df['Contract'].cat.ordered}")
print(f"Contract cats   : {df['Contract'].cat.categories.tolist()}")

### Step 10: Cast `tenure` â†’ nullable Int16

In [None]:
df["tenure"] = df["tenure"].astype("Int16")

print(f"tenure dtype: {df['tenure'].dtype}")

## âœ… Verification

In [None]:
df.info()
print()
print(f"object columns remaining: {(df.dtypes == 'object').sum()}")
print(f"Contract ordered        : {df['Contract'].cat.ordered}")
print(f"MonthlyCharges max      : {df['MonthlyCharges'].max():.2f} (should equal cap={cap:.2f})")

## ðŸ’¾ Save to primary layer

In [None]:
OUTPUT_PATH = Path("../../data/03_primary/Churn/churn_primary.parquet")
OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)

df.to_parquet(OUTPUT_PATH, index=False)

print(f"Saved to: {OUTPUT_PATH}")
print(f"File size: {OUTPUT_PATH.stat().st_size / 1024:.1f} KB")

## ðŸ“Š Analysis of Results and Conclusions

- All 17 previously `object` columns were cast to their correct dtypes (bool, category, float64)
- 5 rogue numeric strings in categorical columns replaced with `NaN`
- `TotalCharges` is now `float64`; rows with whitespace become `NaN`
- `MonthlyCharges` extreme outlier (~5 Ã— 10Â¹Â²) capped at 99th percentile
- `Contract` carries its natural order (Month-to-month < One year < Two year)
- Memory usage reduced due to category and integer downcasting

## ðŸ’¡ Proposals and Ideas

- Proceed to `2-exploration/01_jdg_churn_data_description_20260221.ipynb` for EDA on
  `churn_primary.parquet`
- Investigate imputation strategies for the remaining NaNs in feature engineering
- Consider ordinal encoding for `Contract` when feeding to tree-based models