## Churn Data â€” Type Description & EDA

### By:
jdg

### Date:
2026-02-21

### Description:

Loads the primary Parquet file produced by `1-data/02_jdg_churn_data_cleaning_20260221.ipynb`.
The goals of this notebook are:
- Inspect optimized data types and confirm zero `object` columns
- Detect and quantify missing values
- Explore the cardinality and value distribution of each column
- Analyse the target variable (`Churn`) distribution

## ðŸ“š Import libraries

In [5]:
from pathlib import Path

import numpy as np
import pandas as pd

## ðŸ’¾ Load data

In [None]:
PRIMARY_PATH = Path("../../data/03_primary/Churn/churn_primary.parquet")

df = pd.read_parquet(PRIMARY_PATH)

print(f"Loaded: {df.shape[0]:,} rows x {df.shape[1]} columns")
df.head()

## ðŸ‘· Data description

### 1. Shape and column overview

In [7]:
print(f"Rows   : {df.shape[0]:,}")
print(f"Columns: {df.shape[1]}")
print()
print("Column names:")
for col in df.columns:
    print(f"  - {col}")

Rows   : 14,214
Columns: 20

Column names:
  - gender
  - SeniorCitizen
  - Partner
  - Dependents
  - tenure
  - PhoneService
  - MultipleLines
  - InternetService
  - OnlineSecurity
  - OnlineBackup
  - DeviceProtection
  - TechSupport
  - StreamingTV
  - StreamingMovies
  - Contract
  - PaperlessBilling
  - PaymentMethod
  - MonthlyCharges
  - TotalCharges
  - Churn


### 2. Inferred data types (`dtypes`)

In [8]:
dtype_summary = df.dtypes.reset_index().rename(columns={"index": "column", 0: "dtype"})
dtype_summary

Unnamed: 0,column,dtype
0,gender,object
1,SeniorCitizen,float64
2,Partner,object
3,Dependents,object
4,tenure,float64
5,PhoneService,object
6,MultipleLines,object
7,InternetService,object
8,OnlineSecurity,object
9,OnlineBackup,object


### 3. Detailed schema â€” `df.info()`

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14214 entries, 0 to 14213
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14187 non-null  object 
 1   SeniorCitizen     14154 non-null  float64
 2   Partner           14153 non-null  object 
 3   Dependents        14124 non-null  object 
 4   tenure            14114 non-null  float64
 5   PhoneService      14096 non-null  object 
 6   MultipleLines     14077 non-null  object 
 7   InternetService   14053 non-null  object 
 8   OnlineSecurity    14024 non-null  object 
 9   OnlineBackup      14013 non-null  object 
 10  DeviceProtection  14020 non-null  object 
 11  TechSupport       14017 non-null  object 
 12  StreamingTV       14015 non-null  object 
 13  StreamingMovies   14036 non-null  object 
 14  Contract          14092 non-null  object 
 15  PaperlessBilling  14092 non-null  object 
 16  PaymentMethod     14088 non-null  object

### 4. Numeric columns â€” descriptive statistics

In [10]:
df.describe(include=[np.number])

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,14154.0,14114.0,14101.0
mean,0.162074,32.36276,372028800.0
std,0.368532,24.568811,44155450000.0
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.9
max,1.0,72.0,5243355000000.0


### 5. Categorical columns â€” descriptive statistics

In [None]:
df.describe(include="category")

### 6. Missing values

In [12]:
missing = pd.DataFrame(
    {
        "missing_count": df.isna().sum(),
        "missing_pct": (df.isna().sum() / len(df) * 100).round(2),
    }
)
missing[missing["missing_count"] > 0].sort_values("missing_pct", ascending=False)

Unnamed: 0,missing_count,missing_pct
OnlineBackup,201,1.41
StreamingTV,199,1.4
TechSupport,197,1.39
DeviceProtection,194,1.36
OnlineSecurity,190,1.34
StreamingMovies,178,1.25
InternetService,161,1.13
MultipleLines,137,0.96
PaymentMethod,126,0.89
Contract,122,0.86


In [13]:
print(f"Total cells with NaN: {df.isna().sum().sum()}")

Total cells with NaN: 2556


### 7. Unique value counts per categorical column

In [None]:
CARDINALITY_THRESHOLD = 10

cat_cols = df.select_dtypes(include="category").columns

for col in cat_cols:
    n_unique = df[col].nunique()
    print(f"\n{'â”€' * 50}")
    print(f"Column : {col}")
    print(f"Ordered: {df[col].cat.ordered}")
    print(f"Unique values: {n_unique}")
    if n_unique <= CARDINALITY_THRESHOLD:
        print(df[col].value_counts().to_string())
    else:
        print(f"  (high cardinality â€” sample: {df[col].cat.categories[:5].tolist()}...)")

### 8. Target variable â€” `Churn` distribution

In [15]:
churn_counts = df["Churn"].value_counts()
churn_pct = df["Churn"].value_counts(normalize=True).mul(100).round(2)

churn_summary = pd.DataFrame({"count": churn_counts, "pct": churn_pct})
print("Churn distribution:")
print(churn_summary)

Churn distribution:
       count    pct
Churn              
No     10365  73.38
Yes     3760  26.62


### Fix 4: `SeniorCitizen` â€” cast to boolean

Stored as `int64` (0/1); convert to `bool` to match the semantic type of other binary columns.