## Churn Data â€” Type Description & EDA

### By:
jdg

### Date:
2026-02-21

### Description:

Loads the primary Parquet file produced by `1-data/02_jdg_churn_data_cleaning_20260221.ipynb`.
The goals of this notebook are:
- Inspect optimized data types and confirm zero `object` columns
- Detect and quantify missing values
- Explore the cardinality and value distribution of each column
- Analyse the target variable (`Churn`) distribution

## ðŸ“š Import libraries

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd

## ðŸ’¾ Load data

In [2]:
PRIMARY_PATH = Path("../../data/03_primary/Churn/churn_primary.parquet")

df = pd.read_parquet(PRIMARY_PATH, dtype_backend="numpy_nullable")

print(f"Loaded: {df.shape[0]:,} rows x {df.shape[1]} columns")
df.head()

Loaded: 14,214 rows x 20 columns


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,False,True,False,1,False,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,True,Electronic check,29.85,29.85,False
1,Male,False,False,False,34,True,No,DSL,Yes,No,Yes,No,No,No,One year,False,Mailed check,56.95,1889.5,False
2,Male,False,False,False,2,True,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,True,Mailed check,53.85,108.15,True
3,Male,False,False,False,45,False,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,False,Bank transfer (automatic),42.3,1840.75,False
4,Female,False,False,False,2,True,No,Fiber optic,No,No,No,No,No,No,Month-to-month,True,Electronic check,70.7,151.65,True


## ðŸ‘· Data description

### 1. Shape and column overview

In [3]:
print(f"Rows   : {df.shape[0]:,}")
print(f"Columns: {df.shape[1]}")
print()
print("Column names:")
for col in df.columns:
    print(f"  - {col}")

Rows   : 14,214
Columns: 20

Column names:
  - gender
  - SeniorCitizen
  - Partner
  - Dependents
  - tenure
  - PhoneService
  - MultipleLines
  - InternetService
  - OnlineSecurity
  - OnlineBackup
  - DeviceProtection
  - TechSupport
  - StreamingTV
  - StreamingMovies
  - Contract
  - PaperlessBilling
  - PaymentMethod
  - MonthlyCharges
  - TotalCharges
  - Churn


### 2. Inferred data types (`dtypes`)

In [4]:
dtype_summary = df.dtypes.reset_index().rename(columns={"index": "column", 0: "dtype"})
dtype_summary

Unnamed: 0,column,dtype
0,gender,category
1,SeniorCitizen,boolean
2,Partner,boolean
3,Dependents,boolean
4,tenure,Int16
5,PhoneService,boolean
6,MultipleLines,category
7,InternetService,category
8,OnlineSecurity,category
9,OnlineBackup,category


### 3. Detailed schema â€” `df.info()`

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14214 entries, 0 to 14213
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   gender            14187 non-null  category
 1   SeniorCitizen     14154 non-null  boolean 
 2   Partner           14153 non-null  boolean 
 3   Dependents        14124 non-null  boolean 
 4   tenure            14114 non-null  Int16   
 5   PhoneService      14095 non-null  boolean 
 6   MultipleLines     14076 non-null  category
 7   InternetService   14053 non-null  category
 8   OnlineSecurity    14023 non-null  category
 9   OnlineBackup      14013 non-null  category
 10  DeviceProtection  14019 non-null  category
 11  TechSupport       14017 non-null  category
 12  StreamingTV       14014 non-null  category
 13  StreamingMovies   14035 non-null  category
 14  Contract          14092 non-null  category
 15  PaperlessBilling  14092 non-null  boolean 
 16  PaymentMethod     1408

### 4. Numeric columns â€” descriptive statistics

In [6]:
df.describe(include=[np.number])

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
count,14114.0,14101.0,14121.0
mean,32.36276,64.780374,2283.91528
std,24.568811,30.084023,2263.862562
min,0.0,18.25,0.0
25%,9.0,35.5,400.0
50%,29.0,70.35,1398.25
75%,55.0,89.9,3804.4
max,72.0,114.85,8045.81


### 5. Categorical columns â€” descriptive statistics

In [7]:
df.describe(include="category")

Unnamed: 0,gender,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaymentMethod
count,14187,14076,14053,14023,14013,14019,14017,14014,14035,14092,14088
unique,2,3,3,3,3,3,3,3,3,3,4
top,Male,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Electronic check
freq,7160,6784,6182,6958,6154,6159,6899,5591,5549,7760,4737


### 6. Missing values

In [8]:
missing = pd.DataFrame(
    {
        "missing_count": df.isna().sum(),
        "missing_pct": (df.isna().sum() / len(df) * 100).round(2),
    }
)
missing[missing["missing_count"] > 0].sort_values("missing_pct", ascending=False)

Unnamed: 0,missing_count,missing_pct
StreamingTV,200,1.41
OnlineBackup,201,1.41
TechSupport,197,1.39
DeviceProtection,195,1.37
OnlineSecurity,191,1.34
StreamingMovies,179,1.26
InternetService,161,1.13
MultipleLines,138,0.97
PaymentMethod,126,0.89
Contract,122,0.86


In [9]:
print(f"Total cells with NaN: {df.isna().sum().sum()}")

Total cells with NaN: 2584


### 7. Unique value counts per categorical column

In [10]:
CARDINALITY_THRESHOLD = 10

cat_cols = df.select_dtypes(include="category").columns

for col in cat_cols:
    n_unique = df[col].nunique()
    print(f"\n{'â”€' * 50}")
    print(f"Column : {col}")
    print(f"Ordered: {df[col].cat.ordered}")
    print(f"Unique values: {n_unique}")
    if n_unique <= CARDINALITY_THRESHOLD:
        print(df[col].value_counts().to_string())
    else:
        print(f"  (high cardinality â€” sample: {df[col].cat.categories[:5].tolist()}...)")


â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Column : gender
Ordered: False
Unique values: 2
gender
Male      7160
Female    7027

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Column : MultipleLines
Ordered: False
Unique values: 3
MultipleLines
No                  6784
Yes                 5940
No phone service    1352

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Column : InternetService
Ordered: False
Unique values: 3
InternetService
Fiber optic    6182
DSL            4819
No             3052

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Column : OnlineSecurity
Ordere

### 8. Target variable â€” `Churn` distribution

In [11]:
churn_counts = df["Churn"].value_counts()
churn_pct = df["Churn"].value_counts(normalize=True).mul(100).round(2)

churn_summary = pd.DataFrame({"count": churn_counts, "pct": churn_pct})
print("Churn distribution:")
print(churn_summary)

Churn distribution:
       count    pct
Churn              
False  10365  73.38
True    3760  26.62


## ðŸ“Š Analysis of Results and Conclusions

- **Dtypes confirmed**: 6 `boolean`, 11 `category` (1 ordered), 1 `Int16`, 2 `Float64`,
  0 `object` â€” all columns carry their semantically correct type
- **Missing values**: All columns < 2% missing; `StreamingTV` / `OnlineBackup` highest
  (â‰ˆ1.4%); `gender` lowest (0.19%) â€” low enough for simple imputation strategies
- **Class imbalance**: 26.6% churn (`True`), 73.4% no churn (`False`); baseline
  accuracy is 73.4% â€” imbalance must be addressed in the modelling phase
- **`TotalCharges`**: Negative values (data corruption) and extreme positives were
  capped to [0, 99th percentile] in the cleaning step
- **`MonthlyCharges`**: Extreme outlier capped at 99th percentile (114.85); range is
  plausible for telecom billing
- **`Contract`**: â‰ˆ55% of customers on Month-to-month; minority on two-year contracts â€”
  strong potential predictor of churn

## ðŸ’¡ Proposals and Ideas

- Proceed to `3-analysis/` for distribution plots, correlation analysis, and churn breakdown by
  feature
- Proceed to `4-feat_eng/` for imputation strategy (all features < 2% missing â€” median/mode
  imputation viable)
- Consider encoding plan: `Contract` (ordinal, already ordered category), boolean columns
  (already binary), remaining categories (one-hot or label encoding)
- Class imbalance (26.6% vs 73.4%) warrants SMOTE or class-weight adjustment in the
  modelling phase