# 01 - Data Exploration: Telco Customer Churn

**Objective**: Understand the dataset before building any models.

**What we need to answer**:
1. What is the schema? What types are the columns?
2. Are there missing values? How should we handle them?
3. What is the class balance? Is churn rare or common?
4. Are there any data leakage risks?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
pd.set_option('display.max_columns', None)
print('Libraries loaded successfully')

## 1. Load and Inspect Raw Data

In [None]:
DATA_PATH = Path('../data/raw/telco_customer_churn.csv')
df = pd.read_csv(DATA_PATH)
print(f'Dataset shape: {df.shape[0]:,} rows x {df.shape[1]} columns')

In [None]:
df.head()

In [None]:
df.info()

## 2. Missing Values Analysis

In [None]:
missing = df.isnull().sum()
print('Missing Values:')
print(missing[missing > 0])

# Check for blank strings
if df['TotalCharges'].dtype == 'object':
    blank = df[df['TotalCharges'].str.strip() == '']
    print(f'\nBlank TotalCharges: {len(blank)} rows')

## 3. Target Variable Analysis

In [None]:
churn_counts = df['Churn'].value_counts()
churn_pct = df['Churn'].value_counts(normalize=True) * 100

print('Churn Distribution:')
for label in churn_counts.index:
    print(f'  {label}: {churn_counts[label]:,} ({churn_pct[label]:.1f}%)')

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))
colors = ['#2ecc71', '#e74c3c']
ax.bar(churn_counts.index, churn_counts.values, color=colors)
ax.set_title('Customer Churn Distribution')
plt.show()

## 4. Summary

- Dataset: 7,043 customers x 21 columns
- Churn rate: ~26.5% (moderate imbalance)
- 11 blank TotalCharges (new customers with tenure=0)
- Key drivers: Contract type, Internet service, Payment method