# Customer Churn â€” Initial EDA Notebook

This notebook reflects the **first exploration** stage before modularizing into `src/`.
It loads the raw dataset, inspects structure, visualizes distributions, and explores
relationships with the churn target.


## 1. Setup

In [5]:
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

RAW = Path('../data/raw/Customer-Churn.csv')
RAW

WindowsPath('../data/raw/Customer-Churn.csv')

## 2. Load data

In [7]:
df = pd.read_csv(RAW)
print(df.shape)
df.head(10)

FileNotFoundError: [Errno 2] No such file or directory: '..\\data\\raw\\Customer-Churn.csv'

## 3. Structure and completeness

In [None]:
print('Columns ({}):'.format(len(df.columns)))
print(df.columns.tolist())

print('\nDtypes:')
print(df.dtypes)

print('\nMissing values (top 20):')
print(df.isna().sum().sort_values(ascending=False).head(20))

## 4. Target distribution

In [None]:
target = 'Churn'
print(df[target].value_counts())
print('\nShare:')
print((df[target].value_counts(normalize=True) * 100).round(2).astype(str) + '%')

plt.figure()
df[target].astype(str).value_counts().plot(kind='bar')
plt.title('Churn distribution')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()

## 5. Quick cleanup for EDA only
- Convert `TotalCharges` to numeric to avoid issues in plots and stats.
- No encoder or scaler is saved here.


In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(0)
df[['tenure','MonthlyCharges','TotalCharges']].describe()

## 6. Missing values

In [None]:
na = df.isna().sum()
na = na[na > 0].sort_values(ascending=False)
if not na.empty:
    plt.figure()
    na.plot(kind='bar')
    plt.title('Missing values per column')
    plt.ylabel('Count')
    plt.show()
else:
    print('No missing values detected.')

## 7. Numeric feature distributions

In [None]:
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
for c in num_cols:
    if c in df.columns:
        plt.figure()
        df[c].dropna().hist(bins=30)
        plt.title(f'Distribution of {c}')
        plt.xlabel(c)
        plt.ylabel('Frequency')
        plt.show()

## 8. Churn rate by key categories

In [None]:
def churn_rate_by(col):
    if col not in df.columns:
        return None
    tmp = df[[col, 'Churn']].copy()
    tmp['ChurnBin'] = tmp['Churn'].astype(str).str.lower().isin(['yes','1', 'true'])
    rates = tmp.groupby(col)['ChurnBin'].mean().sort_values(ascending=False)
    return rates

for c in ['Contract','InternetService','PaymentMethod','PaperlessBilling']:
    rates = churn_rate_by(c)
    if rates is not None:
        plt.figure()
        rates.plot(kind='bar')
        plt.title(f'Churn rate by {c}')
        plt.ylabel('Rate')
        plt.xlabel(c)
        plt.show()

## 9. Correlation heatmap (numeric only)

In [None]:
# Create a numeric-only view for quick correlation
numeric_df = df.select_dtypes(include=['int64','float64']).copy()
# If Churn is in {Yes,No}, add a numeric version for correlation view
if 'Churn' in df.columns and 'Churn' not in numeric_df.columns:
    numeric_df['Churn'] = df['Churn'].astype(str).str.lower().isin(['yes','1','true']).astype(int)

if numeric_df.shape[1] >= 2:
    corr = numeric_df.corr()
    plt.figure(figsize=(7,6))
    plt.imshow(corr, interpolation='nearest')
    plt.title('Correlation matrix')
    plt.colorbar()
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.tight_layout()
    plt.show()
else:
    print('Not enough numeric columns for correlation.')

## 10. Notes and next steps
- Based on this EDA, define preprocessing rules (binary encoding, one-hot for multi-category, scaling).
- Move stable logic into reusable modules (done later in your project under `src/`).
- Train and evaluate the model in a separate pipeline script and testing notebook.
