# 01 â€” Exploratory Data Analysis (EDA)

**Objective:** Understand the churn dataset structure, quality, and distributions to inform preprocessing and modeling decisions.

## 1. Load Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("../data/raw/churn-bigml-20_raw.csv")
df.head()

## 2. Shape and Data Types

In [None]:
print("Shape:", df.shape)
print("\nData types:")
df.dtypes

**Executive interpretation:** The dataset has ~3,333 rows and 20 columns. Mix of numeric and categorical features. Target is `Churn` (boolean).

## 3. Missing Value Analysis

In [None]:
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
pd.DataFrame({"Missing": missing, "%": missing_pct})[missing > 0]

**Executive interpretation:** [Describe missing value patterns. If none, state that the dataset is complete.]

## 4. Target Distribution

In [None]:
churn_counts = df["Churn"].value_counts()
print(churn_counts)
print("\nChurn rate:", df["Churn"].mean().round(3))
sns.countplot(data=df, x="Churn")
plt.title("Target Distribution (Churn)")
plt.show()

**Executive interpretation:** Class imbalance (if any) affects model choice. Churn rate indicates proportion of at-risk customers.

## 5. Feature Distributions

In [None]:
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols].hist(figsize=(12, 10), bins=30)
plt.suptitle("Numeric Feature Distributions", y=1.02)
plt.tight_layout()
plt.show()

**Executive interpretation:** Identify skewed features, outliers, and scaling needs for modeling.

## 6. Correlation Analysis

In [None]:
df_numeric = df.select_dtypes(include=[np.number])
df_numeric["Churn"] = df["Churn"].astype(int)
corr = df_numeric.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlation Matrix (Numeric Features)")
plt.tight_layout()
plt.show()

**Executive interpretation:** Strong correlations (e.g., minutes vs charge) may warrant feature selection. Churn correlation highlights key drivers.

## 7. Summary

**Key findings:**
- Dataset is [complete / has X% missing]
- Churn rate: [X%]
- Top churn-related features: [list]
- Preprocessing needs: encoding, scaling, potential feature selection