# Exploratory Data Analysis (EDA) for Customer Churn Dataset
This notebook provides a comprehensive exploratory data analysis of the customer churn dataset. We will examine the data structure, missing values, distributions, relationships, and key insights that can inform modeling.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 1. Load the Data
We will load the raw dataset and display its basic structure.

In [None]:
# Load the raw data
df = pd.read_csv('../data/raw/raw.csv')
df.head()

## 2. Data Overview
Let's check the shape, columns, and data types.

In [None]:
# Data shape and info
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
df.info()

## 3. Missing Values Analysis
Identify missing values in the dataset.

In [None]:
# Missing values count
missing = df.isnull().sum()
missing[missing > 0]

## 4. Target Variable Distribution
Examine the distribution of the target variable `Churn`.

In [None]:
# Target variable distribution
sns.countplot(x='Churn', data=df)
plt.title('Churn Distribution')
plt.show()
df['Churn'].value_counts(normalize=True)

## 5. Numerical Feature Distributions
Visualize the distributions of key numerical features.

In [None]:
# Numerical features
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce')
df[num_cols].hist(bins=20, figsize=(12, 4))
plt.suptitle('Numerical Feature Distributions')
plt.show()

## 6. Categorical Feature Distributions
Show the distribution of selected categorical features.

In [None]:
# Categorical features
cat_cols = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'InternetService', 'Contract', 'PaymentMethod']
fig, axes = plt.subplots(2, 4, figsize=(18, 8))
for i, col in enumerate(cat_cols):
    sns.countplot(x=col, data=df, ax=axes[i//4, i%4])
    axes[i//4, i%4].set_title(col)
plt.tight_layout()
plt.show()

## 7. Churn Rate by Categorical Features
Analyze how churn varies across key categorical features.

In [None]:
# Churn rate by categorical features
for col in cat_cols:
    churn_rate = df.groupby(col)['Churn'].value_counts(normalize=True).unstack().get('Yes', pd.Series())
    churn_rate.plot(kind='bar', title=f'Churn Rate by {col}')
    plt.ylabel('Churn Rate')
    plt.show()

## 8. Correlation Analysis
Check correlations between numerical features and churn.

In [None]:
# Encode target for correlation
corr_df = df.copy()
corr_df['Churn'] = corr_df['Churn'].map({'Yes': 1, 'No': 0})
sns.heatmap(corr_df[num_cols + ['Churn']].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## 9. Insights and Observations
- The dataset is moderately imbalanced with more non-churned customers.
- Senior citizens, customers with month-to-month contracts, and those with higher monthly charges have higher churn rates.
- Tenure is negatively correlated with churn: newer customers are more likely to churn.
- Some features have missing values (notably `TotalCharges`), which should be imputed.
- Categorical features like contract type and payment method show strong relationships with churn.

These insights will guide feature engineering and model selection.