# Lesson 4.3: Exploratory Data Analysis (EDA)

## The Detective Work Before ML

EDA is like **reading the database schema and running test queries** before building a Laravel app. You need to understand your data before building models.

### EDA Checklist:
1. Load & inspect shape, types
2. Check missing values
3. Understand distributions (univariate)
4. Find relationships (bivariate)
5. Correlation analysis
6. Spot outliers

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid')
%matplotlib inline

# Load Titanic dataset (built into seaborn)
df = sns.load_dataset('titanic')
print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

In [None]:
# Step 1: Basic inspection
print("=== Data Types ===")
print(df.dtypes)
print("\n=== Quick Stats ===")
df.describe()

In [None]:
# Step 2: Missing values
missing = df.isna().sum()
missing_pct = (missing / len(df) * 100).round(1)
pd.DataFrame({'Missing': missing, 'Percent': missing_pct}).query('Missing > 0')

In [None]:
# Step 3: Univariate analysis - distribution of key columns
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

sns.histplot(df['age'].dropna(), bins=30, kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Age Distribution')

sns.countplot(x='survived', data=df, ax=axes[0, 1])
axes[0, 1].set_title(f'Survival Rate: {df["survived"].mean():.1%}')

sns.countplot(x='pclass', data=df, ax=axes[1, 0])
axes[1, 0].set_title('Passenger Class')

sns.histplot(df['fare'], bins=30, kde=True, ax=axes[1, 1])
axes[1, 1].set_title('Fare Distribution')

plt.tight_layout()
plt.show()

In [None]:
# Step 4: Bivariate - relationships with target (survived)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

sns.barplot(x='sex', y='survived', data=df, ax=axes[0])
axes[0].set_title('Survival by Gender')

sns.barplot(x='pclass', y='survived', data=df, ax=axes[1])
axes[1].set_title('Survival by Class')

sns.boxplot(x='survived', y='age', data=df, ax=axes[2])
axes[2].set_title('Age by Survival')

plt.tight_layout()
plt.show()

In [None]:
# Step 5: Correlation heatmap
plt.figure(figsize=(8, 6))
numeric = df.select_dtypes(include=[np.number])
sns.heatmap(numeric.corr(), annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlations')
plt.show()

## Key Insights from EDA

- Women survived at much higher rates than men
- 1st class passengers had higher survival rates
- Age has some missing values that need handling
- Fare is heavily right-skewed (few very expensive tickets)

These insights guide our ML model decisions!

## Exercise

1. Load `sns.load_dataset('tips')` and perform a full EDA:
   - Shape, dtypes, missing values
   - Distribution of tip amounts
   - Tips by day, by time (lunch/dinner)
   - Correlation between total_bill and tip

In [None]:
# YOUR CODE HERE