# 📊 Exploratory Data Analysis (EDA) Template

This notebook provides a general template for exploring a new dataset.

You should customize:
- The **data loading** section
- The **target column name**
- Lists of **numerical** and **categorical** columns if needed

---


## 1. Setup & Imports

Import core libraries used throughout the notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Display options
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: f"{x:0.4f}")


## 2. Load Data

Update the path to your dataset as needed.

In [None]:
# TODO: Update this path or data source
DATA_PATH = "your_data.csv"  # e.g., "data/health_scores.csv"

df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
df.head()

## 3. Basic Info & Structure

Look at data types, non-null counts, and a quick preview.

In [None]:
df.info()

In [None]:
# Preview random rows
df.sample(5, random_state=42)

## 4. Basic Statistics

Summary stats for numeric and a quick view for categorical features.

In [None]:
# Numeric summary
df.describe().T

In [None]:
# Categorical summary
cat_cols = [col for col in df.columns if df[col].dtype == 'object']
cat_cols


In [None]:
for col in cat_cols:
    print(f"\n=== {col} value counts ===")
    print(df[col].value_counts(dropna=False).head(20))

## 5. Missing Values

Check where data is missing and how much per column.

In [None]:
missing_count = df.isna().sum()
missing_pct = (missing_count / len(df)) * 100
missing_df = pd.DataFrame({
    "missing_count": missing_count,
    "missing_pct": missing_pct
}).sort_values("missing_pct", ascending=False)

missing_df[missing_df["missing_count"] > 0]

In [None]:
# Bar plot of missing percentages (only columns with missing values)
cols_with_missing = missing_df[missing_df["missing_count"] > 0]
if not cols_with_missing.empty:
    plt.figure()
    cols_with_missing["missing_pct"].plot(kind="bar")
    plt.ylabel("% Missing")
    plt.title("Missing Data by Column")
    plt.tight_layout()
    plt.show()
else:
    print("No missing values detected.")

## 6. Identify Numeric & Categorical Columns

You can manually override these lists if needed.

In [None]:
num_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
num_cols

## 7. Distributions of Numeric Features

Histograms and boxplots help understand the distribution and potential outliers.

In [None]:
for col in num_cols:
    fig, ax = plt.subplots()
    ax.hist(df[col].dropna(), bins=30)
    ax.set_title(f"Histogram of {col}")
    ax.set_xlabel(col)
    ax.set_ylabel("Count")
    plt.tight_layout()
    plt.show()


In [None]:
for col in num_cols:
    fig, ax = plt.subplots()
    ax.boxplot(df[col].dropna(), vert=True)
    ax.set_title(f"Boxplot of {col}")
    ax.set_ylabel(col)
    plt.tight_layout()
    plt.show()


## 8. Correlation Analysis

Examine linear correlations between numeric features.

In [None]:
if len(num_cols) > 1:
    corr = df[num_cols].corr()
    corr

In [None]:
if len(num_cols) > 1:
    fig, ax = plt.subplots(figsize=(8, 6))
    cax = ax.imshow(corr, interpolation="nearest")
    ax.set_xticks(range(len(num_cols)))
    ax.set_yticks(range(len(num_cols)))
    ax.set_xticklabels(num_cols, rotation=90)
    ax.set_yticklabels(num_cols)
    fig.colorbar(cax)
    ax.set_title("Correlation Matrix")
    plt.tight_layout()
    plt.show()
else:
    print("Not enough numeric columns for correlation analysis.")

## 9. Specify Target Column (Optional)

If you're doing supervised learning, set your target column here.

In [None]:
# TODO: Set your target column name, if applicable
TARGET_COL = None  # e.g., "Health_Score"

if TARGET_COL is not None and TARGET_COL in df.columns:
    print("Target column:", TARGET_COL)
else:
    print("No valid TARGET_COL set yet.")

### 9.1 Target Distribution (Regression Case)

If the target is numeric, inspect its distribution.

In [None]:
if TARGET_COL is not None and TARGET_COL in df.columns and df[TARGET_COL].dtype != 'object':
    fig, ax = plt.subplots()
    ax.hist(df[TARGET_COL].dropna(), bins=30)
    ax.set_title(f"Distribution of target: {TARGET_COL}")
    ax.set_xlabel(TARGET_COL)
    ax.set_ylabel("Count")
    plt.tight_layout()
    plt.show()
else:
    print("Target is not numeric or not set; skipping numeric target distribution.")

### 9.2 Target vs Features (Regression Case)

Scatter plots of numeric features vs target and sorted correlations.

In [None]:
if TARGET_COL is not None and TARGET_COL in df.columns and df[TARGET_COL].dtype != 'object':
    feature_cols = [c for c in num_cols if c != TARGET_COL]
    if feature_cols:
        corrs = df[feature_cols + [TARGET_COL]].corr()[TARGET_COL].drop(TARGET_COL)
        print("\nCorrelation with target:")
        print(corrs.sort_values(ascending=False))

        # Scatter plots for top N correlated features
        top_features = corrs.abs().sort_values(ascending=False).head(4).index.tolist()
        for col in top_features:
            fig, ax = plt.subplots()
            ax.scatter(df[col], df[TARGET_COL])
            ax.set_xlabel(col)
            ax.set_ylabel(TARGET_COL)
            ax.set_title(f"{col} vs {TARGET_COL}")
            plt.tight_layout()
            plt.show()
else:
    print("Target is not numeric or not set; skipping target vs feature analysis.")

### 9.3 Target vs Categorical Features

Use groupby statistics for numeric targets or crosstabs for classification targets.

In [None]:
if TARGET_COL is not None and TARGET_COL in df.columns and cat_cols:
    if df[TARGET_COL].dtype != 'object':
        # Regression-like target: show mean target per category
        for col in cat_cols:
            print(f"\n=== {col} vs {TARGET_COL} (mean) ===")
            display(df.groupby(col)[TARGET_COL].agg(['count', 'mean', 'std']).sort_values('mean', ascending=False))
    else:
        # Classification target: crosstab
        for col in cat_cols:
            print(f"\n=== {col} vs {TARGET_COL} (crosstab) ===")
            display(pd.crosstab(df[col], df[TARGET_COL], normalize='index'))
else:
    print("Target or categorical columns not suitable or not set; skipping target vs categorical analysis.")

## 10. Outlier Detection with IQR

Use the Interquartile Range (IQR) rule to flag or remove outliers.

- Here we *create a cleaned copy* `df_clean` using IQR-based filtering.
- We do not modify the original `df` in place.

In [None]:
df_clean = df.copy()

for col in num_cols:
    Q1 = df_clean[col].quantile(0.25)
    Q3 = df_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    low = Q1 - 1.5 * IQR
    high = Q3 + 1.5 * IQR

    before_rows = len(df_clean)
    df_clean = df_clean[(df_clean[col] >= low) & (df_clean[col] <= high)]
    after_rows = len(df_clean)
    print(f"{col}: removed {before_rows - after_rows} rows (remaining: {after_rows})")

print("\nOriginal shape:", df.shape)
print("Cleaned shape:", df_clean.shape)

## 11. Save Processed Data (Optional)

You can save the cleaned dataset for modeling or further analysis.

In [None]:
# TODO: Update output path if you want to persist the cleaned data
OUTPUT_PATH = "cleaned_data.csv"
df_clean.to_csv(OUTPUT_PATH, index=False)
print(f"Cleaned data saved to {OUTPUT_PATH}")

## 12. Next Steps

- Feature engineering
- Train/validation/test splits
- Model training and evaluation
- Hyperparameter tuning

You can now build on top of this EDA notebook for your modeling work.