# Exploratory Data Analysis (EDA) Methodology

EDA is the "detective work" of Data Science. Before you build any fancy AI models, you must understand your data. Is it messy? Are there errors? What are the patterns?

If you skip this step, you risk "Garbage In, Garbage Out"—your model will learn from bad data and give bad predictions.

## Learning Objectives
- **The Checklist**: A systematic way to inspect any new dataset.
- **Data Quality**: How to spot missing values (holes in your data) and outliers (weird values).
- **Distributions**: Understanding the "shape" of your data using histograms.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a dummy dataset with some issues
np.random.seed(42)
df = pd.DataFrame({
    "age": np.random.randint(20, 60, 100),
    "salary": np.random.normal(50000, 15000, 100),
    "department": np.random.choice(["IT", "HR", "Sales"], 100)
})

# Introduce some missing data and outliers
df.loc[0, "salary"] = np.nan
df.loc[1, "salary"] = 1000000  # Outlier

## 2. Inspecting Structure and Quality

The first thing you do is check the "vitals" of your dataset:
1.  **Shape**: How much data do we have? (`rows`, `columns`)
2.  **Missing Values**: Do we have `NaN` (Not a Number) values? If a column is 90% empty, we might want to drop it.
3.  **Summary Stats**: `describe()` gives you the mean, min, and max. This is a quick way to spot issues (e.g., if the `min` age is -5, you know something is wrong).

In [None]:
print("Shape:", df.shape)
print("\nMissing Values:\n", df.isnull().sum())
print("\nSummary Stats:\n", df.describe())

## 3. Univariate Analysis (Distributions)

"Univariate" means looking at **one variable** at a time. The best tool for this is a **Histogram**.

It shows you the "shape" of the data:
*   **Normal Distribution**: Shaped like a bell (most people are average, few are very tall or very short).
*   **Skewed**: Leaning to one side (e.g., salaries—most people earn a normal amount, but a few billionaires pull the average up).

In the plot below, look for the "outlier"—a bar that is far away from the rest.

In [None]:
df["salary"].hist(bins=20)
plt.title("Salary Distribution (Notice the outlier!)")
plt.show()

## 4. Handling Outliers

An **outlier** is a data point that differs significantly from other observations. It could be a real anomaly (e.g., a billionaire in a salary dataset) or an error (e.g., someone typed an extra zero).

Outliers can ruin your analysis (like how one billionaire changes the average income of a whole room). A common strategy is to **filter** them out if they look like errors or extreme edge cases.

In [None]:
clean_df = df[df["salary"] < 200000]
clean_df["salary"].hist(bins=20)
plt.title("Cleaned Salary Distribution")
plt.show()