# Unit 3: EDA (Exploratory Data Analysis)

---

## Lesson 3.2: Basic Seaborn for EDA

In [None]:
# First we will load all the necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Loading the Dataset
df = pd.read_csv("Data/titanic_leapcode_train.csv")
df.head()

In this lesson, we’ll learn how to use **Seaborn**, a Python library for making beautiful and useful charts.  

This helps us understand the data before we use it to train a machine learning model.

## 1. Visualize a column with countplot()

A countplot shows how many times each category appears.

In [None]:
sns.countplot(x="Sex", data=df)
plt.title("Passenger Count by Sex")
plt.show()

**What it shows**:

This shows how many male vs. female passengers were on the Titanic.

**Why it's useful**:

Helps you check balance in your data (e.g. equal number of males/females?)

If one group is much larger, the model might be biased toward that group

## 2. Compare two columns with barplot()

A barplot shows the average of one column, grouped by another.

In [None]:
sns.barplot(x="Sex", y="Survived", data=df)
plt.title("Survival Rate by Sex")
plt.show()

**What it shows**:

The average survival rate for each sex.

**Why it's useful**:

Shows which group is more likely to survive

Tells us which columns (like "Sex") may be important for prediction

## 3. Look at Distributions with histplot()

This shows how values like age are spread out.

In [None]:
sns.histplot(data=df, x="Age", bins=20, kde=True)
plt.title("Age Distribution")
plt.show()

**What it shows**:

How passenger ages are spread out (e.g., many young people? few older ones?).

**Why it's useful**:

Helps find common value ranges (e.g., most passengers are 20–40)

Can show gaps or outliers

Helps decide how to group data

## 4. Look for Relationships with scatterplot()

Use a scatterplot to see how two numerical columns are related.

In [None]:
sns.scatterplot(x="Age", y="Fare", data=df)
plt.title("Age vs. Fare")
plt.show()

**What it shows**:

How passenger ages are spread out (e.g., many young people? few older ones?).

**Why it's useful**:

Helps find common value ranges (e.g., most passengers are 20–40)

Can show gaps or outliers

Helps decide how to group data

## 5. Look for Relationships with lmplot()

Use lmplot to also show a line of best fit (for regression)

In [None]:
sns.lmplot(x="Age", y="Fare", data=df)
plt.title("Age vs. Fare with Regression Line")

**What it shows**:

Same as scatterplot, but also shows the line of best fit.

**Why it's useful**:

Makes patterns easier to see

Suggests whether there’s a linear relationship between two variables

## 6. Boxplots

To see distributions and outliers

In [None]:
sns.boxplot(x="Sex", y="Age", data=df)
plt.title("Age Spread by Sex")

**What it shows**:

Shows the range, median, and outliers of ages for males and females.

**Why it's useful**:

Helps you find outliers that may need to be removed

Shows if a group is more spread out (e.g., older male passengers than females?)

## 7. Heatmaps

To see missing values or correlations

In [None]:
sns.heatmap(df.isnull())
plt.title("Where Data is Missing")

**What it shows**:

Where data is missing (NaN)

**Why it's useful**:

Tells you which columns need to be fixed because they have missing values