# Advanced Seaborn — Part 2 & Titanic EDA  v.ekc-c

Two new high-level Seaborn functions — `displot` and `relplot` — plus a full EDA group activity on the **Titanic dataset**.

| Section | Topic |
|---------|-------|
| 1 | Setup & Warm-Up |
| 2 | `displot` — Faceted Distribution Plots |
| 3 | `relplot` — Faceted Relationship Plots |
| 4 | 🔬 Titanic EDA Activity |
| Appendix | Quick Reference |


---
## 1. Setup & Warm-Up

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = [6, 3]
plt.rcParams['figure.dpi'] = 80


In [None]:
# Reload and clean Covid data (same as last class)
covid = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Python-Data-Cleaning-Cookbook/master/Chapter05/data/covidtotals.csv')
covid.dropna(axis=0, inplace=True)
covid['lastdate'] = pd.to_datetime(covid['lastdate'])
high_regions = covid.loc[covid.region.isin(['South America', 'Western Europe', 'North America'])]
print(f"Covid: {covid.shape}, High-regions subset: {high_regions.shape}")


### 🔬 Warm-Up — Recall from last class

Without peeking, write code to:
1. Make a **kdeplot** of `total_deaths_pm` for the 3 high-covid regions (overlapping, colored by region)
2. Make a **violinplot** of `total_deaths_pm` by `region` for the same subset


In [None]:
# 1. kdeplot recall


In [None]:
# 2. violinplot recall


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
# 1. kdeplot
sns.kdeplot(high_regions, x='total_deaths_pm', hue='region', fill=True)
plt.show()

# 2. violinplot
sns.violinplot(data=high_regions, y='region', x='total_deaths_pm')
plt.show()
```

</details>

---
## 2. `displot` — Faceted Distribution Plots

### 📋 Board Reference

`sns.displot` is a **figure-level** function that wraps `histplot`, `kdeplot`, and `ecdfplot` with built-in faceting.

```python
sns.displot(data=df, x='num', col='cat', kind='hist')   # faceted histogram
sns.displot(data=df, x='num', col='cat', kind='kde')    # faceted KDE
sns.displot(data=df, x='num', hue='cat', fill=True)     # overlapping KDE
```

| `kind=` | Plot type |
|---------|-----------|
| `'hist'` (default) | Histogram |
| `'kde'` | KDE (smooth density) |
| `'ecdf'` | Empirical CDF |

**Advantage over FacetGrid:** Less boilerplate — faceting is built in.


In [None]:
# displot — faceted KDE, one panel per region
sns.displot(data=high_regions, x='total_deaths_pm',
            hue='region', kind='kde', fill=True, col='region');


In [None]:
# displot — overlapping KDE with hue only (no faceting)
sns.displot(data=high_regions, x='total_deaths_pm',
            hue='region', kind='kde', fill=True);


In [None]:
# displot — histogram with binwidth
sns.displot(data=high_regions, x='total_deaths_pm',
            col='region', hue='region', binwidth=50);


---
### 🔬 Explore 1 — `displot`

1. Use `sns.displot` to make a **faceted histogram** of `gdp_per_capita` by `region`  
   (use all regions, `col='region'`, `col_wrap=4`).  
   Add `binwidth=5000`. Which regions have the most spread?

2. Use `kind='kde'` and `fill=True` to make an **overlapping density plot** of `median_age`  
   for `high_regions`. Add `hue='region'`.

3. Use `kind='ecdf'` on `total_cases_pm` for the 3 high-covid regions.  
   What does an ECDF tell you that a histogram doesn't?

4. **Bonus**: On any displot, call `.set(xlim=(0, 100000))` on the result —  
   how do you control axis limits on figure-level plots?


In [None]:
# 1. displot histogram gdp_per_capita all regions


In [None]:
# 2. displot KDE median_age for 3 regions


In [None]:
# 3. displot ECDF total_cases_pm


**What does the ECDF tell you?** *(double-click to edit)*

*Your answer here*

In [None]:
# 4. Bonus: set axis limits on displot


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*ECDF (empirical CDF): the y-value at any x tells you 'what fraction of countries had fewer than x cases per million'. Great for comparing distributions without assuming a shape.*

```python
# 1. Faceted histogram of GDP
g = sns.displot(data=covid, x='gdp_per_capita', col='region',
                hue='region', col_wrap=4, binwidth=5000)
g.set_titles('{col_name}')
plt.tight_layout()
plt.show()

# 2. Overlapping KDE of median age
sns.displot(data=high_regions, x='median_age', hue='region',
            kind='kde', fill=True)
plt.show()

# 3. ECDF
sns.displot(data=high_regions, x='total_cases_pm', hue='region', kind='ecdf')
plt.show()
# ECDF shows: what fraction of countries are BELOW a given value?
# The y-axis is cumulative probability (0→1)

# 4. Axis limits on figure-level plot
g = sns.displot(data=covid, x='gdp_per_capita', hue='region',
                kind='kde', fill=True)
g.set(xlim=(0, 100000))
plt.show()
```

</details>

---
## 3. `relplot` — Faceted Relationship Plots

### 📋 Board Reference

`sns.relplot` is the figure-level version of `scatterplot` and `lineplot` with built-in faceting.

```python
sns.relplot(data=df, x='num', y='num2', col='cat', kind='scatter')
sns.relplot(data=df, x='num', y='num2', row='cat2', col='cat', hue='cat3')
```

| `kind=` | Plot type |
|---------|-----------|
| `'scatter'` (default) | Scatter plot |
| `'line'` | Line plot (useful for time series) |

**Advantage:** Handles `col=`, `row=`, and `hue=` all at once — great for multi-faceted investigation.


In [None]:
# relplot — faceted scatter
sns.relplot(data=high_regions,
            x='total_deaths_pm',
            y='total_cases_pm',
            col='region', kind='scatter');


In [None]:
# relplot with hue + col
sns.relplot(data=high_regions,
            x='gdp_per_capita',
            y='total_deaths_pm',
            col='region',
            hue='region',
            kind='scatter');


---
### 🔬 Explore 2 — `relplot`

1. Use `sns.relplot` to make a **faceted scatter** of `gdp_per_capita` vs `total_cases_pm`  
   with `col='region'` and `col_wrap=3` (all regions). Which region has the strangest pattern?

2. Make a `relplot` of `median_age` (x) vs `total_deaths_pm` (y) for the 3 high-covid regions,  
   with `col='region'`. Size the points by `pop_density`.

3. **Discussion**: For this Covid data, which visualization from the last two classes  
   was most informative to you and why?

4. **Bonus**: Try `kind='line'` on the flight delays dataset below:


In [None]:
# For bonus: flight delays
flights = pd.read_csv('flight_delays.csv')
flights['Flight_Date'] = pd.to_datetime(flights['Flight_Date'])
flights['Year'] = flights['Flight_Date'].dt.year
flights.head(3)


In [None]:
# 1. relplot GDP vs cases, all regions


In [None]:
# 2. relplot median_age vs deaths, sized by pop_density


**Answer to Q3 — Most informative viz:** *(double-click to edit)*

*Your answer here*

In [None]:
# 4. Bonus: relplot on flight delays


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
# 1. GDP vs cases, all regions
g = sns.relplot(data=covid, x='gdp_per_capita', y='total_cases_pm',
                col='region', col_wrap=3)
g.set_titles('{col_name}')
plt.tight_layout()
plt.show()

# 2. Median age vs deaths, sized by density
sns.relplot(data=high_regions,
            x='median_age', y='total_deaths_pm',
            col='region', size='pop_density', hue='region')
plt.show()

# 4. Bonus: flight delays by year
sns.displot(data=flights, x='Departure_Delay_Minutes',
            col='Year', binwidth=5)
for ax in plt.gcf().axes:
    ax.set_xlim(-20, 100)
plt.tight_layout()
plt.show()
```

</details>

---
## 4. 🔬 Titanic EDA Activity

Now you put it all together: a full EDA on the **Titanic dataset**.  
Work through the tasks below — there are no single right answers, just good and weak analyses.


In [None]:
plt.style.use('ggplot')
titanic = sns.load_dataset('titanic')
titanic.head()


In [None]:
titanic.describe()


In [None]:
titanic.info()


### Task 1 — Visualize Missing Data

Use a heatmap to visualize where the null values are.  
*(Hint: pass `titanic.isnull()` as the data to `sns.heatmap`)*


In [None]:
# Heatmap of null values


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*The 'age' column has many missing values. The 'deck' column is almost entirely missing — may need to be dropped.*

```python
plt.figure(figsize=(10, 4))
sns.heatmap(titanic.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Values in Titanic Dataset (yellow = missing)')
plt.tight_layout()
plt.show()
```

</details>

**What do you notice about the missing data?** *(double-click to edit)*

*Your answer here*

### Task 2 — Correlation Heatmap

Use a heatmap to visualize correlations among the **numeric** columns.  
What do you notice about `survived`?


In [None]:
# Correlation heatmap of numeric columns


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*Note: pclass has a negative correlation with survived — lower class numbers (1=first class) survived more.*

```python
corrmat = titanic.corr(numeric_only=True)
plt.figure(figsize=(8, 6))
sns.heatmap(corrmat, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Titanic Correlation Matrix')
plt.tight_layout()
plt.show()
```

</details>

**What do you notice about the `survived` column?** *(double-click to edit)*

*Your answer here*

### Task 3 — Count Plots

Create subplots showing the count of each category for:  
`survived`, `pclass`, `sex`, `sibsp`, `parch`, `embark_town`, `alone`

*(Hint: use `plt.subplots` + a loop over columns)*


In [None]:
# Subplot grid of countplots


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
cat_cols = ['survived', 'pclass', 'sex', 'sibsp', 'parch', 'embark_town', 'alone']
fig, axes = plt.subplots(1, len(cat_cols), figsize=(18, 3))

for i, col in enumerate(cat_cols):
    sns.countplot(data=titanic, x=col, ax=axes[i])
    axes[i].set_title(col)
    axes[i].tick_params(axis='x', rotation=30)

plt.suptitle('Category Counts — Titanic', y=1.05)
plt.tight_layout()
plt.show()
```

</details>

**Observations from count plots:** *(double-click to edit)*

*Your answer here*

### Task 4 — Survival by Class and Sex

Create a plot that shows survived vs. not-survived counts for each `pclass`, faceted by `sex`.  
*(Hint: `sns.countplot(x='pclass', hue='survived')` + `sns.FacetGrid` or `displot`)*


In [None]:
# Survival by pclass, faceted by sex


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
g = sns.FacetGrid(titanic, col='sex', hue='survived', palette={0:'salmon', 1:'steelblue'})
g.map(sns.countplot, 'pclass')
g.add_legend(title='Survived')
g.set_axis_labels('Passenger Class', 'Count')
g.set_titles(col_template='{col_name}')
plt.tight_layout()
plt.show()
```

</details>

### Task 5 — Age Distributions by Survival

Create a visualization that compares the age distribution of survivors vs. non-survivors.  
Use `sns.kdeplot` or `sns.boxplot`.


In [None]:
# Age distribution — survived vs not survived


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
# Option 1: kdeplot
sns.kdeplot(data=titanic.dropna(subset=['age']),
            x='age', hue='survived', fill=True)
plt.title('Age Distribution by Survival')
plt.show()

# Option 2: boxplot
sns.boxplot(data=titanic, x='survived', y='age')
plt.title('Age vs Survival')
plt.xticks([0, 1], ['Did Not Survive', 'Survived'])
plt.show()
```

</details>

**What does this tell you about age and survival?** *(double-click to edit)*

*Your answer here*

---
## Appendix — `displot` and `relplot` Reference

```python
# displot — figure-level distribution plots
sns.displot(data=df, x='num', col='cat', kind='hist', binwidth=10)
sns.displot(data=df, x='num', hue='cat', kind='kde', fill=True)
sns.displot(data=df, x='num', hue='cat', kind='ecdf')
g = sns.displot(...)
g.set(xlim=(0, 100))         # control axis limits
g.set_titles('{col_name}')   # panel titles

# relplot — figure-level scatter / line plots
sns.relplot(data=df, x='num', y='num2', col='cat', kind='scatter')
sns.relplot(data=df, x='num', y='num2', row='cat', col='cat2', hue='cat3', size='num3')

# Common EDA combination for categorical target
sns.countplot(data=df, x='cat', hue='target')
g = sns.FacetGrid(df, col='cat', hue='target')
g.map(sns.countplot, 'other_cat')
g.add_legend()
```
