# Advanced Seaborn — Part 1  v.ekc-c

Beyond the basics: **regplot**, **jointplot**, **violinplot**, **swarmplot**, **kdeplot**, and **FacetGrid**.  
We'll apply these to a real dataset on Covid-19 to practice investigation-driven visualization.

| Section | Topic |
|---------|-------|
| 1 | Setup & Data Prep |
| 2 | Regression Plots — regplot & jointplot |
| 3 | Distribution Comparisons — violin, swarm, point |
| 4 | KDE & Faceted Distributions |
| 5 | Open Exploration |
| Appendix | Advanced Seaborn Reference |


---
## 1. Setup & Data Prep

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')


In [None]:
covid = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Python-Data-Cleaning-Cookbook/master/Chapter05/data/covidtotals.csv')
covid.head()


### Quick data prep — run this before exploring

We'll clean NA rows and fix the date column.


In [None]:
# Drop rows with NAs, fix date type
covid.dropna(axis=0, inplace=True)
covid['lastdate'] = pd.to_datetime(covid['lastdate'])
print(f"Shape after cleaning: {covid.shape}")
covid.dtypes


### 🔬 Warm-Up — Re-orient yourself

1. How many countries (rows) are left after cleaning?
2. How many world **regions** are there? Use `.value_counts()`.
3. What is the median `gdp_per_capita` across all countries?


In [None]:
# Warm-up: explore shape, regions, median GDP


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
print(f"Countries: {covid.shape[0]}")
print("\nRegion counts:")
print(covid['region'].value_counts())
print(f"\nMedian GDP per capita: ${covid['gdp_per_capita'].median():,.0f}")
```

</details>

---
## 2. Regression Plots — `regplot` & `jointplot`

### 📋 Board Reference

| Function | What it does | Best for |
|----------|-------------|----------|
| `sns.regplot(x, y)` | Scatter + OLS regression line + CI band | Showing linear relationship |
| `sns.jointplot(x, y)` | Scatter + marginal distributions on axes | Seeing joint AND marginal distributions |
| `sns.jointplot(..., hue='cat')` | Colored by group | Comparing two groups' joint distributions |

**Key insight:** `regplot` = "is there a trend?" · `jointplot` = "where do the data cluster?"


In [None]:
# regplot — scatter + regression line
sns.regplot(data=covid, x='gdp_per_capita', y='total_deaths_pm');


In [None]:
# jointplot — scatter + marginal histograms
sns.jointplot(data=covid, x='gdp_per_capita', y='total_deaths_pm');


In [None]:
# Subset to three interesting regions for comparison
high_regions = covid.loc[covid.region.isin(['South America', 'Western Europe', 'North America'])]
sns.jointplot(data=high_regions, x='gdp_per_capita', y='total_deaths_pm', hue='region')
plt.xlim([0, 120000])
plt.ylim([-20, 1000]);


---
### 🔬 Explore 1 — Regression & Joint Plots

1. Use `sns.regplot` to plot `median_age` (x) vs `total_deaths_pm` (y).  
   Is there a positive or negative trend? Does it make sense?

2. Use `sns.jointplot` to plot `pop_density` (x) vs `total_cases_pm` (y).  
   Are denser countries hit harder?

3. Filter to **South America only** and make a `regplot` of `gdp_per_capita` vs `total_deaths_pm`.  
   How does the trend compare to the global picture?

4. **Bonus**: Use `sns.jointplot(..., kind='kde')` on `gdp_per_capita` vs `total_deaths_pm`.  
   What does the KDE joint plot show that the scatter doesn't?


In [None]:
# 1. regplot — median_age vs total_deaths_pm


In [None]:
# 2. jointplot — pop_density vs total_cases_pm


In [None]:
# 3. South America only: gdp vs deaths


In [None]:
# 4. Bonus: kind='kde' jointplot


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*Older populations and wealthier countries had more recorded deaths per million — partly due to better reporting and higher median age.*

```python
# 1. Median age vs death rate
sns.regplot(data=covid, x='median_age', y='total_deaths_pm')
plt.title('Older populations → higher death rates?')
plt.show()

# 2. Population density vs cases per million
sns.jointplot(data=covid, x='pop_density', y='total_cases_pm')
plt.show()

# 3. South America only
sa = covid[covid.region == 'South America']
sns.regplot(data=sa, x='gdp_per_capita', y='total_deaths_pm')
plt.title('South America: GDP vs Death Rate')
plt.show()

# 4. KDE joint plot
sns.jointplot(data=covid, x='gdp_per_capita', y='total_deaths_pm', kind='kde')
plt.show()
```

</details>

---
## 3. Distribution Comparisons — Violin, Swarm, Point

### 📋 Board Reference

| Function | What it shows | Best for |
|----------|--------------|----------|
| `sns.violinplot(x, y)` | Distribution shape + median + IQR | Comparing shapes, not just centers |
| `sns.swarmplot(x, y)` | Every individual point, non-overlapping | Small-medium datasets — see all data |
| `sns.pointplot(x, y)` | Mean ± CI as connected dots | Comparing means across ordered categories |

**Tip:** Layer swarmplot ON TOP of boxplot/violin for a richer view.


In [None]:
# Boxplot by region — deaths per million
plt.figure(figsize=(5, 6))
sns.boxplot(y='region', x='total_deaths_pm', data=covid)
plt.tight_layout()
plt.show()


In [None]:
# Zoom in on the high-covid regions
high_regions = covid.loc[covid.region.isin(['South America', 'Western Europe', 'North America'])]

# violin — shows full distribution shape
sns.violinplot(data=high_regions, y='region', x='total_deaths_pm');


In [None]:
# swarmplot — shows every individual country
sns.swarmplot(data=high_regions, y='region', x='total_deaths_pm');


In [None]:
# pointplot — mean ± CI comparison
sns.pointplot(data=high_regions, x='region', y='total_deaths_pm');


---
### 🔬 Explore 2 — Distribution Plots

1. Make a **violinplot** comparing `total_cases_pm` across the 3 high-covid regions.  
   Which region has the widest spread?

2. Make a **swarmplot** of `total_deaths_pm` by region for the same 3 regions.  
   Can you identify any outlier countries? (Hint: look at the extremes.)

3. **Layer** a swarmplot on top of a boxplot (same axes) for `total_deaths_pm` by region.  
   Pass `ax=ax` to both plots.

4. **Bonus**: Make a `pointplot` of mean `gdp_per_capita` per region (all regions, not just 3).  
   Rotate x-axis labels. Which region has the highest average GDP?


In [None]:
# 1. violinplot — total_cases_pm by high-covid region


In [None]:
# 2. swarmplot — total_deaths_pm by region


In [None]:
# 3. Layer swarmplot on boxplot


In [None]:
# 4. Bonus: pointplot of GDP by all regions


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*Layering swarmplot on boxplot is a powerful combo — you see both the summary stats AND the individual data points.*

```python
# 1. Violinplot
sns.violinplot(data=high_regions, y='region', x='total_cases_pm')
plt.title('Cases per Million by Region')
plt.show()

# 2. Swarmplot
sns.swarmplot(data=high_regions, y='region', x='total_deaths_pm')
plt.title('Every Country: Deaths per Million')
plt.show()

# 3. Layered boxplot + swarm
fig, ax = plt.subplots(figsize=(6, 4))
sns.boxplot(data=high_regions, y='region', x='total_deaths_pm', ax=ax, palette='pastel')
sns.swarmplot(data=high_regions, y='region', x='total_deaths_pm', ax=ax, color='black', size=4)
plt.title('Deaths per Million — Boxplot + Individual Points')
plt.tight_layout()
plt.show()

# 4. GDP by all regions
plt.figure(figsize=(8, 4))
sns.pointplot(data=covid, x='region', y='gdp_per_capita')
plt.xticks(rotation=45, ha='right')
plt.title('Mean GDP per Capita by Region')
plt.tight_layout()
plt.show()
```

</details>

---
## 4. KDE & Faceted Distributions

### 📋 Board Reference

| Function | When to use |
|----------|-------------|
| `sns.histplot(..., hue='cat')` | Overlapping histograms by group |
| `sns.kdeplot(..., hue='cat', fill=True)` | Smooth overlapping densities — cleaner than histplot for many groups |
| `sns.FacetGrid(df, col='cat').map(sns.kdeplot, 'x')` | Separate panel per group — avoids overlap |

**Rule of thumb:** More than 3 groups → use FacetGrid instead of overlapping KDE.


In [None]:
# histplot with hue — gets messy with many groups
sns.histplot(high_regions, x='total_deaths_pm', hue='region');


In [None]:
# kdeplot — smoother, easier to compare shapes
sns.kdeplot(high_regions, x='total_deaths_pm', hue='region', fill=True);


In [None]:
# FacetGrid — one panel per group (clearest for 3+ groups)
g = sns.FacetGrid(high_regions, col='region', hue='region')
g.map(sns.kdeplot, 'total_deaths_pm', fill=True)
plt.tight_layout();


---
### 🔬 Explore 3 — KDE & Facets

1. Make a **kdeplot** of `total_cases_pm` for the 3 high-covid regions.  
   Is South America's distribution bimodal (two humps)?

2. Make a **FacetGrid** KDE of `gdp_per_capita` faceted by region.  
   Pick 4 regions of your choice. Add `sharey=False` to `FacetGrid` — why is this helpful?

3. Make a **histplot** + KDE (`kde=True`) of `median_age` for all countries.  
   What does the shape tell you about the world's countries?

4. **Bonus**: Use `sns.FacetGrid` with `row='region'` instead of `col` for the 3-region deaths plot.  
   Which layout is easier to read?


In [None]:
# 1. kdeplot cases_pm for 3 regions


In [None]:
# 2. FacetGrid GDP by 4 regions of your choice


In [None]:
# 3. histplot + KDE of median_age


In [None]:
# 4. Bonus: row vs col FacetGrid


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*sharey=False lets each panel have its own y-axis scale — crucial when groups have very different densities.*

```python
# 1. KDE of cases per million
sns.kdeplot(high_regions, x='total_cases_pm', hue='region', fill=True)
plt.title('Distribution of Cases per Million')
plt.show()

# 2. FacetGrid GDP — 4 regions
four = covid[covid.region.isin(['Western Europe', 'North America', 'Sub-Saharan Africa', 'East Asia'])]
g = sns.FacetGrid(four, col='region', hue='region', sharey=False)
g.map(sns.kdeplot, 'gdp_per_capita', fill=True)
g.set_titles('{col_name}')
plt.tight_layout()
plt.show()

# 3. median_age histogram + KDE
sns.histplot(data=covid, x='median_age', bins=20, kde=True)
plt.title('World Distribution of Median Age')
plt.show()

# 4. Bonus: row layout
g = sns.FacetGrid(high_regions, row='region', hue='region')
g.map(sns.kdeplot, 'total_deaths_pm', fill=True)
plt.tight_layout()
plt.show()
```

</details>

---
## 5. Open Exploration — Your Investigation

Pick a question about the Covid dataset and investigate it visually.  
Use at least **two different plot types** from today.

Some ideas:
- Does population density predict cases per million? Is the relationship different by region?
- Are there countries that are outliers — high GDP but also high deaths?
- Compare any two regions of your choice on multiple dimensions.
- Which region has the most internal variation in `total_deaths_pm`?


In [None]:
# Your investigation — plot 1


In [None]:
# Your investigation — plot 2


**What did you find?** *(double-click to edit)*

*Your story here*

---
## Appendix — Advanced Seaborn Quick Reference

```python
# Regression
sns.regplot(data=df, x='num', y='num')
sns.jointplot(data=df, x='num', y='num', hue='cat')
sns.jointplot(data=df, x='num', y='num', kind='kde')

# Distribution comparisons
sns.violinplot(data=df, x='cat', y='num')
sns.swarmplot(data=df, x='cat', y='num')
sns.pointplot(data=df, x='cat', y='num')

# Layer: boxplot + swarm
fig, ax = plt.subplots()
sns.boxplot(data=df, x='cat', y='num', ax=ax, palette='pastel')
sns.swarmplot(data=df, x='cat', y='num', ax=ax, color='black', size=3)

# KDE and facets
sns.kdeplot(data=df, x='num', hue='cat', fill=True)
sns.histplot(data=df, x='num', hue='cat', kde=True)

g = sns.FacetGrid(df, col='cat', hue='cat', sharey=False)
g.map(sns.kdeplot, 'num', fill=True)
g.add_legend()
plt.tight_layout()
```
