# EDA and Intro to Seaborn
v.ekc-c

Matplotlib gives us control; plotnine gives us grammar. **Seaborn** gives us *both* — a high-level API built on Matplotlib that produces publication-quality statistical graphics with very little code. We will also practice a full **Exploratory Data Analysis (EDA)** workflow.

**Sections:**
1. Setup
2. Seaborn Basics — scatter, bar, histogram
3. Seaborn for EDA — pairplot, boxplot, heatmap
4. Full EDA Walkthrough — Covid Data
5. Activity

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings('ignore') 

---
## 2. Seaborn Basics

Seaborn functions follow a consistent interface: `sns.plottype(data=df, x='col', y='col', ...)`

| Function | Plot type | Key arguments |
|---|---|---|
| `sns.scatterplot()` | Scatter | `hue`, `style`, `palette` |
| `sns.lineplot()` | Line | `hue` |
| `sns.countplot()` | Bar (counts) | `x` or `y` |
| `sns.barplot()` | Bar (aggregated) | `estimator`, `errorbar` |
| `sns.histplot()` | Histogram | `bins`, `kde` |
| `sns.FacetGrid()` | Facet grid | `col`, `row`, `hue` |

In [None]:
# Whenever we want to use seaborn for visualization
import seaborn as sns
sns.set_style("darkgrid")

In [None]:
iris = sns.load_dataset('iris')
iris

In [None]:
# To make a scatter plot
sns.scatterplot(data = iris, x = 'petal_length', y = 'petal_width'); 

In [None]:
# Customizing your scatter plot: change the color of all the points to another color
sns.scatterplot(data = iris, x = 'petal_length', y = 'petal_width', color = 'r'); 

In [None]:
# Customizing your scatter plot: change the color of points to map to another variable (like an aesthetic map)
sns.scatterplot(data = iris, x = 'petal_length', y = 'petal_width', hue = 'species');

In [None]:
# Customizing your scatter plot: change the shape of points to map to another variable (like an aesthetic map)
sns.scatterplot(data = iris, x = 'petal_length', y = 'petal_width', style = 'species'); 

In [None]:
# adjusting axis scales
fig = sns.scatterplot(data = iris, x = 'petal_length', y = 'petal_width', hue = 'species')
fig.set_xlim(0,8);

In [None]:
# adjusting color scales
sns.scatterplot(data = iris, x = 'petal_length', y = 'petal_width', 
                hue = 'species', palette = 'colorblind');

### ✏️ Check-in 1 — Seaborn Scatter

Using the `iris` dataset:

1. Make a scatter plot of `sepal_length` (x) vs `sepal_width` (y), mapping `species` to **both** `hue` and `style`.
2. Use the `'colorblind'` palette.
3. Limit the x-axis to `(4, 8)` by calling `.set_xlim(4, 8)` on the returned axes object.

In [None]:
# 1 & 2. Scatter with hue, style, and colorblind palette


In [None]:
# 3. Add x-axis limit


#### Hint

`sns.scatterplot()` returns a Matplotlib **Axes** object. Assign it to a variable (e.g. `ax = sns.scatterplot(...)`) and then call `ax.set_xlim(4, 8)`.

#### Answer

In [None]:
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width',
                hue='species', style='species', palette='colorblind');

In [None]:
ax = sns.scatterplot(data=iris, x='sepal_length', y='sepal_width',
                     hue='species', style='species', palette='colorblind')
ax.set_xlim(4, 8);

In [None]:
# Facetting
# Create a facet grid
fig = sns.FacetGrid(iris, col="species", hue = 'species')
# map your scatter plots to the grid
fig.map(sns.scatterplot, "petal_length", "petal_width");

In [None]:
# line plots (NOTE: just for demonstration. Not a good use for line plot)
sns.lineplot(data = iris, x = 'petal_length', y = 'petal_width', hue = 'species'); 

In [None]:
# barplots to just look at counts
sns.countplot(data = iris, x = 'species'); 

In [None]:
# barplots to look at averages of a numeric variable vs a categorical variable
sns.barplot(data = iris, x = 'species', y='petal_length'); 

In [None]:
# To remove error bars
sns.barplot(data = iris, x = 'species', y='petal_length', errorbar= None); 

In [None]:
# Use other aggregate functions
sns.barplot(data = iris, x = 'species', y='petal_length', 
            errorbar= None, estimator=np.sum); 

In [None]:
# Histograms
sns.histplot(data = iris, x = 'petal_length');

### ✏️ Check-in 2 — Bar & Histogram

Using the `iris` dataset:

1. Make a **count plot** showing how many rows there are per `species`.
2. Make a **bar plot** showing the **median** `sepal_length` per `species` (no error bars). Use `estimator=np.median`.

In [None]:
# 1. Count plot by species


In [None]:
# 2. Bar plot of median sepal_length per species


#### Answer

In [None]:
sns.countplot(data=iris, x='species');

In [None]:
sns.barplot(data=iris, x='species', y='sepal_length',
            estimator=np.median, errorbar=None);

---
## 3. Seaborn for EDA

These three plot types are especially powerful for quickly exploring a new dataset:

| Plot | Function | Best for |
|---|---|---|
| Pairplot | `sns.pairplot(df)` | Pairwise relationships between all numeric columns |
| Boxplot | `sns.boxplot(data=df, x='cat', y='num')` | Distributions across categories |
| Heatmap | `sns.heatmap(corr_matrix, annot=True)` | Correlation structure |

In [None]:
# Pairplots show the relationship between numeric variables
sns.pairplot(data = iris);

In [None]:
# Can map other variables to aesthetic properties
sns.pairplot(data = iris, hue='species');

In [None]:
# boxplots can be used to look at the distribution of data in different categories
sns.boxplot(data = iris);

In [None]:
# boxplots can also be used with a categorical x and a numerical y
sns.boxplot(data = iris, x = 'species',y='sepal_width');

In [None]:
# heatmaps can be used to study correlations
corrmat = iris.corr(numeric_only = True) # make correlation matrix
sns.heatmap(data = corrmat, annot = True);

### ✏️ Check-in 3 — EDA Plots

Using the `iris` dataset:

1. Create a **pairplot** with `species` mapped to `hue`.
2. Make a **boxplot** of `petal_length` by `species`.
3. Build a **correlation heatmap** with annotations. Which two numeric features are most strongly correlated?

In [None]:
# 1. Pairplot with hue=species


In [None]:
# 2. Boxplot: petal_length by species


In [None]:
# 3. Correlation heatmap


#### Answer

In [None]:
sns.pairplot(data=iris, hue='species');

In [None]:
sns.boxplot(data=iris, x='species', y='petal_length');

In [None]:
corrmat = iris.corr(numeric_only=True)
sns.heatmap(corrmat, annot=True);

---
## 4. Full EDA Walkthrough — Covid Data

Let's practice a complete EDA workflow on a real dataset. The steps are:

1. **Load** the data and preview it
2. **Describe** with `.describe()` and `.info()`
3. **Check** for missing values and duplicates
4. **Clean** as needed
5. **Visualize** to find patterns and outliers

In [None]:
# import covid data
covid = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Python-Data-Cleaning-Cookbook/master/Chapter05/data/covidtotals.csv')
covid.head()

In [None]:
# begin exploring the dataset
covid_desc = covid.describe()
covid_desc

In [None]:
covid.info()

In [None]:
# Check for na values
covid.isna()

In [None]:
# Summarize number of na's by column
nas = covid.isna().sum()
nas

In [None]:
# drop columns with na values
covid = covid.dropna(axis=1)
covid

In [None]:
# check if there are any duplicate rows
covid.duplicated().sum()

In [None]:
# begin visualizing for insights
sns.pairplot(data = covid);

In [None]:
# boxplots to look at distributions
plt.figure(figsize=(12,4))
sns.boxplot(x = covid.region,y=covid.total_deaths_pm)
plt.xticks(rotation=45)
plt.show()

In [None]:
# heatmaps to look at correlations
corrmatc = covid.corr(numeric_only = True) # make correlation matrix
sns.heatmap(data = corrmatc, annot = True);

---
## 5. Activity

1. What did the inital visualizations above reveal that was surprising or interesting? Create an additional Seaborn visualization to explore this. 