# Advanced Plotting in Seaborn and how to use it in EDA
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp25&branch=main&urlpath=tree%2Fdata271_sp25%2Flectures%2Fdata271_lec23_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
# import covid data
covid = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Python-Data-Cleaning-Cookbook/master/Chapter05/data/covidtotals.csv')

### Initial inspection

In [None]:
covid.head()

In [None]:
covid.shape

In [None]:
covid.info()

`covid` is a DataFrame with with 209 rows. Each row represents a location (country). It contains the following columns: 

- iso_code - a unique 3 digit identifier for each country
- lastdate - the date for when the report was last updated
- location - full name of the country
- total_cases - the cumulative number of confirmed COVID-19 cases in that location
- total_deaths - the cumulative number of confirmed COVID-19 deaths in that location
- total_cases_pm - the cumulative number of confirmed COVID-19 cases per million in that location
- total_deaths_pm - the cumulative number of confirmed COVID-19 deaths per million in that location
- population - population size
- pop_density - number of people per square kilometer
- median_age - median age of the population 
- gdp_per_capita - The GDP per capita (economic indicator of wealth per person)
- hosp_beds - The number of hospital beds per 1,000 people (an indicator of healthcare capacity)
- region - broader region 

### Preprocessing

In [None]:
# Check for na values
covid.isna()

In [None]:
# Summarize number of na's by column
nas = covid.isna().sum()
nas

In [None]:
# drop columns with na values
covid.dropna(axis=1)

In [None]:
covid.shape

In [None]:
# drop rows with na values
covid.dropna(axis=0,inplace=True)

In [None]:
covid.shape

In [None]:
# check if there are any duplicate rows
covid.duplicated().sum()

In [None]:
# check the data types to make sure they are appropriate
covid.dtypes

In [None]:
covid['lastdate'] = pd.to_datetime(covid['lastdate'])
covid.dtypes

### Summary statistics and Exploration

In [None]:
# begin exploring the dataset
covid_desc = covid.describe()
covid_desc

In [None]:
# What is the average death rate by region?
covid.groupby('region')['total_deaths_pm'].mean()

In [None]:
# Which country has the highest death counts
covid.loc[covid.total_deaths.idxmax(),'location']

In [None]:
# Which country has the highest death rate
covid.loc[covid.total_deaths_pm.idxmax(),'location']

# Visualization

In [None]:
# heatmaps to look at correlations
corrmatc = covid.corr(numeric_only = True) # make correlation matrix
sns.heatmap(data = corrmatc, annot = True);

The median age and gdp per capita are correlated with deaths per million. This is interesting. 

In [None]:
# Let's look at the relationship with gdp
sns.scatterplot(data = covid, x = 'gdp_per_capita',y = 'total_deaths_pm');

In [None]:
# Where would the regression line be?
sns.regplot(data = covid, x = 'gdp_per_capita',y = 'total_deaths_pm');

In [None]:
# That's a little hard to see
sns.jointplot(data = covid, x = 'gdp_per_capita',y = 'total_deaths_pm');

In [None]:
# boxplots to look at deaths by region
plt.figure(figsize=(4,6))
sns.boxplot(y = covid.region,x=covid.total_deaths_pm)
plt.show()

There are a few regions we might want to zoom in on...

In [None]:
high_covid_regions = covid.loc[covid.region.isin(['South America','Western Europe','North America'])]
high_covid_regions.head()

In [None]:
# Let's redo the boxplot with these points
sns.boxplot(data = high_covid_regions, y = 'region',x='total_deaths_pm');

In [None]:
# To get more information about the distributions
sns.violinplot(data = high_covid_regions, y = 'region',x='total_deaths_pm');

In [None]:
# Show all the points in the distributions with a swarm plot
sns.swarmplot(data = high_covid_regions, y = 'region',x='total_deaths_pm');

In [None]:
# Notice a swarm plot is different from a scatter plot
sns.scatterplot(data = high_covid_regions, y = 'region',x='total_deaths_pm');

In [None]:
# Another way to compare categories (best if this is an ordinal variable)
sns.pointplot(data = high_covid_regions, x = 'region',y='total_deaths_pm');

In [None]:
# Let's look at that gdp relationship again
sns.jointplot(data = high_covid_regions, x = 'gdp_per_capita',y = 'total_deaths_pm', hue = 'region');
plt.xlim([0,120000]);
plt.ylim([-20,1000]);

South America has almost a bimodal distribution

In [None]:
# Focus on these distributions more (messy!)
sns.histplot(high_covid_regions, x = 'total_deaths_pm', hue = 'region');

In [None]:
# Better
sns.kdeplot(high_covid_regions, x = 'total_deaths_pm', hue = 'region', fill=True);

In [None]:
# Even better (but watch out for scales)
g = sns.FacetGrid(high_covid_regions, col="region", hue = 'region')
g.map(sns.kdeplot, "total_deaths_pm", fill = True)
plt.tight_layout();

In [None]:
# Displot makes facetting easier
sns.displot(data = high_covid_regions, x = 'total_deaths_pm', hue = 'region', kind='kde', fill=True, col = 'region');

In [None]:
# Relplot makes facetting easier when looking at relationships
sns.relplot(data = high_covid_regions, 
            x = 'total_deaths_pm', 
            y = 'total_cases_pm', 
            col = 'region', 
            kind='scatter');

In [None]:
na = covid.loc[covid.region == 'North America', :]
na

In [None]:
# A teaser for interactive plots

import plotly.express as px


fig = px.scatter(covid, x = 'gdp_per_capita', y = 'total_deaths_pm', hover_name='location')
# Set figure size
fig.update_layout(width=600, height=600)

## Group Activity: Using plotting methods in an EDA: Titanic Data

In [None]:
# set defined plot sizes and styles
plt.rcParams['figure.figsize'] = [6,3] # figures will be 6 units in length 3 units in height
plt.rcParams['figure.dpi'] = 80 # default is 72 in webpages, we wish to see in higher resolution

In [None]:
# titanic data set is part of Seaborn
titanic = sns.load_dataset('titanic')

In [None]:
# look at the first few lines
titanic.head()

In [None]:
# describe will only represent the numerical columns-- for example, sex, class, embark town, etc. are not included
titanic.describe()

In [None]:
titanic.info()

### Task: 
Use a heatmap to visualize the null values in the dataset. 

In [None]:
plt.style.use('ggplot')

### Discussion Question: 
What do you notice about the heatmap? What information does it show. What information does it not show?

### Task
Use a heatmap to visualize the correlation in the data. 

### Discussion question
Looking at the `survived` column, what do you notice. Anything else you notice about the heatmap overall?

## Task 
Create subplots to show the count of each category for the variables `survived`, `pclass`, `sex`, `sibsp`, `parch`, `embark_town`, and `alone`. 

### Discussion Question
What observations can you make from the plots you created?