<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/02a-DataViz.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 2a -- DataViz

## Goals

* Tell a story with data
* Basic visualizations with seaborn
* Gentle introduction to pandas

## Primary reference

* [seaborn distributions tutorial](https://seaborn.pydata.org/tutorial/distributions.html)

# Anscombe's quartet

* Introduction to [seaborn](https://seaborn.pydata.org/) for statistical visualization
* Ref: [Anscombe's quartet](https://seaborn.pydata.org/examples/anscombes_quartet.html) -- seaborn.pydata.org

In [None]:
# Standard seaborn import (only needs to be done once)
import seaborn as sns

# Load the example dataset for Anscombe's quartet (a pandas dataframe)
df = sns.load_dataset("anscombe")

# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=2, ci=None, palette="muted", height=4,
           scatter_kws={"s": 50, "alpha": 1});

## Irises, penguins and tips

seaborn has quite a few built-in datasets -- the classics

* [seaborn data](https://github.com/mwaskom/seaborn-data) -- github.com
* [Fischer's Irises](https://archive.ics.uci.edu/ml/datasets/iris) -- uci.edu
* [Palmer penguins dataset](https://github.com/allisonhorst/palmerpenguins) -- github.com

### Proper attribution is critically important!

* Always cite an authoritative source.
* Make sure you have permission
* Know your data source!

[Palmer penguins dataset](https://github.com/allisonhorst/palmerpenguins) is a good example of a modern, nicely curated dataset -- an exemplar!

In [None]:
# load the "penguins" dataset from seaborn
penguins = sns.load_dataset("penguins")

# inspect the dataset (note: there are some NaNs)
penguins

# Histogram

* One way to look at the data: "bin" it and plot it
* Create some "bins" and count the data falling in each bin
* Then use a bar chart to compare the bin counts
* Refs: 
  * [Visualizing distributions of data](https://seaborn.pydata.org/tutorial/distributions.html)
  * [`seaborn.displot`](https://seaborn.pydata.org/generated/seaborn.displot.html#seaborn.displot) API reference docs
* Note, however: closely related [`seaborn.distplot`](https://seaborn.pydata.org/generated/seaborn.distplot.html) is popular but deprecated



In [None]:
# It couldn't be easier than with seaborn
sns.displot(penguins, x="flipper_length_mm");


In [None]:
# Varying bin width increases the resolution
sns.displot(penguins, x="flipper_length_mm", binwidth=3);

# KDE

Kernel density estimation (KDE) is a method for estimating the probability distribution (a continuous function) from data.

* Ref: [05.13 Kernel Density Estimation](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.13-Kernel-Density-Estimation.ipynb) by VanderPlas -- github.com

In [None]:
# KDE is built into seaborn.displot()
sns.displot(penguins, x="flipper_length_mm", kind="kde");

# Conditioning on other variables

Do features of a distribution differ across other variables in the dataset? 

For example: can you account for the bimodal distribution of flipper lengths?

Seaborn makes it very easy to plot multivariable graphics with dataframes using the `hue` parameter.

Look at the dataset in the dataframe to see how it works.

* Ref: [Conditioning on other variables](https://seaborn.pydata.org/tutorial/distributions.html#conditioning-on-other-variables) -- seaborn.pydata.org

In [None]:
# Superpose histograms for each species by specifying the "hue" keyword
sns.displot(penguins, x="flipper_length_mm", hue="species");

In [None]:
# Changing the asthetics of the visualization can make things easier to "see"
sns.displot(penguins, x="flipper_length_mm", hue="species", element="step");

In [None]:
# Or you can use a "stacked" bar chart with the "multiple" keyword argument
sns.displot(penguins, x="flipper_length_mm", hue="species", multiple="stack");

In [None]:
sns.displot(penguins, x="flipper_length_mm", hue="species", stat="density");


import matplotlib.pyplot as plt
from google.colab import files
plt.savefig("flippers.png")
files.download("flippers.png") 

In [None]:
sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde");

In [None]:
# You can stack the KDEs, as was done with the histograms using stacked bar charts
sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde", multiple="stack");

In [None]:
# And you can enhance the aesthetics to help with visualizing the important differences
sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde", fill=True);

# Box plots

Also called box and whisker plots.

The plot shows the three quartile values of the distribution, along with the extremes. 

The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile.

Data that lie outside this range are displayed independently. 

The [interquartile range (IQR)](https://en.wikipedia.org/wiki/Interquartile_range) measures statistical dispersion.  

IQR is the difference between 75th and 25th percentiles, or between upper and lower quartiles.

[This figure from the wikipedia article](https://en.wikipedia.org/wiki/Interquartile_range#/media/File:Boxplot_vs_PDF.svg) compares a Boxplot to a PDF (Probability Density Function).


In [None]:
sns.catplot(x="species", y="flipper_length_mm", kind="box", data=penguins);

## Tips dataset

This dataset some characteristics that become apparent in the visualizations.

But the optimal visualization depends on the story and the data.

In [None]:
# load the "tips" dataset
tips = sns.load_dataset("tips")
tips

In [None]:
# Q: What are the main "takeaways" from the visualization?
sns.catplot(x="day", y="total_bill", kind="box", data=tips);

* Skewed distributions
  * [Skewness](https://en.wikipedia.org/wiki/Skewness) -- wikipedia
  * Outliers are all positive (more than we saw with penguins)
  * No values less than zero (as with penguin bills)
* More spending on the weekends

In [None]:
# Q1: Are the main takeaways as obvious here?
sns.displot(tips, x="total_bill", hue="day", kind="kde", fill=True);

In [None]:
# Q2: Are the main takeaways as obvious here?
sns.displot(tips, x="total_bill", hue="day", kind="hist", stat="density");

In [None]:
# Q3: Are the main takeaways as obvious here?
sns.displot(tips, x="total_bill", hue="day", kind="hist", stat="density", multiple="stack");

My (subjective) answers:

* A1: not bad, but too much overlap, 4 variables is almost too many
* A2: it's a mess
* A3: better than A2, still not great and not as good as A1
* In hindsight, sequence of box plots seems most informative
* With penguins however, translucent KDEs seem most informative.