In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="custom.css">

In [None]:
# Filter out some warnings from statsmodels

import warnings
warnings.filterwarnings('ignore')

## `seaborn`

`seaborn` is a high-level visualization library that uses `matplotlib` under the hood and has strong support for `pandas`.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#rc('figure', figsize=(12, 5)) #why do we do this?

Side-effect: `seaborn` adjusts `matplotlib`'s style (and makes your plots prettier)!

In [None]:
import numpy as np

x = np.linspace(0, 2 * np.pi, 128, endpoint=True)
y_cos, y_sin = np.cos(x), np.sin(x)

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y_cos)
ax.plot(x, y_sin)
ax.set_ylim(-1.1, 1.1); #note the ';': what does it do?

## `seaborn` vs `matplotlib`

We can use the `hist` function to visualize the distribution of some data in `matplotlib`.

However, if you want to create something more advanced, it's a lot of work.

Consider a kernel density plot: you have to compute the KDE with `scipy` and plot the results yourself.

In [None]:
# what is a kernel density plot
from scipy import stats
data = np.random.gamma(2, size=200)

kernel = stats.gaussian_kde(data)
grid = np.linspace(-2, 12, num=100)
plt.plot(grid, kernel(grid));

`seaborn` has a handy function for this, called `kdeplot`:

In [None]:
sns.kdeplot(data);

The higher level `distplot` allows to plot a KDE and a histogram:

In [None]:
sns.distplot(data, bins=10) ;

We can also __fit a distribution__ to our histogram.

A fit with a gamma distribution shows how good the KDE approximation is:

In [None]:
from scipy import stats
data = np.random.gamma(2, size=200)
sns.distplot(data, fit=stats.gamma);

This small example already shows that `seaborn` can make our life easier and has a lot of advanced functionality!

## `seaborn` and `pandas`

`seaborn` is even better when you use it with other libraries.

Use `pandas` to get your data into `seaborn` and use `matplotlib` to tweak your plots.

`seaborn` generally accepts either:
* 1D array (like `numpy` or a column of a DataFrame)
* DataFrame + indicators of which columns to plot

Let's load an example dataset. `iris` is a DataFrame:

In [None]:
import seaborn as sns

In [None]:
iris = sns.load_dataset('iris')
iris.drop(['sepal_length', 'petal_width'], axis=1, inplace=True)
iris.head()

We can visualize the relations between our numerical variables `sepal_width` and `petal_length` with a simple `pairplot`:

In [None]:
sns.pairplot(iris, size=3);

The argument `hue` is generally a categorical column to separate the data set in different colors:

In [None]:
sns.pairplot(iris, hue='species', size=2.5)
fig.tight_layout()

This simple visualization already gives us an idea how to classify species given our variables!

## `seaborn` and `matplotlib`

`seaborn` uses `matplotlib` as its visualization engine and (most of) returned objects are `matplotlib` objects.

This means that we can prettify our plots using the `matplotlib` OOP syntax we learned earlier!

In [None]:
from scipy import stats
data = np.random.gamma(2, size=200)

ax = sns.distplot(iris['sepal_width'], axlabel='Sepal width [cm]')
ax.set_title('Distribution of sepal widths')
ax.set_yticks([0, 0.5, 1, 1.5]);

A lot of `seaborn`'s functions* accept an `ax` keyword to tell `seaborn` on which axes to plot: 


\* Except those working with FacetGrids

In [None]:
fig, axes = plt.subplots(2, 1, sharex=True)
sns.distplot(iris['sepal_width'], kde=False, ax=axes[0])
sns.kdeplot(iris['sepal_width'], legend=False, ax=axes[1])

for ax in axes:
    ax.set_xlabel('Sepal width [cm]')

## Summary

`seaborn`:
* has high level plotting functions that can make our life easier;
* accepts arrays or `pandas` DataFrames as input;
* outputs `matplotlib` objects that we can tweak!

## Excercises (15 minutes)

- Plot the distribution of `iris['petal_length']` using `sns.distplot` with a KDE and rug plot but without a histogram.
- Use the argument `color` to set the color to red.
- Plot the relation between `sepal_width` and `petal_length` using `sns.jointplot`
- Plot the relation between `sepal_width` and `petal_length` using `sns.jointplot` as hexbins (hint: look at what the `kind` argument does)

## High and medium level functions

We've already seen that we can do interesting stuff with `sns.jointplot` and `sns.distplot`.

`sns.distplot` and `sns.jointplot` call other functions like `sns.kdeplot` and `sns.rugplot` to do the plotting for them.

`sns.jointplot` is an example of a high level function:

In [None]:
sns.jointplot('sepal_width', 'petal_length', iris);

But we can also use slightly lower level functions to do the plotting ourselves:

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))

sns.kdeplot(iris['sepal_width'], iris['petal_length'], ax=ax) #note ax=ax
sns.rugplot(iris['sepal_width'], color='g', ax=ax)
sns.rugplot(iris['petal_length'], vertical=True, ax=ax);

## Regression plots

We've seen how to plot the relations between variables in `seaborn`, but we can also quantify these relations by doing __regression__.

Although `seaborn` is not the best package to do linear regression, it's convenient to quickly get information about relations between variables.

We can use `regplot` to draw a regression plot.

In [None]:
anscombe = sns.load_dataset('anscombe')
linear = anscombe[anscombe.dataset == 'I']
quadratic = anscombe[anscombe.dataset == 'II']
outlier = anscombe[anscombe.dataset == 'III']

`regplot` draws a scatterplot of two variables and plots the `y ~ x` regression line with the 95% confidence interval.

In [None]:
sns.regplot(x='x', y='y', data=linear);

For polynomial relations we can specify the polynomial `order`:

In [None]:
sns.regplot(x='x', y='y', data=quadratic, order=2);

For data with outliers we can use robust (Hubert) regression:

In [None]:
sns.regplot(x='x', y='y', data=outlier, robust=True)

## Plotting residues

Once the regression is done, it is often very helpful to see how the residues behave as a function of `x` with `sns.residplot`.


In case of linear regression, they should be evenly distributed between the `y > 0` and `y < 0` plane.

In [None]:
sns.residplot(x='x', y='y', data=linear)

That is, evenly distributed but also not too many consecutive in one of the planes!

In [None]:
sns.residplot(x='x', y='y', data=quadratic, scatter_kws={'s': 80})

## Distributions and categories
Distributions over categorical data ask for different kinds visualization techniques. 

Usually one shows either:
* the distribution of observations
    * `stripplot()`, `boxplot()`,  `violinplot()`
* a statistical estimation to show a central tendency and confidence interval
    * `barplot()`, `countplot()`,  `pointplot()`

In [None]:
titanic = sns.load_dataset('titanic')
titanic.head()

## 1. Distribution of observations

Boxplots show a distribution of the median, the first and second quartiles, and a range denoted by whiskers and outliers.

In [None]:
sns.boxplot(x='class', y='age', data=titanic);

`sns.boxplot` also supports `hue`:

In [None]:
sns.boxplot(x='class', y='age', hue='sex', data=titanic);

And some more advanced options:

In [None]:
sns.boxplot(x='class', y='age', hue='sex', data=titanic,
            notch=True, showfliers=True);

## 2. Statistical estimation

`sns.barplot` can be used to compare averages between categories

In [None]:
sns.barplot(x='class', y='age', hue='sex', data=titanic, ci=80);

We can ask for more strict confidence intervals:

In [None]:
sns.barplot(x='class', y='age', hue='sex', ci=99, data=titanic);

Instead of looking at the average, we can also supply a different estimator:

In [None]:
sns.barplot(x='class', y='age', hue='sex', data=titanic, estimator=np.std);

## Summary

* `seaborn` has more functionality than we can cover in a day.
* We can choose between high and medium level plotting functions.
* There's a wide variety of plots: 
    * distribution plots, regression plots, estimation plots, etc.
* Browse `seaborn`'s [gallery](http://seaborn.pydata.org/examples/index.html) for more inspiration.