# Week 4 Discussion

## Infographic

* [Python Plotting for EDA](http://pythonplot.com/): Side-by-side comparison of the major visualization libraries.

## Links

* [The Python Visualization Landscape](https://www.youtube.com/watch?v=FytuB8nFHPQ): A recent talk about visualization libraries for Python.
* [A Dramatic Tour through Python's Data Visualization Landscape](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/): Examples that show why you should know matplotlib, but should use some other library to make most of your plots. From Oct 2016, so a little outdated.
* [matplotlib Arist Tutorial](https://matplotlib.org/users/artists.html): If you want a deeper understanding of matplotlib.

## Notes

How can we make plots in Python?

Package        | Family     | Depends On
---------------|------------|-----------
[matplotlib][] | matplotlib |
[seaborn][]    | matplotlib | matplotlib
[pandas][]     | matplotlib | matplotlib
[plotnine][]   | ggplot     | matplotlib
[ggpy][]       | ggplot     | matplotlib
[altair][]     | browser    | d3.js
[plotly][]     | browser    | d3.js
[mpld3][]      | browser    | d3.js
[bokeh][]      | browser    |
[bqplot][]     | browser    | jupyter
[vega][]       | browser    | jupyter + d3.js

[matplotlib]: https://matplotlib.org/
[seaborn]: https://seaborn.pydata.org/
[pandas]: http://pandas.pydata.org/pandas-docs/stable/visualization.html
[plotnine]: http://plotnine.readthedocs.io/
[ggpy]: http://yhat.github.io/ggpy/
[altair]: https://altair-viz.github.io/
[plotly]: https://plot.ly/python/
[mpld3]: http://mpld3.github.io/
[bokeh]: https://bokeh.pydata.org/
[bqplot]: https://github.com/bloomberg/bqplot
[vega]: https://github.com/vega/ipyvega

And more...

So what should you actually use?

__Seaborn__ is stable. __Plotnine__ is convenient if you already know ggplot.

Uderstanding __matplotlib__ is useful, but using matplotlib to create plots is painful. The most important thing to know is matplotlib's jargon:

* _Figure_: Container for plots.
* _Axes_: Container for components of a plot ("primitives"). In other words, this is a single plot.
* _Axis_: Container for components of an axis. This is a single axis.
* _Tick_: A container for tick marks on an axis.

All of the containers and the primitives are called _Artists_.

What kind of plots do we usually make?

First Feature | Second Feature | Plot
--------------|----------------|:----
categorical   |                | dot, <span style="color: #aaa">bar</span>, <span style="color: #aaa">pie</span>
categorical   | categorical    | dot, mosaic, <span style="color: #aaa">bar</span>
numerical     |                | box, density, histogram
numerical     | categorical    | box, density
numerical     | numerical      | line, scatter, smooth scatter


In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import plotnine as gg
import seaborn as sns

dogs = pd.read_feather("data/dogs.feather")
dogs.head()

Dogs data from [Information is Beautiful](https://informationisbeautiful.net/visualizations/best-in-show-whats-the-top-data-dog/).

Quick notes on Pandas:

### Dot Plots

Plot the number of dogs in each category.

In [None]:
# Pandas
counts = dogs["category"].value_counts()

ax = counts.plot(style = "o")
ax.set(title = "Dog Categories", xlabel = "Category", ylabel = "Count")

In [None]:
counts = dogs["category"].value_counts()

ax = sns.stripplot(x = counts.index, y = counts)
ax.set(title = "Dog Categories", xlabel = "Category", ylabel = "Count")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45)

In [None]:
# Plotnine

p = gg.ggplot(dogs, gg.aes("category")) + gg.geom_point(stat = "count")
p + gg.labs(title = "Dog Categories", x = "Category", y = "Count")

### Box Plots

Plot the distribution of dog longevity, grouped by category.

In [None]:
# Pandas

ax = dogs.boxplot(by = "category", column = "longevity", rot = 45)
# Set title and axis labels.
ax.set(title = "Dog Longevity", xlabel = "Category", ylabel = "Years")
# Hide grouping title Pandas adds.
ax.get_figure().suptitle("")

# There is also .plot.box(), but it seems to be buggy.

In [None]:
# Seaborn

ax = sns.boxplot(x = "category", y = "longevity", data = dogs)
ax.set(title = "Dog Longevity", xlabel = "Category", ylabel = "Years")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45)

In [None]:
# Plotnine

p = gg.ggplot(dogs, gg.aes("category", "longevity")) + gg.geom_boxplot()
p + gg.labs(title = "Dog Longevity", x = "Category", y = "Years")

### Scatter Plots

Plot popularity against datadog score.

In [None]:
# Pandas

ax = dogs.plot.scatter(x = "datadog", y = "popularity")
ax.set(title = "Best in Show", xlabel = "DataDog Score", ylabel = "Popularity Rank")
ylim = reversed(ax.get_ylim())
ax.set_ylim(ylim)

In [None]:
# Seaborn

ax = sns.regplot(x = "datadog", y = "popularity", data = dogs, fit_reg = False)
ax.set(title = "Best in Show", xlabel = "DataDog Score", ylabel = "Popularity Rank")
ylim = reversed(ax.get_ylim())
ax.set_ylim(ylim)

In [None]:
# Plotnine

p = gg.ggplot(dogs, gg.aes("datadog", "popularity")) + gg.geom_point()
p + gg.labs(title = "Best in Show", x = "DataDog Score", y = "Popularity Rank")
p + gg.ylim(95, -5)

### Smooth Scatter Plots

Plot popularity against datadog score as a smooth scatter plot (or similar).

In [None]:
# Pandas

# `sharex = False` to fix a bug with xlabel.
ax = dogs.plot.hexbin(x = "datadog", y = "popularity", gridsize = 10, sharex = False)
ax.set(title = "Best in Show", xlabel = "DataDog Score", ylabel = "Popularity Rank")
ylim = reversed(ax.get_ylim())
ax.set_ylim(ylim)

In [None]:
# Seaborn

g = sns.jointplot(x = "datadog", y = "popularity", data = dogs, kind = "hex", gridsize = 15, ylim = (95, -5))
g.set_axis_labels("DataDog Score", "Popularity Rank")

In [None]:
# Plotnine

# Doesn't have geom_hex() yet.
p = gg.ggplot(dogs, gg.aes("datadog", "popularity")) + gg.geom_bin2d(bins = 20)
p + gg.labs(title = "Best in Show", x = "DataDog Score", y = "Popularity Rank")
p + gg.ylim(95, -5)