# This Week

Data Visualization
* matplotlib - core plotting package for python
* Seaborn - built on top of matplotlib and pandas

# matplotlib

matplotlib was initially intended to bring [MATLAB](https://www.mathworks.com/products/matlab/)-like graphics functionality into Python.  Today it is one of the most (if not THE most) used Python graphics package.  It was started by John Hunter (1968-2012), and is now lead by Michael Droettboom.

#### Object-oriented interface
* Create a figure object and axes objects, and then operate on those
* Separates style from graph
* Can easily have multiple subplots

#### Objectives

* Understand the basic components of a plot
* Understand how to style a plot
* Give you enough information to use the [gallery](http://matplotlib.org/gallery#) on your own
* Reference for several standard plots
  * histogram, boxplot
  * scatter, line, hexbin
  * contour

#### References

This notebook is based on presentations from Monte Lunacek (National Renewable Energy Laboratory), which are based on some of the following content.

- [J.R. Johansson's tutorial](http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb)
- [Matplotlib tutorial by Jake Vanderplas](http://jakevdp.github.io/mpl_tutorial/)
- [Nicolas P. Rougier's tutorial](http://www.loria.fr/~rougier/teaching/matplotlib/)
- [Making matplotlib look like ggplot](http://messymind.net/making-matplotlib-look-like-ggplot/)
- https://github.com/jakevdp/mpld3
-  [Harvard CS109 Data Science Class](http://nbviewer.ipython.org/github/cs109/content/blob/master/lec_03_statistical_graphs.ipynb)

#### matplotlib in the IPython Notebook

The default functionality for matplotlib is to render graphics in a new window. This is great when you are working at the command line since it is not possible for Terminal (Mac) or Anaconda Prompt (Windows) to directly render graphics, they only render text.  However, the Jupyter Notebook is in a web browser, which is perfectly capable of presenting all kinds of graphics.  To get matplotlib and the Notebook to play nice together we use the following **"magic"**:    
`%matplotlib inline`     
This renders the graphics inside the Notebook, as opposed to throwing them into an independent window.  You don't need this magic when working at the command line or when working in a Python script (i.e., a `.py` file).

Note: "magics" add functionality to the IPython command line and the Jupyter Notebook. They are preceded by `%` or `%%`.  You saw the `%timeit` magic a couple weeks ago when we wanted to know how fast some code ran.

Sometimes you will see the following Jupyter Notebook magic:    
`%pylab inline`      
**Don't use this!!!**  This magic runs the following commands, plus a lot of other stuff:     
`from pylab import *`    
`from numpy import *`     
As you learned a few weeks ago, you *never* use the **`from xxx import *`** syntax. 
The result is that your name space gets polluted. The code you write is not easily ported to a regular Python script (i.e., a `.py` file) since it is not clear what has been imported.

Read more about the evils of `pylab` [here](https://github.com/Carreau/posts/blob/master/10-No-PyLab-Thanks.ipynb).

### Quick, easy, simple plots

You (almost) always want to start a Notebook that uses plotting with the following line.

In [None]:
# keep the graphics inside the Notebook
%matplotlib inline

Now you need to import matplotlib. Generally `pyplot` has everything you'll need.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

Let's start with a simple equation that shows the relationship between radii and areas of circles. We'll generate 50 `x` values (radii of circles), and then compute the area of each circle. We then pass this data to the `plot` function and then `show` it.

In [None]:
x = np.arange(1, 51)
y = np.pi * (x**2)
plt.plot(x, y)
plt.show()

The `plot` function will draw a line plot by default. The following cell plots the same data as points.

In [None]:
plt.plot(x, y, 'o')
plt.show()

**Action**: [This website](http://matplotlib.org/api/markers_api.html) lists other markers offered by matplotlib. Swap out the `o` in the above cell with some of the other options to see the different markers you can use.

Another common plot is the histogram. 

In [None]:
plt.hist(np.random.normal(0, 1, 1000), alpha=0.5, histtype='stepfilled')
plt.hist(np.random.normal(2, 2, 1000), alpha=0.5, histtype='stepfilled')
plt.show()

**Action**: Change the `alpha` value in the above cell to other values between 0 and 1. What does the `alpha` setting do?

__Action__: Visualization (and programming) is an alternative way to learn (and understand) statistics. If you want to refresh your memory on the [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution), you can play with the parameters for [`np.random.normal`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.normal.html) and then render the plot again.

### Infrastructure

The examples above are for quick plots. Typically you'll want more fine grained control over the plot to add stylistic elements.

#### The `figures` and `axes` objects

This is the typical workflow we will follow:
1. Create a blank figure
2. Add axes (a.k.a. subplots)
3. Fill each axis with the data plot

In [None]:
x = np.arange(1, 51) # same data as before
y = np.pi * x**2

fig = plt.figure()
ax = fig.add_subplot(1,1,1) # 1 row, 1 col, graphic 1
ax.plot(x, y)
plt.show()

**Note**: Matching up the three steps in the workflow to the code in the above cell:

    Step 1: fig = plt.figure()
    Step 2: ax = fig.add_subplot(1,1,1)
    Step 3: ax.plot(x, y)

Let's now make multiple subplots.

In [None]:
fig = plt.figure()

ax1 = fig.add_subplot(1,2,1) # 1 row, 2 cols, graphic 1
ax2 = fig.add_subplot(1,2,2) # 1 row, 2 cols, graphic 2

ax1.plot(x, y)

ax2.hist(np.random.normal(0, 1, 1000), alpha=0.5, histtype='stepfilled')
ax2.hist(np.random.normal(2, 2, 1000), alpha=0.5, histtype='stepfilled')

plt.show()

These are the same plots as earlier, just squished.

We can combine the first two steps of the workflow using the `plt.subplots()` command.

In [None]:
fig, ax = plt.subplots(2,3)

ax[0,0].plot(x, y)
ax[0,2].hist(np.random.normal(0, 1, 100), color="g")
ax[1,1].scatter(np.random.normal(0, 1, 10), np.random.normal(0, 1, 10), color="r")

plt.show()

**Note**: The first line in the above cell gives us our figure and an array of axes. The axes are accessed using numpy-style indexing. In the above example we only fill three of the six plots with data.

The `subplot2grid` command allows for variable sized subplots.

In [None]:
fig = plt.figure()
ax1 = plt.subplot2grid((3,4), (0,0), colspan=4)
ax2 = plt.subplot2grid((3,4), (1,0), rowspan=2)
ax3 = plt.subplot2grid((3,4), (1,1))
ax4 = plt.subplot2grid((3,4), (2,1))
ax5 = plt.subplot2grid((3,4), (1,2), colspan=2, rowspan=2)
fig.tight_layout()
plt.show()

**Note**: Looking at the second line above. The first argument is `(3,4)`, which indicates the overall shape of the plot. The second argument is `(0,0)`, which acts as a start point for the subplot. The third argument is `colspan=4`, which indicates how many columns this subplot will span. Can you interpret what is happening in the remaining lines of code?

Don't forget about the built-in help functionality.

In [None]:
help(plt.plot)

### Making Pretty Plots

In [None]:
# resetting the setup
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We'll use data on different car models.

In [None]:
cars = pd.read_csv('cars.csv')
cars.head()

In [None]:
cars.shape

The basic `scatter` plot. Notice that we follow the same workflow.

In [None]:
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(cars.wt, cars.mpg)
plt.show()

**Note**: We can interpret the above plot as saying that heavier cars (`wt`) tend to get worse gas mileage (`mpg`).

#### Changing style

We will use the Palettable package, which includes the Color Brewer framework developed by Cindy Brewer, to define our colors.
* [color brewer](http://colorbrewer2.org/) (if you have not seen Color Brewer before, take some time to click around on the website) 
* [palettable](https://jiffyclub.github.io/palettable/)

Install `palettable`
* Launch a new command line.
* Type: `conda install -c conda-forge palettable`

In [None]:
import palettable

colors = palettable.colorbrewer.qualitative.Set2_3.mpl_colors

We'll color the points based on a third variable (the number of engine cylinders (`cyl`).

In [None]:
fig, ax = plt.subplots(figsize=(6,5))
for i, cylinder in enumerate([4,6,8]):
    subset = cars[cars.cyl == cylinder]
    ax = plt.scatter(subset.wt, subset.mpg, s=100, alpha=0.95, 
                     edgecolor='none', c=colors[i])
plt.show()

**Action**: Go through the above cell line by line. Spend some time on this. This is a complicated cell for sure, but there is nothing particularly new here; it is combining stuff you've learned in previous weeks, or earlier in this Notebook. You can type `help(plt.scatter)` to see what all the arguments do; you can also just change the values to see what happens.

**Action**: Go to [colorbrewer2.org](http://colorbrewer2.org) and click through the menus to see that these colors match the website. Recall that the line that selected colors was:
    
    colors = palettable.colorbrewer.qualitative.Set2_3.mpl_colors

In the next few cells we are going to tweak the cars plot a bunch of times. Since we don't want to keep retyping the same code over and over again, we will make it into a function. Notice how the code in the following cell mirrors the code above.

In [None]:
def base_figure():
    fig, ax = plt.subplots(figsize=(6,5))
    for i, cylinder in enumerate([4,6,8]):
        subset = cars[cars.cyl == cylinder]
        ax.scatter(subset.wt, subset.mpg, s=100, alpha=0.95, 
                   edgecolor='none', c=colors[i])
    return fig, ax

Each of the cells below adds minor customizations. Be sure to notice what is different with each new image.

In [None]:
fig, ax = base_figure()

# remove the ticks
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')  
plt.show()

In [None]:
fig, ax = base_figure()

# remove the ticks
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')  

# remove the spines
ax.spines['top'].set_visible(False)  
ax.spines['right'].set_visible(False)
plt.show()

In [None]:
fig, ax = base_figure()

# remove the ticks
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')  

# modify the spines
for s in ['bottom','left','top','right']:
    ax.spines[s].set_linewidth(0.75)
    ax.spines[s].set_color('0.8')    

plt.show()

In [None]:
fig, ax = base_figure()

# remove the ticks
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')  

# modify the spines
for s in ['bottom','left','top','right']:
    ax.spines[s].set_linewidth(0.75)
    ax.spines[s].set_color('0.8')    

# set a background color and overlay some grid lines
ax.patch.set_facecolor('0.93')
ax.grid(True, 'major', color='0.98', linestyle='-', linewidth=1.0)
ax.set_axisbelow(True)

plt.show()

**Note**: Does the style of the above plot look familiar? This is a home-brewed version of the [ggplot style](https://plot.ly/ggplot2/) from R.

Let's put all of this ggplot styling into one function to define a custom transformation.

In [None]:
# put all the transformations together into one function
def ggplot(ax):
    
    ax.xaxis.set_ticks_position('none')
    ax.yaxis.set_ticks_position('none')
    for s in ['bottom','left','top','right']:
        ax.spines[s].set_linewidth(0.75)
        ax.spines[s].set_color('0.8')
    
    ax.patch.set_facecolor('0.93')
    ax.grid(True, 'major', color='0.98', linestyle='-', linewidth=1.0)
    ax.set_axisbelow(True)   

__Aside__: I hope you noticed that the above function does not have a return. That might seem weird since I've been harping on this issue for a couple of weeks... "the last line of a function should ALMOST always be `return ...`." Notice that we are leveraging the "pass by reference" concept in this function. We pass in an object and call it `ax`; then start hacking on `ax` (okay, that's kinda funny right, "hacking" and "ax"... seriously, stop rolling your eyes). Since it's pass by reference, the changes we make to `ax` live on outside the function.

With our two functions, we can now plot the pretty graphic with just two lines.

In [None]:
fig, ax = base_figure()
ggplot(ax)
plt.show()

#### Legends

In [None]:
# small change to our basic scatterplot function
def base_figure():
    fig, ax = plt.subplots(figsize=(6,5))
    for i, cylinder in enumerate([4,6,8]):
        subset = cars[cars.cyl == cylinder]
        ax.scatter(subset.wt, subset.mpg, s=100, alpha=0.95, 
                   edgecolor='none', c=colors[i],
                   label=str(cylinder)+' cyl')     # <---- adding a label
    return fig, ax

The following cell adds a legend based on the `label`.

In [None]:
fig, ax = base_figure()    # get the basic scatterplot
ggplot(ax)                 # make it ggplot like
ax.legend(loc='best')      # add a legend
plt.show()

#### Save Figures to Hard Drive

In [None]:
fig, ax = base_figure()
ggplot(ax)

ax.legend(loc='best', scatterpoints=1) # for a single point
ax.legend_.get_frame().set_linewidth(0)
ax.legend_.get_frame().set_alpha(0.5)

fig.savefig('scatter.png')
fig.savefig('scatter.pdf')

**Action**: Go to the directory where this Notebook is stored. You can open the two new files that you just created by double-clicking them.

### Examples

The following is a selection of plots matplotlib offers (by no means comprehensive).

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

matplotlib lets you set many settings for the entire session so that you don't need to keep typing them in for each plot. Below we set the colors. Notice that we're using a different color scheme from above.

In [None]:
# set the Color Brewer colors as the default colors
import palettable
from cycler import cycler

colors = palettable.colorbrewer.qualitative.Set1_8.hex_colors
mpl.rcParams['axes.prop_cycle'] = cycler('color', colors)
colors

__Action__: Hex color is a color standard that matplotlib is using, it is . Copy and paste a couple of the text strings above into [this website](https://www.hexcolortool.com) to see what they are.

#### Line Graph

The following plot builds on the tools we have seen already, including the normal function. 

In [None]:
fig, ax = plt.subplots()
ggplot(ax)
ax.plot(np.random.normal(0,1,200).cumsum())
plt.show()

Let's put eight lines on one plot.

In [None]:
fig, ax = plt.subplots()
ggplot(ax)
for i in range(8):
    ax.plot(np.random.normal(0,1,200).cumsum())
plt.show()

__Action__: You're going through this slowly, and not just getting mesmerized by the pretty colors... right? There is code from previous weeks in these cells, along with the new stuff on plotting. Make sure you're comfortable with the old _and_ new stuff.

#### Histogram

In [None]:
x = np.random.normal(100, 20, size=300)

fig, ax = plt.subplots()
ggplot(ax)
ax.hist(x, alpha=0.5, bins=20)
plt.show()

In [None]:
fig, ax = plt.subplots()
ggplot(ax)
for i in range(3):
    x = np.random.normal(i, 1, 500)
    ax.hist(x, normed=True, alpha=0.5, histtype='stepfilled', bins=20)
plt.show()

Notice that this throws a deprecation warning (at least it does on my machine). If you read the warning it tells you how to fix the problem. We will do this in the following cell. Notice what has changed. Also, this issue will will return later in this notebook.

In [None]:
fig, ax = plt.subplots()
ggplot(ax)
for i in range(3):
    x = np.random.normal(i, 1, 500)
    ax.hist(x, density=True, alpha=0.5, histtype='stepfilled', bins=20)
plt.show()

**Action**: How many histograms are in the above plot?

**Action**: Use `help(ax.hist)` to see what all the histogram arguments are.

#### Bar Charts

In [None]:
animals = ['dogs', 'cats', 'bats', 'rats', 'cows']
heights = [501, 607, 709, 650, 532]

fig, ax = plt.subplots()
ggplot(ax)

# create some bars
ax.bar(range(len(animals)), heights, color=colors, alpha=0.75)

# set the plot limits to have nice padding above the bars
ax.set_ylim(0,800)

# add labels above each bar
for x, y in enumerate(heights):
    plt.annotate(y, (x, y+20), ha='center')

# add labels below each bar    
ax.set_xticks(np.arange(len(animals)))
ax.set_xticklabels(animals)

plt.show()

**Action**: There is a lot happening in the above cell. 
- Comment out all the lines after `ax.bar(...)`. Rerun the cell.
- One by one uncomment the code chunks and see how the final plot is built up

#### Box Plot

The cell below makes boxplots. Notice that we're still following the same pattern every time: 

    0) Get some data
    1) Create a blank figure
    2) Add axes (a.k.a. subplots)
    3) Fill each axis with the data plot

In [None]:
# create some data
d1 = np.random.normal(20, 10, 300)
d2 = np.random.normal(40, 10, 300)
data = [d1, d2]

# make a plot
fig, ax = plt.subplots()
ggplot(ax)
ax.boxplot(data, widths=0.65)

plt.show()

__Note__: A box plot is an alternative way to summarize a dataset. The middle line is the median, the top of the "box" is the value of the observation at the 75th percentile, the bottom is the value at the 25th percentile. The length of the whiskers can vary from implementation to implementation. matplotlib uses this rule:

whis is set to 1.5 by default.
> In other words, where IQR is the interquartile range (Q3-Q1), the upper whisker will extend to last datum less than Q3 + whis \* IQR). Similarly, the lower whisker will extend to the first datum greater than Q1 - whis \* IQR. Beyond the whiskers, data are considered outliers and are plotted as individual points. Set this to an unreasonably high value to force the whiskers to show the min and max values. Alternatively, set this to an ascending sequence of percentile (e.g., [5, 95]) to set the whiskers at specific percentiles of the data. Finally, whis can be the string 'range' to force the whiskers to the min and max of the data. [more info can be found here](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html)

This has less information than a histogram, but it is easier to plot these for comparisons.

#### Error Bars

In [None]:
# create some data
x = np.arange(1, 51)
y = np.pi * x**2
xerr = np.random.normal(500, 500, size=len(x))

# make a plot
fig, ax = plt.subplots()
ggplot(ax)

ax.plot(x, y)
ax.errorbar(x, y, yerr=xerr, fmt='.k')  # <--- adding this line gives us error bars
plt.show()

**Action**: Comment out the `errorbar` line in the above cell to see what the plot looks like "before and after."

Below we fill in the error range.

In [None]:
fig, ax = plt.subplots()
ggplot(ax)

ax.plot(x, y)
ad_low = y-abs(xerr)
ad_high = y+abs(xerr)
ax.fill_between(x, ad_low, ad_high, color='0.5', alpha=0.2)

plt.show()

#### Point Patterns

The cell below generates 20,000 points using the sklearn package.

In [None]:
from sklearn.datasets import make_blobs
X, membership = make_blobs(n_samples=20000, centers=2, 
                           random_state=37, cluster_std=4)
x = X[:,0]
y = X[:,1]

In [None]:
fig, ax = plt.subplots()
ggplot(ax)

ax.plot(x, y, 'o')
plt.show()

**Action**: What is the problem with plotting 20,000 points?

There are a number of approaches for dealing with overplotting.

In [None]:
# adjust the transparency of the points
fig, ax = plt.subplots()
ggplot(ax)

ax.plot(x, y, 'o', alpha=0.02)
plt.show()

**Note**: From the above cell it is becoming clearer that there are two clusters, and that each has a core with the density of points decreasing as you move away from the core.

**Action**: Play with the alpha level (0 to 1) in the above cell to see the impact.

A hexbin plot overlays a hexagonal grid on the points, and then counts the number points falling in each cell. The plot then colors each cell based on the number of points in the cell.

In [None]:
# use hexbins to deal with over plotting
fig, ax = plt.subplots(figsize=(6,5))

ax.hexbin(x, y, gridsize=20)
plt.show()

Below we make a few customizations to the hexbin plot to increase the interpretability of the data.

In [None]:
blues = plt.get_cmap('Blues')
fig, ax = plt.subplots(figsize=(8,5))

tmp = ax.hexbin(x, y, gridsize=40, cmap=blues)
fig.colorbar(tmp, ax=ax)
plt.show()

# Seaborn

> Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

> Drawing attractive figures is important. When making figures for yourself, as you explore a dataset, its nice to have plots that are pleasant to look at. Visualizations are also central to communicating quantitative insights to an audience, and in that setting it's even more necessary to have figures that catch the attention and draw a viewer in.

> Matplotlib is highly customizable, but it can be hard to know what settings to tweak to achieve an attractive plot. Seaborn comes with a number of customized themes and a high-level interface for controlling the look of matplotlib figures.

Recall that we defined this plot earlier.

In [None]:
x = np.arange(1, 51)
y = np.pi * x**2

fig, ax = plt.subplots()
ax.plot(x, y)
plt.show()

As you see above, the matplotlib styles are kinda boring and in some cases strange. In my opinion, the seaborn styles are generally an improvement.

In [None]:
import seaborn as sns

To get the seaborn styles in one shot, we use the `sns.set()` function. It is important to note that seaborn will apply its defaults to the matplotlib settings. This means that after importing seaborn, when you call matplotlib you will get seaborn style. 

In [None]:
sns.set()

Notice that the following cell makes the exact same calls to matplotlib as the plot above, but the style is much better... in fact it looks a lot like our `ggplot` style from earlier.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1) # 1 row, 1 col, graphic 1
ax.plot(x, y)
plt.show()

Seaborn has five preset seaborn themes: `darkgrid`, `whitegrid`, `dark`, `white` and `ticks`. The default is `darkgrid`. In the following cell we change the style.

In [None]:
sns.set_style("whitegrid")

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1) # 1 row, 1 col, graphic 1
ax.plot(x, y)
plt.show()

In [None]:
plt.hist(np.random.randn(1000), alpha=0.5, histtype='stepfilled')
plt.hist(0.75*np.random.randn(1000)+1, alpha=0.5, histtype='stepfilled')
plt.show()

**Note**: Notice that both the line plot and histogram take on the new style even though we only called `sns.set_style` once. When you set the style, this becomes the default for all subsequent plots; until you set the style again. 

**Action**: Go back up and test each of the five styles.  Which do you like best?

We'll repeat some of the earlier plots now that seaborn has been imported.

In [None]:
# not much here except getting the nicer style
line_data = pd.DataFrame(np.random.normal(0,1,size=[200,8]).cumsum(axis=0))
plt.plot(line_data)
plt.show()

We'll repeat an earlier histogram plot. The following cell just builds the data like before.

In [None]:
x = np.random.normal(100, 20, size=300)

**Action**: The next cell creates the histogram. Roll back up to see how this was coded in matplotlib.

In [None]:
sns.distplot(x)

__Note__: Recall the deprecation warning we saw earlier? This shows that seaborn is using matplotlib under the hood __and__ that seaborn has not updated there code to match the changes coming to matplotlib.

**Note**: Seaborn gives you some other stuff for free, like a [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE) for the data.

In [None]:
# turn off the KDE
sns.distplot(x, kde=False)

We'll replicate our bar plot from earlier.

In [None]:
animals = ['dogs', 'cats', 'bats', 'rats', 'cows']
heights = [501, 607, 709, 650, 532]

fig, ax = plt.subplots()
ggplot(ax)

# create some bars
ax.bar(range(len(animals)), heights, color=colors, alpha=0.75)

# set the plot limits to have nice padding above the bars
ax.set_ylim(0,800)

# add labels above each bar
for x, y in enumerate(heights):
    plt.annotate(y, (x, y+20), ha='center')

# add labels below each bar    
ax.set_xticks(np.arange(len(animals)))
ax.set_xticklabels(animals)

plt.show()

We'll rebuild this using seaborn. 
- Put the data into a pandas dataframe. 
- Call seaborn's `barplot` function.
- We can still use matplotlib syntax to make tweaks to the plot.

For this plot seaborn gets us most of the way there, but we still need a little matplotlib syntax to finish it off. Also notice the x and y axis labels come for free.

In [None]:
animals = ['dogs', 'cats', 'bats', 'rats', 'cows']
heights = [501, 607, 709, 650, 532]
bar_data = pd.DataFrame(animals, columns=['animals'])
bar_data['heights'] = heights

plot = sns.barplot(x='animals', y='heights', data=bar_data)

# set the plot limits to have nice padding above the bars
plot.set_ylim(0,800)

# add labels above each bar
for x, y in enumerate(heights):
    plot.annotate(y, (x, y+20), ha='center')
plt.show()

In [None]:
d1 = np.random.normal(20, 10, 300)
d2 = np.random.normal(40, 10, 300)
data = pd.DataFrame(d1, columns=['data_1'])
data['data_2'] = d2

sns.boxplot(data=data)

**Note**: Seaborn tries to use the pandas data frame columns as appropriate (in this case making them the x-axis labels).

## Closing thoughts on Seaborn

So you may be asking, "why did he show us matplotlib if seaborn does all this great styling stuff." The answer is that seaborn sits on top of matplotlib, and so you will sometimes need to get your hands dirty mucking around with matplotlib to get the graphic just right. This can be analogized to the relationship between numpy and pandas; pandas sits on top of numpy, and provides a much nicer interface to data manipulation, but you still need to understand how numpy works to do sophisticated computations in python.

Seaborn is one of a few python plotting libraries vying to fill the current gap in the language for a high-level plotting package. matplotlib does (almost) everything in terms of plotting, and that has kept it around.  But it is also cumbersome to use.  You can compare this to ggplot in R, which has made visualization easy. It appears that seaborn is winning the battle in python, but if you really like ggplot you can [try this](http://ggplot.yhathq.com/).

# Test Yourself

1a) Use pandas to read in `iris.csv` that is in the `at_home` directory. Make a histogram of the `sepal_length` column using matplotlib. Do __NOT__ use the `ggplot` styling.

In [None]:
# Run this cell to turn off the seaborn style
sns.reset_orig()

1b) Make the same histogram using seaborn. Do __NOT__ include the kernel density estimation.

---

2) In your own words, what is the effect of this line?
  
       %matplotlib inline

3) Make some modern art using matplotlib.

- Turn off the seaborn style using `sns.reset_orig()`.
- Create a numpy array with 50 draws from a [uniform distribution](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.uniform.html), with a minimum value of 1 and max of 20. Call this `x`.
- Repeat the above and call it `y`.
- Do it again and call it `color`.
- Repeat the above one more time, but this time assume that each value is a radius, and compute its area. Call this one `area`.
- Read the docs on [matplotlib's scatterplot](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.scatter.html). Plug in your four arrays to make a scatter plot. Set the `alpha` to 0.5. Hint: `area` is the size of the points and `color` is the color of the points.
- Bonus: [Turn off the x and y labels and ticks](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.tick_params.html?highlight=matplotlib%20axes%20axes%20tick_params#matplotlib.axes.Axes.tick_params) to make it really artsy.

This makes a pretty picture, but notice that you just made a four dimensional plot. Imagine that each point is a U.S. state:
- percent of the population in poverty is on the x axis
- percent of the population with a college degree is on the y axis
- total population is the size of the dot
- population growth rate is represented by the dot color

4) Visualize some flower data

- Turn seaborn style back on using `sns.set()`
- Use pandas to read in the `iris.csv`.
- A [seaborn lmplot](https://seaborn.pydata.org/generated/seaborn.lmplot.html) will plot points and fit regression lines through subsets of the points. It does a ton of stuff in a single function call. Scroll to the bottom of the docs to see some examples.
  - put the Sepal length on the x axis
  - put the Sepal width on the y axis
  - subset the data by species