# Make Explanatory Visualizations

**Objectives**

- What are continuous and categorical variables? Using pandas built-in plotting functionality.
- Learn about various types of `pandas` plots using `matplotlib`
- use `matplotlib` to visualize distributions and relationships with continuous and categorical variables
- imitate a real-world example

**What are categorical, discrete, and continuous variables?**  

* Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order. For example, categorical predictors include gender, material type, and payment method.  
* Discrete variables are numeric variables that have a countable number of values between any two values. A discrete variable is always numeric. For example, the number of customer complaints or the number of flaws or defects.  
* Continuous variables are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. For example, the length of a part or the date and time a payment is received.  
[Source](https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/what-are-categorical-discrete-and-continuous-variables/)

In [None]:
## Imports.


In [None]:
# use the 'magic' symbol (%) to specify some non-python code (i.e., affects the underlying jupyter kernel).

'''
That line is only for jupyter notebooks, and allows plt figures to show up in your noteboook.
if you are using another editor, you'll use: 
 plt.show() 
at the end of all your plotting commands to have the figure pop up in another window.
'''

In [None]:
# Specify the 'plot style' we want to use with pandas and matplotlib

# "fast" is actually the default style, so you don't necessarily have to set it.

In [None]:
# List of other available plot styles you can use instead of "fast".


## The Pandas built-in visualization tool
This is useful only for simple, quick-and-dirty plots. [Read the full documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html). For anything more complex you'll want to use a more robust visualization package such as `matplotlib`, `seaborn`, or `plotly`.  

"Under the hood, pandas plots graphs with the matplotlib library. This is usually pretty convenient since it allows you to just .plot your graphs, but since matplotlib is kind of a train wreck pandas inherits that confusion." [J. Sonoma](http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/understand-df-plot-in-pandas/)


## Intro to `matplotlib`

**Basic example**

Let's walk through a very simple example using two numpy arrays. You can also use lists, but most likely you'll be passing numpy arrays or pandas columns (which essentially also behave like arrays).

** The data we want to plot:**

**Basic Matplotlib Commands**

We can create a very simple line plot using the following ( I encourage you to pause and use Shift+Tab along the way to check out the document strings for the functions we are using).

In [None]:
# a very simple plot, to get started. Notice that NO PANDAS is required!


In [None]:
# You can have two 'plt' plots together (and let's add some color).


In [None]:
# Now add some labels, plus a little texture.


**Creating Multiplots on Same Canvas**

In [None]:
# The basic syntax goes like this: plt.subplot(nrows, ncols, plot_number)


___
### Matplotlib Object Oriented Method
Now that we've seen the basics, let's break it all down with a more formal introduction of Matplotlib's Object Oriented API. This means we will instantiate figure objects and then call methods or attributes from that object.

In Matplotlib, the figure (an instance of the class `plt.Figure`) can be thought of as a single container that contains all the objects representing axes, graphics, text, and labels. The axes (an instance of the class `plt.Axes`) is what we see above: a bounding box with ticks and labels, which will eventually contain the plot elements that make up our visualization. We'll commonly use the variable name `fig` to refer to a figure instance, and `ax` to refer to an axes instance or group of axes instances. Once we have created an axes, we can use the `ax.plot` function to plot some data.
https://jakevdp.github.io/PythonDataScienceHandbook/04.01-simple-line-plots.html

**The `.figure()` method**  
To begin we create a figure instance. Then we can add axes to that figure:

In [None]:
# Create Figure (empty canvas)

# Add set of axes to figure

# Plot on that set of axes


Code is a little more complicated, but the advantage is that we now have full control of where the plot axes are placed, and we can easily add more than one axis to the figure:

In [None]:
# Creates blank canvas

# Larger Figure Axes 1

# Insert Figure Axes 2


**The `subplots()` method**

The `plt.subplots()` object will act as a more automatic axis manager. It is somewhat more common than using `.figure()`

`plt.subplots()` is a function that returns a tuple containing a figure and axes object(s). Thus when using `fig, ax = plt.subplots()` you unpack this tuple into the variables fig and ax. Having fig is useful if you want to change figure-level attributes or save the figure as an image file later (e.g. with fig.savefig('yourfilename.png')). You certainly don't have to use the returned figure object but many people do use it later so it's common to see. Also, all axes objects (the objects that have plotting methods), have a parent figure object anyway, thus:
```
fig, ax = plt.subplots()
```
is more concise than this:
```
fig = plt.figure()
ax = fig.add_subplot(111)
```
https://stackoverflow.com/questions/34162443/why-do-many-examples-use-fig-ax-plt-subplots-in-matplotlib-pyplot-python

In [None]:
# Use similar to plt.figure() except use tuple unpacking to grab fig and axes

# Now use the axes object to add stuff to plot


Then you can specify the number of rows and columns when creating the subplots() object:

In [None]:
# Empty canvas of 1 by 2 subplots


In [None]:
# Axes is an array of axes to plot on


We can iterate through this array:

In [None]:
# Display the figure object 


A common issue with matplolib is overlapping subplots or figures. We ca use **fig.tight_layout()** or **plt.tight_layout()** method, which automatically adjusts the positions of the axes on the figure canvas so that there is no overlapping content:

### Figure size, aspect ratio and DPI

Matplotlib allows the aspect ratio, DPI and figure size to be specified when the Figure object is created. You can use the `figsize` and `dpi` keyword arguments. 
* `figsize` is a tuple of the width and height of the figure in inches
* `dpi` is the dots-per-inch (pixel per inch). 

For example:

The same arguments can also be passed to layout managers, such as the `subplots` function:

## Saving figures
Matplotlib can generate high-quality output in a number formats, including PNG, JPG, EPS, SVG, PGF and PDF. 

To save a figure to a file we can use the `savefig` method in the `Figure` class:

In [None]:
# save as a fig

Here we can also optionally specify the DPI and choose between different output formats:

____
## Legends, labels and titles

Now that we have covered the basics of how to create a figure canvas and add axes instances to the canvas, let's look at how decorate a figure with titles, axis labels, and legends.

**Figure titles**

A title can be added to each axis instance in a figure. To set the title, use the `set_title` method in the axes instance:

In [None]:
# set the title

**Axis labels**

Similarly, with the methods `set_xlabel` and `set_ylabel`, we can set the labels of the X and Y axes:

In [None]:
# x and y labels

### Legends

You can use the **label="label text"** keyword argument when plots or other objects are added to the figure, and then using the **legend** method without arguments to add the legend to the figure: 

In [None]:
# add a legend

The **legend** function takes an optional keyword argument **loc** that can be used to specify where in the figure the legend is to be drawn. The allowed values of **loc** are numerical codes for the various places the legend can be drawn. See the [documentation page](http://matplotlib.org/users/legend_guide.html#legend-location) for details. Some of the most common **loc** values are:

In [None]:
# Lots of options....


# .. many more options are available

# Try replacing the `loc` value with integers 1 through 10.


## Setting colors, linewidths, linetypes

Matplotlib gives you *a lot* of options for customizing colors, linewidths, and linetypes. 

There is the basic MATLAB like syntax (which I would suggest you avoid using for more clairty sake:

### Colors with MatLab like syntax

With matplotlib, we can define the colors of lines and other graphical elements in a number of ways. First of all, we can use the MATLAB-like syntax where `'b'` means blue, `'g'` means green, etc. The MATLAB API for selecting line styles are also supported: where, for example, 'b.-' means a blue line with dots:

In [None]:
# MATLAB style line color and style 

### Colors with the color= parameter

We can also define colors by their names or RGB hex codes and optionally provide an alpha value using the `color` and `alpha` keyword arguments. Alpha indicates opacity.

### Line and marker styles

To change the line width, we can use the `linewidth` or `lw` keyword argument. The line style can be selected using the `linestyle` or `ls` keyword arguments:

In [None]:
# possible linestype options ‘-‘, ‘–’, ‘-.’, ‘:’, ‘steps’

# custom dash

# possible marker symbols: marker = '+', 'o', '*', 's', ',', '.', '1', '2', '3', '4', ...

# marker size and color

### Control over axis appearance

In this section we will look at controlling axis sizing properties in a matplotlib figure.

## Plot range

We can configure the ranges of the axes using the `set_ylim` and `set_xlim` methods in the axis object, or `axis('tight')` for automatically getting "tightly fitted" axes ranges:

## Imitate a real-world example

Today we will reproduce this [example by FiveThirtyEight:](https://fivethirtyeight.com/features/al-gores-new-movie-exposes-the-big-flaw-in-online-movie-ratings/)



In [None]:
url = 'https://fivethirtyeight.com/wp-content/uploads/2017/09/mehtahickey-inconvenient-0830-1.png'
=

Using this data: https://github.com/fivethirtyeight/data/tree/master/inconvenient-sequel

Links
- [Strong Titles Are The Biggest Bang for Your Buck](http://stephanieevergreen.com/strong-titles/)
- [Remove to improve (the data-ink ratio)](https://www.darkhorseanalytics.com/blog/data-looks-better-naked)
- [How to Generate FiveThirtyEight Graphs in Python](https://www.dataquest.io/blog/making-538-plots/)

### Make fake prototypes

This  helps us understand the problem

In [None]:
# what styles are available in matplotlib? There's one for 538.


In [None]:
# Create fake data to replicate the blog post figure.


In [None]:
fake2 = pd.Series(
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     2, 2, 2, 
     3, 3, 3,
     4, 4,
     5, 5, 5,
     6, 6, 6, 6,
     7, 7, 7, 7, 7,
     8, 8, 8, 8,
     9, 9, 9, 9, 
     10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10])


### Annotate with text

In [None]:
fig, ax = plt.subplots()
fig.patch.set(facecolor="white")

# Set the bars

# Set the title and subtitle

# Set the x and y axes labels

# Fix the x an y axis tick marks and grid


### Reproduce with real data

Using this dataset relies on us making two discoveries:

1) The dataset shows logs of the data at different timestamps and these timestamps are cumulative, meaning that it has all of the ratings as earlier timestamps, but with the new ones added on top. 2) The dataset logs ratings breakdowns for a bunch of different demographic groups per timestamp.

Once we realize these two things, we realize that we only really need 1 line of this dataset to make our graphic, it's the last line that holds the ratings for all IMDb users for the very last time stamp.

In [None]:
# read the data from 538's github repo
# 'https://raw.githubusercontent.com/fivethirtyeight/data/master/inconvenient-sequel/ratings.csv'

In [None]:
# Convert timestamps strings to actual datetime objects

In [None]:
# Use the timestamp as the unique index identifier 
# so that we can select rows by timestamp

In [None]:
# grab only the rows corresponding to the last day

In [None]:
# get the demographic breakdowns for all IMDb users on the last day

In [None]:
# just grab the very last line (latest timestamp) of IMDb user ratings
# this should be the most up to date data from the dataset

In [None]:
# Grab only the percentage columns since we don't care about the raw 
# counts in making our graph

In [None]:
# Reset the index so that it's numeric again
# and rename the percent column for easy access in our plotting

**generate the figure**

In [None]:
fig, ax = plt.subplots()

# Figure background color

# Set the bars

# Axes background color

# Set the title and subtitle

# Set the x and y axes labels

# Fix the x an y axis tick marks and grid
