# Matplotlib

## Introduction

In this notebook, we'll explore the Mathplotlib and the various visualizations available.

There is a fair bit of Python code used to render the charts.
If Python scares you, we suggest you don't go through this book (or simply run the cells and ignore the code).

## Imports

We'll start by importing the libraries we'll use. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

## Some Simple (and Not So Simple) Stuff

### Introduction

Let's start by exploring a simple line plot.

### The Data

First, let's create some data.
We'll use the `numpy` library (it is super fast and convenient). 

The code below creates a simple array of data by generating 100 random integers between 1 and 100 (and then we print out the type of the data object and the data so that you can see what it is):

In [None]:
data = np.random.randint(1,100,100)
print(type(data))
print(data)

### A Trick to Get Help

By the way, we'll use numpy quite frequently (and so will you when you get excited about Python). 
If you don't already know, let's take a look at a trick to learn some of these libraries.

For any Python function, you can prepend the function with a questionmark (`?`) to get the documentation. 
For example, say we wanted to see what the `np.random.randomint` function does, we could simply run this:

In [None]:
?np.random.randint

### Creating a Simple Plot

There are many ways to create plots. 
In most of the other notebooks we've seen, we started with some data structure (e.g., pandas), but this time, we'll focus mostly on the plotting library, so be a bit more primitive.


We'll start by creating a plot (or an instance of `AxisPlot`). 
The plotting library has method called `subplots` that we can use.
This function returns a tuple, so we'll do a tuple-assignment:

In [None]:
fig, ax = plt.subplots()

That created a simple graphic with no content.
Obviously that is not our end goal, so let's continue decorating the simple graphic with some labels and header:

In [None]:
ax.set_title("Some Numbers")
ax.set_ylabel("value")
ax.set_xlabel("array index")
fig

Nice... 
We've put the labels and title in, what about some numbers?

In [None]:
ax.plot(data, label='random')
fig

We now have plotted the numbers.
What if we wanted some more graphs to perhaps compare data.

Let me create a few things (I've explained the code inline as comments below):

In [None]:
x = np.linspace(0,100,100)            # Create a linear space using numpy (0,1,2...)
ax.plot(x, label='increasing')        # Plot the linear space
ax.plot(x, 100-x, label='decreasing') # Plot a formula based on the linear space (100-x)
ax.plot(x, x-x+50, label='constant')  # Plot another formula yielding 50 (constant)
ax.legend()                           # Let's add legends to make the graph more readable
fig


## General Plotter

### A Heart

You can use `matplotlib` to render any shape... 
In fact, you can use it as an old fashion plotter.

Let's start by creating a heart:

In [None]:
t = np.arange(0,2*np.pi, 0.1) 
x = 16*np.sin(t)**3 
y = 13*np.cos(t)-5*np.cos(2*t)-2*np.cos(3*t)-np.cos(4*t) 
plt.plot(x,y) 
plt.show() 

### A Spiral

Or perhaps a spiral (Source: matplotlib.org)

In [None]:
theta = np.arange(0, 8*np.pi, 0.1)
a = 1
b = .5

for dt in np.arange(0, 2*np.pi, np.pi/2.0):

    x = a*np.cos(theta + dt)*np.exp(b*theta)
    y = a*np.sin(theta + dt)*np.exp(b*theta)

    dt = dt + np.pi/4.0

    x2 = a*np.cos(theta + dt)*np.exp(b*theta)
    y2 = a*np.sin(theta + dt)*np.exp(b*theta)

    xf = np.concatenate((x, x2[::-1]))
    yf = np.concatenate((y, y2[::-1]))

    p1 = plt.fill(xf, yf)

plt.show()

### Let's Get Back to Business

That was fun, but we don't typically create hearts or spirals at PayPal, but I think it is important to see that the matplotlib is a generic drawing tool that pretty much allows you to draw anything!

However, we should get back to more typical PayPal tasks. 
What is different going forward is that we would typically start with a `DataFrame`. 

For those of you that already are fluent in `matplotlib`, you may want to take advantage of a more low-level manipulation of the plots, but for most business cases, we are better off starting with a `DataFrame`.

## Some More Data

We will start by loading some data.
We'll use a dataset that we've used before, namely the 5000 sales transactions.

Let's start by loading in the data using a magic.
We'll convert the result into a dataframe (and print the first 5 so that we can remember what was in the dataset):

In [None]:
sales = pd.read_csv('/data/5000-sales-records.csv')
sales.head()

### A Simple Pie Chart

Let's start by displaying a simple pie chart with the profits by region.

We'll convert the data into a simple pivot table where we aggregate the profit by region:

In [None]:
profit_by_region = sales[["Region", "Total Profit"]] \
    .pivot_table(index="Region", aggfunc=np.sum)

Creating a pie chart is simple.
We can access the `plot` property on the `DataFrame` and create a pie chart:

In [None]:
profit_by_region.plot.pie(y="Total Profit");

That is what we want, but what's up with the legends?

Often, as a data analyst, it is easy to get to a simple chart, but to make it look good may take a little research.

In this case, we can manipulate the legend's placement using a few parameters to the `legend()` function.

How do we to that? Well, if you haven't done it for some time, you probably have to look it up.
You can find structured documentation for matplotlib here (https://matplotlib.org/), but a simple google search will probably be the fastest path to success in almost all cases.

Here is what I had to do to make it look a little nicer:

In [None]:
profit_by_region \
    .plot.pie(y="Total Profit") \
    .legend(loc="upper left", bbox_to_anchor=(1,1) );

Truth is, with the label on the slices, we probably don't need the legends at all:

In [None]:
l = profit_by_region.plot.pie(y="Total Profit").legend()
l.remove()

### A Simple Bar Chart

Let's render the data in a bar chart as well and see what we have to do to make that pretty.

Let's start by using the defaults:

In [None]:
profit_by_region.plot.bar();

OK, that is not too bad, perhaps we want to display the numbers on the right in millions of USD and remove the legend.

Removing the legend is similar to the pie chart, so let's start there. We did that already, so let's see if it works on a bar chart as well!

In [None]:
l = profit_by_region.plot.bar().legend()
l.remove()

Great, good to see that the library is consistent.

Now, how do we change the y-axis to display Total Profit in million USD?
The formatting of the values of the y-axis is a bit more tricky and requires some Python coding.

With no further due, let me do it and comment on the code with what I  wrote.

In [None]:
# First, let's define a function that does the formatting.
# This function may look strange, but the matplotlib has a formatter function
# for this purpose that we can replace. This function takes two parameters
def format_mill_usd(x, pos):
    'where x is the value and pos is the tick position'
    return '$%1.fM' % (x * 1e-6)

# Matplotlib also deifnes a wrapper for the formatter
from matplotlib.ticker import FuncFormatter

# Let's wrap the format_mill_usd function
formatter = FuncFormatter(format_mill_usd)

bar = profit_by_region.plot.bar()               # create the bar chart
bar.get_yaxis().set_major_formatter(formatter)  # set the formatter for the y axis
bar.get_legend().remove()                       # remove the legend
bar.set_ylabel("Profit in USD")

As I look at the graph above, the very long values on the x-axis looks a bit off.

Perhaps as a final touch, we'll shorten the labels a bit:

In [None]:
bar.set_xticklabels(["Asia", "Australia", "Central America", "Europe", "Middle East", "North America", "Africa"])
bar.get_figure()

## Styles

### What are styles?

The graphics we produced thus were using a default rendering style. 
You can change the style of the charts.

To see an exhaustive set of styles, check this reference https://matplotlib.org/3.3.1/gallery/style_sheets/style_sheets_reference.html.

### Try out styles

Let start by changing the background to black and create a few charts

Just so that you can play with the various styles, let's create. a function that we can call below to render a chart with a particular style:

In [None]:
def create_chart():
    data = np.random.randint(1,100,100)
    f, x = plt.subplots()
    x.plot(data)
    f
def get_chart_with_style(style):
    plt.style.use(style)
    f = create_chart()
    plt.style.use('default')
    f

Currently, we use a style called default. 
Let's just run the function with the default style first.

In [None]:
get_chart_with_style('default')

Let's try another style

In [None]:
get_chart_with_style('Solarize_Light2')

That looks quite different.
Let's try another style:

In [None]:
get_chart_with_style('grayscale')

Yet another:

In [None]:
get_chart_with_style('ggplot')

Feel free to try out the other styles supported by `matplotlib`.

You can find the available styles by running the cell below:

In [None]:
plt.style.available

## A Few More Charts (in Rapid Fire Mode)

Let's try out a few more charts that are supported directly from the `pandas`.

### Area Plot

In [None]:
area = profit_by_region.plot.area()
area.legend()

### Horizontal Bar Chart

In [None]:
profit_by_region.plot.barh()

### Hexbin

We don't have a good data set for this, so I (shamelessly) copied some code from the `matplotlib` documentation (see: https://matplotlib.org/gallery/statistics/hexbin_demo.html#sphx-glr-gallery-statistics-hexbin-demo-py):

In [None]:
# Fixing random state for reproducibility
np.random.seed(19680801)

n = 100000
x = np.random.standard_normal(n)
y = 2.0 + 3.0 * x + 4.0 * np.random.standard_normal(n)
xmin = x.min()
xmax = x.max()
ymin = y.min()
ymax = y.max()

fig, axs = plt.subplots(ncols=2, sharey=True, figsize=(7, 4))
fig.subplots_adjust(hspace=0.5, left=0.07, right=0.93)
ax = axs[0]
hb = ax.hexbin(x, y, gridsize=50, cmap='inferno')
ax.set(xlim=(xmin, xmax), ylim=(ymin, ymax))
ax.set_title("Hexagon binning")
cb = fig.colorbar(hb, ax=ax)
cb.set_label('counts')

ax = axs[1]
hb = ax.hexbin(x, y, gridsize=50, bins='log', cmap='inferno')
ax.set(xlim=(xmin, xmax), ylim=(ymin, ymax))
ax.set_title("With a log color scale")
cb = fig.colorbar(hb, ax=ax)
cb.set_label('log10(N)')

plt.show()


## Histogram

Let's use a slightly different dataset for this to make it interesting. 
Let's look back at the sales transactions and see how the UnitPrice distributes.

In [None]:
it = sales[["Unit Cost"]].plot.hist()


### Density plot

Related to histograms. For a deeper discussion see this article: https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html

In [None]:
profit_by_region.plot.density()

### Kernel Density Estimation (KDE) Diagram

KDE is a non-parametric way to estimate a random variable's probability density function (see Wikipedia for more details https://en.wikipedia.org/wiki/Kernel_density_estimation).

They are closely related to histograms but can be endowed with smoothness or continuity parameters.

Let's look at UnitCost again and plot it in a KDE diagram.

In [None]:
it = sales[["Unit Cost"]].plot.kde()

### Line Chart

In [None]:
profit_by_region.plot.line()

### Scatter Chart

In [None]:
d = sales[["Unit Price", "Unit Cost"]].plot.scatter(x="Unit Price", y="Unit Cost")