# Advanced graphing

Most Matplotlib tutorials cover the same set of standard plots: bar charts, scatter plots, line graphs. My lectures cover more advanced applications of data visualizations. Let's create some of those.

For just a quick refresher, read and run these cells.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

In [None]:
n = np.array([0,1,2,3,4,5])
xx = np.linspace(-0.75, 1., 100)
x = np.linspace(0, 5, 11)
y = x ** 2

In [None]:
fig, axes = plt.subplots(1, 4, figsize=(12,3))

axes[0].scatter(xx, xx + 0.25*np.random.randn(len(xx)))
axes[0].set_title("scatter")

axes[1].step(n, n**2, lw=2)
axes[1].set_title("step")

axes[2].bar(n, n**2, align="center", width=0.5, alpha=0.5)
axes[2].set_title("bar")

axes[3].fill_between(x, x**2, x**3, color="green", alpha=0.5)
axes[3].set_title("fill_between")

## Dot chart

We looked at the dot chart (or dot plot) as a solution to bar charts with long bars and small differences. Conceptually, this is a scatter plot with one categorical axis and distinctive horizontal grid lines. (We can customize the existing grid lines, or just draw some new lines ourselves.)

*(Example copied from https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/#17.-Dot-Plot)*

In [None]:
%matplotlib inline

# Prepare data
import pandas as pd

df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', inplace=True)
df.reset_index(inplace=True)
df.head()

Let's start with just a scatter plot:

In [None]:
# Draw plot
fig, ax = plt.subplots()

# Draw the horizontal lines
# TODO

# Plot the data points
ax.scatter(y=df.index, x=df.cty, s=75, color='firebrick', alpha=0.7)

# Title, Label, Ticks and Ylim
ax.set_title('Dot Plot for Highway Mileage', fontdict={'size':22})

ax.set_xlabel('Miles Per Gallon')

ax.set_yticks(df.index)
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment': 'right'})

ax.set_xlim(10, 27)  # set the horizontal axis to [10, 27]
# plt.show()

With this many data points, it's a little hard to read which is which. You can figure it out, but some sort of visual line will make it easier.

### Task

We could manipulate the regular grid lines, but that gets fiddly. Let's draw some lines instead. We'll use `ax.hlines`: a method to draw **H**orizontal lines on an axis.

In [None]:
help(ax.hlines)

Here is one way to do that with our example:
    
```python
ax.hlines(y=df.index, xmin=11, xmax=26, color='gray', 
          alpha=0.7, linewidth=1, linestyles='dashdot')```
          
The first three parameters say where to draw the lines, and the rest is just optional styling. Copy that into the code above and re-run.

Edit any other parameters you want to make the graph more to your liking. When you're satisfied, move on to the next item.

## Dot-dash plot

This is the scatterplot technique where the axis tick marks are drawn for each data point instead of some arbitrary interval. Doing that can reveal the marginal distribution of the data in addition to its joint distribution in the scatter.

At its core, a dot-dash plot is a normal scatter plot with customized tick marks and tick mark labels.

This example re-uses the MPG data from the dot plot, so you may need to re-run the 'Prepare data' cell if you've ran other examples before getting here.

In [None]:
df_raw.head()

In [None]:
# Draw plot
fig, ax = plt.subplots()

# Plot the data points
ax.scatter(y=df_raw.cty, x=df_raw.hwy, s=50, color='blue', alpha=0.5)

ax.set_xlabel('City MPG')
ax.set_ylabel('Highway MPG')

To customize where the tick marks are drawn, we use `ax.xticks` and `ax.yticks`.

In [None]:
xvals = df_raw.hwy.unique()  # using unique so we don't bother drawing multiple in the same place

In [None]:
# Draw plot
fig, ax = plt.subplots()

# Plot the data points
ax.scatter(y=df_raw.cty, x=df_raw.hwy, s=50, color='blue', alpha=0.5)

ax.set_xlabel('City MPG')
ax.set_ylabel('Highway MPG')

ax.set_xticks(xvals);

That works, but looks ugly, because labels are drawn on each tick mark by default. (Also, this data set is all integers over a small range, so it's not as powerful a visual effect.)

But we can't quite specify where to only draw labels; that's just not how the graph objects are designed. Instead, we can say that in the places we don't want labels, we just draw `''`, the empty string, instead.

One tidy way to do this is with a formatter function:

In [None]:
xtick_labels_we_want = [12, 20, 30, 37, 41, 44]

# based on https://matplotlib.org/stable/gallery/ticks_and_spines/tick_labels_from_values.html
def format_xticks(tick_val, tick_pos):
    if tick_val in xtick_labels_we_want:
        return str(tick_val)
    else:
        return ''
    

In [None]:
# Draw plot
fig, ax = plt.subplots()

# Plot the data points
ax.scatter(y=df_raw.cty, x=df_raw.hwy, s=50, color='blue', alpha=0.5)

ax.set_xlabel('City MPG')
ax.set_ylabel('Highway MPG')

ax.set_xticks(xvals)

# tell Matplotlib to use our custom formatter when drawing the tick labels
ax.xaxis.set_major_formatter(format_xticks);

### Task

Oops, we only did this for the horizontal axis. Do the same for the vertical, please! Then, remove the 'spines' from the axes. There's an example in the data visualization lab.

Make any other customizations you desire before moving on.

## Slopegraphs

Slopegraphs are like line charts where we only care about two (or more, but usually two) points for a series. Such as GDP of a country in 2000 and 2020.

In [None]:
# Data extracted from https://raw.githubusercontent.com/Thiagobc23/slope-charts-Matplotlib/main/data/UNdata_gdp.csv
# Example based on https://towardsdatascience.com/slope-charts-with-pythons-matplotlib-2c3456c137b8
raw_csv = '''"Country or Area","Year","Item","Value"
"Afghanistan","2019","Gross Domestic Product (GDP)","469.919090127469"
"Afghanistan","2018","Gross Domestic Product (GDP)","483.885874505381"
"Albania","2019","Gross Domestic Product (GDP)","5303.19782273234"
"Albania","2018","Gross Domestic Product (GDP)","5254.3847977623"
"Algeria","2019","Gross Domestic Product (GDP)","3975.51038119501"
"Algeria","2018","Gross Domestic Product (GDP)","4153.9572199454"
"Andorra","2019","Gross Domestic Product (GDP)","40887.4216465747"
"Andorra","2018","Gross Domestic Product (GDP)","41794.3985720242"
"Angola","2019","Gross Domestic Product (GDP)","2670.85073226766"
"Angola","2018","Gross Domestic Product (GDP)","3289.64337378359"
"Anguilla","2019","Gross Domestic Product (GDP)","25528.573567178"
"Anguilla","2018","Gross Domestic Product (GDP)","21755.9536326769"
"Antigua and Barbuda","2019","Gross Domestic Product (GDP)","17112.8211347326"
"Antigua and Barbuda","2018","Gross Domestic Product (GDP)","16672.7442395764"
"Argentina","2019","Gross Domestic Product (GDP)","10041.4633030642"
"Argentina","2018","Gross Domestic Product (GDP)","11719.0756778825"
'''

from io import StringIO  # let's you treat a string as a file-type object

df = pd.read_csv(StringIO(raw_csv))
df.head()

First, we plot a single line.

In [None]:
fig, ax = plt.subplots()

temp = df[df['Country or Area'] == 'Albania']
ax.plot(temp.Year, temp.Value)

To plot multiple countries, how about a loop?

In [None]:
countries = ["Afghanistan", "Algeria", "Angola", "Argentina"]

fig, ax = plt.subplots()

for item in countries:
    temp = df[df['Country or Area'] == item]
    ax.plot(temp.Year, temp.Value)

Too easy. But not quite there yet. The rest is just labeling the lines, adding heavier markers on the end points, and cleaning things up.

In [None]:
fig, ax = plt.subplots()


for item in countries:
    temp = df[df['Country or Area'] == item]

    # plot the lines AND emphasize the endpoint marker
    ax.plot(temp.Year, temp.Value, marker='o', markersize=5)
    
    # end label
    # TODO
    
    # start label
    ax.text(temp.Year.values[1]-0.02, temp.Value.values[1], item, ha='right')
    
# x limits, x ticks, and y label 
ax.set_xlim(2017.5, 2019.5)
ax.set_xticks([2018, 2019])  # only want start and end labeled

# get y ticks, replace 1,000 with k, and draw the ticks

yticks = ax.get_yticks();
# this line raises a warning but seems to work fine; feel free to try it out
# ax.set_yticklabels(['{}k'.format(i/1000) for i in yticks]);

# the semicolon at the end of the last line just suppresses a bunch of text output
# it's not proper Python style, but it's accepted for working with Matplotlib

### Task

Putting a label on both ends of the line helps with reading. Do that. Decide if you'd like to include the value of the data on one (or both) ends, and modify the `ax.text` calls accordingly.

If you label the points directly, the vertical axis seems kind of redundant, so clean that up.

Make any other customizations you desire before moving on.

## Next steps

In the time remaining for this topic, visit https://matplotlib.org/stable/gallery/index.html and identify two graphs that look interesting. Copy the example code into this notebook and customize it some. Some suggestions include [scatterplot with histograms](https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_hist.html#sphx-glr-gallery-lines-bars-and-markers-scatter-hist-py), [box plot vs. violin plot](https://matplotlib.org/stable/gallery/statistics/boxplot_vs_violin.html#box-plot-vs-violin-plot-comparison), and [3d surface colormap](https://matplotlib.org/stable/gallery/mplot3d/surface3d.html#sphx-glr-gallery-mplot3d-surface3d-py).

------------
For more examples of advanced plots using Matplotlib, check out https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/.