# Data Visualization using matplotlib - understanding and interpretating data through visualization (e.g. seeing patterns)

## Karl N. Kirschner

A highly popular plotting library

(http://matplotlib.org, https://matplotlib.org/stable/gallery/index.html examples with code)

**Exports**: pdf, svg, ps, eps, jpg, png, bmp, gif


### For citing matplotlib:

Hunter, J.D., 2007. Matplotlib: A 2D graphics environment. IEEE Annals of the History of Computing, 9(03), pp.90-95.

@Article{Hunter:2007,  
  Author    = {Hunter, J. D.},  
  Title     = {Matplotlib: A 2D graphics environment},
  Journal   = {Computing In Science \& Engineering},  
  Volume    = {9},  
  Number    = {3},  
  Pages     = {90--95},  
  abstract  = {Matplotlib is a 2D graphics package used for Python  
  for application development, interactive scripting, and  
  publication-quality image generation across user  
  interfaces and operating systems.},  
  publisher = {IEEE COMPUTER SOC},  
  doi       = {10.1109/MCSE.2007.55},  
  year      = 2007  
}

### Helpful documents
- https://github.com/matplotlib/cheatsheets

***
## Plot plots

In **plot** function (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html):

1. the order of the data being plotted matters - allowing you to connect them with a line

2. you to put in either x data points alone, or as (x,y) data points

In [None]:
## Jupyter notebook - allows plot interactions
#%matplotlib notebook

import matplotlib
import matplotlib.pyplot as plt
import numpy as np

print(matplotlib.__version__)
print(np.__version__)

In [None]:
## uncomment the following to test the help function
#help(plt.plot)

In [None]:
y_data = [0, 1, 4, 9, 16, 25]

plt.figure()  ## Create a new figure

plt.plot(y_data)

plt.show()

Notice that I am not passing the `y_data` using a variable name (e.g. `y=y_data`). For matplotlib, you always specify the x- and y-data in this manner.

In the above example, the x-data is assigned to the list position of the corresponding y-data value. 

`matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)`

---
##### Sidenote:
Quick visit back to Numpy

In [None]:
numpy_data = np.arange(0, 6, 1)
numpy_data

This allows us to do mathematics over the entire array easily:

In [None]:
plt.figure()

plt.plot(numpy_data**2)

plt.show()

Let's show how one might do this using the built-in function range:

In [None]:
range_data = range(0, 6, 1)

plt.figure()
plt.plot(range_data**2)
plt.show() 

As seen, this doesn't work.

The math does not work directly, so the solution is to create
1. a loop 
1. a lambda function
1. use Numpy (note: this adds a dependency to your code)
1. use list comprehension

In [None]:
squared_values = []

## 1. a loop
# for value in range_data:
#     squared_values.append(value**2)

## 2. two different lamda function approaches
# squared_values = list(filter(lambda x: x**2, range_data))
# squared_values = list(map(lambda x: x**2, range_data))

## 3. Numpy 
# squared_values = np.array(range_data)**2

## 4. List comprehension
squared_values = [x**2 for x in range_data]

plt.figure()
plt.plot(squared_values)
plt.show()

---
Okay, back to our original focus.

Let's add some additional information to the figure:

In [None]:
plt.figure()

plt.plot(y_data)

plt.title('The Square of 0-5', fontsize=24)

plt.xlabel('Interger', fontsize=20)
plt.ylabel('Square of Interger', fontsize=20)

plt.tick_params(axis='both', labelsize=14)

plt.show()

In [None]:
x_values = [0, 1, 2, 3, 4, 5]
y_values = [0, 1, 4, 9, 16, 25]

**Note**
- `linestyle`: ‘solid’, ‘dashed’, ‘dashdot’, ‘dotted’
- `marker`: https://matplotlib.org/stable/api/markers_api.html

In [None]:
plt.figure()

plt.plot(x_values, y_values,
         linestyle='dashdot', linewidth=2, 
         marker='v', markersize=20)

plt.show()

---
## Scatter Plots

For scatter plots:
1. need to provide both x and y data points (no connections via a line)
2. can color code regions on the plot

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html

**Note**
- `s` is the pointsize

In [None]:
plt.figure()

plt.scatter(x_values, y_values, s=50)

plt.show()

***
## Colors

https://matplotlib.org/stable/tutorials/colors/colors.html

Now, we will creating a larger data set and changing point color using different color specifications

In [None]:
x_values = range(0, 1000, 1)
y_values = [x**2 for x in x_values]

In [None]:
plt.figure()

plt.scatter(x_values, y_values, facecolor=[[0.6, 0.2, 0.2]], s=50)
plt.show()

plt.scatter(x_values, y_values, facecolor='#9467bd', s=50)
plt.show()

plt.scatter(x_values, y_values, facecolor='b', s=50)
plt.show()

plt.scatter(x_values, y_values, facecolor='C5', s=50)
plt.show()

plt.scatter(x_values, y_values, facecolor='red', s=50)
plt.show()

***
## Additional style control over point

1. `facecolor`
2. `edgecolor`
3. `linewidth` - of the shape's edge
4. `alpha` - transparency of the shape and its edge

In [None]:
x_values = [1, 2, 3, 4, 7]
y_values = [1, 4, 9, 16, 49]

(Demo edgecolor and alpha)

In [None]:
plt.figure()

plt.scatter(x_values, y_values,
            marker='s', s=500, linewidth=5,
            facecolor='lightblue', edgecolor='purple',
            alpha=0.9)

plt.show()

***
## Colormaps (cmap)
- a preset collection of color combinations
- certain color combinations have advantages for viewing certain types of data

#### Colormaps categories
(To see the options for the following: https://matplotlib.org/stable/tutorials/colors/colormaps.html)
- **Sequential**: often used when the data has some ordering to it (e.g. temperature)
- **Diverging**: often used when the data's middle value is important (e.g. deviation around 0)
- **Cyclic**: often used for data's endpoints wrap around (e.g. circular data...time of day)
- **Qualitative**: often used to represent data that lacks ordering or relationships (e.g. )

In [None]:
x_values = range(0, 10000, 1)

Let's use Python3's list comprehension to determine some y data:

In [None]:
y_values = [x**2 for x in x_values]

In [None]:
plt.figure()

plt.scatter(x_values, y_values, c=y_values, cmap=plt.cm.prism, s=100)

plt.show()

## ticklabels and format
https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.ticklabel_format.html

Control tick labels using scientific notation and indicating the number of digits
- `style='sci'`
- `scilimits=(0, 0)` to indicates to use on all numbers


In [None]:
plt.figure()

plt.scatter(x_values, y_values, c=y_values, cmap=plt.cm.prism, s=50)

plt.ticklabel_format(axis='both', style='sci', scilimits=(0, 0))

plt.show()

***
## Output File Format and Resolution

Formats: png, pdf, ps, eps and svg

Recommended formats: svg and pdf (don't need to worry about setting the resolution then using DPI).

A png formtted file is also a good option. The "dots-per-inch" (dpi) is important for this. For print material, you need a minimum of **300 dpi**.

**Note**
- `[wdith, height]` are in inches
- `c` is marker color (see https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)

In [None]:
plt.figure(figsize=[6.4, 4.8], dpi=300)

plt.scatter(x_values, y_values, c=y_values, cmap=plt.cm.plasma, s=50)

plt.savefig('squares.png', bbox_inches='tight')

plt.show()

---
## Overlay of two data sets (using plot)

- write two consecutive `plt` statements

Also covered:
1. `linesytle`
2. `legend`
3. `grid`

Here we will use Numpy to create evenly spaced floats between 0 and 1, and then use them in a sine and cosine function to create data points

In [None]:
x_data = np.arange(0.0, 1.0, 0.01)

y_data_human = 1 + np.sin(2 * np.pi * x_data)
y_data_alien = 1 + np.cos(2 * np.pi * x_data)

In [None]:
plt.figure()

plt.plot(x_data, y_data_human, color='black',
         linewidth=5, linestyle='solid', label='Human Signal')

plt.plot(x_data, y_data_alien, color='red',
         linewidth=5, linestyle='dashdot', label='Alien Signal')

legend = plt.legend(loc='upper center', shadow=True, fontsize='x-large')

plt.xlabel(xlabel='X-Axis (Unit)')
plt.ylabel(ylabel='Y-Axis (Unit)')

plt.title(label='TITLE')

plt.grid(True)

#plt.savefig("simple_matplotlib.png", bbox_inches='tight', dpi=300)
#plt.savefig("simple_matplotlib.svg")

plt.show()

***
## Formatting Issues
1. tick label `rotation` = n degrees
2. `labelpad` = n pts (space between tick labels and axis label)
3. tickmark `width` and `length` (via `tick_params`)
4. axes `linewidth` (via runtime configuration (`rc`))
5. `fontsize` = n pts
6. `fontstyle` = ['normal' | 'italic' | 'oblique']
7. `fontweight` = ['normal' | 'bold' | 'heavy' | 'light']

In [None]:
plt.figure()

plt.plot(x_data, y_data_human, color='black',
         linewidth=5, linestyle='solid', label='Human Signal')

plt.plot(x_data, y_data_alien, color='red',
         linewidth=5, linestyle='dashdot', label='Alien Signal')

legend = plt.legend(loc='upper center', shadow=True, fontsize='x-large',
                    frameon=True)

## new idea: labelpad and rotation
plt.xlabel(xlabel='X-Axis (unit)', fontsize=18, fontweight='bold', labelpad=15)
plt.ylabel(ylabel='Y-Axis (unit)', fontsize=18, fontweight='bold', labelpad=15)

plt.xticks(rotation='90', fontsize=18)
plt.yticks(fontsize=18)

## new idea: fontweight
plt.title(label='Legal vs. Illegal Space Aliens',
          fontsize=18, fontstyle='italic', fontweight='bold')

## new idea: axes linewidth
plt.rc('axes', linewidth=5)

## new idea: major tick mark's length, width and color
plt.tick_params(which='major', length=10, width=5, colors='purple')

plt.show()

---

## Controlling major tick mark range and minor ticks

In [None]:
from matplotlib.ticker import AutoMinorLocator

In [None]:
plt.figure()

plt.plot(x_data, y_data_human, color='black',
         linewidth=5, linestyle='solid', label='Human Signal')

plt.plot(x_data, y_data_alien, color='red',
         linewidth=5, linestyle='dashdot', label='Alien Signal')

legend = plt.legend(loc='upper center', shadow=False, fontsize='x-large',
                    frameon=True)

plt.xlabel(xlabel='X-Axis (unit)', fontsize=18, fontweight='bold', labelpad=15)
plt.ylabel(ylabel='Y-Axis (unit)', fontsize=18, fontweight='bold', labelpad=15)

plt.xticks(rotation='90', fontsize=18)
plt.yticks(fontsize=18)

plt.title(label='Legal vs. Illegal Space Aliens',
          fontsize=18, fontstyle='italic', fontweight='bold')

plt.rc('axes', linewidth=5)

## new ideas: tick range
plt.yticks(np.arange(min(y_data_human), max(y_data_human) + 2, 1.0))

## new ideas: major AND minor tick control
plt.minorticks_on()
plt.tick_params(axis='x', which='minor', direction='out')
plt.tick_params(which='both', width=5)
plt.tick_params(which='major', length=15, colors='purple')
plt.tick_params(which='minor', length=7.5, color='orange')

plt.show()

***
## Predifined styles

In [None]:
plt.style.available[:]

In [None]:
plt.style.use('tableau-colorblind10')

plt.figure()

plt.plot(x_data, y_data_human, label='Human Signal', linewidth=5)
plt.plot(x_data, y_data_alien, label='Alien Signal', linewidth=5)

plt.show()

## Fine control of how the data points and lines overlap

- zorder: https://matplotlib.org/3.3.3/gallery/misc/zorder_demo.html

Now and concerning the above graph, let's switch the line that is in the foreground.
- i.e. put the blue line on top of the orange line

In [None]:
plt.figure()

plt.plot(x_data, y_data_human, label='Human Signal', linewidth=5, zorder=2)
plt.plot(x_data, y_data_alien, label='Alien Signal', linewidth=5, zorder=1)

plt.show()

### Let's create some random-ish plots

In [None]:
time = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0,
        4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5,
        9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0, 12.5,
        13.0, 13.5]

exp = [0.1185, 0.6524, 0.1291, 0.9445, 0.0272, 0.7598, 0.8159, 0.8003,
       0.5716, 0.6651, 0.9983, 0.1004, 0.8433, 0.0067, 0.8238, 0.3952,
       0.6452, 0.848, 0.1986, 0.9114, 0.7693, 0.5009, 0.211, 0.9227,
       0.0461, 0.2177, 0.9554, 0.613]

sim = [0.2255, 0.3052, 0.0744, 0.7611, 0.1183, 0.045, 0.2669, 0.177,
       0.2433, 0.2302, 0.0772, 0.0805, 0.6214, 0.1156, 0.1607, 0.042,
       0.1123, 0.383, 0.5566, 0.667, 0.5655, 0.4875, 0.0104, 0.4968,
       0.2639, 0.2197, 0.944, 0.2423]

print(f'Time: {time}\n')
print(f'Exp: {exp}\n')
print(f'Sim: {sim}\n')

In [None]:
plt.style.use('Solarize_Light2')

plt.figure(figsize=(15, 5))

plt.plot(time, exp, linewidth=5, linestyle='-', label='Experimental')
plt.plot(time, sim, linewidth=5, linestyle='--', label='Simulated')

plt.xlabel(xlabel='Time (seconds)', fontsize=20)
plt.ylabel(ylabel='Y-Axis (unit)', fontsize=20)

plt.title(label='Experimental and Simulated Results')

plt.grid(True)

plt.show()

## Bar plot

bar(x, height, width=0.8, bottom=None, \*, align='center', data=None, **kwargs)

x : sequence of scalars -- The x coordinates of the bars. See also align for the alignment of the bars to the coordinates.

height : scalar or sequence of scalars -- The height(s) of the bars.

width : scalar or array-like, optional -- The width(s) of the bars (default: 0.8).
<br><br>

**Example**: Let's create a bar plot that does the following:
- shows the mean value of the above experimental and simulated data
- shows the standard deviation of those data

In [None]:
exp_average = np.mean(exp)
sim_average = np.mean(sim)

# sample standard deviation via ddof=1 (i.e. this reproduces libreoffice stdev)
exp_std = np.std(exp, ddof=1)
sim_std = np.std(sim, ddof=1)

print(f'Experimental Average: {exp_average:0.2f}')
print(f'Simulated Average: {sim_average:0.2f}')
print(f'Experimental Standard Deviation: {exp_std:0.2f}')
print(f'Simulated Standard Deviation: {sim_std:0.2f}')

**Note**
- standard deviation will be implemented via `yerr` (error values that correspond to the y data)
- format contorl over the error is done using `error_kw` (a dictionary that controls the error bar of the bar plot)

In [None]:
# means = [exp_average, sim_average]
# standard_dev = [exp_std, sim_std]

In [None]:
## Setup to show 1 bar graph within each category (i.e. 1 in exp. & 1 in sim.)
##   (change to 2, 3 or 4 to see what happens)
n_groups = 1
index = np.arange(n_groups)

bar_width = 0.3

plt.figure()

rects1 = plt.bar(x=index, height=exp_average, width=bar_width,
                 color='red', alpha=0.5, label='Experimental',           
                 yerr=exp_std,
                 error_kw={'ecolor': 'C1', 'alpha':0.2, 'elinewidth':4})

## to place this bar next to the above one: index+bar_width
rects2 = plt.bar(x=(index+bar_width), height=sim_average, width=bar_width,
                 color='blue', alpha=0.5, label='Simulated',
                 yerr=sim_std,
                 error_kw={'ecolor': 'C6', 'alpha':0.2, 'elinewidth':4})

plt.xlabel(xlabel='Methodology')
plt.ylabel(ylabel='Values')
plt.title(label='Experimental and Simulated Results')

plt.tick_params(
    axis='x',          ## changes apply only to the x-axis (option: x, y, both)
    which='both',      ## both major and minor ticks (major, minor, both)
    bottom=False,      ## ticks along the bottom edge are off
    top=False,         ## ticks along the top edge are off
    labelbottom=False) ## labels along the bottom edge are off

plt.legend()
plt.grid(False)

plt.show()

Now, let's create a bar graph that has customized xticks labels:

In [None]:
## bar graph - different x-axis labels
plt.figure()

x_position = [0, 1, 2, 3, 4, 5, 6]
y_axis = [0.5, 1.0, 2.0, 3.0, 2.0, 1.0, 0.5]

plt.bar(x=x_position, height=y_axis, color=['black', 'black', 'black',
                                            'orange', 'black', 'black',
                                            'red'])

bars_labels = ['Cis', '30.0', 'Gauche', '90.0', '120.0', '150.0', 'Anti']

#plt.xticks(ticks=x_position, label=bars_labels) ## bug: doesn't work properly (v.3.4.1)
plt.xticks(x_position, bars_labels)

plt.show()

---
## Advance idea: Subplots
- subplot(nrows, ncols, index)
    
- https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html

### Functions for repetitive plotting

For simplicity, let's define some functions that plot.

In [None]:
def plot_line():
    plt.style.use('seaborn-whitegrid')
    
    plt.plot(time, exp, linewidth=5, linestyle='-', label='Experimental')
    plt.plot(time, sim, linewidth=5, linestyle='--', label='Simulated')

    plt.xlabel(xlabel='Time (seconds)', fontsize=18, fontweight='bold')
    plt.ylabel(ylabel='Y-Axis (Unit)', fontsize=18, fontweight='bold')

    plt.yticks(fontsize=14)
    plt.xticks(time, rotation='90', fontsize=10)

    plt.title(label='Experimental and Simulated Results',
              fontsize=18, fontweight='bold')

    plt.grid(False)
    plt.legend()


def plot_scatter():
    x_values = [1, 2, 3, 4, 7]
    y_values = [1, 4, 9, 16, 49]

    plt.style.use('seaborn-whitegrid')
    
    plt.scatter(x_values, y_values, edgecolor='dimgray', linewidth=5,
                facecolor='purple', alpha=0.5, s=500, marker='s', label='Tests')
    plt.legend()


def plot_bar():
    plt.xkcd()
    
    rects1 = plt.bar(x=index, height=exp_average, width=bar_width,
                 color='red', alpha=0.4, label='Experimental',
                 yerr=exp_std,
                 error_kw={'ecolor': 'red', 'alpha':0.3, 'elinewidth':4})
    rects2 = plt.bar(index + bar_width, sim_average, bar_width,
                 color='blue', alpha=0.4, label='Simulated',
                 yerr=sim_std,
                 error_kw={'ecolor': 'blue', 'alpha':0.3, 'elinewidth':4})

    plt.xlabel(xlabel='Methodology')
    plt.ylabel(ylabel='Values')
    plt.title(label='Results by Methodology')

    plt.tick_params(axis='x', which='both', bottom=False, top=False, 
                    labelbottom=False)
    plt.legend()

Now, let create subplots:
- plot two graphs
- 2 rows by 1 column grid (i.e. two plots given in one column)

In [None]:
## needed for the plot1
plt.figure(figsize=(15, 5))

## (2rows, 1column, first subplot position)
plt.subplot(2, 1, 1)
plot_line()

## (2rows x 1column grid, second subplot position)
plt.subplot(2, 1, 2)
plot_scatter()

# ## addes an extra height padding between the plots
plt.tight_layout(h_pad=3.0)
plt.show()

Now, something more complicated.

Let create subplots:
- plot three graphs
    - two graphs stacked to the left, and
    - one graph placed to the right that spans the height of the left two graphs
- 2 rows by 1 column grid (i.e. two plots given in one column)

In [None]:
plt.figure(figsize=(11, 5))

## 2 rows, 2 columns, first subplot position (i.e. top, left position)
plt.subplot(2, 2, 1)
plot_line()

## 2 rows, 2 columns, third subplot position (i.e. bottom, left position)
## Bottom, left position
plt.subplot(2, 2, 3)
plot_scatter()

## 1 rows, 2 columns, second subplot position (i.e. top + bottom, right position)
## right position
plt.subplot(1, 2, 2)
plot_bar()

plt.tight_layout(w_pad=3.0, h_pad=3.0)
plt.show()

#### Adding a little bit more control (and some new data analysis ideas)

First, let's make some new data that will be interesting:

In [None]:
x_data = np.random.randn(100)
y_data = np.random.randn(100)

Reset defaults values (i.e. not xkcd) via the runtime configuration defaults function

In [None]:
plt.rcdefaults()

Let create subplots:
- plot three graphs (scatter, cumulative sum line and histogram) in a 2x2 uniform grid pattern

- 2 rows by 2 column grid

**sidenote**:
- Cumulative sum using Numpy's `cumsum`: https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html
- Histogram: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html

In [None]:
fig = plt.figure()

fig.subplots_adjust(wspace=0.5, hspace=1.0)

## 2x2 grid for plots 
ax1 = fig.add_subplot(2, 2, 1) ## first plot
ax2 = fig.add_subplot(2, 2, 2) ## second plot
ax3 = fig.add_subplot(2, 2, 3) ## third plot

ax1.set_xlabel(xlabel='plot 1 xlabel')
ax1.set_ylabel(ylabel='plot 1 ylabel')
ax1.set_xlim(-5, 5)

ax2.set_xlabel(xlabel='plot 2 xlabel')
ax2.set_ylabel(ylabel='plot 2 ylabel')
ax2.set_xlim(0, 50)

ax3.set_xlabel(xlabel='plot 3 xlabel')
ax3.set_ylabel(ylabel='plot 3 ylabel')
ax3.set_xlim(-5, 5)

ax1.scatter(x_data, y_data, color='blue')

ax2.plot(x_data.cumsum(), color='red', linewidth=2, linestyle='solid')

ax3.hist(x_data, bins=20, color='green', alpha=0.3)

plt.show()