<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/02c-DataViz.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 2c -- DataViz

Data visualization with matplotlib -- a core library that's been around for a while.

### Key topics

* Customizing matplotlib
* Multi-variable data visualization
* Faceting with FacetGrid

### Reading

* Chapter 4 of [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb) -- github
* Reference: [Building structured multi-plot grids](https://seaborn.pydata.org/tutorial/axis_grids.html) -- seaborn.pydata.org

In [None]:
# Standard imports
import matplotlib as mpl
import matplotlib.pyplot as plt

Before we dig into matplotlib, a short digression/reminder about Python, Numpy and Pandas.

## Numpy

* "NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers." 
* "In some ways, NumPy arrays are like Python's built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size."
* "NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you."
* Quoted text is from Ref: [02.00-Introduction-to-Numpy](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.00-Introduction-to-NumPy.ipynb)


## Pandas

* "Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame."
* "DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data."
* "Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs."
* Pandas `Series` and `DataFrame` objects "build on the NumPy array structure and provide efficient access to the sorts of 'data munging' tasks that occupy much of a data scientist's time."

* Quoted text from [03.00-Introduction-to-Pandas.ipynb](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb) by VanderPlas -- github
* [pandas.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) reference docs -- pydata.org
  * One-dimensional ndarray with axis labels
* [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) reference docs -- pydata.org
  * Two-dimensional, size-mutable, potentially heterogeneous tabular data.
  * Data structure also contains labeled axes (rows and columns).
  * Arithmetic operations align on both row and column labels. 
  * Can be thought of as a dict-like container for Series objects. 
  * The primary pandas data structure.
* BTW, you've seen pydata.org before: https://seaborn.pydata.org/
  * https://pydata.org/ -- The power of community! 


## Evenly spaced numbers

Say you want a list/array/series of 100 evenly spaced numbers between 1 & 10, inclusive.

In [None]:
# Python list

# 100 numbers, but not inclusive of 10
#x = [i / 10 for i in range(100)] 

# Inclusive of 10, but 101 numbers
#x = [i / 10 for i in range(101)] 

# This works, but...
#x = [i * 100 / 99 for i in range(100)]

#x

In [None]:
# Standard import for numpy
import numpy as np

# numpy array (easier than list, and maybe more readable?)
x = np.linspace(0, 10, 100)

# Easy conversion back to a Python list (one way)
# list(x)

# Python list (another way)
# x.tolist()

# Convert a list to a numpy array
# x = x.tolist()
# x = np.array(x)

#x

In [None]:
# Standard pandas import
import pandas as pd

# Convert a list or 1-D numpy array into a pandas Series object
# Note: you can give it a name
series = pd.Series(x, name="my_series")
series

# Convert a list or 1-D numpy array into a pandas DataFrame
# Note: you can give it a name, with different syntax than Series
# If you forget the syntax, then look at the reference docs
df = pd.DataFrame(x, columns=['my_column'])
df

# Convert a Pandas Series into a pandas DataFrame (straightforward)
# df = pd.DataFrame(series)
# df

# Convert Series to a list with .to_list() (built-in method)
# series.to_list()

# But be careful: DataFrame doesn't have a .to_list() method
# Next line will throw an AttributeError
# df.to_list()

# Convert a dataframe column to a list (select column first)
#df["my_column"].to_list()

# Wait -- Didn't you just say that DataFrame doesn't have a .to_list() method?!
# The previous line works because of method chaining and because df["my_column"]
# is not a DataFrame. Uncomment the next line to see what I mean...
#type(df['my_column'])

# Matplotlib basics

In [None]:
# Simple line plot
import numpy as np

x = np.linspace(0, 10, 100)

plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x)); # use ";" to suppress printing of the last line

# If you're not using Colab or Jupyter, you may need to uncomment the next line.
# plt.show()

## Two interfaces

* MATLAB-style (legacy "stateful" API)
  * [matplotlib.pyplot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html) API docs 
  * Note the comment about the (disapproved) [pylab module](https://matplotlib.org/stable/api/index.html#module-pylab)
  * Warning about tje pylab module!
    * "Since heavily importing into the global namespace may result in unexpected behavior, the use of pylab is strongly discouraged. Use matplotlib.pyplot instead."
    * The pylab module imports a bunch of things into the global namespace. This can cause unexpected behavior!
    * pylab was created (a long time ago) to mimic MATLAB, but "polluting" the global namespace is considered bad style nowadays.
* Object-oriented style (modern)
  * Ref: [API overview](https://matplotlib.org/stable/api/index.html)
  * figure object
  * axes object

Be aware because the two APIs are not quite interchangeable, which can be confusing.

In [None]:
# MATLAB-style

#plt.figure() # create a plot figure (optional)
#plt.figure(figsize=(10,10))  # Use it (for example) to set "figsize"

# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, np.sin(x))

# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));

**Object-oriented style**

With object-oriented style, you typically create an explicit figure and multiple axes objects with plt.subplots(). 

[plt.subplots()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) is a convenient wrapper that returns both the enclosing figure and the axes, in one call.

You can also create them directly.



In [None]:
# Object-oriented style.
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)

# Call plot() method on the appropriate object
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x));

print(type(ax),'of length', len(ax))
print(type(ax[0]))
print(type(ax[1]))

### Q: Which one should you use?

A: the object-oriented API becomes more powerful with complex plots.

But in many cases, you can simply exchange plt.plot() with ax.plot()



In [None]:
fig, ax = plt.subplots(2, 2)

# Call plot() method on the appropriate object
ax[0, 0].plot(x, np.sin(x))
ax[1, 0].plot(x, np.cos(2 * x))
ax[0, 1].plot(x, np.sin(3 * x))
ax[1, 1].plot(x, np.cos(4 * x));

## Change the default style

See: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html

Note: Since `plt` is used elsewhere, if you change the default style this way, the new style applies to every cell that uses `plt`, even the cells above (if you re-run them).

### **Important**: In general, running cells out of order can cause confusion and problems!!!

In other words: the cells in a Jupyter or Colab notebook do not behave like the cells in a spreadsheet.


In [None]:
# Default figure and axes
import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure()
ax = plt.axes()
print(type(fig))
print(type(ax))
print(fig.get_figwidth(), fig.get_figheight())
print(fig.get_size_inches())

# Note: "default" actually changes the default appearance in Colab!
# ...but not on the command line. Oh well:-)
# plt.style.use('default')
# fig = plt.figure()
# ax = plt.axes()
# print(type(fig))
# print(type(ax))
# print(fig.get_figwidth(), fig.get_figheight())
# print(fig.get_size_inches())

In [None]:
# Changing the default style
# Q: What happens if you rerun the previous cell?
plt.style.use('seaborn-whitegrid')

fig = plt.figure()
ax = plt.axes()
print(fig)
print(ax)

In [None]:
# There are various options for improving readability -- "tight_layout()" is one
fig, ax = plt.subplots(2, 2)

print(fig)
print(ax)
print(ax.shape) # Why does this throw an error with plt.subplots()?
#fig.tight_layout() # Q: What does this do?

# Data visualization basics

* line plots (4.02)
  * customizing axes
  * basic labels and legends
* scatterplots (4.03)
* custom legends (4.06)

## Customizing line plots

* styling -- adjusting line colors and line type
* multiple plots
* labels -- legend

https://jakevdp.github.io/PythonDataScienceHandbook/04.01-simple-line-plots.html

In [None]:
# Multiple plots automatically use different colors
# Matplotlib automatically cycles through a set of default colors
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x));

In [None]:
# You can specify the colors -- with a variety of color encodings
plt.plot(x, np.sin(x - 0), color='blue')        # specify color by name
plt.plot(x, np.sin(x - 1), color='g')           # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75')        # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44')     # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 to 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported

In [None]:
# And you can specify the line type with a keyword argument
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');

# You can use the following codes as shortcuts:
plt.plot(x, x + 4, linestyle='-')  # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':');  # dotted

In [None]:
# You can also use a shorthand to combine both color and line type
plt.plot(x, x + 0, '-g')  # solid green
plt.plot(x, x + 1, '--c') # dashed cyan
plt.plot(x, x + 2, '-.k') # dashdot black
plt.plot(x, x + 3, ':r');  # dotted red

### Customizing axes

In [None]:
# Set axis limits
plt.plot(x, np.sin(x))

plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);

In [None]:
# Reverse axis orientation
plt.plot(x, np.sin(x))

plt.xlim(-1, 11)
plt.ylim(1.5, -1.5);

In [None]:
# Adjust aspect ratio
plt.plot(x, np.sin(x))
plt.axis('equal');

In [None]:
# Each method has a return value that can be used for further customizing
lines = plt.plot(x, np.sin(x)) # returns a list of Line2D

print(type(lines),'of length', len(lines))
print(type(lines[0]))

### Labels and legends

In [None]:
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.axis('equal') # Sets aspect ratio to 1 (may ignore limits)

plt.legend();

In [None]:
ax = plt.axes()
ax.plot(x, np.sin(x))
ax.set(xlim=(0, 10), ylim=(-2, 2),
       xlabel='x', ylabel='sin(x)',
       title='A Simple Plot');

## Scatterplots

There are two ways to create scatterplots

* `matplotlib.pyplot.plot`
  * better performance for large datasets
* `matplotlib.pyplot.scatter`
  * data-dependent styling

https://jakevdp.github.io/PythonDataScienceHandbook/04.02-simple-scatter-plots.html

In [None]:
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np

x = np.linspace(0, 10, 30)
y = np.sin(x)

plt.plot(x, y, 'o', color='black');

In [None]:
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
    plt.plot(rng.rand(5), rng.rand(5), marker,
             label="marker='{0}'".format(marker))
plt.legend(numpoints=1)
plt.xlim(0, 1.8);

`plt.scatter()` is more powerful of the two, allowing you to create data-dependent styles.

In [None]:
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)

plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
            cmap='viridis')
plt.colorbar();  # show color scale

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
features = iris.data.T

plt.scatter(features[0], features[1], alpha=0.2,
            s=100*features[3], c=iris.target, cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1]);

### Iris dataset from Scikit-Learn

* [sklearn.utils.Bunch](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html) API reference docs -- scikit-learn.org
* Bunch is a "Container object exposing keys as attributes" -- so you can use "dot" notation

In [None]:
# Note the form of the Iris dataset that we get from sklearn
type(iris)
#iris.keys()
#iris.DESCR # print this
#iris.data
#iris.feature_names

### Iris dataset from Seaborn

* `sns.pairplots` expects a "tidy" (long-form) dataframe
  * each column is a variable and each row is an observation
* [seaborn.pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) API reference docs -- seaborn.pydata.org

In [None]:
import seaborn as sns

iris = sns.load_dataset("iris")
iris

## Visualizing errors

Will become valuable when we get into modeling

https://jakevdp.github.io/PythonDataScienceHandbook/04.03-errorbars.html

## Density and contour plots

https://jakevdp.github.io/PythonDataScienceHandbook/04.04-density-and-contour-plots.html

## Histograms, binning, and density


* We've already covered this in more detail.

https://jakevdp.github.io/PythonDataScienceHandbook/04.05-histograms-and-binnings.html

## Customizing legends

Includes things like point size

https://jakevdp.github.io/PythonDataScienceHandbook/04.06-customizing-legends.html


## Customizing color bars

https://jakevdp.github.io/PythonDataScienceHandbook/04.07-customizing-colorbars.html

https://matplotlib.org/stable/tutorials/colors/colormaps.html

https://colorbrewer2.org/



## Customizing other things

* [04.08-Multiple-Subplots.ipynb](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.08-Multiple-Subplots.ipynb)
* [04-09-Text-and-Annotation.ipynb](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.09-Text-and-Annotation.ipynb)
* [04.10-Customizing-Ticks.ipynb](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.10-Customizing-Ticks.ipynb)
* [04.11-Settings-and-Stylesheets.ipynb](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.11-Settings-and-Stylesheets.ipynb)


### Other possibilities

* [3-D plotting](https://jakevdp.github.io/PythonDataScienceHandbook/04.12-three-dimensional-plotting.html)
* [Geographic data](https://jakevdp.github.io/PythonDataScienceHandbook/04.13-geographic-data-with-basemap.html)
* [Further resources](https://jakevdp.github.io/PythonDataScienceHandbook/04.15-further-resources.html)
  * plotly, bokeh, vega, altair, and others

# Visualization with Seaborn

* [04.14-Visualization-With-Seaborn.ipynb](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.14-Visualization-With-Seaborn.ipynb)

### Part aethetics

In [None]:
import matplotlib.pyplot as plt
plt.style.use('default')

import numpy as np
import pandas as pd

# Create some data
rng = np.random.RandomState(0)
x = np.linspace(0, 10, 500)
y = np.cumsum(rng.randn(500, 6), 0)

# Plot the data with Matplotlib defaults
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');

In [None]:
import seaborn as sns
sns.set()

# same plotting code as above!
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');

## Mulidimensional data


### Scatterplot matrix

* [Seaborn scatterplot matrix](https://seaborn.pydata.org/examples/scatterplot_matrix.html) -- seaborn.pydata.org

In [None]:
import seaborn as sns

df = sns.load_dataset("penguins")
df = sns.load_dataset("iris")
sns.pairplot(df, hue="species");

In [None]:
# Customizing with Seaborn PairGrid (histogram along the diagonal)
import seaborn as sns

iris = sns.load_dataset("iris")

g = sns.PairGrid(iris, hue="species")
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
g.add_legend();

## Multi-plot grids with FacetGrid

* A FacetGrid can be drawn with up to three dimensions: row, col, and hue. 
* The first two correspond to an array of axes
* The "hue" variable is a third dimension that might be distinguished by color.


Reference: [Building structured multi-plot grids](https://seaborn.pydata.org/tutorial/axis_grids.html) -- seaborn.pydata.org



In [None]:
summary = df.describe()
summary.transpose()

In [None]:
# Loading the tips dataset with Seaborn
tips = sns.load_dataset("tips")
tips

In [None]:
# Intialize the FacetGrid (but don't draw anything)
# In this case, we need two set of axes for the
# two unique values of "time": "Dinner" and "Lunch"
g = sns.FacetGrid(tips, col="time")
g

In [None]:
# Histogram of tips, one histogram for each time
g = sns.FacetGrid(tips, col="time")
g.map(sns.histplot, "tip")

# Add a label for the "y" axis
g.axes[0,0].set_ylabel('count');

In [None]:
# Scatterplot: tip vs total bill,
# one chart for each sex
# use the grid to compare Male and Female tippers
g = sns.FacetGrid(tips, col="sex", hue="smoker")
g.map(sns.scatterplot, "total_bill", "tip", alpha=.7)
g.add_legend();

In [None]:
# Faceting in multiple dimensions
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']

grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True)
grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15))
grid.set_ylabels("histogram bin count");