[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/joshmaglione/CS102-Jupyter/main?labpath=.%2FWeek08.ipynb) 

<a href="https://colab.research.google.com/github/joshmaglione/CS102-Jupyter/blob/main/Week08.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

[View on GitHub](https://github.com/joshmaglione/CS102-Jupyter/blob/main/Week08.ipynb)

# Week 8: Plotting Data

We will continue discussing pandas, but now we will be primarily focused on visualization. 

### Time data.

Dealing with time data is generally tricky. There are 
- timezones
- daylight savings
- bizarre formatting
- inconsistent numbers (months have 28, 29, 30, 31 days)

Moreover, time data is often not just a number (e.g. like an integer).

A common Python package for dealing with time data is `datetime`. 

`pandas` has tools to deal with time data.

Often one needs to give pandas a hint at how time data is formatted, and then pandas can reformat however is required. 

There are four basic styles (taken directly from pandas doc referenced below):
1. Date times: A specific date and time with timezone support. Similar to `datetime.datetime` from the standard library.
2. Time deltas: An absolute time duration. Similar to `datetime.timedelta` from the standard library.
3. Time spans: A span of time defined by a point in time and its associated frequency.
4. Date offsets: A relative time duration that respects calendar arithmetic. Similar to `dateutil.relativedelta.relativedelta` from the `dateutil` package.

If you want to be overwhelmed by all that pandas can do with time data:

[check this out](https://pandas.pydata.org/docs/user_guide/timeseries.html).

In [None]:
import pandas as pd

We will look at the publicly available [HPSC's Covid-19 county data](https://data.gov.ie/dataset/covid-19-hpsc-county-statistics-historic-data1).

In [None]:
df = pd.read_csv(
	"data/COVID-19_HPSC_County_Statistics_Historic_Data.csv", 
	index_col='TimeStamp', parse_dates=True
)
df.head()

Let's split off two counties.

[Recall that `query` defaults to making a copy while slicing defaults to a view.]

In [None]:
df_gal = df.query("CountyName == 'Galway'")
df_gal.head()

In [None]:
df_gal["ConfirmedCovidCases"]

In [None]:
_ = df_gal["ConfirmedCovidCases"].plot()

Note that `pandas` already knows how to interpret the time data.

We can plot two counties manually

In [None]:
df_don = df.query("CountyName == 'Donegal'")
df_don.head()

In [None]:
gal_v_don = pd.DataFrame({
	"Galway": df_gal["ConfirmedCovidCases"], 
	"Donegal": df_don["ConfirmedCovidCases"]
})
gal_v_don.head()

In [None]:
_ = gal_v_don.plot()

We can view specific years by accessing the `dt` attribute from the datetime data.

In [None]:
_ = gal_v_don.query("TimeStamp.dt.year == 2020").plot()

Cumulative data is nice, but the daily values are where the real drama is found!

We can use `diff` to take consecutive differences of our DataFrame. 

Since each row corresponds to a day, this will give us what we want.

In [None]:
_ = (gal_v_don
  .query("TimeStamp.dt.year == 2020")
  .diff()
  .plot()
)

## Visualization with `Matplotlib`

In the background, [`Matplotlib`](https://matplotlib.org/) is running with `pandas` to make these nice plots. 

There are lots of [examples](https://matplotlib.org/stable/gallery/index.html) on the Matplotlib's webpage.

### Plotting curves

We'll do the simplest plots: plotting curves of the form $y=f(x)$. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

Let's start with $f(x) = \sin(x)$ for $0\leqslant x \leqslant 4\pi$.

In [None]:
xs = np.linspace(0, 4*np.pi, 100)
ys = np.sin(xs)

In [None]:
_ = plt.plot(xs, ys)

We can edit all sorts of data. Here's an example that we won't carefully go through.

It is fairly self-explanitory what each line is doing.

In [None]:
zs = np.cos(xs)			# Add a cosine wave

# Another common way to plot is to use the object-oriented interface
fig, ax = plt.subplots()
ax.grid(True)
ax.set_title("Sine and Cosine Waves")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.plot(xs, ys, c='r', label="sin(x)")
ax.plot(xs, zs, c='b', label="cos(x)")
ax.legend()
plt.xticks([0, np.pi, 2*np.pi, 3*np.pi, 4*np.pi], ['0', '$\\pi$', '$2\\pi$', '$3\\pi$', '$4\\pi$'])
ax.xaxis.set_minor_locator(plt.MultipleLocator(np.pi/4))
plt.yticks([-1, -0.5, 0, 0.5, 1])
ax.yaxis.set_minor_locator(plt.MultipleLocator(0.25))
# plt.axis('equal')		# Make the x and y scales equal
_ = plt.show()

Color can be specified in many ways by using the `color` or just `c` keyword. If nothing is provided, Matplotlib will cycle through some default.

In [None]:
xs = np.linspace(0, 10, 100)

# specify color by name
plt.plot(xs, xs + 0, color='blue', label='blue') 

# short color code (rgbcmyk)
plt.plot(xs, xs + 1, color='g', label='g') 

# Grayscale between 0 and 1
plt.plot(xs, xs + 2, color='0.75', label='0.75') 

# Hex code (RRGGBB from 00 to FF)
plt.plot(xs, xs + 3, color='#FFDD44', label='#FFDD44') 

# RGB tuple, values 0 to 1
plt.plot(xs, xs + 4, color=(1.0,0.2,0.3), label='(1.0,0.2,0.3)') 

# all HTML color names supported
plt.plot(xs, xs + 5, color='chartreuse', label='chartreuse')

_ = plt.legend()

The line style can be adjusted using the `linestyle` keyword.

In [None]:
plt.plot(xs, xs + 0, linestyle='-')  	# solid
plt.plot(xs, xs + 1, linestyle='--') 	# dashed
plt.plot(xs, xs + 2, linestyle='-.') 	# dashdot
plt.plot(xs, xs + 3, linestyle=':')		# dotted
_ = plt.show()

Line style and color are such a common parameter to edit, that they can be combined without a keyword.

In [None]:
plt.plot(xs, xs + 0, '-g')  	# solid green
plt.plot(xs, xs + 1, '--c') 	# dashed cyan
plt.plot(xs, xs + 2, '-.k') 	# dashdot black
plt.plot(xs, xs + 3, ':r')  	# dotted red
_ = plt.show()

We could talk more about all of the specific aspects that can be changed in a plot. 

But I would rather show some cool pictures instead. 😅

## Scatter plots

Let's take our sine and cosine plots and add noise.

In [None]:
xs = np.linspace(0, 4*np.pi, 100)
ys = np.sin(xs)
zs = np.cos(xs)
noise1 = np.random.normal(0, 0.1, 100)
noise2 = np.random.normal(0, 0.2, 100)

In [None]:
plt.scatter(xs, ys + noise1, c='r', label='sin(x)', marker='x')
plt.scatter(xs, zs + noise2, c='b', label='cos(x)', marker='o')
plt.legend()
_ = plt.show()

We can put our curves on this plot as well. 

In [None]:
plt.scatter(xs, ys + noise1, c='r', label='sin(x)', marker='x')
plt.scatter(xs, zs + noise2, c='b', label='cos(x)', marker='o')
plt.plot(xs, ys, c='black', label="sin(x)")
plt.plot(xs, zs, c='orange', label="cose(x)")
plt.legend()
_ = plt.show()

By use of color and marker size, we can record two additional variables in our scatter plot. 

(Of course, this can be somewhat harder to discern, but it is helpful in som contexts.)

In [None]:
rng = np.random.RandomState()
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)	# pixels

plt.figure(figsize=(12,6))	# Set the size of the figure
plt.scatter(x, y, c=colors, s=sizes, alpha=0.5, cmap='viridis')
plt.colorbar()
_ = plt.show()

The `color` (or `c`) argument is automatically mapped to a color scale -- shown by the `colorbar` command.

The size is also given in pixels.

#### The Iris data set

The [Iris data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) is standard data set used for various reasons.

There are 50 samples of three species of Iris. Each sample has four measurements: length and width of the pedals and sepals.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
iris

We will map the `target` array to color. Since there are only three speices, there will be three distinct colors.

We have three more dimensions we can display: $x$, $y$, and size. 

There are four measurements, so there are $4$ different sets of size $3$ we can consider. 

Of those $3$, there are essentially only $3$ different plots -- where interchanging $x$ and $y$ isn't really different.

In total, there are potentially $12$ distinct plots. Let's plot them all.

Here are some helper function to organize.

In [None]:
def triple(i, j, s=False):
	v = list(range(4))
	v.remove(3 - j)
	u = v[:i] + v[i+1:] + [v[i]]
	if s:
		return ''.join(map(str, u))
	return u

def scaled(arr, N=200):
	m = arr.max()
	return N/m * arr

Here's how we will organize. The string `'ijk'` represents the list `[i, j, k]`. 

The first two entries of the list are the $x$ and $y$ values, and the last entry is the size.

In [None]:
np.array([[triple(i, j, s=True) for j in range(4)] for i in range(3)])

Let's plot!

In [None]:
features = iris.data.T
fig, axs = plt.subplots(3, 4, figsize=(12, 6))
for i in range(3):
    for j in range(4):
        a, b, c = triple(i, j)
        axs[i, j].scatter(features[a], features[b], alpha=0.5,
            s=scaled(features[c], N=100), c=iris.target, cmap='viridis')
        axs[i, j].set_xlabel(iris.feature_names[a])
        axs[i, j].set_ylabel(iris.feature_names[b])
        axs[i, j].set_title(iris.feature_names[c])
plt.tight_layout()      # Make the labels fit
_ = plt.show()

Actually, there's some mathematics (I mean data analysis) to do here to make this clear-cut.

I would do a principal component analysis (PCA).

Sciket-learn discusses that approach [in their tutorial](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). 

[Foreshadowing]()

## Histograms

In a histogram, data are first grouped into *bins*, then the bins are plotted according to their size.

If the data is somewhat continuous, then this is a discretization of the data. 

Bar charts are often used for histograms.

In [None]:
data1 = rng.normal(size=1000)
data2 = 4*rng.random(size=1000) - 2 
_ = plt.hist(data1, alpha=1)
# _ = plt.hist(data2, alpha=0.5)

In [None]:
_ = plt.hist(data1, bins=30, density=True, alpha=0.5, histtype='stepfilled', color='steelblue', edgecolor='blue')
_ = plt.hist(data2, bins=30, density=True, alpha=0.5, histtype='stepfilled', color='salmon', edgecolor='red')

'Steelblue' reminds me of ... 

![](imgs/Zoolander.jpg)

Anyways, if you need histogram data without the histogram, you can use NumPy.

In [None]:
counts, bin_edges = np.histogram(data1, bins=10)
print(counts)
print(bin_edges)

## Bonus

### Animations

One can animate in Jupyter notebooks via Matplotlib.

I learned about this on [Stack Overflow](https://stackoverflow.com/questions/35532498/animation-in-ipython-notebook/46878531)

Below I've copied directly from one of the answers. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["animation.html"] = "jshtml"
import matplotlib.animation
import numpy as np

t = np.linspace(0,2*np.pi)
x = np.sin(t)

fig, ax = plt.subplots()
l, = ax.plot([0,2*np.pi],[-1,1])

animate = lambda i: l.set_data(t[:i], x[:i])

ani = matplotlib.animation.FuncAnimation(fig, animate, frames=len(t))
ani

### Grayscale image compression

In `Week01.ipynb`, we saw an example of image compression, but we consider an entirely different approach here.

In [None]:
from PIL import Image

image_file = "imgs/Zoolander.jpg"

img = np.asarray(Image.open(image_file).convert("L"))
fig, ax = plt.subplots(ncols=2, figsize=(12, 4))
ax[0].imshow(img, cmap='gray', vmin=0, vmax=255)
ax[0].axis("off")
ax[0].set_title("Blue Steel")
ax[1].hist(img.ravel(), bins=256)
ax[1].set_xlabel("Pixel value")
ax[1].set_ylabel("Count of pixels")
ax[1].set_title("Distribution of the pixel values")
_ = fig.suptitle("Original image of Derek Zoolander")

Now we compress the image.

In [None]:
n_bins = 10

from sklearn.preprocessing import KBinsDiscretizer
encoder = KBinsDiscretizer(
    n_bins=n_bins,
    encode="ordinal",
    strategy="uniform",
    random_state=0,
    subsample=200_000,
)
compressed_img = encoder.fit_transform(img.reshape(-1, 1)).reshape(img.shape).astype(np.uint8)

fig, ax = plt.subplots(ncols=2, figsize=(12, 4))
ax[0].imshow(compressed_img, cmap=plt.cm.gray)
ax[0].axis("off")
ax[0].set_title("Compressed Blue Steel")
ax[1].hist(compressed_img.ravel(), bins=256)
ax[1].set_xlabel("Pixel value")
ax[1].set_ylabel("Number of pixels")
ax[1].set_title("Distribution of the pixel values")
_ = fig.suptitle("Derek Zoolander compressed using 3 bits and a K-means strategy")

This example was taken partially from the [Scikit-learn tutorial](https://scikit-learn.org/stable/auto_examples/cluster/plot_face_compress.html).

## Exercises

1. Use suitable functions and ranges to plot a circle of radius $3$ around the centre $(1,1)$.
2. Plot the rational function $f(x) = \frac{x^2 + x - 2}{x^3 + 6}$ and its derivative $f'(x)$ so that all interesting points (zeros, extreme values, inflection points, singularities, ...) are contained in the plot.
3. Plot $f(x) = x^2 \sin(\pi/x)$ for $x$ in the range $[-0.3, 0.3]$.