## CS102-4 - Further Computing

Prof. Götz Pfeiffer<br>
School of Mathematics, Statistics and Applied Mathematics<br>
NUI Galway

### 3. Aspects of Data Visualization

# Week 10: Scatter Plots and Histograms

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In a **line plot**, the computer draws a number of dots and then joins the dots.
If sufficiently many dots are draw, the result can look like a smooth curve.
When the dots are not joined, you get a **scatter plot**.  
Scatter plots are more appropriate when joining the dots does not make any sense.

In [None]:
pi = pd.Series([3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6,4,3,3,8,3,2,7,9])
pi.plot()

In [None]:
plt.scatter(pi.index,pi.values, color='c')

In [None]:
plt.bar(pi.index, pi.values, color='m')

In [None]:
pi.plot.bar()

Recall the line-style and color arguments for line plotting:
* `'-'`: solid
* `'--'`: dashed
* `'-.'`: dashdot
* `':'`: dotted

* `'r'`: red
* `'g'`: green
* `'b'`: blue
* `'c'`: cyan
* `'m'`: magenta
* `'y'`: yellow
* `'k'`: black


## Simple Scatter Plots

* Instead of points being joined by line segments, in a scatter plot the points are represented individually with a dot, circle, or other shape.

* It turns out that the ``plt.plot``/``ax.plot`` function can produce scatter plots as well:

In [None]:
x = np.linspace(0, 10, 30)
y = np.sin(x)

plt.plot(x, y, 'o', color='black');

* The third argument in the function call is a character that represents the type of symbol used for the plotting.  
* The full list of available symbols can be seen in the documentation of ``plt.plot``,
* Most of the possibilities are fairly intuitive:

In [None]:
rng = np.random.RandomState()
plt.figure(figsize=(12, 6))
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
    plt.plot(rng.rand(5), rng.rand(5), marker,
             label="marker='{0}'".format(marker))
plt.legend()
plt.xlim(0, 1.4);

* These character codes can be used together with line and color codes to plot points along with a line connecting them:

In [None]:
plt.plot(x, y, '-ok');

* Additional keyword arguments to ``plt.plot`` specify a wide range of properties of the lines and markers:

In [None]:
plt.plot(x, y, '-p', color='gray',
         markersize=16, linewidth=4,
         markerfacecolor='lightblue',
         markeredgecolor='g',
         markeredgewidth=1)
plt.ylim(-1.2, 1.2);

In [None]:
#?plt.plot

## Scatter Plots with ``plt.scatter``

* A second, more powerful method for creating scatter plots is the ``plt.scatter`` function:

In [None]:
plt.scatter(x, y, marker='o');

* The primary difference of ``plt.scatter`` from ``plt.plot`` is that it can be used to create scatter plots where the properties of each individual point (size, face color, edge color, etc.) can be individually controlled or mapped to data.
* In order to better see the overlapping results, we'll also use the ``alpha`` keyword to adjust the transparency level:

In [None]:
rng = np.random.RandomState()
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)

plt.figure(figsize=(12,6))
plt.scatter(x, y, c=colors, s=sizes, alpha=0.5,
            cmap='viridis')
plt.colorbar();  # show color scale

* Notice that the color argument is automatically mapped to a color scale (shown here by the ``colorbar()`` command).
* Also note that the size argument is given in pixels.
* In this way, the color and size of points can be used to convey information in the visualization, in order to visualize multidimensional data.

* For example, we might use the Iris data from Scikit-Learn, where each sample is one of three types of flowers that has had the size of its petals and sepals carefully measured:

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
features = iris.data.T

plt.figure(figsize=(12, 6))
plt.scatter(features[0], features[1], alpha=0.5,
            s=100*features[3], c=iris.target, cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1]);

* We see that this scatter plot has given us the ability to simultaneously explore four different dimensions of the data:
the $(x, y)$ location of each point corresponds to the sepal length and width, the size of the point is related to the petal width, and the color is related to the particular species of flower.

## Histograms and Binnings

* Bar charts are often used for histograms.
* In a histogram, data are first grouped into bins, then the bins are plotted according to their size.
* A simple histogram can be a great first step in understanding a dataset.

In [None]:
data = np.random.randn(1000)
plt.hist(data);

The ``hist()`` function has many options to tune both the calculation and the display; 
here's an example of a more customized histogram:

In [None]:
plt.hist(data, bins=30, 
         density=True, 
         alpha=0.5,
         histtype='stepfilled', 
         color='steelblue',
         edgecolor='blue');

In [None]:
#?plt.hist

* This combination of ``histtype='stepfilled'`` with some transparency ``alpha`` can be very useful when comparing histograms of several distributions:

In [None]:
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)

kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);

* If you would like to simply compute the histogram (that is, count the number of points in a given bin) and not display it, the ``np.histogram()`` function is available:

In [None]:
counts, bin_edges = np.histogram(data, bins=5)
print(counts)

## References

In [None]:
### `numpy`

* `random.randn`: [[doc]](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html)


* `random.normal`: [[doc]](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html)


* `histogram`: [[doc]](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html)

### `matplotlib.pyplot`

* `figure`: [[doc]](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html)


* `scatter`: [[doc]](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)


* `bar`: [[doc]](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html)


* `hist`: [[doc]](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)

## Exercises

1. A new hit single enters the charts in week 10 at position 14.  Its positions in the subsequent 6 weeks
are 8, 6, 6, 2, 16, 18.  It then drops out of the charts, only to re-enter once more at position 17 in week 20.
How would you plot these data?

2. 50 Students sit an exam and receive marks as follows: 
    $39, 75, 61, 80, 55, 43, 64, 32, 80, 40, 30, 61, 74, 78, 59, 79, 76,
    35, 68, 82, 41, 60, 31, 66, 80, 33, 49, 79, 44, 89, 75, 78, 35, 56,
    33, 40, 38, 81, 34, 61, 35, 30, 87, 63, 40, 31, 42, 32, 53, 71$.
    How would you plot these data?
       