## More plotting

- Descriptive statistics
- Other statistical plots

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Loading simple text data

Consider a simple text file like `data/data.txt` which is located the
`data` directory on the interface. The file has two columns of data like so:
```
6.930830303506971291e+01 1.950570203348928500e+00
5.076032587226054460e+01 1.968615715715650527e+00
6.285323760945500027e+01 1.343008070563283418e+00
7.068897251770047774e+01 1.602208201785916941e+00
...
```
This can be easily loaded as a numpy array like so:

In [None]:
# Change the path to suit your setup.
data = np.loadtxt('../data/data.txt')

In [None]:
print(len(data))
data[:2]

- Look at the documentation for `np.loadtxt`
- Since this has two columns, we can use `unpack=True` to extract the two columns

In [None]:
x, y = np.loadtxt('../data/data.txt', unpack=True)
print(len(x), len(y))
x[:2]

## Mean, std-dev, percentiles, ...

Now let us look at some simple statistics of this data.

In [None]:
np.mean(x)

In [None]:
np.mean(y)

In [None]:
# This is the biased estimator.
np.std(x)


The biased estimator is:
$$\sigma^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n}$$
The unbiased estimator is:
$$\sigma^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$$

In [None]:
# The unbiased estimator
np.std(x, ddof=1)

In [None]:
np.median(x)

### Percentiles

In [None]:
np.percentile(x, 50)

In [None]:
# Get multiple in one shot.
np.percentile(x, [25, 50, 75, 90])

## Exercise

- For the data below
- Find the percentiles from 0 to 100 (in unit steps)
- Plot them


In [None]:
# Solution

## Scatter plots

Useful to get a sense of the data and their relationship

In [None]:
plt.scatter(x, y);

In [None]:
# Can also show size and color.
# We construct a simple scalar to display.
s = s = (x-np.mean(x))**2 + (y - np.mean(y))**2
plt.scatter(x, y, c=s, s=s)
plt.grid();

## Histograms

Extremely useful to get a sense of the distribution of the data.

In [None]:
plt.hist(y);

In [None]:
# Warning default bins are just 10, using 'auto' is always a better idea.
plt.hist(x, bins='auto');

In [None]:
# But if you know better you can specify the number of bins.
plt.hist(x, bins=20);

### Cumulative and relative histograms

- Hope you've seen this in your Statistics course?
- You get an empirical probability density function (PDF) through a histogram!
- But this has to be a relative histogram, i.e. use `density=True`
- A cumulative histogram is an approximation to a Cumulative Distribution Function (CDF)

In [None]:
# Relative histogram.
plt.hist(y, bins='auto', density=True);

In [None]:
plt.hist(y, bins='auto', cumulative=True, density=True);

## Box-plots

- Powerful plotting utility
- Shows the distribution of the data at a glance
- Typically shows the following:
   - Median, Q1, Q3, IQR
   - Q1 - 1.5*IQR, Q3 + 1.5*IQR


In [None]:
plt.boxplot(x, tick_labels=['x'], showmeans=True);

- Can show multiple boxplots together


In [None]:
plt.boxplot((x, y), tick_labels=['x', 'y'], showmeans=True, patch_artist=True);

## Violin plots!

- Combine PDFs and box plots.


In [None]:
violinplot(x);

In [None]:
violinplot(y);

## Pearson's correlation coefficient

 $$\begin{aligned}
 r = \frac{\sum_i (x_i - \bar{x}) (y_i - \bar{y})}{(n-1) s_x s_y} \\
 \end{aligned}$$$$r = \frac{\sum_i (x_i - \bar{x}) (y_i - \bar{y}) }
          {\sqrt {\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}$$

- $-1 \leq r \leq 1$
- \"Small\" r implies lower correlation

In [None]:
np.corrcoef(x, y)

## Pie and bar charts.

- In general avoid pie charts, use barcharts/histograms instead.
- But here is how you can do this.

Consider this data:

| **Cancer**  | Lung | Breast | Colon | Prostate | Melanoma | Bladder |
|-------------|------|--------|-------|----------|----------|---------|
| **Numbers** | 42   |  50    |  32   |   55     |  9       |  12     |


In [None]:
# Solution
cancer = ['Lung', 'Breast', 'Colon', 'Prostate', 'Melanoma', 'Bladder']
numbers = [42, 50, 32, 55, 9, 12]
plt.pie(numbers, labels=cancer, shadow=True);

### Better to use a bar chart


In [None]:
x = np.arange(len(numbers))
plt.bar(x, numbers, tick_label=cancer)
plt.grid();

## Exercise

- Load the data in `'sachin.txt'`, and `'kohli.txt'`
- These have the ODI runs for Sachin tendulkar and Virat Kohli
- They have a single column each
- Load the data using `np.loadtxt`
- Show a histogram (relative) for each.
- Show them together, use the same approach as we did for boxplot.
- Hint: use `plt.legend` to set the legend
- Show the data using a boxplot
- Show the data using a violin plot.

Comment on the player's performance

In [None]:
# Solution


## Interactivity with widgets

- Jupyter widgets or [IPyWidgets](https://ipywidgets.readthedocs.io/) make
  it very easy to add interactivity to a notebook.
- Already installed on the course website.
- Usually already installed in Anaconda.

Here we learn the bare-minimum to get started. Read the docs to learn more.


In [None]:
from ipywidgets import interact

In [None]:
def sine_plot(k=1, phi=0.0):
    x = np.linspace(0, 2*np.pi, 1000)
    y = np.sin(x*k + phi)
    plt.plot(x, y)
    plt.grid()

In [None]:
sine_plot(k=2, phi=0.5)

In [None]:
interact(sine_plot, k=(1, 20), phi=(0, 2*np.pi))

Can also use the decorator syntax

In [None]:
@interact(k=(1, 20), phi=(0, 2*np.pi))
def sine_plot(k=1, phi=0.0):
    x = np.linspace(0, 2*np.pi, 1000)
    y = np.sin(x*k + phi)
    plt.plot(x, y)
    plt.grid()

### Another example


In [None]:
@interact(name=['Krishna', 'Prabhu', 'Hassan'], n=(0, 10), x=(0.0, 1.0))
def f(option=True, hi='Hello', name='Krishna', n=1, x=0.1):
    print(option, hi, name, n, x)

### Aside: decorators

- Decorators are higher order functions
- Operate on a function and return a function


In [None]:
# Trivial example
def deco(f):
    return f

In [None]:
@deco
def g(x):
    pass

# Equivalent to
h = deco(g)

In [None]:
g(1)

In [None]:
# More complex example
def deco1(f):
    def _my_f(x):
        print("Hello!")
        return f(x)
    return _my_f

In [None]:
@deco1
def g(x):
    return x + 1

In [None]:
g(1)

## Summary

- Statistical quantities: mean, median, std. dev., percentiles
- Statistical plots: scatter plot, histogram, boxplot, bar/pie chart, violin plot
- Correlation coefficient
- Interactivity with `ipywidets` and `interact`
- Decorators