# Statistics

In [9]:
import matplotlib.pyplot as plt
import numpy as np
import math
from scipy import stats
%matplotlib

Using matplotlib backend: Qt5Agg


![The central dogma of statistgics](figures/statistics.png)
From S.Skiena, *The Data Science Manual*, Texts in computer Science, 2017

## Descriptive Statistics

We consider a sample (data set) composed of $n$ observations : $x_1, x_2,\ldots, x_i, \ldots, x_n$.

The two main types of descriptive statistics :

* **Central tendency measures** : describe the center around the data is distributed.
* **Variation or variability measures** : describe data spread, i.e. how far the measurements lie from the center.

The **Mean** is a well centrality measure :

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i
$$

This measure is meaningful for symetric distributions without outlier. For example, consider the following data sets :

In [9]:
heights = [1.79, 1.65, 1.85, 1.72, 1.94, 1.87, 1.62, 1.80]
# outlier
weights = [80, 62, 57, 68, 90, 2000, 71]
# asymmetric distribution
grades = [20, 20, 20, 20, 20, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]

The **median** is used in the case of **skewed** distribution or in the presence of outliers. This measure is the exact middle value among the data set. This value is close to the arithmetic mean in the case of symmetrical distributions.

Compare the wealth per adult on [https://en.wikipedia.org/wiki/List_of_countries_by_wealth_per_adult](https://en.wikipedia.org/wiki/List_of_countries_by_wealth_per_adult). What do you conclude ?

The **geometric mean** is also a centrality measure :

$$
\Big(\prod_{i=1}^n x_i\Big)^{1/n}
$$

This measure has several applications such as :

* Compute average interest rate
* Makes sense with rations (1/2 and 2/1)
* To average values evaluated on different scales -> the same relative change leads to the same relative change in the geometric mean, for example :

In [10]:
v = [(10, 100), (20, 100), (10, 200), (20, 200)]
arithmetic_means = [sum(pair)/2 for pair in v]
geometric_means = [math.sqrt(pair[0]*pair[1]) for pair in v]

**Standard deviation** ($\sigma$) is a common variability measure :

$$
\sigma = \sqrt{\frac{\sum_{i=1}^n (a_i - \bar{a})^2}{n - 1}}
$$

where $\sigma^2 = V$ is the variance. This measure is obviously very sensitive to outliers.

In [12]:
# the lifespan of light bulbs: normal (left) and with zero variance (right).

Means and standard deviations complement each other for characterising any distribution. For example, this allows to use the **Chebyshev's inequality** :

$$
P(|X - \mu| \geqslant k\sigma) \leqslant \frac{1}{k^2}
$$

This means that at least $(1-1/k^2)$th of the observations must lie in the interval $[\bar{x}-k\sigma, \bar{x}+k\sigma]$. Therefore, $75\%$ of all the data must lie in the interval $[\bar{x}-2\sigma, \bar{x}-2\sigma]$.

## Distributions

In [11]:
fig, ax = plt.subplots()
n = 200
p = 0.5
x = np.arange(stats.binom.ppf(0.000000001, n, p), stats.binom.ppf(0.999999999, n, p))
ax.plot(x, stats.binom.pmf(x, n, p), linewidth="1.5")
ax.set_xlabel("X")
ax.set_ylabel("Probability")
ax.set_title("The Binomial Distribution of Coin Flips")
ax.set_xlim(65,135)

<matplotlib.text.Text at 0x7f2545516e10>

In [12]:
fig, ax = plt.subplots()
n = 1000
p = 0.001
x = np.arange(stats.binom.ppf(0.0000000000000001, n, p), stats.binom.ppf(0.999999999999999999, n, p))
ax.plot(x, stats.binom.pmf(x, n, p), linewidth="1.5")
ax.set_xlabel("X")
ax.set_xlim(0,5)
ax.set_ylabel("Probability")
ax.set_title("The Binomial Distribution of Lightbulb Burnouts")

<matplotlib.text.Text at 0x7f25455d1240>

### The Binomial Distribution

We consider a *random experiment* with two possible outcomes $P_1$ and $P_2$ with probabilities $p$ and $q = (1-p)$. The *binomial distribution* defines the probability that $P_1$ occurs exactly $x$ times after $n$ independent trials :

$$
P(X = x) = {n \choose x} p^x (1 - p)^{(n - x)}
$$

This function of $x$ is the so-called **probability mass function** (**pmf**) for the **discrete random variable** $X$ for the *binomial distribution*.

### The Multinomial Distribution

This distribution generalise the *binomial distribution*, where the result of a trial can lead to $k$ different outcomes instead of two. Each outcome is labelled $A_i$, with $i = 1,\ldots,k$, with $p_i$ the respective probability. The *probability mass function* is defined as follows :

$$
P(X = (x_1,\ldots,x_i,\ldots,x_k)) = n!\prod_{i=1}^k \Big(\frac{p_i^{x_i}}{x_i!}\Big)
$$

### The Uniform Continuous Distribution

Each value in the range $[a, b]$ is equaly likely to occur. For $x \in [a, b]$, the **probability density function** is defines as :

$$
f(x) = \frac{(x-a)}{(b-a)}
$$

$f(x) = 0$, otherwise.

In [30]:
fig, ax = plt.subplots()
a = 2
b = 7.6
x = np.linspace(a, b, 100)

ax.plot(x, stats.uniform.pdf(x, loc=a, scale=(b-a)), color='r', linewidth="1.5")
ax.set_xlabel("X")
ax.set_ylabel("Probability")
ax.set_title("Uniform Distribution")
ax.set_xlim(0,10)

# random values
r = stats.uniform.rvs(loc=a, scale=(b-a), size=10000)
ax.hist(r, normed=True, histtype='stepfilled', alpha=0.2)

(array([ 0.18271735,  0.18128848,  0.18003821,  0.18039543,  0.17235801,
         0.18057404,  0.17807351,  0.17878795,  0.18521788,  0.16664251]),
 array([ 2.00069874,  2.56057989,  3.12046104,  3.68034219,  4.24022334,
         4.80010448,  5.35998563,  5.91986678,  6.47974793,  7.03962908,
         7.59951022]),
 <a list of 1 Patch objects>)

### The Normal Distribution

$$
f(x) = \frac{1}{\sigma\sqrt{2\pi} } \; e^{-(x-\mu)^2/2\sigma^2}
$$

A typical example of attribute following a normal distribution is the *experimental error*, where small ones are more likely dans big ones.

Other examples of attributes that are **bell-shaped** but not necessarily normal are **physical phenomenon** such as :
* height,
* weight,
* lifespan,...

In [31]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13,4.5))

# LEFT PLOT
x = np.linspace(stats.norm.ppf(0.0001), stats.norm.ppf(0.9999), 100)
ax1.plot(x, stats.norm.pdf(x), 'r-', linewidth=1.2)
ax1.fill_between(x, stats.norm.pdf(x), facecolor='red', alpha=0.25)
ax1.set_xlabel('x')
ax1.set_ylabel('Density')
ax1.set_title("pdf")
ax1.set_ylim(0,0.42)
ax1.set_xlim(-3.5, 3.5)

# RIGHT PLOT
ax2.plot(x, stats.norm.cdf(x), 'b-', linewidth=1.2)
ax2.fill_between(x, stats.norm.cdf(x), facecolor='blue', alpha=0.25)
ax2.set_xlabel('x')
ax2.set_ylabel('Density')
ax2.set_title("cdf")
ax2.set_ylim(-0.005,1.05)
ax2.set_xlim(-3.5, 3.5)

(-3.5, 3.5)

In [40]:
# Note: stats.norm.pdf(x, mean, stdev)

x = np.linspace(stats.norm.ppf(0.00000000001), stats.norm.ppf(0.99999999999), 100000)
mean = 0
sigma = 1
pdf = stats.norm.pdf(x, mean, sigma)

fig, ax = plt.subplots()
ax.plot(x, pdf, 'b-')

ax.fill_between(x, pdf, color='b', alpha=0.15)
ax.fill_between(x, pdf, color='b', where=( x <-3*sigma), alpha=0.15)
ax.fill_between(x, pdf, color='b', where=( x <-2*sigma), alpha=0.15)
ax.fill_between(x, pdf, color='b', where=( x <-1*sigma), alpha=0.15)
ax.fill_between(x, pdf, color='b', where=( x > 3*sigma), alpha=0.15)
ax.fill_between(x, pdf, color='b', where=( x > 2*sigma), alpha=0.15)
ax.fill_between(x, pdf, color='b', where=( x > 1*sigma), alpha=0.15)

ax.set_ylim(0, 0.42)
ax.set_xlim(-3.5, 3.5)

#Hide y axis
ax.set_yticks([])

print('P(sigma <= X <= sigma) = ', stats.norm.cdf(1*sigma, mean, sigma) - stats.norm.cdf(-1*sigma, mean, sigma))
print('P(2*sigma <= X <= 2*sigma) = ', stats.norm.cdf(2*sigma, mean, sigma) - stats.norm.cdf(-2*sigma, mean, sigma))
print('P(3*sigma <= X <= 3*sigma) = ', stats.norm.cdf(3*sigma, mean, sigma) - stats.norm.cdf(-3*sigma, mean, sigma))

P(sigma <= X <= sigma) =  0.682689492137
P(2*sigma <= X <= 2*sigma) =  0.954499736104
P(3*sigma <= X <= 3*sigma) =  0.997300203937


## References

* **The Data Science Design Manual**, by Steven Skiena, 2017, Springer
* Python notebooks available at [http://data-manual.com/data](http://data-manual.com/data)
* Lectures slides available at [http://www3.cs.stonybrook.edu/~skiena/data-manual/lectures/](http://www3.cs.stonybrook.edu/~skiena/data-manual/lectures/)
* [Grinstead and Snell's Introduction to Probability](https://math.dartmouth.edu/~prob/prob/prob.pdf), The CHANCE Project 1 Version dated 4 July 2006
* **Statistical Distributions**, by Catherine Forbes, Merran Evans, Nicholas Hastings, Brian Peacock, 4th Edition, 2011