# Stats/ML Study Sesh

## Python version

### Topics

- Basic statistics review
- Tableau Python integration

## Numerical measures

From [R numerical measures](http://www.r-tutor.com/elementary-statistics/numerical-measures).

- Mean
- Median
- Quartile/Percentile
- Range
- Interquartile range
- [Variance](http://www.r-tutor.com/elementary-statistics/numerical-measures/variance)
- [Standard deviation](http://www.r-tutor.com/elementary-statistics/numerical-measures/standard-deviation)
- [Covariance](http://www.r-tutor.com/elementary-statistics/numerical-measures/covariance)
- [Correlation coefficient](http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient)

## Probability distributions

From [R probability distributions](http://www.r-tutor.com/elementary-statistics/probability-distributions)

- [Binomial distribution](http://www.r-tutor.com/elementary-statistics/probability-distributions/binomial-distribution)
- [Poisson distribution](http://www.r-tutor.com/elementary-statistics/probability-distributions/poisson-distribution)
- [Continuous uniform distribution](http://www.r-tutor.com/elementary-statistics/probability-distributions/continuous-uniform-distribution)
- [Exponential distribution](http://www.r-tutor.com/elementary-statistics/probability-distributions/exponential-distribution)
- [Normal distribution](http://www.r-tutor.com/elementary-statistics/probability-distributions/normal-distribution)
- [Chi-squared distribution](http://www.r-tutor.com/elementary-statistics/probability-distributions/chi-squared-distribution)

### Our list in Python

In [1]:
import numpy as np

my_list = [3, 6, 12, 17, 32, 49, 50, 90]

print('My list of numbers: ', my_list)

My list of numbers:  [3, 6, 12, 17, 32, 49, 50, 90]


We can take a look at what variables are available in the Jupyter environment. This includes all `code` cells. Variables cannot be used in markdown cells unfortunately.

In [2]:
whos

Variable   Type      Data/Info
------------------------------
my_list    list      n=8
np         module    <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>


### Mean and median

In [4]:
# The Pythonic way is to do only one thing per line
# This:
#
# a = np.some_calculation(arg)
# print("hello: ", a)
#
# is more readable than this:
#
# print("hello: ", np.some_calculation(arg))
#

print(my_list)

m = np.mean(my_list)
d = np.median(my_list)

print(f'Mean: {m}')
print('Median: ', d)

[3, 6, 12, 17, 32, 49, 50, 90]
Mean: 32.375
Median:  24.5


### Quartiles and percentiles

In [5]:
# If you need to calculate the same thing manytimes, use a loop:

print(my_list)

p_tiles = [25, 50, 75, 3, 9, 90, 99]

for i in p_tiles:
    p = np.percentile(my_list, i)
    print(f'{i}th percentile: {p}')

[3, 6, 12, 17, 32, 49, 50, 90]
25th percentile: 10.5
50th percentile: 24.5
75th percentile: 49.25
3th percentile: 3.63
9th percentile: 4.890000000000001
90th percentile: 61.99999999999999
99th percentile: 87.19999999999999


### Range

In [6]:
# In Python the keyword 'range' is already taken
# so instead, numpy names it 'peak to peak'

print(my_list)

r = np.ptp(my_list)
print('Range: ', r)

[3, 6, 12, 17, 32, 49, 50, 90]
Range:  87


### Interquartile Range

In [None]:
print(my_list)

# Numpy doesn't have a dedicated function for IQR
q25, q75 = np.percentile(my_list, [25, 75])
iqr = q75 - q25

print('IQR: ', iqr)

### Variance

The **variance** is a numerical measure of how the data values are dispersed around the mean. The sample mean and variance are $\bar{x}$ and $s^2$, respectively. The population mean and variance are $\mu$ and $\sigma^2$.

To calculate the variance, you take each value and subtract the mean from it, then square the result. Once you have added all the squares divide that total by the number of observations. Actually, just one less than the total number of observations. It looks like this. Where $n$ is the total number of observations.

$$s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 +\;...\,+ (x_n - \bar{x})^2}{n-1}$$

The fancy mathematical way to write it is like this.

$$s^2=\frac{1}{n-1}\sum_{i=1}^n (x_{i}-\bar{x})^2 \qquad \text{or} \qquad \sigma^2=\frac{1}{n-1}\sum_{i=1}^n (x_{i}-\mu)^2$$

In [9]:
print(my_list)

# Numpy defaults to 0 degrees of freedom
variance = np.var(my_list, ddof=1)

print(f'Variance: {variance}')

[3, 6, 12, 17, 32, 49, 50, 90]
Variance: 873.9821428571429


### Standard deviation

The **standard deviation** of an observation variable is the square root of its variance.

Sample $\sqrt{s^2}$.

Population $\sqrt{\sigma^2}$.

In [13]:
print(my_list)

std = np.std(my_list, ddof=1)

print(f'Std dev: {std}')

[3, 6, 12, 17, 32, 49, 50, 90]
Std dev: 29.5631889832126
