# Basic prerequisites for machine learning

This notebook contains explanations of basic statistical concepts such as calculating standard deviation and correlation coefficient.

## Content
- [Standard deviation](#Standard-deviation)
- [Correlation](#Correlation)
- [Linear regression](#Linear-regression)

In [2]:
import numpy as np
from scipy.stats import pearsonr
from __future__ import division

## Standard deviation

> Standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values. – [Wikipedia](https://en.wikipedia.org/wiki/Standard_deviation)

$$s = \sqrt{\frac{\Sigma{(x_i - \overline{x})^2}}{n-1}}$$

Above is the mathematical notation for the calculation of standard deviation. The steps are explained in greater detail below, with this example data: `[27, 28, 30, 45]`.

__1. Find the average of the data set: $\overline{x}$__

In [27]:
sd_data = [27, 28, 30, 45]
sd_avg = np.average(sd_data) # 32.5

__2. Take each value in the data set and subtract the mean from it: $x_i - \overline{x}$__

In [28]:
sd_mean_differences = [x - sd_avg for x in sd_data]
# [-5.5, -4.5, -2.5, 12.5]

__3. Square each of the differences: $(x_i - \overline{x})^2$__

In [29]:
sd_diff_squares = [np.square(x) for x in sd_mean_differences]
# [30.25, 20.25, 6.25, 156.25]

__4. Add the squared differences: $\Sigma{(x_i - \overline{x})^2}$__

In [30]:
sd_squares_sums = sum(sd_diff_squares)
# 213.0

__5. Divide that sum with the length of the data minus 1: $\frac{\Sigma{(x_i - \overline{x})^2}}{n-1}$__

In [31]:
sd_squares_div = sd_squares_sums / (len(sd_data) - 1)
# 71.0

__6. Take the square root from the previous step's to get the final result: $\sqrt{\frac{\Sigma{(x_i - \overline{x})^2}}{n-1}}$__

In [39]:
np.sqrt(sd_squares_div)

8.426149773176359

The calculation can be done in one step with numpy's built-in `std`.

In [45]:
np.std(sd_data, ddof=1)

8.426149773176359

## Correlation

> A measure of the strength and direction of the linear relationship between two variables. [Wikipedia](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)

$$r = \frac{1}{n-1}\left(\frac{\sum\limits_{x}\sum\limits_{y}(x-\overline{x})(y-\overline{y})}{s_x s_y}\right)$$

By correlation, I'm here referring to the Pearson correlation coefficient. The calculation formula is stated above. The calculation steps are explained below, with the following data: `[{'x': 10, 'y': 15}, {'x': 20, 'y': 28}, {'x': 6, 'y': 12}]`.

__1. Find the mean of all x-values, and the mean of all y-values: $\overline{x} \overline{y}$__

In [47]:
cr_data = [{'x': 10, 'y': 15}, {'x': 20, 'y': 28}, {'x': 6, 'y': 12}]
cr_x_mean = np.mean([d['x'] for d in cr_data])
cr_y_mean = np.mean([d['y'] for d in cr_data])

__2. Find the standard deviation for all x and y values: $s_x s_y$__

In [49]:
cr_x_std = np.std([d['x'] for d in cr_data], ddof=1)
cr_y_std = np.std([d['y'] for d in cr_data], ddof=1)

__3. For each of the x,y pairs, subtract the x mean from x and the y mean from y, then multiply together the results: $(x-\overline{x})(y-\overline{y})$.__

In [57]:
cr_diff_products = [(d['x'] - cr_x_mean) * (d['y'] - cr_y_mean) for d in cr_data]

__4. Add up the results from the previous step: $\sum\limits_{x}\sum\limits_{y}(x-\overline{x})(y-\overline{y})$.__

In [61]:
cr_product_sum = sum(cr_diff_products)

__5. Divide the sum by product of the standard deviation of x and y respectively: $\frac{\sum\limits_{x}\sum\limits_{y}(x-\overline{x})(y-\overline{y})}{s_x s_y}$.__

In [62]:
cr_sum_div = cr_product_sum / (cr_x_std * cr_y_std)

__6. Divide the result from the previous step by the number of data pairs minus 1 to get the correlation coefficient (same as multiplying by 1 over n – 1): $\frac{1}{n-1}\left(\frac{\sum\limits_{x}\sum\limits_{y}(x-\overline{x})(y-\overline{y})}{s_x s_y}\right)$.__

In [64]:
cr_sum_div / (len(cr_data) - 1)

0.99462397526908541

The same calculation can be done using the `scipy.stats.pearsonr` function.

In [68]:
pearsonr([d['x'] for d in cr_data], [d['y'] for d in cr_data])[0]

0.9946239752690853

## Linear regression

_TODO_