<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Central-Tendency" data-toc-modified-id="Central-Tendency-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Central Tendency</a></span><ul class="toc-item"><li><span><a href="#Mean" data-toc-modified-id="Mean-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Mean</a></span><ul class="toc-item"><li><span><a href="#🧠-Knowledge-Check:-Coding-from-scratch" data-toc-modified-id="🧠-Knowledge-Check:-Coding-from-scratch-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>🧠 Knowledge Check: Coding from scratch</a></span></li></ul></li><li><span><a href="#Median" data-toc-modified-id="Median-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Median</a></span></li></ul></li><li><span><a href="#Dispersion" data-toc-modified-id="Dispersion-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Dispersion</a></span><ul class="toc-item"><li><span><a href="#Variance-and-Standard-Deviation" data-toc-modified-id="Variance-and-Standard-Deviation-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Variance and Standard Deviation</a></span><ul class="toc-item"><li><span><a href="#🧠-Knowledge-Check:-Why-do-we-square-it?" data-toc-modified-id="🧠-Knowledge-Check:-Why-do-we-square-it?-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>🧠 Knowledge Check: Why do we square it?</a></span></li><li><span><a href="#Coding-it" data-toc-modified-id="Coding-it-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Coding it</a></span></li></ul></li><li><span><a href="#Visualizing" data-toc-modified-id="Visualizing-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Visualizing</a></span></li></ul></li><li><span><a href="#Covariance-&amp;-Correlation" data-toc-modified-id="Covariance-&amp;-Correlation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Covariance &amp; Correlation</a></span><ul class="toc-item"><li><span><a href="#Covariance-is-Variance-Between-2-Variables" data-toc-modified-id="Covariance-is-Variance-Between-2-Variables-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Covariance is Variance Between 2 Variables</a></span></li><li><span><a href="#Correlation-is-Covariance-Scaled" data-toc-modified-id="Correlation-is-Covariance-Scaled-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Correlation is Covariance Scaled</a></span></li></ul></li></ul></div>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# Central Tendency

We want to be able to "summarize" the data as an overall value (Beware of _[datasaurus set](https://www.autodeskresearch.com/publications/samestats)_)

In [None]:
x = [54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60, 54, 54, 54, 55, 56, 57, 57, 
     58, 58, 60]
plt.hist(x, bins=5)
plt.title("Retirement Ages")
plt.show()

## Mean

In [None]:
x_mean = np.mean(x)
x_mean

In [None]:
plt.hist(x, bins=5)
plt.axvline(x_mean, color='red')
plt.title("Retirement Ages")
plt.show()

### 🧠 Knowledge Check: Coding from scratch

How would you code this from scratch?

In [None]:
def mean(x):
    pass

<details>
    <summary>Example Solution</summary> 
<code> 
def mean(x):
    return sum(x)/len(x)
</code>
</details>

## Median

In [None]:
x_median = np.median(x)
x_median

In [None]:
plt.hist(x, bins=5)
plt.axvline(x_mean, color='red')
plt.axvline(x_median, color='yellow')
plt.title("Retirement Ages")
plt.show()

> More robust than the mean

In [None]:
# Adding in an outlier
outlier = 120
x_mean_outlier = np.mean(x+[outlier])
print('Means:', x_mean_outlier, x_mean)

x_median_outlier = np.median(x+[outlier])
print('Median:', x_median_outlier,x_median)

In [None]:
plt.hist(x, bins=5)
plt.axvline(x_mean_outlier, color='red')
plt.axvline(x_mean, color='red', linestyle='--')
plt.axvline(x_median_outlier, color='yellow')
plt.axvline(x_median, color='yellow', linestyle='--')
plt.title("Retirement Ages")
plt.show()

# Dispersion

This can tell you how much the values differ from each other. How would you measure that?

## Variance and Standard Deviation

Variance (standard deviation squared) will show the spread of the values. Below is our formula:

$$ \sigma^2 = \frac{1}{N} \sum{(x-\bar{x})^2}$$

### 🧠 Knowledge Check: Why do we square it?

> So the positives and negatives don't cancel. 
> 
> This is using the L2-norm (aka least squares) if you're curious. Another option is using absolute value which is called the L1-norm (aka least absolute deviations (LAD) & least absolute errors (LAE))

### Coding it

In [None]:
# An example of how this could be written
def variance(number_list):
    avg = mean(number_list)
    var = sum((x_n - avg)**2 for x_n in x) / len(number_list)
    return var
    

In [None]:
variance(x)

## Visualizing

* Histograms
* Box plots 
* Violin Plots

In [None]:
import seaborn as sb

def draw_3_plots(nums):
    fig = plt.figure(figsize=(16,16))
    fig.subplots_adjust(hspace=0.1, wspace=0.4)

    # We'll draw this out all at once
    for i in range(1, 4):
        ax = fig.add_subplot(2, 3, i)
        if i % 3 == 1:
            plt.hist(nums)
        elif i % 3 == 2:
            plt.boxplot(nums)
        else:
            sb.violinplot(nums)


In [None]:
draw_3_plots(x)

In [None]:
# With an outlier
draw_3_plots(x+[110])


# Covariance & Correlation

We'll cover this in more detail when we go over [linear regression](../../DataScienceBasics/LinearRegression/linear_regressions_and_simple_relatiohsips.ipynb)

## Covariance is Variance Between 2 Variables

$$\sigma_{XY} = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)$$

## Correlation is Covariance Scaled

Multiple ways though a common one is Pearson's $r$ coefficient

$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {\sqrt{\sum_{i=1}^{n}(x_i - \mu_x)^2 \sum_{i=1}^{n}(y_i-\mu_y)^2}}$$