# Basic Statistics

_Including content from the DATA 601 staff, wikipedia, "Data Science from Scratch", and elsewhere._

A very simple distinction between probability and statistics is that probability is about the _future_ and statistics are about the _past_.

More formally...

> Probability is primarily a **theoretical branch of mathematics**, which studies the consequences of mathematical definitions. Statistics is primarily an **applied branch of mathematics**, which tries to **make sense of observations in the real world**.

from "Calculated Bets: Computers, Gambling, and Mathematical Modeling to Win" (http://www3.cs.stonybrook.edu/~skiena/jaialai/excerpts/node12.html)

In summary, probability theory enables us to **find the consequences of a given ideal world**, while statistical theory enables us to **measure the extent to which our world is ideal.**

We have talked about Tukey summaries already, but there are some additional ways to measure spread in our data:  variance and standard deviation.

* Variance is defined as ($\frac{\sum{(\bar{x} - x_i)}^{2}}{n-1})$ 
* Standard deviation is defined as $\sqrt{variance}$.

## Correlation
When we talk about correlation, we're looking for the way two values relate to one another.  

### Covariance
One such approach is _covariance_, $covariance(x, y)$, an analog of variance but for pairs of values.  It does this by computing the dot product of of the _de-mean'ed_ values (the differences from their respective means).

A covariance of 0 means no relationship exists.  A "large" positive covariance means that x tends to be large when y is large.  Similarly a "large" negative covariance means that x tends to be small when y is large and vice versa.

Caveats:  Covariance can be difficult to interpret because ...
* the units are hard to interpret (since they are the products of the two input sets units)
* because the scale of the values is unbounded and their scales are not normalized, knowing what counts as "large" is challenging

### Correlation
Correlation solves this by computing covariance and then dividing by the standard deviation of both variables.

The correlation is thus _unitless_ and always lies between -1 (perfect anti-correlation) and 1 (perfect correlation).

Caveats:

* Correlation can be sensitive to outliers, however, so consider removing them and re-examining correlation.
* Simpson's Paradox can cause misinterpretations if the presence of _confounding_ variables isn't accounted for
* Correlation looks for _linear_ relationships.  It tells us nothing about non-linear relationships (e.g. $x$ vs. $|x|$)
* It also tells us nothing about the magnitude of the relationship, so a perfect positive correlation may be true and yet uninteresting because the effect is very small.

Lastly, the classic refrain:  "correlation does not imply causation"

Note that if two variables are correlated, there are at least three possibilities for why:

* x could cause y
* y could cause x
* some third factor, z, could cause both

Randomized trials can help us feel much more secure in a causal assertion.  In general, early on in the EDA process, it's best to avoid causal language (h/t Elements of Data Analytic Style).

## Simpson's Paradox

_taken from "Data Science from Scratch"_

Often when examining correlation, we come across something called "Simpson's Paradox", where the presence of confounding variables isn't accounted for.

The key issue is that correlation is measuring the relationship between your two variables **all else being equal**. If your data classes are assigned at random, as they might be in a well-designed experiment, “all else being equal” might not be a terrible assumption. But when there is a deeper pattern to class assignments, “all else being equal” can be an awful assumption.

For example, imagine that you can identify all of your members as either East Coast data scientists or West Coast data scientists. You decide to examine which coast’s data scientists are friendlier:

In [40]:
import pandas as pd
by_coast  = pd.DataFrame({'coast': ["West Coast", "East Coast"], 
                          'n_members': [101, 103], 
                          'avg_friends': [8.2, 6.5]})
by_coast

Unnamed: 0,coast,n_members,avg_friends
0,West Coast,101,8.2
1,East Coast,103,6.5


When playing with the data you discover something very strange. If you only look at people with PhDs, the East Coast data scientists have more friends on average. And if you only look at people without PhDs, the East Coast data scientists also have more friends on average!

In [41]:
by_coast_and_degree = pd.DataFrame({
    'coast': ['West Coast', 'East Coast', 'West Coast', 'East Coast'],
    'degree': ['PhD', 'PhD', 'no PhD', 'no PhD'],
    'n_members': [35, 70, 66, 33],
    'avg_friends': [3.1, 3.2, 10.9, 13.4]
})
by_coast_and_degree

Unnamed: 0,coast,degree,n_members,avg_friends
0,West Coast,PhD,35,3.1
1,East Coast,PhD,70,3.2
2,West Coast,no PhD,66,10.9
3,East Coast,no PhD,33,13.4


Once you account for the users’ degrees, the correlation goes in the opposite direction! Bucketing the data as East Coast/West Coast disguised the fact that the East Coast data scientists skew much more heavily toward PhD types.