# Statistics
* refers to the mathematics and techniques with which we understand data

## by simple number of data-points
```python
num_points = len(num_friends) # 204
largest_value = max(num_friends) # 100
smallest_value = min(num_friends) # 1
sorted_values = sorted(num_friends)
smallest_value = sorted_values[0] # 1
second_smallest_value = sorted_values[1] # 1
second_largest_value = sorted_values[-2] # 49
```

## By Central Tendencies
* Mean(or average)
* Median(middle value)

```python
# median in detail
def median(v):
    """finds the 'middle-most' value of v"""
    n = len(v)
    sorted_v = sorted(v)
    midpoint = n // 2
    if n % 2 == 1:
        # if odd, return the middle value
        return sorted_v[midpoint]
    else:
        # if even, return the average of the middle values
        lo = midpoint - 1
        hi = midpoint
        return (sorted_v[lo] + sorted_v[hi]) / 2
```

* mean is simpler to compute, median needs to sort data
* mean is very sensitive to outlieres, but median is not
* generalization of the median is the quantile

``` python
def quantile(x, p):
    """returns the pth-percentile value in x"""
    p_index = int(p * len(x))
    return sorted(x)[p_index]

quantile(num_friends, 0.10) # 1
quantile(num_friends, 0.25) # 3
quantile(num_friends, 0.75) # 9
quantile(num_friends, 0.90) # 13

```

* mode, most-commnon-value[s]

```python

def mode(x):
    """returns a list, might be more than one mode"""
    counts = Counter(x)
    max_count = max(counts.values())
    return [x_i for x_i, count in counts.iteritems()
            if count == max_count]

mode(num_friends) # 1 and 6

```

## By how spread out data - Dispersion
* range  : max - min
 * does not really depend on the whole data
* variance 

``` python
def de_mean(x):
    """translate x by subtracting its mean (so the result has mean 0)"""
    x_bar = mean(x)
    return [x_i - x_bar for x_i in x]

def variance(x):
    """assumes x has at least two elements"""
    n = len(x)
    deviations = de_mean(x)
    return sum_of_squares(deviations) / (n - 1)    # Why n - 1
```

 * why n-1?
  * https://en.wikipedia.org/wiki/Bessel%27s_correction

* standard deviation for correction of unit
* standard deviation also suffer from outliers
* alternative is : 75th and 25th percentile

## Correlation

relationship between variables

* Covariance : how two variables vary in tandem from their means
 * unit is product of two variable's unit => weird

```python
def covariance(x, y):
    n = len(x)
    return dot(de_mean(x), de_mean(y)) / (n - 1)

```

* Correlation 
 * unitless, 
 * between -1 (perfect anti-correlation) and 1 ( perfect correlation )
 * still be vulerable to outliers
 
```python
def correlation(x, y):
    stdev_x = standard_deviation(x)
    stdev_y = standard_deviation(y)
    if stdev_x > 0 and stdev_y > 0:
        return covariance(x, y) / stdev_x / stdev_y
    else:
        return 0 # if no variation, correlation is zero

```

* with outlier, 0.25, without outlier, 0.5

<img style="float: left;" src="./images/ch05_statistics/1.png" width="400">
<img style="float: center;" src="./images/ch05_statistics/2.png" width="400">


  

## Simpson paradox

* correlations can be misleading when confounding variables are ignored

<img style="float: left;" src="./images/ch05_statistics/3.png" width="400">
<img style="float: center;" src="./images/ch05_statistics/4.png" width="400">


<img src="./images/ch05_statistics/5.png" width="400">


## Some Other Correlational Caveats

* |X| = y 
 * zero correlation

```python
x = [-2, -1, 0, 1, 2]
y = [ 2, 1, 0, 1, 2]

```

* correlation tells you nothing about how large the relationship

```python
x = [-2, 1, 0, 1, 2]
y = [99.98, 99.99, 100, 100.01, 100.02]

```


## Correlation and Causation

* correlation is not causation

<img src="./images/ch05_statistics/6.png" width="800">


