# Descriptive and Inferential Statistics

### Population

A _population_ is a particular group of interest we want to study, such as “all seniors over the age of 65 in the North America,” “all golden retrievers in Scotland,” or “current high school sophomores at Los Altos High School.”

### Sample

A _sample_ is a subset of the population that is ideally random and unbiased, which we use to infer attributes about the population. We often have to study samples because polling the entire population is not always possible. 

### Bias

The way to overcome bias is to truly at random select students (in the example study) from the entire population, and they cannot elect themselves into or out of the sample voluntarily.

There are many types of bias, but they all have the same effect of distorting findings.

_Confirmation bias_ is gathering only data that supports your belief, which can even be done unknowingly. An example of this is following only social media accounts you politically agree with, reinforcing your beliefs rather than challenging them.

_Self-selection bias_ is when certain types of subjects are more likely to include themselves in the experiment. Walking onto a flight and polling the customers if they like the airline over other airlines, and using that to rank customer satisfaction among all airlines, is silly. Why? Many of those customers are likely repeat customers, and they have created self-selection bias.

_Survival bias_ captures only living and survived subjects, while the deceased ones are never accounted for.


## Descriptive Statistics

### Mean and Weighted Mean



In [2]:
# Number of pets each person owns
sample = [1, 3, 2, 5, 7, 0, 2, 3]

mean = sum(sample) / len(sample)
mean

2.875

The mean we commonly use (above) gives equal importance to each value. But we can manipulate the mean and give each item a different weight:

$$
\text{Weighted mean} = \frac{(x_1 \cdot w_1) + (x_2 \cdot w_2) + (x_3 \cdot w_3) + \ldots + (x_n \cdot w_n)}{w_1 + w_2 + w_3 + \ldots + w_n}
$$

This can be helpful when we want some values to contribute to the mean more than others. 

In [3]:
# Three exams of .20 weight each and final exam of .40 weight
sample = [90, 80, 63, 87]
weights = [.20, .20, .20, .40]

weighted_mean = sum(s * w for s,w in zip(sample, weights)) / sum(weights)
weighted_mean

81.4

### Median

The median is the middlemost value in a set of ordered values. You sequentially order the values, and the median will be the centermost value. If you have an even number of values, you average the two centermost values.

The median can be preferable in outlier-heavy situations (such as income-related data) over the mean, when your median is very different from your mean, that means you have a skewed dataset with outliers. 

In [6]:
# Number of pets each person owns
sample = [0, 1, 5, 7, 9, 10, 14]

def median(values):
    ordered = sorted(values)
    n = len(ordered)
    mid = int(n / 2) - 1 if n % 2 == 0 else int(n/2)

    if n % 2 == 0:
        return (ordered[mid] + ordered[mid+1]) / 2.0
    else:
        return ordered[mid]

median(sample)

7

### Mode

The mode is the most frequently occurring set of values. It primarily becomes useful when your data is repetitive and you want to find which values occur the most frequently.

When no value occurs more than once, there is no mode. When two values occur with an equal amount of frequency, then the dataset is considered _bimodal_.

In practicality, the mode is not used a lot unless your data is repetitive. This is commonly encountered with integers, categories, and other discrete variables.

In [8]:
# Number of pets each person owns
from collections import defaultdict

sample = [1, 3, 2, 5, 7, 0, 2, 3]

def mode(values):
    counts = defaultdict(lambda: 0)

    for s in values:
        counts[s] += 1

    max_count = max(counts.values())
    modes = [v for v in set(values) if counts[v] == max_count]
    return modes

mode(sample)

[2, 3]

### Variance and Standard Deviation

The _variance_ is a measure of how spread out our data is.

#### Population Variance and Standard Deviation

$$
\text{Population variance} = \frac{(x_1 - mean)^2 + (x_2 - mean)^2 + \ldots + (x_n - mean)^2}{N}
$$

More formally:

$$
\sigma^2 = \frac{\sum(x_i - \mu)^2}{N}
$$

In [15]:
# Number of pets each person owns
data = [0, 1, 5, 7, 9, 10, 14]

def variance(values):
    mean = sum(values) / len(values)
    _variance = sum((v - mean) ** 2 for v in values) / len(values)
    return _variance

variance(data)

21.387755102040813

So the variance for number of pets owned by my office staff is 21.387755. OK, but what does it exactly mean?

This number is larger than any of our observations because we did a lot squaring and summing, putting it on an entirely different metric. So how do we squeeze it back down so it’s back on the scale we started with?

Let’s take the square root of the variance, which gives us the _standard deviation_.

This is the variance scaled into a number expressed in terms of “number of pets,” which makes it a bit more meaningful:

$$
\sigma = \sqrt{\frac{\sum(x_i - \mu)^2}{N}}
$$

In [16]:
from math import sqrt

def std_dev(values):
    return sqrt(variance(values))

std_dev(data)

4.624689730353898