# Descriptive and Inferential Statistics

## Descriptive versus Inferential Statistics

The most understood part of statistics is called $Descriptive$ $Statistics$, where are used tools as the mean, the median, the mode, charts, bell curves and tool to summarize the date we have.
Instead, what is called $Inferential$ $Statistics$ is less intuitive, it tries to uncover attributes about a larger population often based on a sample. In additiom, this sample may have several problems, for example, it could not represent the actual population.

## Population, Samples and Bias

Population is a particular group of interest we want to study such as "all seniors over the age of 65 in the North America", or "all golden retrievers in Scotland" or "current high school sophomores at Los Altos high school". Notice how we have boundaries defining our population, and some of these boundaries are broad and captures a large population over a vast geography or age group, whereas others are highly specific as the last example. So defining a population depends essentially on what you are interested in.

If we are going to infer attributes about the population based on a sample, it's really important the sample to be as random as possible, so we do not skew our conclusions. For example, if I want to find the average number of hours college students watch televisions per week in the United States I can walk right outside my door and start pulling random students walking by. However, here there is a big problem because our student sample is going to have a bias, meaning it skews our findings by over representing a certain group at the expense of other groups. Indeed, my study, defined the population to be college students in the United States and not college students at Arizona State University. Ideally, I should be randomly pulling college students all over the country at different universities, so, in that way, I should have a more representative sample.

Let's make another example. I could encounter a $self$-$selection$ bias through the collection of this kind of information through social media like Twitter and Facebook. Why? Because the people that he's going to fill out the questionnaire is more likely to watch TV, or to have Netflix, thus, watching more hours compared to those that are not on these social media.

Seen in this way it seems that bias is inevitable and in many cases it is, so, there could be many confounding variables or factors we didn't account for that can influence our study. Thus, this problem of data bias is expensive and difficult to overcome and of course, machine learning is especially vulnerable to it. The way to overcome this problem is to select students from the entire population, so they cannot elect themselves into or out of the sample voluntarily. This represents the most effective way to mitigate bias but as you can imagine, it takes a lot of coordinated resources to do.

There are many type of bias, but they all have the same effect of distorting findings. We could have confirmation bias, so, gathering information that supports our belief, even unknowingly. An example of this bias is to follow only those social media accounts that you politically agree with, reinforcing your beliefs, rather than challenging them.
Another type of bias is called survival bias, where are captured only living and survived subjects while the deceased ones are never accounted for.
An example of this that comes to mind is related to Steve Jobs. He was said to be passionate and hot tempered, and he also created one of the most valuable companies of all times. Thus, for some people being passionate and a temperate might correlate with success, but again this is survivor bias because there are many, many companies that are run by passionate leaders that have failed in obscurity rather than becoming a success story like Apple.

These stories remind us to always ask questions about how the data was obtained, and then scrutinise how that process could have biased data. As anticipated before these problems with sampling, and bias extends to machine learning as well. Whether it is linear regression, logistic regression, or neural networks, a sample of data is used to infer predictions. If that data is biased, then it will steer the machine learning algorithm to make biased conclusions.

## Descriptive Statistics

### Mean and Weighted Mean

The mean is just the average of a set of values, and it is useful since it gives back where the "center of gravity" exists for an observed set of values. In addition, it is computed in the same way either for samples and populations.

In [1]:
# Example 3-1 Calculating mean in Python
sample = [1, 3, 5, 7, 9, 4]

mean = sum(sample) / len(sample)
print(mean)

4.833333333333333


As anticipated earlier, we have to versions:
- $\bar{x} = \dfrac{x_1 + x_2 + x_3 + ... + x_n}{n}$

- $\bar{\mu} = \dfrac{x_1 + x_2 + x_3 + ... + x_n}{N}$

They represent the same thing, just the denomination is different. In the first case we are computing the mean for a sample whereas in the second case for an entire population. Besides the symbol to display the mean depending on what we're considering, there also a little change in the denominator. Indeed, in the first case with a sample, we indicate its size trough $n$, whereas with a population with $N$.

We can of course apply a weight to item in the mean through the weighted mean:
- $weighted$ $mean = \dfrac{x_1*w_1 + x_2*w_2 + x_3*w_3 + ... + x_n*w_n}{w_1 + w_2 + w_3 + ... + w_n}$

This is especially useful when we want some values to contribute to the mean more than others.

In [7]:
# Example 3-2 Calculating a weighted mean in Python
# Three exames with .20 weight each and final exam with .40 weight
sample = [90, 80, 63, 87]
weights = [.20, .20, .20, .40]

weighted_mean = sum([s * w for s, w in zip(sample, weights)]) / sum(weights)
print(weighted_mean)

81.4


Weightings don't have to be percentages since any number put inside will be proportionalized. See this in the next example.

In [9]:
# Example 3-3 Calculating a weighted mean in Python
# Three exames with .20 weight each and final exam with .40 weight
sample = [90, 80, 63, 87]
weights = [1, 1, 1, 2]

weighted_mean = sum([s * w for s, w in zip(sample, weights)]) / sum(weights)
print(weighted_mean)

81.4


### Median

The median is the middlemost value in a set of ordered values. If you have an even number of values, you average the two centermost values.

In [18]:
# Example 3-4 Calculating the median in Python
def median(values):
    values = sorted(values)
    n = len(values)
    if n % 2 == 0:
        mid = int(n/2) - 1
    else:
        mid = int(n/2)
    
    if n % 2 == 0:
        median = (values[mid] + values[mid + 1]) / 2
        return median
    else:
        median = values[mid]
        return median
    
sample = [0, 1, 5, 7, 9, 10, 14]

print(median(sample))

7


Recall that in Python $int(\dfrac{3}{2})$ returns 1, not 2.

The median can be a helpful alternative to the mean when data is skewed by outliers, or values that are extremely high or low compared to the rest of the values. Thus, when our median is quite different from the mean it means the data is skewed.

Recall that the median is simply the 50% quantile, meaning that the 50% of the ordered values are below that value. In the same way, we could have other quantiles, 25% and 75%, also called quartiles.

### Mode

The mode is the most frequently occuring set of values. It primarily becomes useful when your data is repetitive and you want to find which values occur the most frequently. When no values occurs more than once, there is no mode. When two values occur with an equal amount of frequency, then the dataset is said to be bimodal.

In [30]:
# Example 3-5 Calculating the mode in Python
from collections import defaultdict
def mode(values):
    # We create a dictionary with keys the number and values the occurrencies in the sample
    counts = defaultdict(lambda: 0)
    for s in values:
        counts[s] += 1
    
    # Highest occurrency
    max_count = max(counts.values())
    
    # Mode
    modes = [v for v in set(values) if counts[v] == max_count]
    
    return modes

sample = [1, 3, 2, 5, 7, 0, 2, 3]

mode(sample)

[2, 3]

### Variance and Standard Deviation

#### Population Variance and Standard Deviation

Let's say we want to study the number of pets owned by members of the working staff. Please note that we are defining it as my population, not a sample. In order to measure how spread is the data we have first to compute its mean, and then to subtract the mean from each measure. Once we have done this operation we have to square each result in order to get rid of negative values and in order to amplify differences (it is even mathematically easier to work with). Finally, we sum all of these values and we divide it by the population size.

Here, the math's formula:

- $variance = \dfrac{(x_1 - mean)^2 + (x_2 - mean)^2 + (x_3 - mean)^2 + ... + (x_n - mean)^2}{N}$

- or alternatively, $\sigma^{2} = \dfrac{\sum{(x_i - \mu)^2}}{N}$

In [3]:
# Example 3-6 Calculating the variance in Python
data = [0, 1, 5, 7, 9, 10, 14]

def variance(values):
    n = len(data)
    mean = sum(data) / n
    variance = sum([ (value - mean) ** 2 for value in data ]) / n
    return variance

print(variance(data))

21.387755102040813


Thus, the variance for the number of pets is 21.38. But what does it exactly mean? It's reasonable to assume that the higher is the variance the more spread out is the data. But since we squared our values, we get back a different kind of metrics. So, to squeeze it back down so it's on the scale we started with we apply the square root to the variance in order to get the standard deviation.

Here, the full formula to get the standard deviation:
- $\sigma = \sqrt{\sigma^{2}} = \sqrt{\dfrac{\sum{(x_i - \mu)^2}}{N}}$ 

In [5]:
# Example 3-7 Calculating the standard deviation in Python
from math import sqrt

data = [0, 1, 5, 7, 9, 10, 14]

def variance(values):
    n = len(data)
    mean = sum(data) / n
    variance = sum([ (value - mean) ** 2 for value in data ]) / n
    return variance

def square_root(variance):
    return sqrt(variance)

print(square_root(variance(data)))

4.624689730353898


We got a standard deviation of 4.62 pets. In this way, we expressed how is spread our data with the original metric.

#### Sample Variance and Standard Deviation

In order to get the variance and standard deviation for a sample, instead of a population, we have to do a small adjustment to our formulas.
Our sample variance and sample standard deviation become:
- $sample$ $variance$ $= s^{2} = \dfrac{\sum{(x_i - \mu)^2}}{n - 1}$
- $sample$ $standard$ $deviation$ $= s = \sqrt{\dfrac{\sum{(x_i - \mu)^2}}{n - 1}}$

Thus, we are dividing the sum of squared differences by $n-1$ instead of $N$ to decrease the bias and increase the variance in order to capture a greater uncertainty in our sample.

In [7]:
# Example 3-8 Calculating standard deviation for a sample
from math import sqrt

data = [0, 1, 5, 7, 9, 10, 14]

def variance(values, is_sample: bool = False):
    n = len(data)
    mean = sum(data) / n
    variance = sum([ (value - mean) ** 2 for value in data ]) / (len(values) - (1 if is_sample else 0))
    return variance

def square_root(values, is_sample: bool = False):
    return sqrt(variance(values, is_sample))

print(variance(data, is_sample=True))  # Sample variance
print(square_root(data, is_sample=True))  # Sample standard deviation

24.95238095238095
4.99523582550223


We got a higher variance and standard deviation just because of our denominator. This is correct as a sample could be biased and imperfect representing the population.

### Normal Distribution

The normal distribution, also known as the $Gaussian$ $Distribution$, is a symmetrical bell-shaped distribution that has most mass around the mean, and its spread is defined as a standard deviation. The tails on either side become thinner as you move away from the mean.

#### Properties of a Normal Distribution

The normal distribution has the following properties:
- it's symmetrical, both sides are identically mirrored at the mean, which is the center;
- most mass is at the center around the mean;
- it has a spread that is specified by the standard deviation;
- the tails are the least likely outcomes and approach zero infinetely but never touch zero;
- it resembles a lot of phenomena in nature and daily life, even non-normal problems through the center limit theorem.

#### The Probability Density Function

The standard deviation plays an important role in the normal distribution, because it defines how spread out it is. It is actually one of the parameters alongside the mean. The probaiblity density function that creates the normal distribution is as follows:

- $f(x) = \dfrac{1}{\sigma} * \sqrt{2*\pi} * e^{-\dfrac{1}{2} * (\dfrac{x-\mu^2}{\sigma})}$

Just like the beta distribution the normal distribution is continuous. This means to retrieve a probability we need to integrate a range of $x$ values to find an area.

#### The Cumulative Distribution Function

With the normal distribution, the vertical axis is not the probability but the likelihood for the data. To find the probability we need to look at a given range and then fin the area under the curve for that range.

There's a relationship between the PDF and the CDF. The CDF is a S-shaped curve that prjects the area up to that range in the PDF. Indeed, if we try to capture the area from negative infinity to the mean of the normal distribution our CDF shows a value of .5 or 50%. That it is because of the symmetry of the normal distribution.

In [12]:
# Example 3-10 The normal distribution CDF in Python
from scipy.stats import norm

mean = 64.43
std_dev = 2.99

x = norm.cdf(64.43, mean, std_dev)  # CDF up to 64.43
print(x)

0.5


We have a 50% probability of observing a value up to 64.43.

In [14]:
# Example 3-11 Getting a middle range probability using the CDF in Python
from scipy.stats import norm

mean = 64.43
std_dev = 2.99

x = norm.cdf(66, mean, std_dev) - norm.cdf(62, mean, std_dev)
print(x)

0.4920450147062894


We have a 49.20% probability of observing a value between 66 and 62.