### Introduction

Throughout this course we'll learn to summarize the distribution of a variable with a single value. Depending on the particular characteristics of a distribution, we'll see that we can summarize it using the mean, the weighted mean, the median, or the mode.

We'll also learn to measure the variability in a distribution. If we have a distribution A with the values [3, 3, 3, 3], and a distribution B with [30, 1, 15, 43], we can clearly see that there's much more variability (diversity) in B. We'll learn to quantify variability using measures like variance and standard deviation.

Once we understand the measures of variability, we can then learn how to locate a value in a distribution, and determine how it compares to other values. For instance, when we analyze salaries, we might want to find out whether a salary of $75000 is common or extreme inside a company. We'll learn to answer this question with precision using a z-score.

In this first mission, we'll have a detailed discussion about the mean. We already learned briefly about the mean in the previous courses of the data science path, but here we discuss the concept again to give the explanations much more depth.

### The mean

Let's say we want to summarize the distribution below with a single value that is representative of the distribution as a whole.

`[0, 1, 4, 7, 8, 10]`


Intuitively, we need to take into account equally every single value in the distribution if we want to find a good summary value that's representative of the entire distribution. We could try to sum all the values in the distribution, and then divide the total by the number of values we added — this way we'll manage to take into account equally every value in the distribution:


![image.png](attachment:image.png)


When we compute the summary value of a distribution in this way, we call the value the arithmetic mean, or the mean. For our distribution above, the mean is 5.

We have good reasons to consider 5 a representative value for the distribution above. First, notice that 5 is right at the center of the distribution's range, which is 0 - 10 (0 is the minimum value in the distribution, and 10 is the maximum value):

![image.png](attachment:image.png)


Also, each value is fairly close to the mean. The nearest value to the mean is 4, which is just 1 unit away from the mean, while the farthest values are 0 and 10, located 5 units away from the mean.

![image.png](attachment:image.png)


Although the distance for each individual value varies, the sum of the distances of the values that are below the mean is equal to the sum of the distances of the values that are above the mean:

![image.png](attachment:image.png)

#### Exercise

In [6]:
distribution = [0,2,3,3,3,4,13]

mean = sum(distribution) / len(distribution)
center = False
equal_distances = False

4.0

### The mean as a balance point

In the last exercise, we observed from the distribution [0,2,3,3,3,4,13] that the mean 4 is not in the center of the 0 - 13 range interval:

![image.png](attachment:image.png)
As a consequence, we should avoid thinking of the mean as being the center of a distribution's range. In some cases, the mean will be equivalent to the center of the distribution's range, but we've just seen that this doesn't hold true for all distributions.


We should think of the mean as being the value located at that particular point in the distribution where the total distance of the values below the mean is the same as the total distance of the values that are above the mean. In our last exercise, we saw that this holds true for the distribution [0,2,3,3,3,4,13].


![image.png](attachment:image.png)

In fact, this is true for the distribution of any variable measured on an interval or ratio scale.

To give students a better intuition for this property of the mean, it's common in the literature to describe the mean as the balance point of a lever.

![image.png](attachment:image.png)

If the total distances above and below the mean were equivalent to the forces exerted by the weights on the rod of the lever, then there would be the same amount of force exerted on each side of the mean. This will make the rod stay in perfect equilibrium:


![image-2.png](attachment:image-2.png)

Now that we've seen that the total distances below the mean equal the total distances above the mean, we'll check this rule by measuring the distances for 5000 different distributions in the exercise below.

- Generate 5000 different distributions, measure the total distances above and below the mean, and check whether they are equal. For each of the 5000 iterations of a for loop:

    - Set a seed using the seed() function from numpy.random. For the first iteration, the seed number should be 0, for the second iteration it should be 1, for the third it should be 2, and so on.
    - Generate randomly a distribution of integers using the randint() function from numpy.random. Pass the right arguments to randint() such that each distribution will:
        - Have 10 values.
        - The values can range from 0 to 1000.
    - Compute the mean of the distribution.
    - Measure the total distance above and below the mean.
        - Round off each distance to 1 decimal place using the round() function. This will prevent rounding errors at the 13th or 14th decimal place.
    - Compare the two sums. If they are equal, then increment a variable named equal_distances with 1. You'll need to define equal_distances outside the loop with a value of 0.

- At the end equal_distances should have a value of 5000. This will confirm that for each of the 5000 distributions the total distance of the values above the mean is equal to the total distance of the values below the mean.

In [13]:
import numpy as np
from numpy.random import randint, seed

equal_distances = 0
for i in range(5000):
    dist = randint(0, 1000, size = 10)
    mean = round( dist.mean(), 1 )
    below = round( (mean - dist[dist < mean]).sum(), 1)
    above = round( (dist[dist > mean] - mean).sum(), 1)
    
    if below == above:
        equal_distances += 1

equal_distances    

5000

### Defining the mean algebraically

A very useful property of the mean is that it can be defined algebraically in a simple way. This is how we can define the mean for any population of N values (N is the number of values in the population):

![image.png](attachment:image.png)


By convention, the mean of a population is denoted with the Greek letter $μ$ (pronounced "mew"). So we rewrite the formula above:

![image.png](attachment:image.png)

Let's say the distribution `[0,2,3,3,3,4,13]` represents a population. The distribution has 7 values, so $N = 7$. Let's plug the values into the formula above:

![image.png](attachment:image.png)

Above, we computed the mean for a population. When we compute the mean for a sample, we need to use a slightly different notation to indicate that we're computing the mean for a sample and not for a population. Instead of $μ$, we denote the sample mean using $\bar{x}$ (pronounced "x-bar"), and we use $n$ instead of $N$ to denote the number of values in the sample. This is how we could define the sample mean algebraically:

![image.png](attachment:image.png)

Let's say we sample three values from the distribution used above and end up with the sample [2,3,4]. Let's plug the values into the sample mean formula:

![image.png](attachment:image.png)

Alternative notations:

![image.png](attachment:image.png)

### An alternative definition

