# Statistics

_This document is part of [doper](https://github.com/mwermelinger/doper),
a collection of domain-oriented programming exercises._

Statistics is the branch of mathematics that deals with collecting, analysing
and presenting data. Statistics are used to inform most human activity,
whether it's scientific research, business management or government policy.
Statistics are used in advertisements and in the press,
sometimes to influence our opinion.
It's increasingly important to have some knowledge of statistics.

The following exercises ask you to implement simple statistical measures
that summarise the collected data.
For all exercises, the collected data is a non-empty list of integers.
They can represent daily sales of a product, daily temperatures or rainfall,
salaries of surveyed people, etc.

If you're taking an algorithms and data structures course, you may wish to
measure the run-time of your functions and analyse their complexity.

## 1 Variability
A simple measure of how much the data varies is its **range**:
the difference between the largest and the smallest value.

There are at least two ways of implementing this measure,
by doing two passes or one pass over the data.

In [1]:
def range_of(values: list) -> int:
    """Return the difference between the largest and smallest values."""
    pass

assert range_of([5]) == 0               # range is 5 - 5
assert range_of([3, 1, -4, 2]) == 7     # range is 3 - -4
assert range_of([1, 2, 4, 8, 16]) == 15

## 2 Central tendency
There are three common measures to indicate the 'central' value in the data.

### 2.1 Mean
The **mean** is the sum of the values, divided by how many there are.
For example, the mean of 1, 2, 3 and 20 is (1 + 2 + 3 + 20) / 4 = 26 / 4 = 6.5.

In [2]:
def mean(values: list) -> float:
    """Return the sum of the values divided by the number of values."""
    pass

assert mean([1, 2, 3, 20]) == 6.5
assert mean([2, 20, 1, 3]) == 6.5   # the order doesn't matter
assert mean([5, -5]) == 0
assert mean([2, 2, 2]) == 2

### 2.2 Median
The **median** is the middle value if we list the values in sorted order
(from lowest to highest, or highest to lowest, it doesn't matter).
For example, the median of 7, 5 and 15 is 7, because it's the middle value
of 5, 7, 15 and of 15, 7, 5.

If there's an even number of values, then there are two values in the middle.
In that case, the median is the mean of those values. For example, the median of
1, 2, 3 and 20 is the mean of 2 and 3, namely 2.5.

This exercise requires sorting.
Use Python's `sorted` function to avoid modifying the input list.

In [3]:
def median(values: list) -> float:
    """Return the middle of the sorted values, or the mean of both middle values."""
    pass

assert median([5]) == 5
assert median([7, 5, 15]) == 7
assert median([1, 2, 3, 20]) == 2.5

Unfortunately, the press, adverts and even reports tend to use the word
'average' instead of 'median' or 'mean'.

The statement "the average salary is X" is meaningless without clarification.
If the average is the median, then half of the people earn less than X
and half earn more, but if the average is the mean, then it's likely
many more than half the people earn less than X because even a few
very high earners make the mean salary increase.
As you've seen, the median of 1, 2, 3, 20 is 2.5 but the mean is 6.5,
more than double of the median.

### 2.2 Mode
The **mode** is the most common value in the data.
For example, the mode of 2, 3, 2, 3, 0 and 2 is 2 because it occurs three times,
more often than the other values (0 occurs once and 3 occurs twice).

You can assume that exactly one value occurs more often than the others.
In other words, you don't have to deal with values like 1, 2, 3, 2 and 1,
which have more than one mode.

There are several ways of computing the mode, some involving
additional data structures. As before, do not modify the input list.

In [4]:
def mode(values: list) -> int:
    """Return the value that occurs most frequently."""
    pass

assert mode([5]) == 5
assert mode([2, 3, 2, 3, 0, 2]) == 2