In this checkpoint, we'll learn how to describe large datasets using summary statistics that give us a view of the central tendency and variance in datapoints across observed examples. Here are the concepts we'll cover:

* populations
* samples
* central tendency
* mean
* median
* mode
* bias
* variance
* standard deviation
* standard error
* generating summary statistics in Pandas

At the end of this checkpoint, you'll complete a challenge in which you conduct basic summary analysis of a dataset.


## Populations vs. samples

A major purpose of data science is to give us information about some group, known as a *population*. This population can be all the people living in a country, all the purchases made at a store, or any other unit from which information can be drawn. Often, it is difficult, prohibitively expensive, or simply impossible to get data from all members of a population. Imagine trying to get a questionnaire to every person in a country to learn about their product preferences- it can't be done!  

Instead, we *randomly* extract a subset from the population (a random group of people, a random selection of purchases), called a *sample*, that we can study in detail to learn about the population as a whole.

For example, imagine we have a 100 pound bag of M&Ms and we want to know the percentage of green M&Ms in the bag. The bag of M&Ms would be our population. While it would certainly be possible to count every single M&M, it would take a very long time, not to mention being potentially quite messy. As an alternative, we could shake up the bag, pour out half a pound of M&Ms, and count the M&Ms in that sample. If our half pound of M&Ms was 8% green, then it is pretty safe to say that the whole 100-pound bag is also 8% green M&Ms.

Statisticians take data about a sample and reduce the complexity of that data into understandable and accurate summaries, known as *statistics*. Statisticians use the **sample** statistics to infer information about the entire **population** from which the sample is taken.


## Measures of central tendency

Statistics can describe either an individual variable or the relationships among two or more variables. A *variable* represents information about a particular measurable concept (temperature, price, size, etc).  Each measurement within a variable is called a *datapoint*. Let's make a dataframe in Python with one variable, age, that we can play with later on.



The central tendency describes a point around which datapoints in a variable cluster. Central tendency can be measured in a number of ways.  The most common measures are the *mean* the *median*, and the *mode*.  


### Mean

The mean represents the average value within a variable, and is computed as the sum of the individual datapoints in a variable `x` divided by the total number of values in a variable `n`.  It is sometimes also referred to as the "expected value" of a variable.

```python
mean = sum(x) / n
```

Here are two ways you can compute the mean of our `age` data, first with built-in Python functionality and then with NumPy.

```
# Using built-in Python functionality.
sum(df['age']) / len(df['age'])

# Using NumPy
import numpy as np

np.mean(df['age'])
```

The mean is easy to understand and commonly used, but it's sensitive to extreme values: one abnormally large value in a set of otherwise small values will cause the mean to become much larger.


### Median
The *median* represents the middle value in a variable when the values are ordered from least to greatest.  If there are an odd number of values in a variable, then the median is the middle value, and if there are an even number of values in a variable, the median represents the average of the two middlemost values.

Here's how you can compute the median of our `age` data using the `statistics` module of the Python standard library or NumPy.

```
# Vanilla Python, using the built-in statistics module.
import statistics

statistics.median(df['age'])

# Using NumPy.
import numpy as np

np.median(df['age'])

```

The median, like the mean, easy to understand, and has the added benefit that it isn't sensitive to extreme values. However, the median has fewer useful mathematical properties than the mean as we'll see later.


### Mode

The *mode* represents the value in a variable that occurs the most frequently.

```
# Return the mode using the statistics module.
import statistics
statistics.mode(df['age'])
```

If two or more values in a variable occur with equal frequency, there will be multiple modes. Note the code above will raise a `StatisticsError` if you run it on data containing multiple modes. Receiving this error, or generating and inspecting a list of counts beforehand, will show whether there is more than one mode to look for.

```
# Generate a list of unique elements along with how often they occur.
(values, counts) = np.unique(df['age'], return_counts=True)

# The location in the values list of the most-frequently-occurring element.
ind = np.argmax(counts)

# The most frequent element.
values[ind]
```
The code above will handle data with multiple modes without raising an exception, but you'll get back just the first mode. If you want to push your understanding of Python you can challenge yourself to revise it to give you all of the modes.

### Quick note about bias

The mean, median and mode calculated from a **sample** are considered unbiased estimates of the **population** mean, median and mode.  An estimate is *"unbiased"* if, across multiple representative samples, the sample estimates converge on the population value.  A *"biased"* estimate would converge on a value that was either higher or lower than the population value.

Unbiased estimates are useful because they let us use a small group of observations to make generalizations about a much larger group.



