# Sampling a population

<div class="alert alert-warning">

**In this notebook you will learn about the difference between descriptive statistics and inferential statistics, samples and populations and sample statistics and population parameters.**
    
</div>

We've learnt about visualising a set of data using a histogram and about describing the location and spread of data using descriptive statistics.

The purpose of **descriptive statistics** is to summarise what we know about a set of data. 

However, statistics covers much more than that. In fact, descriptive statistics is one of the smallest parts of statistics, and one of the least powerful. The bigger and more useful part of statistics, called **inferential statistics**, is the part that lets scientists make **inferences** about the wider world and lets them test their ideas about how the natural world works. The word "inference" means the forming of a conclusion from data.

But first, we need to be a bit more explicit about what it is that we’re drawing inferences from (known as the **sample**) and what it is that we’re drawing inferences about (known as the **population**).

## What is a population and what is a sample?

In almost all cases, what we have available to us as researchers is a sample of data. All of the datasets you've been examining over the last few weeks are samples.

As researchers, you, and the scientific community, are not particularly interested in a sample of data. What we are actually interested in is the population of individuals from which the sample was drawn. It is this population we want to say something about, not the single small sample drawn from it. 

What is a population?

That really depends on the scientific question you want to answer. So there is no one-size-fits-all answer.

For example, in the field trips you collected data on the size of two-spot ladybirds in Edinburgh graveyards. If you want to understand how the density of Harlequin ladybirds affects the sizes of two-spot ladybirds, and that density varies between Edinburgh graveyards, then each graveyard is a separate population. However, if you want to understand how climate affects the size of two-spot ladybirds then your populations may be separate regions of a country or even different countries because climate varies on larger geographical scales than graveyards.

Irrespective of how we define the population, the critical point is that a sample is a subset of a population, and usually a tiny subset of a population. Observational and experimental studies should be designed so that individuals are **randomly** sampled from a population. Which means that the chance an individual is sampled is the same for all individuals in the population. This is often easier said than done, but we don't have time in this course to discuss this thorny issue properly.

## Population parameters and sample statistics

We've looked at lots of different datasets over the last few weeks. All of them contain data sampled from a population. We've plotted the data and calculated their descriptive statistics: i.e., means and standard deviations.

In other words, the **sample mean** and **sample standard deviation** are descriptive statistics of a sample.

The sample mean is usually denoted $\bar{x}$ (pronounced x-bar) and the sample standard deviation is usually denoted $s$. For example, in Exercise 3.2.1 researchers sampled 33 worker bees. We found their mean foraging lifespan to be $\bar{x}$ = 27.85 hours with a standard deviation of $s$ = 20.56 hours.

In theory, if we measured the foraging lifespans of all worker bees in the world we would know the **population** mean foraging lifespan and the **population** standard deviation. 

The **population mean** and the **population standard deviation** are called **parameters**. The population mean is usually denoted $\mu$ (pronounced mew) and the population standard deviation is usually denoted $\sigma$ (pronounced sigma).

To conclude, **statistics** describe a sample and **parameters** describe the population the sample was taken from.

We don't know and, in most cases, will never know the population parameters because we can't measure every single individual in a population. But we do know the statistics of the small sample taken from the population. Inferential statistics is about inferring (making conclusions) about the population parameters from the sample statistics. 

For example, we can use the sample mean to infer the population mean. The inferred population mean is an **estimate** of the true population mean, and is usually denoted $\hat{\mu}$ (pronounced mew-hat). We'll look at an example in the next Notebook.

The following table summarises all of this:

| | Population parameter (unknown) | Sample statistic (known) | Inferred (estimated) population parameter (known)
:--- | :---: | :---: | :---:
mean | $\mu$ (mew) | $\bar{x}$ (x-bar) | $\hat{\mu}$ (mew-hat)
st. dev. | $\sigma$ (sigma) | $s$ | $\hat{\sigma}$ (sigma-hat)

Next week we'll look at another example of inferential statistics: Inferring if two population means differ or not by comparing the means of the samples taken from each population.

## The sample size *n*

There's one more piece of information we need to know about our sample.

The number of values in our sample is called the **sample size**. It is usually denoted by *n*.

For example, in the worker bee dataset we had a sample size of *n* = 33. 

The larger the sample size the better our knowledge is about the population the sample is drawn from. Which means our estimates of the population parameters are better (i.e., more precise to be precise). 

<div class="alert alert-info">

This is demonstrated in the interactive plot in the code cell below for a sample of the masses (in kg) of Alaskan sockeye salmon.
    
Run the following code cell and move the slider at the top of the plot to change the sample size *n*.
    
</div>

In [None]:
%matplotlib widget
from interactive_plots import Sample_size
Sample_size();

Clearly, larger sample sizes are better. With sample sizes below 30 our knowledge of the shape of the distribution of salmon masses is limited. As we increase the sample size above 30 the shape of the distribution becomes more clearly defined which results in more precise estimates of the population parameters.
 
In science, however, there is a trade off between collecting larger samples for better knowledge about the population against the time, effort and cost of collecting large samples. Fortunately we can design experimental and observational studies that balance these two.

## Next Notebook

[Estimating a population mean](3.5%20-%20Estimating%20a%20population%20mean.ipynb)