# Descriptive statistics: spread

<div class="alert alert-warning">

**In this notebook you will learn how to describe the spread of a set of numerical data in pandas using the standard deviation.**
    
</div>

The mean tells us about the average individual in a set of numerical data. The **spread** tells us how variable individuals are from each other - how widely scattered the observations are around the centre. 

The importance of describing **spread** is less obvious than for describing the mean but no less crucial. In biology much of the variability we observe signifies real differences among individuals, and this variability begs measurement. 

Measuring variability also gives us perspective. We can ask, "How large are the differences between groups compared with differences within groups?" Biologists also appreciate variation as the stuff of evolution - we wouldn’t be here without variation.

There are several ways of measuring spread. The most important one is the standard deviation. In this notebook we'll see how to calculate it in pandas, in the next notebook we'll see what it actually means.

## Darwin's finches again

Let's consider another example of Darwin's finches to calculate and compare measures of spread using pandas.

<div>
<img src="attachment:darwins_finches.jpg" width='50%' title="Reproduced from Grant, P.R. (1991). Natural Selection and Darwin’s Finches. Scientific American Vol. 265, pp. 82-87"/>
</div>

The file `Datasets/finches beak length.csv` contains the beak lengths of a sample of 100 Cactus finches and 100 Woodpecker finches. 

<div class="alert alert-info">
    
Run the code cell below to print and plot histograms of the data. 
</div>

In [None]:
import pandas as pd
import seaborn as sns

# Read in the Darwin finches beak length data.
beak_lengths = pd.read_csv('Datasets/finches beak length.csv')

# Print it to look at the data.
print(beak_lengths)

# Plot them in histograms. Seaborn knows to plot a separate histogram for each species.
g = sns.displot(beak_lengths)

# Add some annotation.
g.ax.set_xlabel('Beak length (mm)')
g.ax.set_ylabel('Number of finches')
g.ax.set_title('Frequency distributions of beak length\nof 100 Cactus and 100 Woodpecker finches')
g.legend.set_title('Finch species');

It looks as if the mean beak lengths of these Woodpecker and Cactus finches are similar at around 10 mm. But it is clear that beak lengths of Woodpecker finches are far more variable (more spread out around the mean) than for Cactus finches. Woodpecker finch beak lengths vary from about 8 mm to 12 mm, whereas Cactus finch beak lengths vary only from about 9.5 mm to about 10.5 mm.

But we want to be more precise in how we describe this variability.

First let's look at the mean to convince ourselves that the mean beak lengths of the two species are similar. 

<div class="alert alert-info">
    
Run the following code cell to calculate the sample means.
</div>

In [None]:
print( beak_lengths.mean() )

Mean beak lengths are very similar at about 10 mm as expected with a difference of just 0.1 mm.

## Standard deviation

Standard deviation is the most common measure of spread of a sample of data used in science. It measures how far from the mean the measurements typically are. The standard deviation is large if most measurements are far from the mean, and it is small if most measurements lie close to the mean. Standard deviation is usually represented by lowercase $s$. 

<div class="alert alert-info">

We use the panda's `.std()` method to find it.
    
Run the code below to calculate the standard deviation of beak lengths of the samples of Woodpecker and Cactus finches.
</div>

In [None]:
# Calculate the standard deviation of beak lengths for both species.
s = beak_lengths.std()

print(s)

The standard deviation of the sample of Cactus finches is 0.25 mm and the standard deviation of the sample of Woodpecker finches is 0.98 mm. 

So Woodpecker finches are more variable (the data are more spread out) than Cactus finches, as we already know. 

There are a couple things to note here.
1. The standard deviation is reported to 2dp, just like the mean, because the data were measured to 1dp.
2. The standard deviation has the same units as the mean, that is, millimetres.

It's always beneficial to see other explanations of scientific concepts. Before continuing with the exercise notebook first watch this video about measures of centre and spread.
    
[![Simple Learning Pro video](https://img.youtube.com/vi/mk8tOD0t8M0/2.jpg)](https://www.youtube.com/watch?v=mk8tOD0t8M0)

## Exercise Notebook

[Descriptive statistics: spread](Exercises/3.2%20-%20Descriptive%20statistics%20-%20spread.ipynb)