# Descriptive statistics: location

<div class="alert alert-success">

**Before completing this week's Self-study Notebooks you must have completed the [Coding 3 - Working with data Notebook.](../Coding%20Practicals%20Notebooks/Coding%203%20-%20Working%20with%20data.ipynb)**
    
</div>

<div class="alert alert-warning">

**In this notebook you will learn how to describe the location, or centre, of a set of numerical data in pandas using the arithmetic mean.**
    
</div>

In the Coding 3 workshop we looked at visualising a set of data. Visualising data collected from an experiment or from observations is very important and almost always the first step in statistically analysing your data. But visualising data is not enough. We need to say something concrete about it. And that means describing it with numbers. 

These numbers are called **descriptive or summary statistics of a sample**.

This sounds complicated, but it's not. 

When describing numerical data with numbers we usually need two pieces of information: the **location** of the values and the **spread** of values.

The location of the values tells us where the data are centred. The importance of calculating the centre of a set of data seems obvious. How else do we address questions like "Which species is larger?" or "Which drug yielded the greatest response?".

The **average** or **mean** (more specifically, the **arithmetic mean**), is the most important descriptive statistic of the location of a set of numerical data. You know how to calculate a mean: add the values together and divide by the number of values. 

Other descriptive statistics of location include the **median** and the **mode**. We won't consider these further in this course.

We mentioned in the Coding 3 Practical that pandas provides lots of functions for reading in, analysing, manipulating and describing data. Pandas also provides functions for calculating means. Let's look at an example of Darwin's finches to see how this is done.

## Darwin's finches

Darwin's finches (also known as the Galápagos finches) are a group of about 18 species well known for their remarkable diversity in beak form and function.
<br>
<br>
<div>
<img src="attachment:darwins_finches.jpg" width='70%' title="Reproduced from Grant, P.R. (1991). Natural Selection and Darwin’s Finches. Scientific American Vol. 265, pp. 82-87"/>
</div>

The file `Datasets/finches beak width.csv`
contains beak widths of a sample of 100 Cactus finches and 100 Vegetarian finches (shown in the above image). 

<div class="alert alert-info">

Run the code cell below to read in and print the DataFrame. 
</div>

In [None]:
import pandas as pd

# Read in the Darwin finches beak width data.
beak_widths = pd.read_csv('Datasets/finches beak width.csv')

# Print it to look at the data.
print(beak_widths)

There are three columns. The first is an index and can be ignored. The second contains all the Cactus finches' beak widths and the third column contains all the Vegetarian finches' beak widths. 

As always, we plot the data to see what they look like.

<div class="alert alert-info">

Run the code cell below to plot histograms of beak widths of both species. 
</div>

In [None]:
import seaborn as sns

# Plot histograms of the data. Seaborn knows to plot a separate histogram for each species.
g = sns.displot(beak_widths)

# Add some annotation.
g.ax.set_xlabel('Beak width (mm)')
g.ax.set_ylabel('Number of finches')
g.ax.set_title('Frequency distributions of beak widths\nof 100 Cactus and 100 Vegetarian finches')
g.legend.set_title('Finch species');

If you get a `UserWarning`, just ignore it: it's a bug in seaborn's code.

There are two distributions plotted in this figure, blue for Cactus finches and orange for Vegetarian finches. Plotting distributions like this allows us to easily compare them by eye.

It's clear that Vegetarian finches have wider beaks than Cactus finches even though there is some variation within each species. 

What is the difference in mean beak widths between the two species? Eye-balling the distributions shows that Vegetarian finches have a mean beak width of about 12mm and the Cactus finches of about 8mm; so that's a difference of about 4mm.

Eye-balling data is a useful technique to get a quick, but approximate, measure of where it is located. After that we should calculate the exact mean. We can compare the exact mean with our eye-balled value to make sure we haven't made a mistake.

---

Notice in the above figure that there is a legend which tells us which histogram belongs to which species: blue for Cactus finches and orange for Vegetarian finches. Legends are essential when two or more sets of data are plotted in the same graph. In addition, we've added a title to the legend with the command

```python
g.legend.set_title('Finch species');
```

## Calculating means with pandas

Now let's calculate the exact mean beak widths using pandas. It's very simple. The mean of a sample of data is usually written as $\bar{x}$ or $\bar{y}$ (pronounced "x-bar" and "y-bar").

<div class="alert alert-info">

1. Look at the code below to try and understand what it is doing. 
2. Run the code.
</div>

In [None]:
# Calculate the mean beak widths of Cactus and Vegetarian finches and store the means in xbar.
xbar = beak_widths.mean()

print( xbar )

The code
```python
xbar = beak_widths.mean()
```
calculates the mean for each column (i.e., for each species) in the `beak_widths` DataFrame automatically. The results are stored in a variable called `xbar` which we've printed out. (You can ignore the `dtype: float64` bit.) 

Storing the means in the variable `xbar` is handy because we can easily calculate the difference in mean beak widths between the two species.

<div class="alert alert-info">

Look at the code below to try and understand what it is doing. Then run it.
</div>

In [None]:
# The difference in mean beak widths of Vegetarian and Cactus finches

d = xbar['Vegetarian'] - xbar['Cactus']

print( d )

The difference is 4.130000000000003 mm which is very close to what we expected by eye-balling the two histograms.

## Format your output with an f-string

All those digits after the decimal place are meaningless as beak width was only measured to one-tenth of a millimetre. We should round the output with an f-string to make it more readable. 

The general rule-of-thumb is to report a mean to one more decimal place than in the data. As the data were measured to 1dp, we should report the mean to 2dp. 

<div class="alert alert-info">
    
Here's the f-string for doing that.

Look at the code below to try and understand what it is doing. Then run it.
</div>

In [None]:
print( f'Difference in mean beak widths = {d:.2f} mm')

## Exercise Notebook

[Descriptive statistics: location](Exercises/3.1%20-%20Descriptive%20statistics%20-%20location.ipynb)