# Ladybird Analysis: Estimating the population mean size of your two-spot ladybirds

## Task 1: Read in and print your groups's data

Using pandas, you are now going to read in the Excel spreadsheet you just created with your group mates.

1. Read in your Excel spreadsheet using the command 
```python
pd.read_excel('ladybird_sizes_X_Y_Z.xlsx')
```
Replace X with your group name (GNU, YAK, FOX, or APE) and Y and Z with your groups' letters. For example, if your group is APE D and your partnering group is GNU C the filename is `ladybird_sizes_GNU_C_D.xlsx`.

2. Call the DataFrame something sensible, such as `ladybirds`.

3. Print the data to make sure it is okay.

In [None]:
# read in and print your ladybird size dataset

## Task 2: Plot your group's data

Plot your two-spot ladybird sizes in an annotated histogram in the following code cell. See [Coding 3 - Working with data](../Coding%20Practicals%20Notebooks/Coding%203%20-%20Working%20with%20data.ipynb#Visualising-data) for help.

**Only plot your group's data. We'll look at the other group's data next week.**.

<div class="alert alert-success">

Note: In Task 1 you imported pandas and read in your spreadsheet. Jupyter Notebooks remember that you did this. Which means you DO NOT need to import pandas nor read in your spreadsheet again in this and all following code cells.
</div>

In [None]:
# annotated histogram of two-spot ladybird sizes.

## Task 3: Check for outliers

A histogram allows you to easily spot any outliers; data that are extremely far from the average. Perhaps the wrong species was measured. If you think any of the data are outliers you'll need to go back to your spreadsheet in Teams, update the excel spreadsheet and re-upload to Noteable.

## Task 4: Eye-ball estimates of the mean and standard deviation

It is generally a good idea to estimate means and standard deviations by eye before calculating them on a computer. This is so you can check your eye-ball estimates with the actual values output by Python. If they don't match then you know something is wrong: either your estimates or the Python code.

Using your histogram, estimate the mean and standard deviation of ladybird sizes. Remember that a rough estimate of the standard deviation is given by this formula

$$s \approx \frac{\mathrm{max\ value} - \mathrm{min\ value}}{4}$$


> Write your estimates here

## Task 5: Calculate the sample size, mean and standard deviation

Now, using Python code, calculate the sample size, mean and standard deviation of your data in the following code cell to the appropriate number of decimal places. (See Notebook [3.3 - Normal distribution](../Self-study%20Notebooks/3.3%20-%20Normal%20distribution.ipynb#Find-the-sample-size,-mean-and-standard-deviation))

How do they compare to your eye-ball estimates?

In [None]:
# sample size, sample mean and sample standard deviation

## Task 6: Check if your data obey the 68-95-99.7% rule

Now you should check to see if your data are roughly normally distributed.

1. Check if roughly 68% of your data lie within one standard deviation of the mean using Python code.
2. Do you think your data are normally distributed?

<div class="alert alert-info">

To do this task you will need to calculate, using Python code, the range from the mean minus one standard deviation to the mean plus one standard deviation. Then count how many ladybirds had sizes within this range. Is that roughly 68% of your data? 
    
See the example in Notebook [3.3 Normal distribution](../Self-study%20Notebooks/3.3%20-%20Normal%20distribution.ipynb) for how to answer this Task.
</div>

In [None]:
# check if roughly 68% of your data are within one standard deviation of the mean

## Task 7: Calculate the precision of your estimate of the population mean

Calculate the standard error of the mean and the 95% confidence interval of the mean. (See Notebook [3.5 - Estimating a population mean](../Self-study%20Notebooks/3.5%20-%20Estimating%20a%20population%20mean.ipynb#How-to-calculate-the-standard-error-of-the-mean-(SEM)))

In [None]:
# standard error and 95% confidence interval

## Task 8: Report your estimate of the population mean

Write a short sentence below reporting the estimate and precision of the population mean. (See Notebook [3.5 - Estimating a population mean](../Self-study%20Notebooks/3.5%20-%20Estimating%20a%20population%20mean.ipynb#Reporting-the-estimate-of-the-population-mean-and-its-standard-error))

> Report your estimate and precision

## Task 9: Simulate the sampling distribution of the population mean

In Notebook [3.5 - Estimating a population mean](../Self-study%20Notebooks/3.5%20-%20Estimating%20a%20population%20mean.ipynb#But-how-precise-is-the-estimate-of-the-population-mean?) we did a thought-experiment where we repeatedly selected two finches from a population of finches and measured their beak depths. For each sample of two, we estimated the mean beak depth of the entire population. We plotted the following histogram of all of these estimates of the population mean to create a sampling distribution of the population mean. 

![sampling_distribution_2.png](attachment:sampling_distribution_2.png)

Now you will perform a similar thought experiment, but instead of finch beak depths, you will work with ladybird sizes. You will use Python code to simulate sampling ladybird sizes from your graveyard. Simulation is a powerful technique used in various fields such as science, engineering, and administration. In statistics, it is used for hypothesis testing and provides an intuitive way to understand complex statistical concepts. Don't worry, we'll take you through this step by step with example code and the help of the demonstrators. 

Next week you will use simulation to see how we can test if two population means are the same or different. 

Here's what you will learn today

1. How to conduct a simulation.
2. How to create a simulated sampling distribution.
3. Convince yourself that the standard deviation of your simulated sampling distribution, known as the standard error, is equal to $\frac{\sigma}{\sqrt{n}}$.

Work through the following steps.

### Step 1. Create a statistical model of the sampling process

A thought experiment is a simplified representation, or model, of reality. Models can take various forms, like a paper aeroplane representing a real aeroplane or computer simulations of the weather.

The thought experiment you are going to do now involves estimating the population mean ladybird size based on a sample of two ladybirds. To perform this thought experiment you need to know how ladybird sizes are distributed in the population. You can't know this in reality because you can't measure the sizes of every single ladybird in your graveyard. So you have to **model** your population by **pretending** you know how ladybird sizes are distributed. 

This model will be a simplification of the actual distribution of ladybird sizes. But that doesn't matter. The purpose of this exercise is to demonstrate how model simulation is done and what we can learn from it.

For our model we make three reasonable assumptions about the population of ladybird sizes:

1. Ladybird sizes have a normal distribution. This is reasonable since many characteristics of living organisms are normally distributed.
2. The population mean ladybird size, denoted as $\mu$, is 6 mm.
3. The population standard deviation of ladybird size, denoted as $\sigma$, is 1 mm.

With these assumptions in place, you can proceed to simulate the process of taking samples from this simplified model of the ladybird population.

### Step 2. Simulate random sampling from the population

To perform simulation you need to generate random numbers (like rolling a dice many times). In this case you need to generate random numbers from a normal distribution. Each random number you generate will correspond to a single ladybird's size. 

Python doesn't do this for you, so you first need to import a module that generates random samples drawn from a normal distribution with this code:

```python
from numpy.random import normal
```

The following line of code simulates drawing `n` random numbers from a normal distribution of mean `mu` and standard deviation `sigma`. It stores the randomly generated `n` numbers in a list called `sizes`.

```python
sizes = normal(mu, sigma, n)
```

1. Write some code below that simulates sampling sizes of *n* = 2 ladybirds from a population of ladybirds with mean size $\mu$ = 6 mm and standard deviation $\sigma$ = 1 mm.
2. Print the simulated sample of the two ladybird sizes.
3. Run the code several times to convince yourself that on each run you generate two different random ladybird sizes.

Note that the generated numbers are printed to eight decimal places. In reality we wouldn't be able to measure ladybird sizes to that precision. But for the purposes of this exercise it doesn't matter how precise these numbers are.

In [None]:
# simulate sampling n = 2 ladybird sizes from a population distributed normally with mean mu = 6 mm and standard deviation sigma = 1 mm


### Step 3. Simulate many samples

In Step 2 you simulated a single sample of two ladybird sizes. To construct the sampling distribution you'll need to simulate thousands of samples. 

The following line of code simulates `m` samples of size `n` from a normal distribution of mean `mu` and standard deviation `sigma`.

```python
sizes = normal( mu, sigma, (n, m) )
```

1. Write some code below that simulates *m* = 10,000 samples of *n* = 2 ladybirds each from a population with mean size $\mu$ = 6 mm and standard deviation $\sigma$ = 1 mm.
2. Print the simulated samples. You should see two rows of numbers. Only the first and last three numbers of the 10,000 numbers in each row are printed.

In [None]:
# simulate m = 10,000 samples of n = 2 ladybird sizes from a population distributed normally with mean mu = 6 mm and standard deviation sigma = 1 mm

### Step 4. Calculate the sample means

Now calculate and print the 10,000 means of the 10,000 samples with the following line of code and print.

```python
xbars = sizes.mean(axis=0)
```

The `axis=0` part makes Python calculate the means of each column of `sizes`. (Alternatively, `axis=1` calculates the means in each row, but that's not what we want.)

In [None]:
# calculate the sample means

### Step 5. Plot the histogram of sample means

Use seaborn to plot a histogram of the sample means. 

<div class="alert alert-info">

You should see a histogram centred on 6 mm with a range from about 4 mm to 8 mm.
</div>

In [None]:
# plot the sampling distribution

### Step 6. Calculate the standard deviation of the distribution of sample means (i.e., the standard error) 

Well done, you've simulated the sampling distribution of the sample mean. 

The standard deviation of your simulated sampling distribution, known as the standard error, is found with the line of code

```python
sem = xbars.std()
```

<div class="alert alert-info">

Your answer should be about 0.71 mm
</div>

In [None]:
# calculate the standard error of the sampling distribution (the standard deviation of the distribution of sample means)

### Step 7. Compare the simulated standard error with the formula for the standard error

In Self-study notebook [3.5 - Estimating a population mean](../Self-study%20Notebooks/3.5%20-%20Estimating%20a%20population%20mean.ipynb#Standard-error-of-the-mean) we stated that the theoretical standard error of the sampling distribution equals the standard deviation of the population ($\sigma$) divided by the square root of the sample size ($n$):

$$ \mathrm{SEM} = \frac{\sigma}{\sqrt{n}}$$

Using Python code, substitute the values of $\sigma$ and $n$ into this equation to calculate SEM. Hopefully your simulated standard error (which you calculated in Step 6) should be a close match to that given by the formula.

In [None]:
# calculate the standard error using the formula sigma/square root(n)

### Step 8. Repeat with different samples sizes

Try repeating the process from Step 3 onward changing the sample size (e.g., 5 ladybirds, 100 ladybirds, etc.) to see how the width of the sampling distribution changes.

## Conclusion

What was the purpose of simulating the sampling distribution? 

You intuitively know that larger sample sizes result in more precise estimates of the average of the thing you are measuring. Averaging the sizes of one thousand ladybirds gives a more precise estimate of their sizes than, say, averaging just two ladybirds. 

Simulating the sampling distribution demonstrates how this intuition arises; it gives you a deeper insight into the behaviour and characteristics of the sample mean that you obtain from repeated sampling. It also helps you understand where theoretical formula, like $ \mathrm{SEM} = \frac{\sigma}{\sqrt{n}}$, come from. 

Next week you'll be using simulation to test the hypothesis that differences in the size of two spot lady birds at different sites is caused by differences in the level of predation by Harlequin ladybirds.