# Ladybird Analysis: Estimating the population mean size of your two-spot ladybirds

<div class="alert alert-success">

For this solution I am using ladybird data collected from one of last year's groups.
</div>

## Task 1: Read in and print your groups's data

Using pandas, you are now going to read in the excel spreadsheet and call it something sensible.

1. In the Self-study notebooks we read in csv data files with the pandas command `pd.read_csv(filename)`. To read in excel spreadsheets we use the command `pd.read_excel(filename)`. Do this now, calling the DataFrame something sensible, such as `ladybirds`.

2. Print the data to make sure it is okay.

In [None]:
# read and print your ladybird size dataset

import pandas as pd

ladybirds = pd.read_excel('ladybird_sizes_demo.xlsx')

ladybirds

## Task 2: Plot your group's data

Plot your two-spot ladybird sizes in an annotated histogram in the following code cell.

In [None]:
# Annotated histogram of two-spot ladybird sizes.

import seaborn as sns

g = sns.displot(ladybirds['low'])

# Add some useful annotation to help others understand what the graph contains
g.ax.set_xlabel('Size (mm)')
g.ax.set_ylabel('Number of ladybirds')
g.ax.set_title('Sizes of two-spot ladybirds collected from a cemetery');


## Task 3: Check for outliers

No outliers in this data

## Task 4: Eye-ball estimates of the mean and standard deviation

Using your histogram, estimate the mean and standard deviation of ladybird sizes. Remember that a rough estimate of the standard deviation is given by this formula

$$s \approx \frac{\mathrm{max\ value} - \mathrm{min\ value}}{4}$$


mean is about 4.0 mm
s is about (6-3)/4 = 0.75 mm

## Task 5: Calculate the sample size, mean and standard deviation

Now, using Python code, calculate the sample size, mean and standard deviation of your data in the following code cell to the appropriate number of decimal places.

How do they compare to your eye-ball estimates?

In [None]:
# sample size, sample mean and sample standard deviation

n = ladybirds['low'].count()   # sample size
xbar = ladybirds['low'].mean() # sample mean
s = ladybirds['low'].std()     # sample standard deviation

print(f'sample size = {n} ladybirds')
print(f'mean = {xbar:.2f} mm')
print(f'st. dev. = {s:.2f} mm')

## Task 6: Check if your data obey the 68-95-99.7% rule

Now you should check to see if your data are roughly normally distributed.

1. Check if 68% of your data lie within one standard deviation of the mean using Python code.
    - Unless you have measured over a hundred ladybirds, you won't be able to check the 95% and 99.7% parts of the rule.
2. Do you think your data are normally distributed?

In [None]:
# check if 68% of your data are within one standard deviation of the mean

print(f'expected number of ladybirds within 1 st. dev. of the mean = {0.68*n:.1f}')

# Set tally of ladybird sizes within one standard deviation of the mean to zero.
count = 0

# Loop through ladybird sizes one at a time.
for size in ladybirds['low']:
    
    # If this ladybird's size is within one standard deviation of the mean increment tally by 1. 
    if xbar - s < size < xbar + s:
        count += 1

print( f'within 1 st. dev. of the mean is between {xbar - s:.2f} mm and {xbar + s:.2f} mm' )
print( f'{count} ladybirds are within 1 st. dev. of the mean' )
print('perhaps not quite normally distributed')

## Task 7: Calculate the precision of your estimate of the population mean

Calculate the standard error of the mean and the 95% confidence interval of the mean.

In [None]:
import math

SEM = s / math.sqrt(n)   # the standard error of the mean for the 1976 finches

print(f'SEM = {SEM:.2f} mm')

lower_limit = xbar - 2 * SEM   # Lower limit of 95% CI
upper_limit = xbar + 2 * SEM   # Upper limit of 95% CI

print(f'lower limit = {lower_limit:.2f} mm')
print(f'upper limit = {upper_limit:.2f} mm')


## Task 8: Report your 95% confidence interval

Write a short sentence below reporting your confidence interval.

Mean ladybird size is 4.62 mm (95% CI: 4.43 - 4.80 mm)

Mean ladybird size is 4.62 mm (SEM: 0.09 mm)

## Task 9: Calculate the estimate and precision of the population mean of the other group's data

Repeat Tasks 5, 7 and 8 for the other group's data.

In [None]:
# sample size, sample mean and sample standard deviation

n = ladybirds['high'].count()   # sample size
xbar = ladybirds['high'].mean() # sample mean
s = ladybirds['high'].std()     # sample standard deviation

print(f'sample size = {n} ladybirds')
print(f'mean = {xbar:.2f} mm')
print(f'st. dev. = {s:.2f} mm')

SEM = s / math.sqrt(n)   # the standard error of the mean for the 1976 finches

print(f'SEM = {SEM:.2f} mm')

lower_limit = xbar - 2 * SEM   # Lower limit of 95% CI
upper_limit = xbar + 2 * SEM   # Upper limit of 95% CI

print(f'lower limit = {lower_limit:.2f} mm')
print(f'upper limit = {upper_limit:.2f} mm')


Mean ladybird size is 4.97 mm (95% CI: 4.76 - 5.18 mm)

Mean ladybird size is 4.97 mm (SEM: 0.11 mm)


## Task 10: Simulate the sampling distribution of the population mean

### Step 1. Create a statistical model of the sampling process

Make the following three realistic assumptions about the population of ladybird sizes:

1. Ladybird sizes are normally distributed.
2. The population mean ladybird size is $\mu$ = 6 mm.
3. The population standard deviation of ladybird size is $\sigma$ = 1 mm.

### Step 2. Simulate random sampling from the population

1. Write some code below that simulates sampling $n$ = 2 ladybirds from a population with mean size $\mu$ = 6 mm and standard deviation $\sigma$ = 1 mm.
2. Print the simulated sample of the two ladybird sizes.
3. Run the code several times to convince yourself that on each run you generate two different random ladybird sizes.

In [None]:
from numpy.random import normal

mu = 6
sigma = 1
n = 2

sizes = normal(mu, sigma, n)

print(sizes)

### Step 3. Simulate many samples

1. Write some code below that simulates $m$ = 10,000 samples of $n$ = 2 ladybirds each from a population with mean size $\mu$ = 6 mm and standard deviation $\sigma$ = 1 mm.
2. Print the simulated samples. You should see two rows of numbers. Only the first and last three numbers of the 10,000 numbers in each row are printed.

In [None]:
m = 10000

sizes = normal( mu, sigma, (n, m) )

print(sizes)

### Step 4. Calculate the sample means

Now calculate and print the 10,000 means of all 10,000 samples with the following line of code and print.


In [None]:
xbars = sizes.mean(axis=0)

print(xbars)

### Step 5. Plot the histogram of sample means

In [None]:
g = sns.displot(xbars)

g.ax.set_xlabel('mean ladybird size (mm)')
g.ax.set_title('Mean ladybird size of samples of size 2');

### Step 6. Calculate the standard deviation of the distribution of sample means (i.e., the standard error) 

In [None]:
sem = xbars.std()

print(sem)

### Step 7. Compare the simulated standard error with the formula for the standard error

The theoretical standard error of the sampling distribution equals the standard deviation of the population ($\sigma$) divided by the square root of the sample size ($n$):

$$ \mathrm{SEM} = \frac{\sigma}{\sqrt{n}}$$

Using Python code, substitute the values of $\sigma$ and $n$ into this equation to calculate SEM. Hopefully your simulated standard error (which you calculated in Step 6) should be a close match to that given by the formula.

In [None]:
import math

print( sigma / math.sqrt(n) )