# Ladybird Analysis

# Compare the mean sizes of low and high predated two-spot ladybird populations

<div class="alert alert-success">

# Part 1: Exploring your data
</div>

## Task 1.1: If you don't have your group's ladybird excel spreadsheet on Noteable, upload it now

Follow these instructions to do this:
1. Go to Learn and click **Open Microsoft Teams classes** in the left-hand panel.
2. Log in to the **Biology 1A Variation (2023-2024) Team**.
3. Click on **Files** and locate your group's spreadsheet. For example, if your group is YAK E and your partnering group is YAK F then your spreadsheet is called `ladybird_sizes_YAK_E_F.xlsx`.
4. Hover your cursor over your spreadsheet name, click on the three dots and select Download. 
5. Return to the **Variation1/Ladybird Analysis/** browser tab running **Noteable**.
6. Click on **Upload** on the right, find your spreadsheet on your laptop, then click on the **blue Upload** button.
7. Make sure your spreadsheet is saved in the **Variation1/Ladybird Analysis** folder. That is, the same folder this Notebook is in.

## Task 1.2: Read in and print the low and high predation samples to check the data are okay

Using pandas, read in your excel spreadsheet and call it something sensible.

1. To read in excel spreadsheets we use the command `pd.read_excel('filename.xlsx')`. Do this now, calling the DataFrame something sensible, such as `ladybirds`.

2. Print the data to make sure it is okay. You should see two columns headed `low` and `high`. You will probably see `NaN` repeated at the bottom of one of the columns. This isn't a problem; it's just because different numbers of ladybirds were measured in the two cemeteries.

In [None]:
# read and print your ladybird size dataset

## Task 1.3: Plot the samples in a histogram to see how they are distributed

Plot the distributions of the low and high predation samples as histograms in a single annotated graph. 

See [4.2 - Comparing two population means](../Self-study%20Notebooks/4.2%20-%20Comparing%20two%20population%20means.ipynb#First-plot-the-data) for help.

In [None]:
# annotated histograms of samples of two-spot ladybird sizes from low and high predation cemeteries

## Task 1.4: The distributions might be clearer in a boxplot

Your `low` and `high` histograms will probably overlap quite a lot. This makes it hard to see if the means of the two samples are different.

If that is the case, a boxplot is probably a better way to visualise your data as it hides individual data points and instead uses a 5-number-summary to summarise the distribution of your samples. 

Plot the distributions of the low and high predation samples in an annotated boxplot. 

See [4.2 - Comparing two population means](../Self-study%20Notebooks/4.2%20-%20Comparing%20two%20population%20means.ipynb#First-plot-the-data) for help.

In [None]:
# a boxplot to visually compare ladybird sizes from low and high predation cemeteries 

## Task 1.5: What does the box and the various lines in a boxplot represent?

If you don't know try googling the answer. Write your answer in the following markdown cell.

> Write your answer here. 

## Task 1.6: Eye-ball estimates of the means and standard deviations

It is generally a good idea to estimate means and standard deviations by eye before calculating them on a computer. This is so you can check your eye-ball estimates with the actual values output by Python. If they don't match then you know something is wrong: either your estimates or your code.

Using your histograms or boxplots, estimate the means and standard deviations of ladybird sizes from both cemeteries. Remember that a rough estimate of the standard deviation is given by this formula

$$s \approx \frac{\mathrm{max\ value} - \mathrm{min\ value}}{4}$$


> Write your estimates here

## Task 1.7: Calculate the sample sizes, means and standard deviations

Now, using Python code, calculate the sample sizes, means and standard deviations of the two samples and print to the appropriate number of decimal places.

See Notebook [4.2 - Comparing two population means](../Self-study%20Notebooks/4.2%20-%20Comparing%20two%20population%20means.ipynb#Sample-means-and-standard-deviations) for example code.

How do they compare to your eye-ball estimates?

In [None]:
# sample sizes, sample means and sample standard deviations of both samples

## Task 1.8: Calculate the *d*-statistic: the absolute difference in sample means

Using the sample means you just calculated, calculate, using Python code, the absolute difference in sample means. We will call this the *d*-statistic. 

See Notebook [4.2 - Comparing two population means](../Self-study%20Notebooks/4.2%20-%20Comparing%20two%20population%20means.ipynb#The-test-statistic) for the code to do this.

In [None]:
# calculate your observed d-statistic: difference in sample means

<div class="alert alert-success">


# Part 2: How likely is the observed difference in your sample means (the *d*-statistic) if the null hypothesis were true? The *p*-value.
</div>

Having looked at your data and calculated the difference in the sample means, you next need to work out how likely that difference is assuming the null hypothesis were true.

If that difference is **likely** under the null hypothesis then you have insufficient evidence to reject the null hypothesis.

On the other hand, if that difference is **unlikely** under the null hypothesis then you have sufficient evidence to reject the null hypothesis. 

How likely the observed difference in sample means is under the null hypothesis is called a *p*-value. 

This is what you are going to calculate now.

## Task 2.1: Construct a statistical model of the null hypothesis

To calculate a *p*-value you first need to construct a **statistical model of the null hypothesis**. What this actually means is you will assume that ladybird sizes in the low and high predation populations have identical distributions. (See Notebook [4.2 - Comparing two population means](../Self-study%20Notebooks/4.2%20-%20Comparing%20two%20population%20means.ipynb#Create-a-statistical-model-of-the-sampling-process-assuming-the-null-hypothesis-were-true) for an explanation of a statistical model.)

In your model, first let's assume that ladybird sizes are normally distributed. This is a reasonable assumption as sizes of most things in the natural world are normally distributed.

Second, **you have to decide** what the mean ($\mu$) and standard deviation ($\sigma$) of this normal distribution will be. You should pick values that are close to the means and standard deviations of your samples. For example, you might set $\mu$ equal to the average of your two sample means and $\sigma$ to the average of your two sample standard deviations. The actual values you pick will not matter too much so don't spend too long on choosing values.

---

In the markdown cell below, state the mean ($\mu$) and standard deviation ($\sigma$) of the normal distribution of your statistical model.

> Write your mean (mu) and standard deviation (sigma) of your statistical model here.

## Task 2.2: Simulate a pair of samples from the low and high predation populations under the null hypothesis and calculate the *d*-statistic

You are now going to simulate the statistical model of the null hypothesis. To start with you will do this for a single pair of samples. This is to make sure your code is working correctly before doing the full simulation. One sample is simulated from the low predation population and the other sample from the high predation population. Even though we are assuming both populations have the same distribution, the simulated samples will be different due to the randomness of sampling. You will calculate the means of these two simulated samples and then calculate the difference in these means, i.e., you will calculate a simulated *d*-statistic. 

---

Follow these steps to perform a single simulation of the statistical model of the null hypothesis: 
1. Write code to simulate randomly drawing a pair of samples, one each from the low and high predation populations. Remember, the sample sizes (i.e., the number of ladybirds measured) of these simulated samples must match those of your actual samples from the cemeteries. 
2. Print out the simulated ladybird sizes from both simulated samples.
3. Calculate and print the mean ladybird sizes of both simulated samples.
4. Calculate and print the *d*-statistic of this pair of simulated samples.

You might find it helpful to copy, paste and adapt the code from Notebook [4.2 - Comparing two population means](../Self-study%20Notebooks/4.2%20-%20Comparing%20two%20population%20means.ipynb#simpair).

<div class="alert alert-info">

Run the code several times to convince yourself that each time you run it you get different random samples with different sample means, resulting in different *d*-statistics.
</div>

In [None]:
# using your statistical model, simulate a pair of random samples from the low and high predation populations assuming the null hypothesis were true and calculate the d-statistic

## Task 2.3: Construct the sampling distribution of the *d*-statistic

Now construct and plot the sampling distribution of the *d*-statistic under the null hypothesis. Do this by simulating thousands of pairs of samples and calculating the *d*-statistic for each.

To construct and plot the sampling distribution, it may be helpful to copy, paste and adapt the code from [4.2 - Comparing two population means](../Self-study%20Notebooks/4.2%20-%20Comparing%20two%20population%20means.ipynb#samplingdist).

In [None]:
# construct and plot the sampling distribution of the d-statistic

## Task 2.4: Is your observed *d*-statistic likely or not under the null hypothesis?

Think about what the sampling distribution of the *d*-statistic means. What it tells you is this. If the null hypothesis were true (i.e., average ladybird sizes are the same in the low and high predation cemeteries) and you kept taking samples from both populations and calculating the difference in their means (the *d*-statistic), the histogram of all those differences would be the sampling distribution you just plotted.

If the null hypothesis were actually true, your observed *d*-statistic (the one you calculated from your data) would lie somewhere in this distribution.

If your observed *d*-statistic lies far into the tail of the sampling distribution, then your *d*-statistic is unlikely if the null hypothesis were true (because that is what we are assuming when we construct the sampling distribution). Which, of course, suggests then that the null hypothesis is not true and you have sufficient evidence to reject it.

On the other hand, if your *d*-statistic lies roughly in the middle of the sampling distribution, then your *d*-statistic is quite likely if the null hypothesis were true. Which means you have insufficient evidence to reject the null hypothesis.

By eye-balling your constructed sampling distribution, take a guess on whether your observed *d*-statistic is likely or unlikely under the null hypothesis? 

Write your answer below with a justification.

> Write your answer here.

## Task 2.5: Calculate the *p*-value of your *d*-statistic

You want to put a number (a probability) to how likely your *d*-statistic is if the null hypothesis were true. This is called a *p*-value.

Your *p*-value is the probability of obtaining a *d*-statistic as unusually high as what you observed if the null hypothesis were true. That's a bit of a mouthful. It is the area of the upper tail of the sampling distribution to the right of your observed *d*-statistic. 

---

Now calculate the *p*-value of your *d*-statistic. 

To calculate the *p*-value, it may be helpful to copy and paste the code from [4.2 - Comparing two population means](../Self-study%20Notebooks/4.2%20-%20Comparing%20two%20population%20means.ipynb#pvalue).

In [None]:
# calculate and print the p-value of your observed d-statistic under the null hypothesis

<div class="alert alert-success">

# Part 3: Two-sample *t*-test in practice
</div>

## Task 3.1: Perform a two-sample *t*-test

The *d*-statistic is the simplest, and most intuitive, measure of the difference between mean ladybird sizes in the low and high predation cemeteries. This is why we have taken you through the process of simulating its sampling distribution to calculate a *p*-value. However, from a practical point-of-view the *d*-statistic is not that useful. Instead we use the *t*-statistic.

The great thing about using the *t*-statistic is that we do not need to do any simulations to construct its sampling distribution to calculate a *p*-value. As we saw in the Self-study Notebooks, the sampling distribution of the *t*-statistic is already known; it has a mathematical formula that we can directly plug our data into to get a *p*-value. This means we can use statistical software to perform the statistical test for us and not have to go through the laborious process of coding it ourselves. 

---

Now perform a two-sample *t*-test on your data using Python code. To do this, copy, paste and adapt the code from [4.3 - Two sample *t*-test in practice](../Self-study%20Notebooks/4.3%20-%20Two%20sample%20t-test%20in%20practice.ipynb).

In [None]:
# perform a two-sample t-test on your data

## Task 3.2: Reject or not reject your null hypothesis

At this point we could leave it there: We've stated our hypotheses, collected and analysed the data and calculated how likely our data are under the null hypothesis (i.e., the *p*-value). You could then leave it to other scientists to judge if your data support your biological hypothesis. 

But scientists, like everyone else, like clear-cut answers: Do your data support your hypothesis or not?

Unfortunately there are rarely such clear-cut answers. But scientists have created an illusion of such. 

To create this illusion, we set a threshold value on our *p*-values. This threshold is a convention (i.e., has no scientific basis) and in the Biological and Medical sciences this threshold is 0.05. It even has a fancy name: **the 95% confidence level**, and given the fancy Greek letter $\alpha$. This is how the illusion of a clear-cut answer works:

If our *p*-value is below 0.05 then we **reject the null hypothesis**. And we say "There is a **statistically significant** difference between mean ladybird sizes in low and high predation cemeteries."

If our *p*-value is above 0.05 then we **fail to reject the null hypothesis**. And we say "There is **no statistically significant difference** between mean ladybird sizes in low and high predation cemeteries."

What happens if your *p*-value is say 0.051? You've just missed the threshold. All that hard work collecting data and you end up with a boring, non-significant result that is unpublishable. This, of course, leads to scientists trying to find ways to make their *p*-values go below 0.05. This is quite easy to do: remove some data points, use other statistical tests, or even make data up (this does happen, although rarely). Much has been written lately in the scientific literature about why *p*-values promote poor scientific practices.

However, all the scientific literature you will read in your studies, and in your later careers, will contain statistical analyses with *p*-values. This is why you need to understand how they are calculated and what they mean.

---

Based on your *p*-value and a confidence level of $\alpha=$ 0.05. Do you reject or fail to reject your null hypothesis that mean ladybird sizes are the same in cemeteries with low and high predation rates? Write your answer below.

Also see [4.3 - Two sample *t*-test in practice](../Self-study%20Notebooks/4.3%20-%20Two%20sample%20t-test%20in%20practice.ipynb#To-reject-or-not-reject-the-null-hypothesis) for more discussion about rejecting or not rejecting a null hypothesis.

> Do you reject or not reject the null hypothesis? Explain why.

## Task 3.3: Report the result of your test

There are three possible outcomes of your analysis.

1. You fail to reject the null hypothesis. Which means you have no evidence that mean ladybird sizes differ between Edinburgh cemeteries.

2. You reject the null hypothesis but mean ladybird sizes are smaller in the high predation cemetery than in the low predation cemetery. Which means you have evidence that mean ladybird sizes differ between Edinburgh cemeteries. But that this difference is not due to Harlequin ladybirds preferring to eat small two-spot ladybirds. Something else must be causing this difference.

3. You reject the null hypothesis and mean ladybird sizes are larger in the high predation cemetery than in the low predation cemetery. Which means you have evidence that mean ladybird sizes differ between Edinburgh cemeteries and that this difference is due to Harlequin ladybirds preferring to eat small two-spot ladybirds.

Report the outcome of your test in words, as you might write in a report.

See [4.3 - Two sample *t*-test in practice](../Self-study%20Notebooks/4.3%20-%20Two%20sample%20t-test%20in%20practice.ipynb#Reporting-the-result-of-the-test) for an example. 

> Report the outcome of your test.