# Tails of a distribution

<div class="alert alert-success">

**Before completing this week's Self-study Notebooks you must complete the [Ladybird Analysis 1 - Estimating a population mean Notebook.](../Ladybird%20Analysis%20Notebooks/Ladybird%20Analysis%201%20-%20Estimating%20a%20population%20mean.ipynb)**
    
</div>

<div class="alert alert-warning">

**In this notebook you will learn what the tails of a distribution are and how to calculate their area and their probability.**
    
</div>

This week you will learn how to perform your own statistical tests. These tests have the forbidding name "null hypothesis significance tests". But don't worry, performing and interpreting statistical tests is simple.

But first we need to introduce the concept of the tails of a distribution and their area.

---

The **tails** of a distribution refer to the parts of it that are farthest from the mean. That is, the data "tail-off" as we move away from the centre. These are the blue areas shown in the student height histogram below. There is a lower tail to the left and an upper tail to the right.

![tails-2.png](attachment:tails-2.png)

The tails of a distribution aren't precisely defined terms. In other words, there is not some specific place where you stop being in the middle of the distribution and start being in the tail.

## The tails of the Normal distribution

Last week we found that, for a sample of 125 US female college students, their mean height was 164.2 cm, and their standard was 6.6 cm.

We saw that roughly 68% of the students have heights within one standard deviation of the mean. That is, roughly 85 (125 x 0.68) students have heights between 157.6 cm and 170.9 cm.

![normal_rule68-2.png](attachment:normal_rule68-2.png)

According to the 68-95-99.7% rule, if data are normally distributed, 68% of data lie within one standard deviation of the mean (the orange region). This means the remaining 32% of data lie further than one standard deviation from the mean (the data in the two blue tails). As the Normal distribution is symmetrical this means that 16% of the data lies below $\bar{x}-s$ and 16% lies above $\bar{x}+s$.

16% of 125 students is 20 students. So, in theory, 20 students will be shorter than 157.6 cm and 20 students will be taller than 170.9 cm. Let's test that theory.

<div class="alert alert-info">

Look at the code below to try and understand what it does. Then run it.
</div>

In [None]:
import pandas as pd

# Read in the students' heights
students = pd.read_csv('Datasets/college students.csv')

n = students['height'].count()   # The sample size
xbar = students['height'].mean() # The sample mean
s = students['height'].std()     # The sample standard deviation


# Set tally to zero of students shorter than the mean minus one standard deviation.
lower_tail_count = 0

# Set tally to zero of students taller than the mean plus one standard deviation.
upper_tail_count = 0


# Loop through student heights one at a time.

for height in students['height']:
    
    # If a student is shorter than the mean minus one standard deviation increment lower tail tally by 1. 

    if height < xbar - s:
        lower_tail_count += 1

    # If a student is taller than the mean plus one standard deviation increment upper tail tally by 1. 
    
    if height > xbar + s:
        upper_tail_count += 1
        
print( f'{lower_tail_count} students are shorter than {xbar - s:.1f} cm' )
print( f'{upper_tail_count} students are taller than {xbar + s:.1f} cm' )

So 21 students are shorter than 157.6cm and 17 are taller than 170.9cm. Not quite 20 in each of the two tails. But given this is actual data and the small number of students it's not bad.

Let's examine the code within the loop. We first test whether a student is shorter than the mean minus one standard deviation ($\bar{x}-s$) with the code
```python
    if height < xbar - s:
        lower_tail_count += 1
```
if it is we increment the lower tail tally by 1. If it's not we move on to test whether a student is taller than the mean plus one standard deviation ($\bar{x}+s$) with the code
```python
    if height > xbar + s:
        upper_tail_count += 1
```
if it is we increment the upper tail tally by 1. If it's neither of these we do nothing and move on to the next student.

## Tail area probability 
 

If I chose one of the 125 students at random (e.g., picked their name out of a hat), what is the probability that they are taller than 170.9 cm?

That's just the relative frequency, or proportion, of students taller than 170.9 cm. The number of students taller than 170.9 cm is stored in the variable `upper_tail_count` and the total number of students is the sample size of 125.

This proportion is called the upper tail area probability.

<div class="alert alert-info">

Look at the code below to try and understand what it does. Then run it.
</div>

In [None]:
# Calculate upper tail area probability
p_upper = upper_tail_count / n

print(f'Probability a student is in the upper tail = {p_upper:.2f}')

So the probability of a randomly chosen student being taller than 170.9 cm is 0.14 or 14%, about a 1 in 7 chance.

What is the probability that a randomly chosen student is not within one standard deviation of the mean? From the 68-95-97.5% rule we expect that probability to be about 32%. Let's check that by adding the actual lower and upper tail area probabilities.

<div class="alert alert-info">

Look at the code below to try and understand what it does. Then run it.
</div>

In [None]:
# Calculate upper tail area probability
p_upper = upper_tail_count / n

# Calculate lower tail area probability
p_lower = lower_tail_count / n

print(f'p_lower = {p_lower:.3f}')
print(f'p_upper = {p_upper:.3f}')
print()

print(f'Probability a student is not within one st. dev. of the mean = {p_lower + p_upper:.3f}')

So there is a 30.4% chance of randomly picking a student who is **not** within one standard deviation of the mean. That's quite close to the theoretical probability of 32% we expect from the 68-95-99.7% rule.

<div class="alert alert-success">

The tails of a distribution are core concepts in statistics. It is important you understand what they mean so that you understand the results of statistical tests which we'll cover next.

</div>

## Exercise Notebook

[Tails of a distribution](Exercises/4.1%20-%20Tails%20of%20a%20distribution.ipynb)