# Summary Statistics and Five Number Summary

In this notebook, we will look at how to use Python to calculate summary statistics in general and the five number summary in particular.

We will look at some simple examples first, and then work on some real datasets.

## Using Numpy Arrays

Numpy arrays can be used to store numerical data. In the example below, we calculate the mean and the five-number summary for an example dataset called `x`.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.array([85, 90, 29, 72, 65, 72, 68, 97, 24, 35, 2, 70,  9, 82, 14, 38, 26, 52, 67, 33])

print("Mean for dataset 'x'.")
print(f"Mean = {x.mean()}")

x_min = x.min()
Q1 = np.percentile(x, 25)
Q2 = np.percentile(x, 50)
Q3 = np.percentile(x, 75)
x_max = x.max()

print("\nFive number summary for dataset 'x'.")
print(f"Min = {x_min}")
print(f"Q1 = {Q1}")
print(f"Median = {Q2}")
print(f"Q3 = {Q3}")
print(f"Max = {x_max}")

## Exercise 1

A second dataset, `y` is defined below as a numpy array.

Using the example above as a model, find the mean and the five number summary for this dataset and print out the values.

In [None]:
y = np.array([83,  2, 85,  3, 99,  0, 65,  2, 24, 98, 62,  0,  8, 43, 83, 24, 75, 16,  7, 37])

# Type your code here


## Exercise 2

Now that you have worked out the five number summary for the two datasets `x` and `y`, what are some differences between the two datasets? Try to identify two differences between the datasets.


Difference 1:

Difference 2:

## Box Plots

A box plot gives you a visual representation of the five number summary. It gives you a snapshot of the dataset, helping you to identify key features.

The code below generates a box plot for the dataset `x`.

In [None]:
plt.boxplot(x)
plt.show()

## Formatting Box Plots

We can also customize the plot. For example. we can give it a title, add gridlines and an x-axis label, and change the frequency of the tick marks on the y-axis.

In [None]:
plt.boxplot(x)
plt.title("Box Plot for dataset 'x'")
plt.xticks([1], ['x'])
plt.yticks(range(0, 110, 10))
plt.grid(axis='y')
plt.show()

## Multiple Box Plots on the Same Axes

When comparing datasets, it is best to plot the box plots on the same graph. To do this, you put the box plots you want to show in a list. Let's do this for `x` and `y` side by side.


In [None]:
data_list = [x, y]
plt.boxplot(data_list)
plt.xticks([1, 2], ['x', 'y'])
plt.yticks(range(0, 110, 10))
plt.title("Box Plots for x and y")
plt.grid(axis='y')
plt.show()

## Exercise 3

For the two datasets below, `d1` and `d2`, print out the five-number summary for each dataset.

Then create side-by-side boxplots to compare visually. Add a title and labels to make the box plot clearer.

What do you notice about the box plots? Use your numbers and graphs to describe two ways in which the datasets differ. 

In [None]:
d1 = np.array([46.58445004, 44.07671392, 46.39921661, 48.37888704, 46.78337129,
       45.3086478 , 41.96563946, 51.08916634, 52.66959907, 47.38920504,
       49.22702405, 60.78051231, 48.26781518, 51.31306374, 59.15027285,
       50.76734158, 40.85106059, 55.51203685, 48.03584999, 52.52697665,
       47.46062391, 50.14474262, 47.31792238, 43.98105114, 49.90104119])

d2 = np.array([56.62214897, 40.51212078, 73.35946238, 60.06651664, 57.7240614 ,
       46.02005184, 51.02818696, 67.36239918, 45.10682943, 62.3958202 ,
       60.77152019, 46.59952106, 36.90600902, 53.23142776, 46.06795209,
       59.29048824, 45.85408562, 43.42303302, 70.87260426, 54.21304038,
       73.42317902, 47.562131  , 68.03251023, 61.80320401, 58.40891008])

# Type your code here


Difference 1:

Difference 2:

## End of Notebook

Well done! You've made it to the end of this notebook. If you want to look further at box plots and five number summaries, you can now look at the following notebook:

**Pandas_Box_Plots.ipynb**