# Descriptive Statistics

**Author:** 'Felipe Millacura'
    
**Date:** '29th January 2021'

## Learning Objectives

* Understand meaning of the term "distribution"
* Understand the different measures of centre for a distribution - mean, mode, median
* Be able to identify unimodal, bimodal, uniform, left and right skewed distributions
* Know the correct measure of centre for a skewed distribution and/or in the presence of outliers

## Introduction

Descriptive Statistics aims to **summarize** and **present** data coming from a **total population** or a **sample** of it 

In this tutorial we will review the fundamental concepts in this area, which can be grouped into three main areas:

- Distributions
- Measures of centrality
- Measures of spread

Descriptive statistics is the first step in any data exploration. It allows us to see how data is distributed and to make hypotheses about it. 

## What is a distribution? 


A distribution is a list of all of the possible outcomes of a random variable along with either their corresponding frequency or probability values (we would speak of the **frequency distribution** or **probability distribution** respectively). 

Defined either way, the distribution gives insight into *how likely* or *how common* the various outcomes are.  


## Unimodal distributions


A **unimodal** distribution is one in which there is *a single* 'peak' or 'hump' in the data. We can think of a mode as being a local peak in the distribution, so 'unimodal' just means 'one-peaked'


In [168]:
import pandas as pd
import matplotlib as plt
import plotly.express as px

In [169]:
unimodal = pd.read_csv("data/unimodal.csv")

#unimodal.head()

In [170]:
px.histogram(unimodal, x="x", nbins = 40)

Here, we can see there is one clear peak in the distribution.


# Measures of centrality: mean, median and mode

Now we have our distribution, we can start making summaries from it. To start, we're going to look at three measures of the 'centre' of a probability distribution. Each of the measures have advantages and disadvantages, which we will talk about later. First, we'll define the measures.

We'll work with a data set tracking daily sales of air conditioning units in a small business. 


In [171]:
# read in dates as strings
air_con_sales = pd.read_csv("data/AirConSales.csv")

In [172]:
air_con_sales.head()

Unnamed: 0,Date,Units_sold
0,6/5/2018,0
1,6/6/2018,1
2,6/7/2018,0
3,6/8/2018,9
4,6/9/2018,3


In [34]:
# then convert column to Date afterwards
air_con_sales["Date"] = pd.to_datetime(air_con_sales["Date"])

In [173]:
air_con_sales.head()

Unnamed: 0,Date,Units_sold
0,6/5/2018,0
1,6/6/2018,1
2,6/7/2018,0
3,6/8/2018,9
4,6/9/2018,3


Next, what we need is the total number of occurrences in our data. For this, we need to take a variable and sum up the total number of occurrences of each value of the variable. For example, the `Units_sold` variable of `air_con_sales`, let's run `groupby` on it! 


In [36]:
air_con_sales.groupby("Units_sold").count()

Unnamed: 0_level_0,Date
Units_sold,Unnamed: 1_level_1
0,3
1,2
2,2
3,9
4,8
6,3
8,1
9,1
10,2
11,3


In the first column we have the number of units sold, and on the second column, how many dates on which that number of sales occurred. The entries on the second column can be thinked as **frequencies**, e.g. days on which $4$ units were sold occur with frequency $8$ in this dataset.


### Relative frequencies and empirical probabilities

How do frequencies relate to probabilities? A **frequency** is the number of times a given observation occurs in a data set. A **relative frequency** is the fraction of times a particular observation occurs. **Relative frequencies** correspond to **empirical probabilities**. 

The best way to understand this is via an example. Imagine we have an unbiased die, which we roll $100$ times, noting the results as we go. We know that the **theoretical probability** of each number is $1/6 = 0.166667$. However, the frequencies we actually rolled are:


In [174]:
from random import randint

In [175]:
die1 = randint(1,6)

random_number = [randint(1,6) for i in range(1,10)] 

random_number

[3, 6, 5, 4, 3, 6, 4, 5, 1]

In [176]:
import numpy as np

In [178]:
random_number_array = np.random.randint(1,6, size=(1, 10))

random_number_array

array([[3, 4, 1, 3, 4, 1, 5, 3, 5, 2]])

This tells us that we rolled a `1` X times, a `2` Y times... and so on.   

We get then get the **relative frequencies** by dividing the observed frequencies by the total number of rolls:


In [179]:
# initializing dict to store frequency of each element
elements_count = {}
# iterating over the elements for frequency
for element in random_number:
   # checking whether it is in the dict or not
   if element in elements_count:
      # incerementing the count by 1
      elements_count[element] += 1
   else:
      # setting the count to 1
      elements_count[element] = 1
# printing the elements frequencies
for key, value in elements_count.items():
   print(f"{key}: {value}")

3: 2
6: 2
5: 2
4: 2
1: 1


In [180]:
(unique, counts) = np.unique(random_number, return_counts=True)

frequencies = np.asarray((unique, counts)).T


print(frequencies)

[[1 1]
 [3 2]
 [4 2]
 [5 2]
 [6 2]]


and these are **empirical probabilities**, 'empirical' in the sense of 'obtained by observation'. 

*If all we knew about this die were the results of these $10$ rolls, the empirical probabilities would be our best estimate of the behaviour of the die*.


So, let's look at the relative frequencies (or empirical probabilities) for the sales of air con units using `pandas`:   

In [181]:
 air_freq = air_con_sales.groupby("Units_sold").count()

In [182]:
air_freq

Unnamed: 0_level_0,Date
Units_sold,Unnamed: 1_level_1
0,3
1,2
2,2
3,9
4,8
6,3
8,1
9,1
10,2
11,3


And let's plot the distribution:  

In [65]:
px.histogram(air_con_sales["Units_sold"], nbins=25)

Now we have our distribution, we can start looking at summaries of that distribution. 


## Mean

Now let's calculate the **mean** number of units sold per day. The mean is just another term for the average, defined as

$$\textrm{mean} = \frac{\textrm{sum of values}}{\textrm{no. of values}}$$

<br> 
The mean is given the symbol 

* $\mu$ for a population, and 
* $\bar{x}$ for a sample.

In [183]:
air_con_sales["Units_sold"].mean()

5.945945945945946


Note that there is no reason that the mean should be an integer. A floating point number, as we obtain here, is perfectly valid. So, on average, $5.95$ units are sold every day.


## Median

The **median** is the value for which one-half of the values in the data set lie below it, and one-half above. Python makes this easy with the `median()` function!


In [185]:
air_con_sales["Units_sold"].median()

4.0


We had an odd number of values in the `AirConSales` CSV. What if we had an even number?


## Mode

The **mode** is the most likely value in the data set, i.e. the value that occurs most frequently. Python has no built-in function to calculate the mode of a distribution (this is genuinely weird), but we can create one ourselves!


In [62]:
air_con_sales["Units_sold"].mode()

0    3
dtype: int64


So, it turns out that $3$ is the most common daily sales figure. The plot of the distribution confirms that the sales value with the highest probability is $3$.


In [67]:
px.histogram(air_con_sales["Units_sold"], nbins=45)

## Outliers

We've seen the definition of outliers earlier in the course. Just a reminder: we can get the values of outliers from the object returned from `boxplot()`


In [68]:
px.box(air_con_sales['Units_sold'])


So, value $43$ is an outlier in the `Units_sold` distribution. It turns out there was a flash clearance sale that day, so it's not surprising that sales figures were much higher than usual! 

**Measures of centre and outliers**

Why mention outliers here? It turns out that **the three measures of distribution 'centre' show very different sensitivities to outliers!**


Let's investigate this!


##### Calculation of Mean:

In [186]:
units_sold = air_con_sales['Units_sold']

units_sold_wo = air_con_sales[air_con_sales['Units_sold'] <40]['Units_sold']

In [187]:
print("Considering outliers: " + str(units_sold.mean()))
print("Not considering outliers: " + str(units_sold_wo.mean()))

Considering outliers: 5.945945945945946
Not considering outliers: 4.916666666666667


##### Calculation of Median:

In [188]:
print("Considering outliers: " + str(units_sold.median()))
print("Not considering outliers: " + str(units_sold_wo.median()))

Considering outliers: 4.0
Not considering outliers: 4.0


##### Calculation of Mode:

In [189]:
print(units_sold.mode())
print(units_sold_wo.mode())

0    3
dtype: int64
0    3
dtype: int64



The key point is that **the mean is more heavily swayed by outliers than the median or the mode**. If we suspect our data set contains outliers, we should consider carefully which measure of centrality to use, and perhaps prefer the median to the mean. 


## Bimodal and Multimodal distributions


On the other hand, we might have a **bimodal** ('two-peaked') distribution.

In [190]:
bimodal =  pd.read_csv("data/bimodal.csv")

In [191]:
px.histogram(bimodal, x="x", nbins = 40)

Here, you can see the two peaks in the data. 

We can extend this to a **multimodal** distribution, featuring more than two peaks!

We need to be careful when applying measures of centrality to multimodal distributions. For example, let's compute the mean and the median of the example bimodal distribution above.


In [192]:
bimodal.mean()["x"]

12.567729996144596

In [193]:
bimodal.median()["x"]

12.089356552090251

So, the mean and median both lie around $12$ to $13$, i.e. in the 'dip' between the two peaks. Aside from the outer edges of the distribution, this is the region with the **lowest frequency** of values! Do the **mean** and **median** provide *useful summaries* of the data if they fall in this region? Probably not... 

## Skewness

Not only can we describe the central tendencies of a distribution, as well as how many peaks it has, but we can describe how symmetrical it is. This is called **skewedness**.   

**Skewness** in a distribution refers to **asymmetry**, to a tendency to be distorted to the left or right.


*Skew (definition): a bias towards one particular group or subject.*


### Left skewness

In a **left-skewed** distribution, the centrality measures **typically** fall in the order

$$\textrm{mean} \lt \textrm{median} \lt \textrm{mode}$$ 

Let's take a look at an example:


In [194]:
left_skewed = pd.read_csv("data/leftskew.csv")

If we calculate the descriptive stats we see it follows what we said above:  


In [195]:
left_skewed["x"].mean()

11.69079232975779

In [196]:
left_skewed["x"].median()

11.831301664112399

In [197]:
left_skewed["x"].mode().max()

12.7870403885633

And if we plot it, we can see it is **left skewed**: the tail is pointing towards the left.  

In [201]:
px.histogram(left_skewed, x="x", nbins = 2)

### Diversion - binning a continuous variable

In the above example, variable `x` dataset is a continuous numeric variable.   

How do we interpret the mode of a continuous variable? Essentially, every `x` value is (or has the potential to be) unique, and occurs with frequency $1$, so what does mode mean in this context?

We can `bin` a continuous variable to get around this problem: in short, this converts the variable from continuous to numeric. Here's a picture that might help to explain the process.

![](images/Binning_cut.png)



To do this, we set up a series of 'bins' into which we sort the values in the distribution, one-by-one. 

## Right skewness

Right skewed data is the opposite of left skewed. For right skewed data, the tail should be pointing towards the right.  


In [202]:
right_skewed = pd.read_csv("data/rightskew.csv")

In [203]:
px.histogram(right_skewed, x="x", nbins = 20)

## Measure of skewness

We can compute the `skew()` of a distribution prior to further analyses. A **negative** value correspond to left-skew, and a **positive** value to right-skew. The magnitude of the skewness (i.e. the value ignoring the sign) can be interpreted on the following table.

|   Magnitude of Skewness  | Classification     |
|--------------------------|--------------------|
|             0          |  symmetrical |
|         higher than -0.5 | fairly symmetrical |
|          lower than 0.5  | fairly symmetrical |
|         between - 1.0 & -0.5      | moderately skewed  |
|      between    0.5 & 1.0       | moderately skewed  |
|          lower than -1.0          | highly skewed      |
|        higher than >1.0          | highly skewed      |

** Negative values refer to left-skewed data and positive to right-skewed 

In [204]:
right_skewed["x"].skew()

0.6761924294982851

In [205]:
left_skewed["x"].skew()

-1.0157021805771353

We should try not to use the mean as a measure of centrality for moderately and highly skewed distributions, preferring the median instead in these cases.  


* is the distribution significantly skewed?
* if significantly skewed, is it left- or right-skewed? 


## The key message

Why does distribution shape and skewness matter? It will become clearer later, but the TL;DR (“Too Long; Didn't Read.”) version is that all statistical tests have assumptions about the underlying distribution of data. And if your data doesn't meet these assumptions, you will have to use a different statistical test. The only way you can ensure you're using the correct test is to look at your data carefully and assess the underlying distribution.    

We've been saying this a lot, but it bears repeating: 

**"Always visualise your data before performing any further statistical analyses!"**

# Recap


#### What is a distribution?

The distribution is a function that tells us either the *frequency* or *probability* of each outcome in the sample space of an experiment (we would speak of the **frequency distribution** or **probability distribution** respectively)

#### Define mean, median and mode

The **mean** is the sum of all values divided by the number of values.
The **mode** is the most likely value in the data set, i.e. the value that occurs most frequently.
The **median** is the 'middle' value of the sorted data set.

#### What are unimodal, bimodal and multimodal distributions?

A **unimodal** distribution is one in which there is **a single** 'peak' or 'hump' in the data, a **bimodal** distribution has two humps, and **multimodal** data has more than two. 

#### What is the skewness of a distribution?

   **Skewness** in a distribution refers to **asymmetry**, to a tendency to be distorted to the left or right.


#### Which centrality measures should you be careful of for data with outliers, bimodal and multimodal data, and skewed data?

The mode is probably the most reliable centrality measure overall, the mean is strongly affected by the presence of outliers and skew and may lie in a 'dip' in multimodal data, and the median suffers difficulties in multimodal data. 




## Measures of spread


### Learning Objectives

* Understand measures of spread - range, quartiles, interquartile range
* Be able to interpret spread and skew on a box plot
* Understand the definition of an outlier from the interquartile range
* Know the formula for calculating the sample variance


### Measures of spread


In the last lesson we discussed **centrality** of a distribution, and we saw that there are various measures we can use, each with their own advantages and disadvantages. We think of the centre, however defined, as the position around which the data distribution **spreads**, and now we'll move on to detail the latter concept. 

* How broad is the spread? 
* How do we measure it?  



### Range

The simplest measure of the spread of a distribution is the **range**. We define this as 

$$\textrm{spread} = \textrm{max. value} -  \textrm{min. value}$$

i.e. the difference separating the maximum and minimum values in the dataset.

This time, we're going to look at the distribution of 'Accounting' and 'Management' salaries in the Tyrell Corporation. Let's load the CSV and have a look at the `head()` of the data. 


In [206]:
jobs = pd.read_csv("data/TyrellCorpJobs.csv")
jobs.head()


Unnamed: 0.1,Unnamed: 0,Position,Salary
0,1,Accounting,34397
1,2,Accounting,30359
2,3,Management,50036
3,4,Accounting,34716
4,5,Accounting,35786



So, `Accounting` and `Management` salaries are mixed together in a single dataset. Clearly, some grouping will be necessary if we want to examine *only* `Accounting` or `Management` data. 

Firstly, let's try to get the `range` of the entire distribution


In [128]:
jobs.describe()

Unnamed: 0.1,Unnamed: 0,Salary
count,175.0,175.0
mean,88.0,39382.8
std,50.662281,10095.80455
min,1.0,29459.0
25%,44.5,33038.5
50%,88.0,35009.0
75%,131.5,41989.5
max,175.0,91027.0


Here we get the minimum and maximum values in the `Salary` column. We get the properly defined range by subtraction


In [129]:
jobs['Salary'].max() - jobs['Salary'].min()

61568


We've broken a key rule, however! We haven't yet visualised the distribution...


In [207]:
px.histogram(jobs['Salary'], nbins = 20)


There's weak evidence that this may be a **bimodal** distribution, with two local maxima around 30,000 and 50,000. Let's plot salaries grouped by `Position`!


In [137]:
px.histogram(jobs, x="Salary", facet_col="Position")

Ah, so the two local maxima seem to come broadly from the two classes of position. Let's get separate ranges for these salaries grouped by position


In [141]:
accounting = jobs[jobs["Position"]=="Accounting"]

management = jobs[jobs["Position"]=="Management"]


The range of 'Management' salaries is nearly four times greater than that of 'Accounting' positions.


In [208]:
print(accounting["Salary"].mean())
print(accounting["Salary"].median())


34115.01666666667
34150.0


In [209]:
print(management["Salary"].mean())
print(management["Salary"].median())


50876.145454545454
50300.0



# Quartiles and interquartile range

The **quartiles** of a distribution, $Q1$, $Q2$ and $Q3$, are the values that split the distribution into sections as follows: 

* $Q1$ = the value at which 25% of distribution is equal to or lower, and 75% higher,<br>
* $Q2$ = the value at which 50% of distribution is equal to or lower, and 50% higher, and <br> 
* $Q3$ = the value at which 75% of distribution is equal to or lower, and 25% higher.<br> 

We've already encountered $Q2$, it's just the **median**, so it's really $Q1$ and $Q3$ that are new here. We have an efficient way to get the quartiles, and indeed, any quantile we desire, via the `describe()` function!

* *General quantile*, e.g. the $43%$ quantile = the value at which 43% of the distribution is equal to or lower, and 57% higher  


In [158]:
management["Salary"].describe()

count       55.000000
mean     50876.145455
std      10940.555109
min      31661.000000
25%      43698.000000
50%      50300.000000
75%      57288.500000
max      91027.000000
Name: Salary, dtype: float64

We define the **interquartile range (IQR)** as

$$IQR = Q3 - Q1$$

Let's now summarise to get these values for each type of `Position`

In [210]:
management["Salary"].quantile(.75) - management["Salary"].quantile(.25)

13590.5

In [211]:
accounting["Salary"].quantile(.75) - accounting["Salary"].quantile(.25)

2874.5

The `mean` is self-explanatory, and `p0`, `p25`, `p50` and `p75` correspond to the minimum, Q1, Q2 (median), Q3 and maximum values, respectively. 

The combination of values<br><br>

<center>minimum, Q1, median, Q3, maximum</center><br>

is known as the **five number summary** and is very commonly quoted for a distribution. 

<hr>

# Box plots

Box plots (also known as 'box-and-whisker' or 'hinge' plots) were popularised by John Tukey (the originator of the term 'data science') in 1970, and they are an effective means to visualise the key measures of a distribution. 

<br>

In [213]:
px.box(management["Salary"])


* The central 'box' corresponds to the IQR of the distribution, the left-hand or lower edge marking Q1, and the right-hand or upper edge, Q3. 
* The median or Q2 is also marked by a line within the box. 
* The whiskers on either side of the box, also known as the **Tukey fences**, mark the positions beyond which data values are normally deemed to be outliers. 

Precise definitions do vary, but, for a data set $x$ the whiskers in R are positioned at
<br><br>
<center>Lower whisker: $\textrm{max}[ \; \textrm{min}(x), \; \textrm{Q1} – 1.5 \times \textrm{IQR} \;]$</center><br>
<center>Upper whisker: $\textrm{min}[ \; \textrm{max}(x), \; \textrm{Q3} + 1.5 \times \textrm{IQR} \; ]$</center><br>

and this is a very common definition. Data points **below the lower whisker** or **above the upper whisker** are deemed to be **outliers**, as you've seen earlier in the course.



Spread of a distribution is evident in many ways in a box plot! The key is to think carefully about **how much of the data is present in each section of the plot**.


## Skew in box plots

How does **skew** of a distribution manifest in a box plot? Let's have a look at the box plot of a **heavily right skewed** distribution and see!


In [214]:
heavily_right_skewed = pd.read_csv("data/heavily_right_skewed.csv")

In [216]:
heavily_right_skewed.skew()

Unnamed: 0    0.000000
x             1.410243
dtype: float64

In [215]:
px.histogram(heavily_right_skewed, x="x")

In [167]:
px.box(heavily_right_skewed, x="x")

Hmm, so we see most of the values concentrated to the left of the plot, the median is also shifted leftwards in the central box, and we have a long whisker and many outliers to the right, corresponding to the rightwards pointing tail of the skewed distribution.



# Variance - a single number measure of spread!

So far we have the range, IQR, distance between the whiskers on a box plot; indeed, a wealth of measures of the spread of a distribution. Which should you use, and when? 

Ask a statistician which measure of spread they would quote first for a distribution and they would likely say the **variance**. This section will define that measure.  
<br> 
<div class='emphasis'>
The variance is a measure of how far each value in the data set is from the mean. 
</div>
<br> 

What does this really mean though? Let's suppose I collect the weight of a group of 50 people on a morning bus, around peak commuting and school time. Then suppose I collect the weights of the first 50 people to finish running the London Marathon. Coincidently, the mean weight for each group is roughly the same. Does this give you an accurate picture of the data?

Anyone who has been squeezed into a morning commuter bus in the morning might know that there is a large variation in the weights of passengers: some will be larger, some will be small children on their way to school. In contrast, the weight of people running the London marathon and finishing in the top 50 probably hovers around the lower end of the weight scale. 

As such, in our example, the weights of the bus passengers will be more "spread out" than the weights of the runners. This in itself is a more informative way of looking at the data. And this is where the variance and standard deviation comes into play (we'll learn about the standard devation next!).  


## Calculating variance

The formula for the variance $s^2$ of a **sample** of size $n$ elements is

$$s^2 = \frac{1}{n-1}\sum_i(x_i - \bar{x})^2$$

where $x_i$ is each of the data values in the sample and $\bar{x}$ is the sample mean. If we were calculating the variance of a **population** of size $N$ the formula would be slightly different, but this is a much rarer thing to do: we almost always deal with samples. You can find a description of the population variance [here](https://www.statisticshowto.datasciencecentral.com/population-variance/).


Capital sigma $\sum$ is used in mathematics to indicate the **sum** over a set of objects. For example, if we have a set of numbers

$$x = [3, 1, 9, 4, 12]$$
The we can write the **sum** of the numbers using capital sigma notation as 

$$\sum_{i=1}^{5}{x_{i}}$$

Let's break this down to see what it means. 

* First, read the capital sigma with attached letter $i$ and numbers as *"the sum from i equals 1 to 5"*. 
* Next we say **what** we are summing: in this case, each number in the $x$ set. We indicate each member as $x_{i}$, so, the first member of the set is $x_{1}$ (which is $3$ in this case); the second, $x_{2}$ (which is $1$), and so on...

So what our sum is telling us to do is the following

$$
\begin{align}
\sum_{i=1}^{5}x_i & = x_1 + x_2 + x_3 + x_4 + x_5 \\
& = 3 + 1 + 9 + 4 + 12 \\
& = 29
\end{align}
$$

<blockquote class='task'>
**Task - 2 mins** Using the same set of numbers $x = \{3, 1, 9, 4, 12\}$ evalute the sum 
$$\sum_{i=2}^{4}{x_{i}}$$
<details>
<summary>**Solution**</summary>
$$\sum_{i=2}^{4}{x_{i}} = 1 + 9 + 4 = 14$$
i.e. start at index 2 and sum up to index 4
</details>
</blockquote>

Sometimes, if we just want to indicate *"sum everything in the set $x$"*, we might omit the *lower and upper indices* (i.e. $1$ and $5$) and write something like this

$$\sum_{i}{x_i} = 29$$

it's understood here we mean *"go from the start to the end of the set"*. 

Finally, we can apply more complicated operations inside the sum, e.g. *"sum the squares of each of the numbers minus 1 in the set $x$"*

$$
\begin{align}
\sum_i(x_i^2-1) & = (x_1^2 - 1) + (x_2^2 - 1) + (x_3^2 - 1) + (x_4^2 - 1) + (x_5^2 - 1) \\
& = 8 + 0 + 80 + 15 + 143 \\
& = 246
\end{align}
$$

In Python, we can use the `sum()` function to do this for us.

```{python}
x = sum([3, 1, 9, 4, 12])

```

## Back to variance...

The **units of the variance** will be the square of whatever the units were of the original data. So, for example, if we were computing the variance of a set of people's heights, measured in metres, the variance will be in metres squared. 

The `var()` function in Python computes the variance, but it always assumes you are doing so for a **sample**. Just think of it as the *'sample variance'* function, and worry about population variance if that situation ever arises. 

Let's apply the `var()` function salaries split by `Position`.


So, the variance of `Management` salaries is approximately 21 times larger than that of `Accounting` salaries. This fits with all our findings above that the distribution of `Management` salaries is wider than `Accounting` salaries. 

<hr>

# Standard deviation

The fact that the units of variance are the **square** of the units of the original data can make interpretation difficult. The fix for this is to take the **square root** of the variance: we call this value the **standard deviation**. 

<br>
<div class='emphasis'>

The standard devaiation is again, a quantity expressing by how much the members of a group differ from the mean value for the group. In other words, how spread out are the observations?

</div>





<br>
<div class='emphasis'>
Maths wise: 

The standard deviation $s$ for a sample is just the **square root of the sample variance**! This is nice, as the standard deviation is then measured in the same units as the original variable, making it easier to interpret. 

</div>
<br>

$$s = \sqrt{s^2}$$

The `std()` function in Python returns the sample standard deviation of a data set. Let's try this with the `jobs` data:


In [219]:
management["Salary"].std()

10940.555108839875

In [222]:
accounting["Salary"].std()

2383.6756130588146

In [223]:
management["Salary"].var()

119695746.0895623

In [224]:
accounting["Salary"].var()

5681909.428291316


## Recap


* What measures of the spread of a distribution were discussed in this lesson?

The range, the interquartile range (IQR) and the variance.

* How many **quartiles** does a distribution have? What are their names and into what proportions do they split the distribution?

There are three quartiles:<br>
$\textrm{Q1}$: splits the distribution as: $25\%$ equal to or below, $75\%$ above.<br>
$\textrm{Q2}$: splits the distribution as: $50\%$ equal to or below, $50\%$ above. This is just the median by another name.<br>
$\textrm{Q3}$: splits the distribution as: $75\%$ equal to or below, $25\%$ above.

* What is the definition of the IQR of a distribution?

$$\textrm{IQR} = \textrm{Q3} - \textrm{Q1}$$

* What is the five-number summary of a distribution?

minimum, Q1, median, Q3, maximum

* What are the main components of a boxplot? Where do outliers lie on a boxplot?

![]("images/BoxPlotComponents.png")

Outliers are points lying beyond the 'whiskers', which lie at positions:

<center>Lower whisker: $\textrm{max}[ \; \textrm{min}(x), \; \textrm{Q1} – 1.5 \times \textrm{IQR} \;]$</center><br>
<center>Upper whisker: $\textrm{min}[ \; \textrm{max}(x), \; \textrm{Q3} + 1.5 \times \textrm{IQR} \; ]$</center>

* What is the equation for the sample variance?

Variance of a sample of size $n$ elements:
$$s^2 = \frac{1}{n-1}\sum_i(x_i - \bar{x})^2$$


* What is the definition of the sample standard deviation?


It is the square root of the variance
$$s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_i(x_i - \bar{x})^2}$$
</details>
