Copyright 2025 Dale Bowman, Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Descriptive statistics

One of the most important components of a data science project is to examine your data using descriptive statistics. 
We call this phase of the project *Exploratory Data Analysis (EDA)*. 

There are two types of EDA, *graphical* EDA and *numerical* EDA. 
The graphical EDA involves plotting the features (variables) in the data to look for symmetry, skewness and multiple modality. 
**We will discuss numerical EDA in this session.**

## What you will learn

In the sections that follow you will learn about descriptive statistics, in particular numerical EDA, and how they can help us learn about our data and what types of analyses may be appropriate.  We will study the following:

- level of measure
- measures of central tendency
- measures of dispersion
- sampling

## When to use numerical EDA

Descriptive statistics, both numerical and graphical, are useful when you begin a data science project and you want to explore the data.  Often insights will be gained that can be useful in further analyses.

## Level of Measure

The type of features (variables) you have in the data determines which graphical and numerical techniques are appropriate. 
One characteristic of the features is their level of measure. 
There are four levels: nominal, ordinal, interval, and ratio. 
The characteristics of these levels are detailed in the table below.

| Level    | Description                              | Example                      |
|:---------|:-----------------------------------------|:-----------------------------|
| nominal  | categorical data that can be *named*     | eye color                    |
| ordinal  | categorical data with a natural ordering |  grades: A, B, C, D, F       |
| interval | numerical data without a true zero       | Fahrenheit temperature scale |
| ratio    | numerical data with a true zero          | Kelvin temperature scale     |


## Descriptive Statistics, Numerical EDA

Descriptive measures fall into one of two categories:

- Measuring central tendencies of a variable
- Measuring the spread of a variable

To explore these, we'll use the `iris` dataset:

| Variable    | Type    | Description           |
|:-------------|:---------|:-----------------------|
| SepalLength | Ratio   | the sepal length (cm) |
| SepalWidth  | Ratio   | the sepal width (cm)  |
| PetalLength | Ratio   | the petal length (cm) |
| PetalWidth  | Ratio   | the petal width (cm)  |
| Species     | Nominal | the flower species    |

<div style="text-align:center;font-size: smaller">
 <b>Source:</b> This dataset was taken from the <a href="https://archive.ics.uci.edu/ml/datasets/iris">UCI Machine Learning Repository library
    </a></div>
<br>


We can calculate central tendency and spread using `pandas dataframes`.
Let's start by importing `pandas`:

- `import pandas as pd`

In [10]:
import pandas as pd

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="zq]t,,UR`GWj6?Hub9%a">pd</variable></variables><block type="importAs" id="o[3wY[w:R*b$p^Ow/ZZw" x="125" y="352"><field name="libraryName">pandas</field><field name="VAR" id="zq]t,,UR`GWj6?Hub9%a">pd</field></block></xml>

Now let's load a dataset into a dataframe:

- Set `dataframe` to with `pd` do `read_csv` using 
    - `"datasets/iris.csv"`
- `dataframe` (to display it)

In [12]:
dataframe = pd.read_csv('datasets/iris.csv')

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable><variable id="zq]t,,UR`GWj6?Hub9%a">pd</variable></variables><block type="variables_set" id="pspF!@hdz,LBbxbCt.T," x="97" y="189"><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field><value name="VALUE"><block type="varDoMethod" id="euk#bP:=-t?/^HkbQeAp"><mutation items="1"></mutation><field name="VAR" id="zq]t,,UR`GWj6?Hub9%a">pd</field><field name="MEMBER">read_csv</field><data>pd:read_csv</data><value name="ADD0"><block type="text" id="uXF*vA1y:R}D1bf_0r{N"><field name="TEXT">datasets/iris.csv</field></block></value></block></value></block></xml>

Now we're ready to calculate the measures of central tendency in the next section.

### Measures of Central Tendency

The measures of central tendency most commonly used to describe data are **mean, median and mode**. 


#### Mean
The **mean** is the numerical average of the variables. 
Let Let $X_1, X_2, \ldots, X_n$ represent the data;
then the mean is found as $\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i.$

A dataframe will do this calculation for you:

- with `dataframe` do `mean` using
    -  freestyle `numeric_only=True`

In [17]:
dataframe.mean(numeric_only=True)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="varDoMethod" id="]I:]}wMu=(Gnu*RD$$~9" x="8" y="188"><mutation items="1"></mutation><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field><field name="MEMBER">mean</field><data>dataframe:mean</data><value name="ADD0"><block type="dummyOutputCodeBlock" id="6fr(WT?jqdM_FRQ2EDef"><field name="CODE">numeric_only=True</field></block></value></block></xml>

SepalLength    5.843333
SepalWidth     3.054000
PetalLength    3.758667
PetalWidth     1.198667
dtype: float64

This gave us the mean of each column, i.e. the mean of each variable in the dataframe

#### Median
The **median** is the number in the middle of the data. 
By definition, one half of the data points are below the median and one half are above. 

We can get this from a dataframe to:

- with `dataframe` do `median` using
    -  freestyle `numeric_only=True`

In [23]:
dataframe.median(numeric_only=True)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="varDoMethod" id="]I:]}wMu=(Gnu*RD$$~9" x="8" y="188"><mutation items="1"></mutation><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field><field name="MEMBER">median</field><data>dataframe:median</data><value name="ADD0"><block type="dummyOutputCodeBlock" id="T#YbS@FF(7Q:{su)^6*b"><field name="CODE">numeric_only=True</field></block></value></block></xml>

SepalLength    5.80
SepalWidth     3.00
PetalLength    4.35
PetalWidth     1.30
dtype: float64

Just like before we have a median for each variable.
Notice that the mean and the median are almost the same for the first two variables but very different for the third variable.
What do you think that means?

#### Mode

The **mode** is the value in the data that shows up the most:

- with `dataframe` do `mode` using
    -  freestyle `numeric_only=True`

In [26]:
dataframe.mode()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="varDoMethod" id="]I:]}wMu=(Gnu*RD$$~9" x="8" y="188"><mutation items="1"></mutation><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field><field name="MEMBER">mode</field><data>dataframe:mode</data></block></xml>

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.0,3.0,1.5,0.2,setosa
1,,,,,versicolor
2,,,,,virginica


The output here is a little more difficult to understand.
The first row gives us the mode for `SepalLength`, `SepalWidth`, `PetalLength`, and `PetalWidth`.
However `Species` has three modes: each of the species occurs **exactly** the same number of times in the data.
Because of that, `pandas` displays three rows here, with `NaN` (Not a Number) everywhere else on the second two rows.
<!-- To get a more clear display, do the following:

- Create a variable `species`
- In menu LISTS, get a `dictVariable` block
- Put `"Species"` inside it and change `dictVariable` to `dataframe`

TODO: intellisense not working on series variables -->

<!-- species = dataframe['Species']
-->

<!-- Now do this:

- `with species do mode using` -->

#### Summary

The type of variable determines the measures of central tendency available.

For categorical data, the mode is the only measure of central tendency that can be computed unless the data are ordinal, in which case you can use either the mode or the median.

In the table below, X indicates where a variable type and a measure of central tendency can be used together.

|          | mode | median | mean |
|----------|------|--------|------|
| nominal  | X    |        |      |
| ordinal  | X    | X      |      |
| interval | X    | X      | X    |
| ratio    | X    | X      | X    |

### Example Categorical Data 

The grades in a large statistics course occurred with the following frequency.

<!-- | Grade     | A | B  | C  | D  | F  |
|:-----------|---|----|----|----|----|
| Frequency | 5 | 15 | 25 | 10 | 45 | -->

<!-- AO: seems more natural: -->

| Grade     | F | D  | C  | B  | A  |
|:-----------|---|----|----|----|----|
| Frequency | 45 | 10 | 25 | 15 | 5 |

The mode of this data is the grade with the highest frequency, in this case F. 
Since this data is ordinal, we can also compute the median. 
There are a total of 100 grades, so the median will be the grade with 50 grades above and 50 below. 
This puts the median grade at **D**.

For numerical data, the mode, median and mean can all be used to measure central tendency. 
Sometimes one measure will be more useful than another. 
For example, when outliers exist in the data, the mean can be skewed towards the outliers. 
Think of measuring incomes where one of the incomes is that of a professional basketball player. 
The extremely higher income of the player is much different than most of the other incomes. 
It is called an *outlier* and will affect the mean more than the median. 
As a simple example consider the following incomes.

$30,000 ~~ 40,000 ~~ 50,000~~60,000~~4,000,000$

The mean of these incomes is 

$\frac{30000+40000+50000+60000+4000000}{5} = \$836,000.$

Notice that \\$4 million is more than 10 times greater than the next highest value, \\$60 thousand.
As a result, the mean is pulled between \\$60 thousand and \\$4 million

In contrast, the median of the incomes is the number with 2 incomes below and 2
incomes above, $50,000, which is a much more reasonable estimate of the
central tendency of the majority of these incomes.

Let's take a closer look at this with dataframes:

- Set `salary` to with `pd` create `Dataframe` using` 
    - a list containing
        - 30000
        - 40000
        - 50000
        - 60000
        - 4000000
- `salary` (to display it)
    
**Note it is `create` and not `do`** because we are creating a dataframe using the values in the list.

This is what the lists look like:

![image.png](https://pbs.twimg.com/media/GpP3zHeXoAAAknk?format=png&name=small)

In [28]:
salary = pd.DataFrame([30000, 40000, 50000, 60000, 4000000])

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id=":CSTVHy4liG@-G^A}4h7">salary</variable><variable id="zq]t,,UR`GWj6?Hub9%a">pd</variable></variables><block type="variables_set" id="RsV~*dS!P8Pv3SGa%E]2" x="115" y="221"><field name="VAR" id=":CSTVHy4liG@-G^A}4h7">salary</field><value name="VALUE"><block type="varCreateObject" id="u{cYr:fAgi#7d*n;fdPw"><mutation items="1"></mutation><field name="VAR" id="zq]t,,UR`GWj6?Hub9%a">pd</field><field name="MEMBER">DataFrame</field><data>pd:DataFrame</data><value name="ADD0"><block type="lists_create_with" id="vOJNH[,fhpmUQ~MDz|F("><mutation items="5"></mutation><value name="ADD0"><block type="math_number" id="/Q}*{r}_k(J.joOT=Zj#"><field name="NUM">30000</field></block></value><value name="ADD1"><block type="math_number" id="ir*bx9kvQ#;bc~+S.wGD"><field name="NUM">40000</field></block></value><value name="ADD2"><block type="math_number" id="`pL#YfpMk1E$_sx-!|sA"><field name="NUM">50000</field></block></value><value name="ADD3"><block type="math_number" id="F2G:~gdVD/C@Tt78iOLG"><field name="NUM">60000</field></block></value><value name="ADD4"><block type="math_number" id="{W-5iQqRNiA9jF6EIedc"><field name="NUM">4000000</field></block></value></block></value></block></value></block></xml>

Now try the following:

- `print` with `salary` do `median`  ( `print` in TEXT)
- `print` with `salary` do `mean`

In [30]:
print(salary.median())
print(salary.mean())

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id=":CSTVHy4liG@-G^A}4h7">salary</variable></variables><block type="text_print" id="`jn6J~kaejV!Cx!s{zP-" x="111" y="117"><value name="TEXT"><shadow type="text" id="X+4}*qR3]Q;+8cTr}T0?"><field name="TEXT">abc</field></shadow><block type="varDoMethod" id="Pl:^oR,*DCGup|yt@jhP"><mutation items="1"></mutation><field name="VAR" id=":CSTVHy4liG@-G^A}4h7">salary</field><field name="MEMBER">median</field><data>salary:median</data></block></value><next><block type="text_print" id=",V)HMF*rq4(p2J4=%*g^"><value name="TEXT"><shadow type="text" id="tLFz1IMPkggv@6$7*S[n"><field name="TEXT">abc</field></shadow><block type="varDoMethod" id="{^HKlDM0I%-#.W.[A$z."><mutation items="1"></mutation><field name="VAR" id=":CSTVHy4liG@-G^A}4h7">salary</field><field name="MEMBER">mean</field><data>salary:mean</data></block></value></block></next></block></xml>

0    50000.0
dtype: float64
0    836000.0
dtype: float64


As you can see, this matches the example above.
The mean is much much bigger than the median.

Now go back up to your list and take out the $4 million block (remember to take the blank spot out of your list as well) and run the mean/median again.

When you do this, you should find that the mean and the median are **exactly** the same.
What does this mean?
When they are the same, the data is **symmetric** (like the plot below).

### Measures of Dispersion (spread)

Even when two different variables have similar means (or medians, or modes) they can still be quite different depending on how the data are spread out around the center. 
In Figure 1 below both distributions have the same mean (0) but different spreads. 
The red curve has most of its points close to the center while the blue curve has points spread further from the mean.


![spread2.png](attachment:spread2.png)

**Figure 1:** Two distributions with the same center but different
spread

One measure of dispersion that can be used with ordered categorical data (ordinal level) or numerical data (interval/ratio level) is the **five number summary**.
The five number summary is useful for comparing the center and spread of multiple variables. 
You use the numbers in the five number summary to construct a box and whiskers plot. 
The five numbers are: 

- minimum
- first quartile
- median
- third quartile
- maximum

The first quartile is the median of the values below the median and the third quartile is the median of the
values above the median.

To use a football analogy, quartiles are like the 4 quarters in a game, and the median is like halftime.

We can get the five number summary easily from `pandas`:

- with `dataframe` do `describe`

In [32]:
dataframe.describe()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="varDoMethod" id="O3QcP4Y+iT~?:8x5/Mld" x="8" y="188"><mutation items="1"></mutation><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field><field name="MEMBER">describe</field><data>dataframe:describe</data></block></xml>

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Note we've also got a few extras here: count, mean, and std.

### Example Five Number Summary 

For the data shown below the minimum is 3, the median is 9 and the maximum is 22. 
We find the first quartile as the median of the lower five numbers: here it is 6. 
The third quartile is 13, the median of the numbers above the median. 
So the five number summary for this data is $\{ 3,6,9,13,22\}$.

![summary.png](attachment:summary.png)

Other measures of the spread for numerical data include the range, the interquartile range, and the variance. 

The **range** is simply the maximum value minus the minimum. 
When outliers are present they may inflate the range. 
For example in our income example the range would be $4000000-30000=3,970,000$ which is not representative of the spread of the majority of incomes. 

To reduce the effect of outliers on the measure of dispersion, the interquartile range is often used. 
The **interquartile range** is defined as the third quartile minus the first quartile.

The most commonly used measures of dispersion for numerical data are the **variance** and its square root, the **standard deviation**. 
The variance measures the sum of squared differences of the data about the mean.
Squaring the differences may seem complicated but makes sense when you realize that the sum of differences about the mean is zero.

Again, let $X_1, X_2, \ldots, X_n$ be the variables you want to compute the variance of. 
The formula for the variance is given by $S^2 = \frac{\sum_{i=1}^n (X_i  - \bar{X})^2}{n-1}.$ 
The standard deviation is the square root of the variance.

When we did `describe` above, it gave us standard deviation already (`std`).

## Sampling

The descriptive statistics discussed here all assume that the data we have is a **random sample** from some larger population. 
The population mean, $\mu$, and the population variance, $\sigma^2$ are unknown and the sample is typically taken to gain information about them. 
The population mean and variance are **parameters** while the sample mean ($\bar{X}$) and sample variance ($S^2$) are called **statistics**. 
Since the sample mean and sample variance are computed from a random sample from the population, each time we take a different sample, we expect to get different values of the sample mean and sample variance. 

We would like to know how much difference there would be in say $\bar{X}$ over different samples. 
The *standard error* can be used to estimate the variance about a statistic. 
For the sample mean, it is known that the variation in $\bar{X}$ will vary in direct proportion to the population variance, $\sigma^2$ and inversely with the sample size. 
So we can reduce the variation in $\bar{X}$ by increasing our sample size, $n$. 
The standard error of $\bar{X}$ can be estimated by $\displaystyle \sqrt{\frac{S^2}{n}}.$

The best way to begin to understand this is to sample some rows from your dataframe (remember the rows are just data points) and calculate the mean of that same.

When we do this in `pandas`, the sample is just another (smaller) dataframe:

- Set `sample` to with `dataframe` do `sample`
    - using `10`
- `sample`

In [34]:
sample = dataframe.sample(10)

sample

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="X~V70-15t9L^b}[IjBoW">sample</variable><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="variables_set" id="X2Q0lhlIuB;M-}%}kt0@" x="137" y="151"><field name="VAR" id="X~V70-15t9L^b}[IjBoW">sample</field><value name="VALUE"><block type="varDoMethod" id="}THa7gMDOBGv~{O0J-}v"><mutation items="1"></mutation><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field><field name="MEMBER">sample</field><data>dataframe:sample</data><value name="ADD0"><block type="math_number" id="x6aHWTYmPfX=;4W1C.io"><field name="NUM">10</field></block></value></block></value></block><block type="variables_get" id=".%[gOquwCx4nt[!vFu@%" x="139" y="250"><field name="VAR" id="X~V70-15t9L^b}[IjBoW">sample</field></block></xml>

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
41,4.5,2.3,1.3,0.3,setosa
108,6.7,2.5,5.8,1.8,virginica
92,5.8,2.6,4.0,1.2,versicolor
62,6.0,2.2,4.0,1.0,versicolor
101,5.8,2.7,5.1,1.9,virginica
113,5.7,2.5,5.0,2.0,virginica
46,5.1,3.8,1.6,0.2,setosa
43,5.0,3.5,1.6,0.6,setosa
107,7.3,2.9,6.3,1.8,virginica
3,4.6,3.1,1.5,0.2,setosa


Rerun the sampling cell above a couple of times to see how it randomly draws 10 rows from the original dataframe.
You can get a sense of the randomness by looking at the index column on the left hand side - this originally was orderd from 1 to 150 in the dataframe.

Once you're ready do this:

- with `sample` do `mean` using
    - freestyle `numeric_only=True` 

In [40]:
sample = dataframe.sample(10)

sample.mean(numeric_only=True)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="X~V70-15t9L^b}[IjBoW">sample</variable><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="variables_set" id="x.L#2+Kmq01S1!FI|%*2" x="5" y="123"><field name="VAR" id="X~V70-15t9L^b}[IjBoW">sample</field><value name="VALUE"><block type="varDoMethod" id="!L!ZRT9j}{ngeakBRIE!"><mutation items="1"></mutation><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field><field name="MEMBER">sample</field><data>dataframe:sample</data><value name="ADD0"><block type="math_number" id="VWJ^n9kSRcY:)U5Kmi;8"><field name="NUM">10</field></block></value></block></value></block><block type="varDoMethod" id="GrD,tn=C@B=JZus4YvC;" x="20" y="202"><mutation items="1"></mutation><field name="VAR" id="X~V70-15t9L^b}[IjBoW">sample</field><field name="MEMBER">mean</field><data>sample:mean</data><value name="ADD0"><block type="dummyOutputCodeBlock" id="G=gxt:RG*{bKe(}.JC?+"><field name="CODE">numeric_only=True</field></block></value></block></xml>

SepalLength    5.82
SepalWidth     2.99
PetalLength    4.01
PetalWidth     1.35
dtype: float64

Run this code cell a couple of times (trick: do Ctrl + Enter)

Notice that the means change each time, just a bit.

That difference in means over different samples is what we mean by standard error of the mean.

## Check your knowledge

**Hover to see the correct answer.**

1.  What is the primary goal of **Exploratory Data Analysis (EDA)**?
- To build predictive models
- <div title="Correct answer"> To examine data using descriptive statistics</div>
- To clean and pre-process data
- To visualize data for presentations

2.  Which of the following is **numerical data without a true zero**?
- Nominal
- Ordinal
- <div title="Correct answer"> Interval</div>
- Ratio

3.  In the `iris` dataset, what is the **level of measure** for the 'Species' variable?
- Ratio
- Interval
- Ordinal
- <div title="Correct answer"> Nominal</div>

4.  Which of the following is **NOT** a measure of central tendency?
- Mean
- Median
- Mode
- <div title="Correct answer"> Variance</div>

5.  Consider the following dataset of incomes: \\$30,000, \\$40,000, \\$50,000, \\$60,000, \\$4,000,000. Which measure of central tendency would be a more reasonable estimate of the central tendency of the majority of these incomes, and why?
- The mean, because it includes all values.
- <div title="Correct answer"> The median, because it is less affected by outliers.</div>
- The mode, because it represents the most frequent income.
- Both mean and median would be equally useful.

6.  Which of the following is **not** one of the five numbers in a **five-number summary**?
- Minimum
- <div title="Correct answer"> Mean</div>
- Median
- Maximum

7.  What does a smaller standard error of the sample mean suggest?
- The sample mean is less accurate.
- The sample size is likely small.
- <div title="Correct answer"> The variation in sample means across different samples is smaller.</div>
- The population variance is very large.

<!--  -->