# Descriptive statistics

One of the most important components of a data science project is to
examine your data using descriptive statistics. 
We call this phase of the project *Exploratory Data Analysis (EDA)*. 
There are two types of EDA, *graphical* EDA and *numerical* EDA. 
The graphical EDA involves plotting the features (variables) in the data to look for
symmetry, skewness and multiple modality. 
We will discuss numerical EDA here.

## Level of Measure

The type of features (variables) you have in the data determines which
graphical and numerical techniques are appropriate. One characteristic
of the features is their level of measure. There are four levels:
nominal, ordinal, interval, and ratio. The characteristics of these
levels are detailed in the table below.

| Level    | Description                              | Example                      |
|:---------|:-----------------------------------------|:-----------------------------|
| nominal  | categorical data that can be *named*     | eye color                    |
| ordinal  | categorical data with a natural ordering |  grades: A, B, C, D, F       |
| interval | numerical data without a true zero       | Fahrenheit temperature scale |
| ratio    | numerical data with a true zero          | Kelvin temperature scale     |


## Descriptive Statistics, Numerical EDA

Descriptive measures fall into one of two categories. Interest is either
in measuring central tendencies or the spread of the variable.

### Measures of Central Tendency

The measures of central tendency most commonly used to describe data are
mean, median and mode. The *mean* is the numerical average of the
variables. Let Let $X_1, X_2, \ldots, X_n$ represent the data;
then the mean is found as $\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i.$
The *median* is the number in the middle of the data. One half of
the data points are below the median and one half are above. The
*mode* is the value in the data that shows up the most.

For categorical data, the mode is the only measure of central tendency
that can be computed unless the data are ordinal, in which case you can
use either the mode or the median.

### Example Categorical Data 

The grades in a large statistics course occurred with the following frequency.

| Grade     | A | B  | C  | D  | F  |
|:-----------|---|----|----|----|----|
| Frequency | 5 | 15 | 25 | 10 | 45 |

The mode of this data is the grade with the highest frequency, in this
case F. Since this data is ordinal, we can also compute the
median. There are a total of 100 grades, so the median will be the grade
with 50 grades above and 50 below. This puts the median grade at
**D**.

For numerical data, the mode, median and mean can all be used to measure
central tendency. Sometimes one measure will be more useful than
another. For example, when outliers exist in the data, the mean can be
skewed towards the outliers. Think of measuring incomes where one of the
incomes is that of a professional basketball player. The extremely
higher income of the player is much different than most of the other
incomes. It is called an *outlier* and will affect the mean more
than the median. As a simple example consider the following incomes.

$30,000 ~~ 40,000 ~~ 50,000~~60,000~~4,000,000$

The mean of these incomes is 

$\frac{30000+40000+50000+60000+4000000}{5} = \$836,000.$

While the median of the incomes is the number with 2 incomes below and 2
incomes above, $50,000, which is a much more reasonable estimate of the
central tendency of the majority of these incomes.

### Measures of Dispersion (spread)

Even when two different variables have similar means (or medians, or
modes) they can still be quite different depending on how the data are
spread out around the center. In Figure 1 below both distributions have
the same mean (0) but different spreads. The red curve has most of its
points close to the center while the blue curve has points spread
further from the mean.



![spread2.png](attachment:spread2.png)

**Figure 1:** Two distributions with the same center but different
spread

One measure of dispersion that can be used with ordered categorical data
(ordinal level) or numerical data (interval/ratio level) is the five
number summary. The five number summary is useful for comparing the
center and spread of multiple variables. You use the numbers in the five
number summary to construct a box and whiskers plot. The five numbers
are: *minimum*, *first quartile*, *median*, *third
quartile*, and *maximum*. The first quartile is the median of the
values below the median and the third quartile is the median of the
values above the median.

### Example Five Number Summary 

For the data shown below the minimum
is 3, the median is 9 and the maximum is 22. 
We find the first quartile
as the median of the lower five numbers: here it is 6. 
The third
quartile is 13, the median of the numbers above the median. 
So the five
number summary for this data is $\{ 3,6,9,13,22\}$.

![summary.png](attachment:summary.png)

Other measures of the spread for numerical data include the range, the
interquartile range, and the variance. The range is simply the maximum
value minus the minimum. When outliers are present they may inflate the
range. For example in our income example the range would be
$4000000-30000=3,970,000$ which is not representative of the spread of
the majority of incomes. To reduce the effect of outliers on the measure
of dispersion, the interquartile range is often used. It is defined as
the third quartile minus the first quartile.

The most commonly used measures of dispersion for numerical data are the
variance and its square root, the standard deviation. The variance
measures the sum of squared differences of the data about the mean.
Again, let $X_1, X_2, \ldots, X_n$ be the variables you want
to compute the variance of. The formula for the variance is given by
$S^2 = \frac{\sum_{i=1}^n (X_i  - \bar{X})^2}{n-1}.$ The standard
deviation is the square root of the variance.

## Sampling

The descriptive statistics discussed here all assume that the data we
have is a *random sample* from some larger population. The
population mean, $\mu$, and the population variance, $\sigma^2$ are
unknown and the sample is typically taken to gain information about
them. The population mean and variance are *parameters* while the
sample mean ($\bar{X}$) and sample variance ($S^2$) are called
*statistics*. Since the sample mean and sample variance are computed
from a random sample from the population, each time we took a different
sample, we expect to get different values of the sample mean and sample
variance. We would like to know how much difference there would be in
say $\bar{X}$ over different samples. The *standard error* can be
used to estimate the variance about a statistic. For the sample mean, it
is known that the variation in $\bar{X}$ will vary in direct proportion
to the population variance, $\sigma^2$ and inversely with the sample
size. So we can reduce the variation in $\bar{X}$ by increasing our
sample size, $n$. The standard error of $\bar{X}$ can be estimated by
$\displaystyle \sqrt{\frac{S^2}{n}}.$