# Descriptive Statistics


![](descriptive_stats.png)


## Statistics

Statistics is a set of methods and measures that deal with collecting, cleaning and analyzing a dataset and making quantitative statements about the population.

## Population

A population is a group of "all individuals, objects, or measurements whose properties are being studied" - https://openstax.org/books/introductory-statistics/pages/1-key-terms

## Sample

A sample is a subset of the population. Preferrably it is a random subset.

## Variable

"A characteristic of interest for each person or object in a population" - https://openstax.org/books/introductory-statistics/pages/1-key-terms

## Random Variable

A variable whose value is determined by a random process.

## Descriptive Statistics

- A set of methods and measures to describe a dataset.

- There is no uncertainty in descriptive statistics! It is not based on any theory of probability.

Before we dive into the most widely used measures, let's have a look at the type of variables that we encounter in datasets.

Image on: https://bolt.mph.ufl.edu/6050-6052/unit-4/

### Categorical, Ordinal and Metric Variables

#### Categorical Variables

The values of categorical variables (also called nominal variables) do not have a natural ordering.

#### Ordinal Variables

The values of ordinal variables have a natural ordering. However, the difference between the values is not measurable.

#### Metric Variables

The values of metric variables have a natural ordering and the distance between two values is measurable.

### Discrete vs. Continuous Variables

#### Discrete Variables

The values of a discrete variable are countable. They can be finite or infinite.

#### Continuous Variables

The values of a continuous variable are uncountable.

---

# Challenge! Match the statistical method with their pandas function

In [9]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

### calculates median of a column

In [None]:
pd.Series.median()

### calculates arithmetic mean of a column

In [None]:
pd.Series.mean()

### calculates a (relative) contingency/cross table for two columns

In [None]:
pd.crosstab()

In [None]:
pd.Series.var()

### create a (relative) frequency tables for a column

In [None]:
pd.Series.value_counts()

### calculates quantiles of a column

In [None]:
pd.Series.quantile()

### calculates correlation between two columns

In [None]:
pd.Series.corr()

### calculates the minimum of a column

In [None]:
pd.Series.min()

###  calculates various descriptive statistics for each column

In [None]:
pd.DataFrame.describe()

---

### Measures of Central Tendency/Location

Measures of central tendency are concerned with the question what a typical value of the variable is.

#### Mean

- Arithmetic mean: 
$\frac{\sum_{i=1}^n x_i}{n}$


- Weighted arithmetic mean - average of grouped/weighted series
$\frac{w_1*x_1 + w_2*x_2 + ... + w_n*x_n}{n}$


- Geometric mean - average of multiplicative series
$\sqrt[n]{\prod_{i=1}^n x_i}$


Where $w_i$ is the weight assigned to the $i_{th}$ observation of x.

**Why do we concern ourselves with different concepts of the mean?**

#### Median

The value that divides the sample into two groups of equal size. The probability of observing a value larger than the median and the probability of observing a value smaller than the median in the sample are 50% each.

#### Mode

The Mode of a variable is the value that occurs most often in the dataset.

**Why do we concern ourselves with different concepts of centrality?**

http://krspiced.pythonanywhere.com/chapters/appendix/ethics_in_data_science/statistical_pitfalls/README.html

If the distribution is symmetrical, the measures of central tendency are equal.

### Measures of Dispersion (or Variability)

Measures of dispersion are concerned with the question how far the values of the variable diverge from the measures of central tendency.

#### Median Average Deviation (MAD)

median($|x_i - \bar{x}|$)

#### Variance

$\frac{\sum_{i=1}^n (x_i - \overline{x})^2}{n}$

#### Standard Deviation

$\sqrt{Variance}$

#### Quantile

A generalization of the concept of the median. The q-Quantile of the variable are those values that divide the sample into q groups of equal size. One commonly used Quantile is the 4-Quantile or Quartile, which divides the sample into 4 groups of equal size.

Other commonly used Quantiles are Quintiles (5-Quantiles) and Percentiles (100-Quantiles).

#### Minimum

#### Maximum

#### Range

$Maximum - Minimum$

#### Inter-Quartil-Range (IQR)

$Q_{0.75} - Q_{0.25}$

#### Boxplot

A visual representation of the variability.

### Measures of Symmetry and Curvature

#### Skewness

$\frac{1}{n}*\frac{\sum_{i=1}^n (x_i - \overline{x})^3}{Variance^{3/2}}$

#### Kurtosis

$\frac{1}{n}*\frac{\sum_{i=1}^n (x_i - \overline{x})^4}{Variance^2}$

---

### Try to apply this to a new dataset
* Go through each feature and look at their descriptive stats

In [5]:
path = 'path_to_penguins'
df = pd.read_csv(path, sep=';')

---