# Tools and Methods of Data Analysis
## Session 2 - Part 1

Niels Hoppe <<niels.hoppe.extern@srh.de>>

### Agenda 01/06

* Empirical Distributions
* Statistical Parameters
  - Measures of Central Tendency
  - Measures of Dispersion

### Empirical Distributions

* Qualitative variables
* Discrete quantitative variables
* Continuous quantitative variables

#### Distributions of Qualitative Variables

Variables on nominal or ordinal scale.

How often does a specific value / category occur?

#### Distributions of Discrete Variables

#### Distributions of Continuous Variables

### Caveats for the Presentation of Data

### Statistical Parameters

* Measures of Central Tendency
* Measures of Dispersion

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame([('bird', 2, 2),
                   ('mammal', 4, np.nan),
                   ('arthropod', 8, 0),
                   ('bird', 2, np.nan)],
                  index=('falcon', 'horse', 'spider', 'ostrich'),
                  columns=('species', 'legs', 'wings'))

### Measures of Central Tendency

* Mean, weighted mean
* Geometric mean
* Harmonic mean
* Median
* Mode

#### Mean, weighted mean



* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html
* https://datagy.io/pandas-weighted-average/

In [34]:
from IPython.display import display, Math
display(Math(rf'\frac{{\sum_{{i=1}}^{{N}} v_i}}{{N}}'))

<IPython.core.display.Math object>

In [33]:
from IPython.display import display, Math
display(Math(rf'\frac{{\sum_{{i=1}}^{{N}} v_i \cdot w_i}}{{\sum_{{i=1}}^{{N}} w_i}}'))

<IPython.core.display.Math object>

In [11]:
df = pd.DataFrame([(1,   0.5),
                   (0.5, 1),
                   (2,   1)], columns=('value', 'weight'))

mean = df.loc[:, 'value'].mean()
wmean = np.average(a=df['value'], weights=df['weight'])
wmean

1.2

#### Geometric Mean

* https://stackoverflow.com/questions/56465969/geometric-mean-in-dataframe
* https://www.geeksforgeeks.org/find-the-geometric-mean-of-a-given-pandas-dataframe/

#### Harmonic Mean

* https://stackoverflow.com/questions/67693636/how-to-calculate-harmonic-mean-in-pandas
* https://www.geeksforgeeks.org/python-statistics-harmonic_mean/

#### Median

* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html

In [5]:
df.loc[:,['legs', 'wings']].median()

legs     3.0
wings    1.0
dtype: float64

#### Mode

The **mode** of a set of values is the **value that appears most often**. It can be multiple values.

* https://en.wikipedia.org/wiki/Mode_(statistics)
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mode.html

In [3]:
df.mode()

Unnamed: 0,species,legs,wings
0,bird,2.0,0.0
1,,,2.0


### Measures of Dispersion

* Range
* Quantiles, Quartiles, Interquartile Range
* Variance
* Standard Deviation
* Coefficient of Variation

#### Range

The **range** is of a set of data is the difference between the smallest and largest value.

In [None]:
range = df['column'].max() - df['column'].min()

#### Quantiles, Quartiles, Interquartile Range

Quantile:

* Value greater than or equal to a specified percentage of all values. (inclusive definition)
* Value greater than a specified percentage of all values. (exclusive definition)

Quartiles and deciles and percentiles are special quantiles:

* Quartile: Any of three values dividing the data into four equal parts.
* Decile: Any of nine values dividing the data into ten equal parts.
* Percentile: Any of 99 values dividing the data into 100 equal parts.

The following equalities hold:

* 25th percentile = 1st quartile (Q1)
* 50th percentile = 2nd quartile (Q2) = 5th decile (D5) = median
* 75th percentile = 3rd quartile (Q3)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html

In [None]:
p25, p75 = np.percentile(df['points'], [25, 75])

quartiles = df.points.quantile([0.25, 0.5, 0.75])

quartiles

#### Interquartile Range

In [None]:
q1, q3 = np.percentile(df['points'], [25, 75])
iqr = q3 - q1

#### Quantiles

#### Variance

* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html

In [None]:
var = df['column'].var()
var

#### Standard Deviation

* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html

#### Coefficient of Variation

CV = sigma over my