# Lesson 3

### Descriptive statistics. Qualitative and quantitative characteristics of a population. 

#### Graphical representation of data

Let's look at descriptive statistical characteristics using a dataset with data on hockey players as an example.

Let's use the data from the article 
<a href='https://habr.com/post/301340/'>"Growth of hockey players: analyzing data from all world championships in this century"</a>.

Import the libraries and download the data:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('csv/hockey_players.csv', encoding='cp1251', parse_dates=['birth'])

FileNotFoundError: ignored

We'll remove the duplicates:

In [None]:
df = df.drop_duplicates(['name', 'birth'])

This dataset could be considered a **general population** if it contained data about all hockey players, that corresponds to a particular characteristic.

If we randomly select a certain number of examples (observations), this set can be called a sample.

A **sample** is a randomly selected part of a general population.

One of the basic concepts in probability theory is **math expectation**. It is referred to as $M(X)$ (in statistics, it is referred to as $\mu$).

The mathematical expectation is the average value of a random variable (the probability distribution of a stationary random variable) as the number of samples or number of measurements (sometimes said to be the number of trials) tends towards infinity.

The arithmetic mean of a univariate random variable of a finite number of trials is usually called the expectation estimate. When the number of trials of a stationary random process tends to infinity, the estimate of the mathematical expectation tends to the mathematical expectation.

### Find the average height of the hockey players from the dataset using the formula:

$$M(X) = \frac{1}{n} \sum\limits_{i=1}^{n} x_i$$

where $x$ — height, $n$ — number of observations

In [None]:
mean_height = df['height'].sum() / df['height'].count()
mean_height

NameError: ignored

Let's find the same value using the **mean** method:

In [None]:
df['height'].mean()

The values are equal (only the number of decimal places differs). This estimate of the mathematical expectation is called unbiased.

Another important measure of sampling is **the standard deviation**. It shows how far the observations can be "scattered" relative to the mean.

It can be calculated using the formula:

$$\sigma = \sqrt{\frac{\sum\limits_{i=1}^{n} (x_i - \overline{x})^2}{n}}$$

Let's calculate the root mean square deviation of the hockey players' height:

In [None]:
height_std = np.sqrt(((df['height'] - df['height'].mean())**2).sum() / df['height'].count())
height_std

Let's calculate the standard deviation again, but this time using **std**:

In [None]:
df['height'].std(ddof=0)

**Dispersion** is equal to the standard deviation squared:

$$\sigma^2 = \frac{\sum\limits_{i=1}^{n} (x_i - \overline{x})^2}{n}$$

Let's calculate the dispersion of hockey players' growth:

In [None]:
height_variance = ((df['height'] - df['height'].mean())**2).sum() / df['height'].count()
height_variance

This variance estimate is **biased**. The following formula shows how an **unbiased variance estimate** is calculated:

$$\sigma^2_{st.} = \frac{\sum\limits_{i=1}^{n} (x_i - \overline{x})^2}{n - 1}$$

The difference between unbiased and biased variance estimation is that we divide the sum of the squares of the differences of all the values with the mean by $n$, but by $n - 1$.

Let's calculate the unbiased variance of hockey players' height:

In [None]:
height_variance2 = ((df['height'] - df['height'].mean())**2).sum() / (df['height'].count() - 1)
height_variance2

The **var** method can be used to calculate a biased variance estimate:

In [None]:
df['height'].var(ddof=0)

Unbiased counts in the same way:

In [None]:
df['height'].var(ddof=1)

The **ddof** (Delta Degrees of Freedom) argument shows how much to subtract from the number of observations $n$, which is in the variance formula in the divisor.

**Mode** is the most frequent value in the sample. For discrete distributions, a mode is any value $a_i$ whose probability $p_i$ is greater than the probabilities of neighbouring values.

If a sample has two or more values occurring with the same (maximum) probability, i.e. the sample has two or more modes, it is called **multimodal**.

A mode of a perfectly continuous distribution is any point of the local maximum of the distribution density.

The **Median** is the value that divides the sample into two parts, so that the values that are less than the median are half (50%) of the sample. That is, half of the sample values are greater than the median, and half are not greater.

**First quartile** is a value that does not exceed 25% of the observations in the sample.

**Second quartile** is synonymous with the median.

**Third quartile** is a value that does not exceed 75% of the observations in the sample.

**Quantile** is a similar concept, only as a point it can be any value between 0 and 100%.

For example, 40% quantile is a value that does not exceed 40% of observations.

**Percentile** is a special case of quantile, taking whole percentages as a measure.

For example, 40% quantile is the same as 40% percentile.

**Decile** is a special case of the quantile, taking tens of percentiles as a measure.

For example, the 70% quantile is also the 7th decile.

**Interquartile distance** is a segment equal to the difference between 3rd and 1st quartile.

##### Graphical representation of data

To get an idea of the distribution of hockey players' height, let's build a histogram. By default the data is divided into 10 parts (argument **bins**). The height of the bars corresponds to the number of observations in the bin:

In [None]:
plt.hist(df['height'])
plt.show()

Let's plot the histogram again, giving the number 20 as the **bins** argument, to get a more detailed picture:

In [None]:
plt.hist(df['height'], bins = 20)
plt.show()

Another way to graphically represent data is to build a **boxplot** diagram. To do this, import the **seaborn** library:

In [None]:
import seaborn as sns

In [None]:
sns.boxplot(df['height'], orient='v')
plt.show()

На данном графике показатели роста располагаются по вертикали. Черта в середине прямоугольника обозначает медиану, его нижняя сторона — 1-й квартиль, а верхняя — 3-й квартиль.

От прямоугольника вверх и вниз отходят «усики». Черта на конце нижнего усика отстоит от 1-го квартиля на 1.5 интерквартильных расстояния, а на конце верхнего — на 1.5 интерквартильных расстояния от 3-го квартиля.

Точки, находящиеся за их пределами, обозначают выбросы в данных — нетипичные наблюдения, которые могут быть и ошибками.