# Statistic basics

Most tools of trade of a Data Scientist are based or directly stem from the mathematical field of statistics. Because of that we will now review the basic concepts. These concepts are the fundamentals for a deeper understanding of the following algorithms and ideas.

To fresh things up we will understand the theoretical concepts using example data. The _Iris data set_ is very famous around machine learning tutorials.

First some imports.

In [None]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

In [None]:
plt.rcParams['figure.figsize'] = (16.0, 8.0)
plt.style.use('ggplot')

## The Iris data set

![Iris versicolor](https://upload.wikimedia.org/wikipedia/commons/d/db/Iris_versicolor_4.jpg)

> The Iris flower data set or "Fisher's Iris data set" is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper _The use of multiple measurements in taxonomic problems_ as an example of linear discriminant analysis.
> 
> The data set consists of 50 samples from each of three species of Iris (_Iris setosa_, _Iris virginica_ and _Iris versicolor_). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
> &mdash; ["Iris flower data set," Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)

Quite convenient for users, the data set is part of the Python machine learning library [**scikit-learn**](http://scikit-learn.org/stable/) (`sklearn`).

In [None]:
from sklearn import datasets

iris = datasets.load_iris()

df = pd.DataFrame(iris['data'])
df.columns = iris['feature_names']
df['target'] = iris.target

The data is downloaded from the internet and a `pandas.DataFrame` is constructed. The DataFrame columns contain the four different features (_sepal length_, _sepal width_, _petal length_ and _petal width_) as well a `target` column encoding the true Iris species.

In [None]:
df.columns

In [None]:
# Let's see
df.head()

The rows seem sorted, to get a better overview, we will look at a random sample of 5 rows.

In [None]:
df.sample(5)

## Have a look at the data

Whenever working with a new data set, we start by getting a good overview of the data and its characteristics. This works either by looking at summary statistics or visualizing the data - usually both approaches are combined.

### Descriptive statistics

To get an easy access to a new data set a look at some key figures is an easy way. Pandas has everything we need on board, it's easy to calculate means, standard deviations and other basic statistical properties.

In [None]:
def analyse_column(df, column_name):
    # Pick a column
    column = df[column_name]
    
    # Get unit from column name
    s = column_name.split()
    name = ' '.join(s[:2])
    unit = s[2].strip('()')
    
    # Some key figures
    count = column.count()
    mean = column.mean()
    std = column.std()
    
    # Output
    print(
        f'The sample (size={count}) of data in column "{name}" has a mean of {mean:.2f} {unit} with a standard deviation of {std:.2f} {unit}'
    )

In [None]:
analyse_column(df, 'sepal length (cm)')

In [None]:
analyse_column(df, 'petal length (cm)')

The example includes the following statistical figures:

- **Sample size** `column.count()`
- **Arithmetic mean** `column.mean()`
- **Standard deviation** `column.std()`

#### Exercise

Define a new function using one or more of the following statistical figures:


- `column.sem()`
- `column.median()`
- `column.quantile()`
- `column.var()`
- `column.min()`
- `column.max()`

You can find the [documentation online](http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics).

In [None]:
# Your code









The built in function `pandas.DataFrame.describe()` calculates and shows some of the discussed figures.

In [None]:
df.describe()

### Add-on: Separating target classes

We modify the already defined function `analyse_column` to allow calculating the descriptive statistical figures for just one of the target classes (_i.e._ Iris species).

In [None]:
def analyse_column_kind(df, column_name, kind=None):
    if kind is not None:
        df = df[df['target'] == kind]
        print(f'Species: {kind}:')
    else:
        print(f'For all species:')
        
    analyse_column(df, column_name)

In [None]:
analyse_column_kind(df, 'petal length (cm)', 1)

In [None]:
for kind in df['target'].unique():
    analyse_column_kind(df, 'sepal length (cm)', kind)

### Discussing the results

The different species of flowers show differences in the distribution and key figures of the provided features. This allows a classification. To get an even better understanding of the data, we will now plot some histrograms.

## Data visualization

Let's have a look at the feature `sepal length` and plot the sample distribution for all three classes in a histogram.

In [None]:
feature = 'sepal length (cm)'

labels = [f'Species {kind}' for kind in df['target'].unique()]
hists = [df[df['target'] == kind][feature] for kind in df['target'].unique()]
plt.hist(hists, alpha=0.8, bins=np.linspace(3, 9, 20))
plt.legend(labels)
plt.xlabel(feature)
plt.ylabel('Number of entries per bin');

#### Note

The parameter `bins` is already familiar. Until now we always assigned a whole-number describing the total number of bins. Now we use numpy's `np.linspace(3, 9, 13)` to create an array of 13 bin boundaries, that are uniformly distributed over the interval 3 - 9.

-------

## Samples and Distributions

The Iris data set has a **sample size** of 150. Selecting only data for one Iris species results in a sample size of 50.

Let's have a look at two different samples of randomly generated numbers:

In [None]:
# large sample with 20k items
big = np.random.normal(loc=0.0, scale=10.0, size=20000)

# small sample with 1000 items
small = np.random.normal(loc=0.0, scale=10.0, size=1000)

In [None]:
plt.hist(big, bins=np.linspace(-40, 40, 100), color='dodgerblue', alpha=0.5, density=False)
plt.hist(small, bins=np.linspace(-40, 40, 100), alpha=0.5, density=False)
plt.xlabel('Normal distribution')
plt.ylabel('Number of entries per bin');

While the histograms of the large and the small sample look very different at first glance, the underlying **distribution** is exactly the same.

In this case the data is normally distributed or "it follows a normal distribution". This distribution is very typical for most processees in nature. The mean of the distribution is the most probable value, the so called _expactation value_, while the width of the normal distribution is described by the _standard deviation_. The normal distribution is a symmetric distribution.

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=1, sharey=True)
axes[0].hist(big, bins=np.linspace(-40, 40, 100), color='dodgerblue', alpha=0.5, density=True)
axes[1].hist(small, bins=np.linspace(-40, 40, 100), alpha=0.5, density=True)
axes[0].set_ylabel('Number of entries per bin (normalized)')
axes[0].set_xlabel('large sample')
axes[1].set_xlabel('small sample');

To better compare the two samples, we normalize the histograms. This means each bin entry is divided by the sample size.

Beside the **arithmetic mean**, we know another central value describing the distribution. The **median** is the "middle" value after sorting all values, i.e. the median splits the sample in two halfs. While the mean is usualy the more familiar measure, the median has some advantages concerning it's robustness against outliers in the data sample.

**Percentiles** are the generalization of this concept:

> A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found. &mdash; ["Percentile," Wikipedia](https://en.wikipedia.org/w/index.php?title=Percentile&oldid=825250895)

Thus the median is the equivalent to the 50%-percentile.

In [None]:
# we add some outliers
outlier = [45, 50, 48, 35, 35, 35, 35, 5000]
big_outlier = np.append(big, outlier)
small_outlier = np.append(small, outlier)

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=1, sharey=True)
axes[0].hist(big_outlier, bins=np.linspace(-40,40,100), color='dodgerblue', alpha=0.5, density=True)
axes[1].hist(small_outlier, bins=np.linspace(-40,40,100), alpha=0.5, density=True)

axes[0].axvline(np.mean(big_outlier), color='black', linestyle='solid', linewidth=2, label='Arithmetic mean')
axes[1].axvline(np.mean(small_outlier), color='black', linestyle='solid', linewidth=2, label='Arithmetic mean')
axes[0].axvline(np.median(big_outlier), color='darkgreen', linestyle='dashed', linewidth=2, label='Median')
axes[1].axvline(np.median(small_outlier), color='darkgreen', linestyle='dashed', linewidth=2, label='Median')

axes[0].legend()
axes[1].legend()

axes[0].set_ylabel('Number of entries per bin (normalized)')
axes[0].set_xlabel('large sample with outliers')
axes[1].set_xlabel('small sample with outliers');

The **standard deviation** as well as the **variance** describe the width of a distribution. In broad distributions the data points show a larger spread around the central value. The variance describes the mean distance between each value and the central value.

The standard deviation is calculated as the square root of the variance. In a normally distributed data sample approximately 68% of data points are within one standard deviation (also called $\sigma$ (sigma)) around the central value and approximately 95% of data points within two standard deviations (2$\sigma$).

The **standard error of the mean** is calculated as the standard deviation divided by the square root of the sample size. It is a measure of uncertainty, indicating how accurately the mean of the sample represents the mean of the distribution. Unlike the standard deviation, the standard error of the mean shrinks with increasing sample size.

In [None]:
plt.hist(big, bins=np.linspace(-40, 40, 100), color='dodgerblue', alpha=0.5, density=False)

plt.axvline(big.mean(), color='black', linestyle='solid', linewidth=2)
plt.axvline((big.mean() + big.std()), color='darkgreen', linestyle='solid', linewidth=2, label='1$\sigma$ (~68%)')
plt.axvline((big.mean() - big.std()), color='darkgreen', linestyle='solid', linewidth=2)
plt.axvline((big.mean() + 2 * big.std()), color='darkgreen', linestyle='dashed', linewidth=2, label='2$\sigma$ (~95%)')
plt.axvline((big.mean() - 2 * big.std()), color='darkgreen', linestyle='dashed', linewidth=2)

plt.legend()
plt.xlabel('Normal distribution')
plt.ylabel('Number of entries per bin');

-------

## Separation of two variables

The Iris data set includes four different feature variables: the length and width of the flowers petals and sepals. We will now combine two of the features into one plot.

In [None]:
# 'sepal length (cm)'
# 'sepal width (cm)'
# 'petal length (cm)'
# 'petal width (cm)'
feature_A = 'sepal width (cm)'
feature_B = 'sepal length (cm)'
plt.scatter(
    df[feature_A],
    df[feature_B],
    c=df['target'],# 'third axis' = color
    cmap='Set1',   # colormap
    s=100,          # dot size
);
plt.xlabel(feature_A)
plt.ylabel(feature_B);

In this overview plot we combine all length-width combinations. In total we could show six different plots if we would plot length-length and width-width combinations as well.

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=2)

# Create four sub-plots
axes[0][0].scatter(df['sepal width (cm)'], df['sepal length (cm)'], c=df['target'], cmap=plt.cm.Set1, s=50)
axes[0][1].scatter(df['petal width (cm)'], df['sepal length (cm)'], c=df['target'], cmap=plt.cm.Set1, s=50)
axes[1][0].scatter(df['sepal width (cm)'], df['petal length (cm)'], c=df['target'], cmap=plt.cm.Set1, s=50)
axes[1][1].scatter(df['petal width (cm)'], df['petal length (cm)'], c=df['target'], cmap=plt.cm.Set1, s=50)

# Axis labels
axes[0][0].set_ylabel('sepal length (cm)')
axes[1][0].set_ylabel('petal length (cm)')
axes[1][0].set_xlabel('sepal width (cm)')
axes[1][1].set_xlabel('petal width (cm)')

# Axis ranges
axes[0][0].set_ylim(3.5, 8.5)
axes[0][1].set_ylim(3.5, 8.5)
axes[1][0].set_ylim(0, 8)
axes[1][1].set_ylim(0, 8)

# Remove axis labels for inner axes
plt.setp(axes[0][0].get_xticklabels(), visible=False)
plt.setp(axes[0][1].get_yticklabels(), visible=False)
plt.setp(axes[0][1].get_xticklabels(), visible=False)
plt.setp(axes[1][1].get_yticklabels(), visible=False)

fig.subplots_adjust(hspace=0)
plt.tight_layout()

### Discussion of the resulting plots

In this visualization we represent the three species using three different colors. The ability to distinguish between the flower species varies depending on the two features we combine in the scatter plot. 

As a general rule in machine learning we try to find features that provides us with a good separation of at least two target classes.

## The science behind correlations

For a better understanding of the interaction of two features, we can explore their **correlation**. Usually understood as a measure for the linear relationship of two variables, correlation is an important measure given that it is applied correctly. 

In [None]:
x = df[df['target'] == 2]['sepal width (cm)']
y = df[df['target'] == 2]['petal width (cm)']

# 2x2 grid
plt.figure(figsize=(10, 10))
gs1 = gridspec.GridSpec(2, 2, height_ratios=[1, 3], width_ratios=[3, 1])
gs1.update(wspace=0.0, hspace=0.0)

# scatter plot
ax2 = plt.subplot(gs1[2])
ax2.scatter(x, y ,s=50, color='dodgerblue')
ax2.set_xlabel('sepal width (cm)')
ax2.set_ylabel('petal width (cm)')

# upper histogramm
ax0 = plt.subplot(gs1[0], sharex=ax2)
ax0.hist(x, color='dodgerblue')
plt.setp( ax0.get_xticklabels(), visible=False)

# right histogramm
ax3 = plt.subplot(gs1[3], sharey=ax2)
ax3.hist(y, orientation='horizontal', color='dodgerblue')
plt.setp( ax3.get_yticklabels(), visible=False)

# Calculaitng the parameters of the correlation matrices
eigenvectors = np.linalg.eig(np.cov(x, y))
print('Covariance matrix\n', np.cov(x, y))
ev = eigenvectors[0]
ex = eigenvectors[1][0]
ey = eigenvectors[1][1]

angle = np.arctan(ey[0] / ey[1]) / np.pi * 180

# Draw ellipses
ax2.add_artist(
    mpl.patches.Ellipse(xy=(x.mean(), y.mean()),
                        width=2 * np.sqrt(ev[0]),
                        height=2 * np.sqrt(ev[1]),
                        angle=angle,
                        color='dodgerblue',
                        alpha=0.30))
ax2.add_artist(
    mpl.patches.Ellipse(xy=(x.mean(), y.mean()),
                        width=2 * 2 * np.sqrt(ev[0]),
                        height=2 * 2 * np.sqrt(ev[1]),
                        angle=angle,
                        color='dodgerblue',
                        alpha=0.30));

### Discussion of the resulting plot

The main part of the visualization is a scatter plot of the two features _petal width_ and _sepal width_. The histograms at the top and right side of the main plot show the one dimensional distribution for each feature. The ellipses in light blue show the 68% (and 95%) countour (one and two $\sigma$) of the two dimensional distribution.

### Correlation coefficient (Pearson correlation)

The linear correlation between two variables can be quantified using the correlation coefficient (often depicted as $\rho$ (rho)). A coefficient of 0 means _no linear correlation_ while a coefficient of +1 or -1 stands for maximal correlation or anti-correlation.

Here we use the function `pearsonr` included in the [**scipy**](https://www.scipy.org) package.

In [None]:
from scipy.stats import pearsonr

rho = pearsonr(x, y)[0]
print(f'The linear correlation of the two variables in the data sample is {rho:.3f}')

#### Warning

Use the correlation coefficent with caution! If there is a non-linear relation between the variables the value of the coefficient is misleading. Plotting the data helps in this case to spot possible traps.

---

_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_