# Keywords

- Central tendency
- mean (arithmetic, geometric, and harmonic mean)
- median
- mode

# Introduction

In order to describe the <strong>[central tendency](https://en.wikipedia.org/wiki/Central_tendency)</strong> of data, this note book shows:
- mean
- median
- mode

<p><a href="https://commons.wikimedia.org/wiki/File:Visualisation_mode_median_mean.svg#/media/File:Visualisation_mode_median_mean.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Visualisation_mode_median_mean.svg/1200px-Visualisation_mode_median_mean.svg.png" width="200px" alt="Visualisation mode median mean.svg"></a><br>By <a href="//commons.wikimedia.org/wiki/User:Cmglee" title="User:Cmglee">Cmglee</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=38969094">Link</a></p>

In [1]:
import numpy as np

In [15]:
num_list = np.array([3,1,5,5,8,2,4])
n=num_list.shape[0]
print(num_list)

[3 1 5 5 8 2 4]


# mean

## Arithmetic mean (AM)

- Sum of the sampled values divided by the number of samples. 

$$
    \bar{x}=\frac{1}{n}\sum_{i=1}^{n} x_i 
$$

Here, x bar is the <strong>sample mean</strong>. 

In [51]:
(3 + 1 + 5 + 5 + 8 + 2 + 4)/n
#np.sum(num_list)/n

4.0

In [52]:
np.mean(num_list)

4.0

## Geometric mean (GM)

- This can be used for sets of positive numbers. 

$$
    \bar{x} = \Bigl(\Pi_{i=1}^{n} x_i \Bigr)^{\frac{1}{n}}
$$

### Numpy

In [19]:
np.prod(num_list)**(1/n)

3.3565382864325617

### [scipy.stats.mstats.gmean](https://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.mstats.gmean.html)

In [20]:
from scipy.stats.mstats import gmean
gmean(num_list)

3.356538286432562

## Harmonic mean (HM)

The harmonic mean is useful for sets of numbers which are defined in a specific unit, e.g., speed (distance per unit of time). 

$$
  \bar{x}=\frac{n}{\sum_{i=1}^{n}\frac{1}{x_i}}
$$

### Numpy

In [57]:
n/np.sum(1/num_list)

2.6837060702875397

### [scipy.stats.hmean](https://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.hmean.html)

In [58]:
from scipy.stats import hmean
hmean(num_list)

2.6837060702875397

## When to use GM?

Useful reference: "[You should summarize data with the geometric mean](https://medium.com/@JLMC/understanding-three-simple-statistics-for-data-visualizations-2619dbb3677a)" on Medium.

1. Geometric mean is used when you calculate, for instance, rate in %. Let us think about someone's salary.    

1st year: 50000 USD  
2nd year: 60000 USD (120 %)   
rd year: 81000 USD (135 %)

The geometric mean is calculated to be 1.272. 

In [48]:
gmean_rate=gmean(np.array([1.2,1.35]))
print("GM: ",gmean_rate)
50000*gmean_rate*gmean_rate

GM:  1.2727922061357855


81000.0

2. Geometric mean is less sensitive to one outlier compared to the arithmetic mean. GM changes its value bacause of the outlier, but does not change by orders of magnitude.

In [31]:
num_list_mod=np.append(num_list,10000)
num_list_mod

array([    3,     1,     5,     5,     8,     2,     4, 10000])

In [32]:
print("AM without the outlier (1000):", np.mean(num_list))
print("AM with the outlier (1000):", np.mean(num_list_mod))
print("GM without the outlier (1000):", gmean(num_list))
print("GM with the outlier (1000):", gmean(num_list_mod))

AM without the outlier (1000): 4.0
AM with the outlier (1000): 1253.5
GM without the outlier (1000): 3.356538286432562
GM with the outlier (1000): 9.123367196696424


## When to use HM?

You can use the harmonic mean when considering sets of numbers which are defined in a specific unit, e.g., speed (distance per unit of time). 

Example: I drive the first 1 km with the speed of 10 km/h and the second 1 km with 20 km/h.

- First part: 0.1 hour
- Second part: 0.05 hour

The harmonic mean is 2 / (1/10 + 1/20) = 2 / ( 0.1+ 0.05) = 13.3 km/h

In [73]:
print("Distance / harmonic mean:",2/13.3)
print("This value is almost the same as 0.1 + 0.05.")

Distance / harmonic mean: 0.15037593984962405
This value is almost the same as 0.1 + 0.05.


# [Median](https://en.wikipedia.org/wiki/Median)

- The median is the value separating the higher half from the lower hald of a data sample. 
- The advantage of the median compared to the mean is that it is not akewed so so much by a small proportion of outliers (extremely large or small values).
- THe median is defined for ordered one-dimensional data

In [64]:
print(num_list)
print(np.median(num_list))

[3 1 5 5 8 2 4]
4.0


Below is an example.

<p><a href="https://commons.wikimedia.org/wiki/File:Finding_the_median.png#/media/File:Finding_the_median.png"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Finding_the_median.png/1200px-Finding_the_median.png" width="200px" alt="Finding the median.png"></a><br>By <a href="//commons.wikimedia.org/wiki/User:Blythwood" title="User:Blythwood">Blythwood</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=50321138">Link</a></p>

# Mode

- The most frequent value in the data set.
- The mode is not necessaryly unique because there might be several maxima in the distribution. 

In [84]:
import statistics

print(num_list)
statistics.mode(num_list)

[3 1 5 5 8 2 4]


5