# Calculating Descriptive Statistics
## Central Tendency

In [1]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

In [2]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]

Now you have the lists x and x_with_nan. They’re almost the same, with the difference that x_with_nan contains a nan value. It’s important to understand the behavior of the Python statistics routines when they come across a not-a-number value (nan).

In [3]:
x

[8.0, 1, 2.5, 4, 28.0]

In [4]:
x_with_nan

[8.0, 1, 2.5, nan, 4, 28.0]

In [5]:
math.isnan(np.nan)

True

In [6]:
np.isnan(math.nan)

True

# Central Tendency 

The sample mean, also called the sample arithmetic mean or simply the average, is the arithmetic average of all the items in a dataset. The mean of a dataset 𝑥 is mathematically expressed as Σᵢ𝑥ᵢ/𝑛, where 𝑖 = 1, 2, …, 𝑛. In other words, it’s the sum of all the elements 𝑥ᵢ divided by the number of items in the dataset 𝑥.

In [7]:
mean = sum(x) / len(x)
mean

8.7

In [8]:
mean = statistics.mean(x)
mean

8.7

In [8]:
len(x)

5

In [7]:
len(x_with_nan)

6

In [9]:
sum(x)

43.5

In [10]:
sum(x_with_nan)

nan

fmean() is introduced in Python 3.8 as a faster alternative to mean(). It always returns a floating-point number.

In [9]:
mean = statistics.fmean(x)
mean

8.7

In [10]:
mean = statistics.mean(x_with_nan)
mean

nan

In [11]:
mean = statistics.fmean(x_with_nan)
mean

nan

# Use Numpy

The function mean() and method .mean() from NumPy return the same result as statistics.mean(). This is also the case when there are nan values among your data

In [12]:
mean = np.mean(x)
mean

8.7

You often don’t need to get a nan value as a result. If you prefer to ignore nan values, then you can use np.nanmean()

In [13]:
np.nanmean(x_with_nan)

8.7

# Weighted Mean


In [12]:
x = [8.0, 1, 2.5, 4, 28.0]
w = [0.1, 0.2, 0.3, 0.25, 0.15]
wmean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
wmean

6.95

In [13]:
sum(w)

1.0

In [15]:
wmean = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
wmean

6.95

# Harmonic Mean

The harmonic mean is the reciprocal of the mean of the reciprocals of all items in the dataset: 

𝑛 / Σᵢ(1/𝑥ᵢ), where 𝑖 = 1, 2, …, 𝑛 and 𝑛 is the number of items in the dataset 𝑥. 

One variant of the pure Python implementation of the harmonic mean is this

In [16]:
hmean = len(x) / sum(1 / item for item in x)
hmean

2.7613412228796843

In [17]:
hmean = statistics.harmonic_mean(x)
hmean

2.7613412228796843

It’s quite different from the value of the arithmetic mean for the same data x, which you calculated to be 8.7.

# Geometric Mean
The geometric mean is the 𝑛-th root of the product of all 𝑛 elements 𝑥ᵢ in a dataset 

𝑥: ⁿ√(Πᵢ𝑥ᵢ), where 𝑖 = 1, 2, …, 𝑛.

In [18]:
statistics.geometric_mean(x)

4.67788567485604

# Median
The sample median is the middle element of a sorted dataset. The dataset can be sorted in increasing or decreasing order. If the number of elements 𝑛 of the dataset is odd, then the median is the value at the middle position: 0.5(𝑛 + 1). If 𝑛 is even, then the median is the arithmetic mean of the two values in the middle, that is, the items at the positions 0.5𝑛 and 0.5𝑛 + 1.

For example, if you have the data points 2, 4, 1, 8, and 9, then the median value is 4, which is in the middle of the sorted dataset (1, 2, 4, 8, 9). If the data points are 2, 4, 1, and 8, then the median is 3, which is the average of the two middle elements of the sorted sequence (2 and 4).

In [20]:
statistics.median(x)

4

# Mode
The sample mode is the value in the dataset that occurs most frequently. If there isn’t a single such value, then the set is multimodal since it has multiple modal values. For example, in the set that contains the points 2, 3, 2, 8, and 12, the number 2 is the mode because it occurs twice, unlike the other items that occur only once.

In [22]:
u = [2, 3, 2, 8, 12]
statistics.mode(u)

2

In [26]:
[(u.count(item), item) for item in set(u)]

[(1, 8), (2, 2), (1, 3), (1, 12)]

In [23]:
max((u.count(item), item) for item in set(u))[1]

2

In [14]:
v = [12, 15, 12, 15, 21, 15, 12]

In [15]:
len(v)

7

In [28]:
statistics.mode(v)  

12

In [29]:
statistics.multimode(v)

[12, 15]