# 6. QLS - Means

## Measures Of Central Tendency

Here we'll discuss ways to summarize a set of data using a single number to capture information about the distribution of data.

In [9]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import yfinance as yf

### Arithmetic Mean

It's used to summarize numerical data, and it's commonly referred to as "average". It's defined as the sum of the observations divided by the number of observations.

$\mu = \frac{\sum_{i=1}^{N} X_{i}}{N}$ where $X_{1}, X_{2}, ..., X_{N}$ are our observations.

In [10]:
x1 = [1, 2, 2, 3, 4, 5, 5, 7]
x2 = x1 + [100]

print("Mean of x1:", sum(x1), "/", len(x1), "=", np.mean(x1))
print("Mean of x2:", sum(x2), "/", len(x2), "=", np.mean(x2))

Mean of x1: 29 / 8 = 3.625
Mean of x2: 129 / 9 = 14.333333333333334


We can use a weighted arithmetic mean, which is useful to specify the number of times each observation is counted.

For example, if you want to get the value of a portfolio it's easier to say that 70% of the equities are of type X rather than making a list of every share.

The weighted arithmetic mean is defined as $\sum_{i=1}^{n} w_{i} X_{i}$ where $\sum_{i=1}^{n} w_{i} = 1$. In the standard arithmetic mean, we have $w_{i} = \frac{1}{n}$ for all $i$.

#### Median

The median of a data set is the number which appears in the middle when sorted by increasing/decreasing order. When we have an odd number of $n$ data points, we get $\frac{(n=1)}{2}$. When we have an even number, it splits the list in half, so we have no middle item; we define the median as the average of the values in positions $\frac{n}{2}$ and $\frac{(n+2)}{2}$.

The median is less affected by extreme values in the data than the arithmetic mean. It shows the value that splits the data in half, but not how much smaller/larger the other values are.

In [11]:
print("Median of x1:", np.median(x1))
print("Median of x2:", np.median(x2))

Median of x1: 3.5
Median of x2: 4.0


#### Mode

This is the most occurring value in a data set. It can be applied to non-numerical data. One useful situation is where the data has independent values.

For example, in the outcomes of weighted die, coming up $6$ doesn't mean it's likely to come up $5$; so knowing it has a mode of $6$ is more useful than a mean of $4.5$.

In [12]:
# scipy has a built-in mode function, but it returns exactly one value even if two values occur the same number of times, or if no value appears more than once
print("One mode of x1:", stats.mode(x1, keepdims=True)[0][0])

# So we will write our own
def mode(l):
    # Count the number of times each element appears in the list
    counts = {}
    for e in l:
        if e in counts:
            counts[e] += 1
        else:
            counts[e] = 1

    # Return the elements that appear the most times
    maxcount = 0
    modes = {}
    for (key, value) in list(counts.items()):
        if value > maxcount:
            maxcount = value
            modes = {key}
        elif value == maxcount:
            modes.add(key)

    if maxcount > 1 or len(l) == 1:
        return list(modes)
    return 'No mode'

print("All of the modes of x1:", mode(x1))

One mode of x1: 2
All of the modes of x1: [2, 5]


For data that can't take on many different values, such as returns, there may not be values that appear more than once. We can bin values, like on a histogram, and find the mode of the data where EACH value is replaced with the name of its bin. That is, we find which bin elements fall into most often.

In [13]:
# get return data for an asset and compute the mode of the data set
start = "2014-01-01"
end = "2015-01-01"

pricing = yf.download("SPY", start, end)
returns = pricing["Adj Close"].pct_change()[1:]
print("Mode of returns:", mode(returns))

# Since all the returns are different, we use a frequency distribution to get an alternative mode. np.histogram returns the frequency distribution over the bins as well as the endpoints of the bins
hist, bins = np.histogram(returns, 20) # breaks data into 20 bins
maxfreq = max(hist) # find all the bins that are hit with frequency maxfreq, then print the intervals corresponding to them.
print("Mode of bins:", [(bins[i], bins[i+1]) for i, j in enumerate(hist) if j == maxfreq])

[*********************100%***********************]  1 of 1 completed
Mode of returns: No mode
Mode of bins: [(-0.001250095642141489, 0.0011115931768550524)]


### Geometric Mean

While the arithmetic mean averages using addition, the geometric mean uses multiplication:

$$G = \sqrt[n]{X_{1}, X_{2}, ...,X_{n}}$$

For observations in which $X \geq 0$ it can be rewritten as an arithmetic mean using logarithms, like so:

$$\ln{G} = \frac{\sum_{i=1}^{n} \ln{X}_{i}}{n}$$

The geometric mean is always less than or equal to the arithmetic mean (non-negative observations), with equality only when all the observations are the same.

In [15]:
# use scipy's gmean function to compute the geometric mean
print("Geometric mean of x1:", stats.gmean(x1))
print("Geometric mean of x2:", stats.gmean(x2))

Geometric mean of x1: 3.0941040249774403
Geometric mean of x2: 4.552534587620071


What if we want to compute the geometric mean when we have negative observations? This is easy to solve with asset returns, because the values are at least $−1$. We can add $1$ to a return $R_{t}$ to get $1+R_{t}$, which is the ratio of the price of the asset for two consecutive periods (as opposed to percent change between prices). This will always be non-negative. So we can calculate the geometric mean return:

$$R_{G} = \sqrt[T]{(1+R_{1}) ... (1+R_{T})} - 1$$

In [17]:
# add 1 to every value in the returns array and then calculate R_G
ratios = returns + np.ones(len(returns))
R_G = stats.gmean(ratios) - 1
print("Geometric mean of returns:", R_G)

Geometric mean of returns: 0.0005417539342906785


The geometric mean is defined so that if the RoR over the whole time period is constant and equal to $R_{G}$, the final price would be $R_{1}, ..., R_{T}$.

In [18]:
start = "2014-01-01"
end = "2015-01-01"

pricing = yf.download("SPY", start, end)
returns = pricing["Adj Close"].pct_change()[1:]

T = len(returns)
init_price = pricing["Adj Close"][0]
final_price = pricing["Adj Close"][T]
print('Initial price:', init_price)
print('Final price:', final_price)
print('Final price as computed with R_G:', init_price*(1 + R_G)**T)

[*********************100%***********************]  1 of 1 completed
Initial price: 153.82884216308594
Final price: 176.22891235351562
Final price as computed with R_G: 176.22894731497104


### Harmonic Mean

It's much less known than the previous two, it is defined as:

$$H = \frac{n}{\sum_{i=1}^{n} \frac{1}{X_{i}}}$$

As with the geometric mean, we can rewrite it to look like an arithmetic mean. The reciprocal of the harmonic mean is the arithmetic mean of the reciprocals of the observations:

$$\frac{1}{H} = \frac{\sum_{i=1}^{n} \frac{1}{X_{i}}}{n}$$

The harmonic mean of non-negative numbers $X_{i}$ is always at most the geometric mean, and they're equal if all observations are equal.

In [19]:
print("Harmonic mean of x1:", stats.hmean(x1))
print("Harmonic mean of x2:", stats.hmean(x2))

Harmonic mean of x1: 2.5590251332825593
Harmonic mean of x2: 2.869723656240511


This should be used when the data can be phrased in terms of ratios. For example, when using dollar-cost averaging, a fixed amount is spent on shares at regular intervals. The higher the price, the fewer shares you buy. The average (arithmetic) they pay for the stock is the harmonic mean of the prices.

## Point Estimates Can Be Deceiving

Means, by nature, hide a lot of information since they average distributions into one number. Often 'point estimates' can disguise problems in the data. We should ensure that key info isn't being lost in the process. And we should rarely use a mean without referring to a measure of spread.

## Underlying Distributions Can Be Wrong

Even when using the right metrics for mean and spread, they can make no sense if the distribution is not what we expect. For example, using standard deviation to measure frequency of an event will assume normality. Try not to assume distributions, and if we do, we should check the data fits the distribution we are assuming.