<h1 style="color: rgb(0, 91, 94);">Bias</h1>

<hr style="border-top: 1px solid rgb(0, 91, 94);" />



In this notebook, you will learn about bias - statistical and cognitive.

In [None]:
# Numerical arrays.
import numpy as np

# Plots.
import matplotlib.pyplot as plt

<h2 style="color: rgb(0, 91, 94);">Cognitive Bias</h2>

<hr style="border-top: 1px solid rgb(0, 91, 94);" />

*The hope for informed gossip is that there are distinctive patterns in the errors people make. Systematic errors are known as biases, and they recur predictably in particular circumstances. When the handsome and confident speaker bounds onto the stage, for example, you can anticipate that the audience will judge his comments more favorably than he deserves. The availability of a diagnostic label for this bias—the halo effect—makes it easier to anticipate, recognize, and understand.*

-- Kahneman; Thinking Fast and Slow

<a style="color: #ff791e" href="https://github.com/ianmcloughlin/papers/raw/master/tversky-kahneman-heuristics-biases.pdf"><i>Judgment under Uncertainty: Heuristics and Biases;</i></a><br>Amos Tversky and Daniel Kahneman; Science, New Series, Vol. 185, No. 4157. (Sep. 27, 1974), pp. 1124-1131.

<a style="color: #ff791e" href="https://github.com/ianmcloughlin/papers/raw/master/tversky-kahneman-framing-of-decisions.pdf"><i>The Framing of Decisions and the Psychology of Choice;</i></a><br>Amos Tversky and Daniel Kahneman; Science, Vol. 211, 30 January 1981.

<a style="color: #ff791e" href="https://github.com/ianmcloughlin/papers/raw/master/kruger-dunning-ones-incompetence.pdf"><i>Unskilled and Unaware of It: How Difficulties in Recognizing One's Own Incompetence Lead to Inflated Self-Assessments;</i></a><br>Justin Kruger and David Dunning; Psychology, 2009, 1, 30-46.

<a style="color: #ff791e" href="https://plato.stanford.edu/entries/aristotle-rhetoric/"><i>Aristotle’s Rhetoric;</i></a><br>Stanford Encyclopedia of Philosophy

<a style="color: #ff791e" href="https://github.com/ianmcloughlin/papers/raw/master/lousville-logos-pathos-ethos-kairos.pdf"><i>Logos, Ethos, Pathos, Kairos;</i></a><br>University of Louisville Writing Center.

<h3 style="color: rgb(0, 91, 94);">Guessing Game</h3>

<hr style="border-top: 1px solid rgb(0, 91, 94);" />

Below is some code that you shouldn't try to understand for now.

The game is to, in English, give a precise rule that describes when the function returns True.

In [None]:
# This code is obfuscated on purpose.
import operator as o__

def test(L):
    return True if o__.__ge__(0b10001, len(L) * 2 + sum([i - 0b10 for i in L])) and all([o__.__ge__(L[::-1][i], L[::-1][i+0b1]) for i in range(len(L)-1)]) else False

So, we can repeatedly call the function as follows, with different lists of integers.

In [None]:
test([1, 2, 3, 4])

In [None]:
test([2, 3])

In [None]:
test([3, 2, 1])

In [None]:
test([1, 2, 3, 10])

In [None]:
test([1, 3, 2])

So what is the rule?

How confident are you that that is the rule?

Let's test your rule.

Tell me what the outputs of the following code should be.

In [None]:
test([1, 2, 3, 4])

In [None]:
test([2, 3, 5, 10])

Note that the point here is to think about your confidence in your answer and how you got confident.

<h3 style="color: #001a79;">Exercise 1</h3>

<hr style="border-top: 1px solid #001a79;" />

<i style="color: #001a79;">Remember to do these exercises in your own notebook in your assessment repository.</i>

Give three real-world examples of different types of cognitive bias.

<hr style="border-top: 1px solid #001a79;" />

<h2 style="color: rgb(0, 91, 94);">Statistical Bias</h2>

<hr style="border-top: 1px solid rgb(0, 91, 94);" />

Two of the common calculations you will find in the statistics literature are mean and standard deviation.

The mean is straight-forward - it is the usual calculation that people call the average.

You take all of the numbers you have, add them up, and then divide by the number of them.

<h3 style="color: rgb(0, 91, 94);">Mean</h3>

Suppose you take a sample of values from a larger population of values.

In [None]:
# Generate a sample of 1000 values from a normal distribution.
x = np.random.normal(10.0, 1.0, 1000)
x


If you take the mean of the sample, it is a good estimate of the population average.

In [None]:
# We expect the mean of the sample to be close to the mean of the population.
x.mean()

What do mean by a good estimate?

To investigate, let us take lots of samples.

In [None]:
# Let's run a simulation of taking 1000 samples of size 1000.
samples = np.random.normal(10.0, 1.0, (1000, 1000))
samples

In [None]:
# Get the mean of the first sample.
samples[0].mean()

In [None]:
# Calculate the mean of all samples.
sample_means = samples.mean(axis=1)
sample_means

Let's plot the means in a histogram.

In [None]:
plt.hist(sample_means);

<h3 style="color: rgb(0, 91, 94);">Standard Deviation</h3>

The standard deviation is a different story.

First of all, the calculation is not as familiar.

It is designed to give a measure of how far the numbers are away from the mean in general.

The need for a such a measure is seen in the following example of calculating the mean of two sets of numbers.

In [None]:
# A list of nubmers - four small and one big.
numbers1 = np.array([1, 1, 1, 1, 10])
# Their mean.
np.mean(numbers1)

In [None]:
# A list of numbers - all close to each other.
numbers2 = np.array([2, 2, 3, 3, 4])
# Their mean.
np.mean(numbers2)

The example illustrates a common issue.

The mean on its own does is limited as a summary of the data points.

That is why we use the standard deviation - it gives us a measure of the spread.

Let's see how it is calculated.

First we'll generate a sample.

In [None]:
# Generate a sample of values - note we can see the population standard deviation.
x = np.random.normal(10.0, 1.0, 1000)
x

Now, let us calculate the mean and investigate it.

In [None]:
# Calculate the mean.
x_mean = x.mean()

# Subtract the mean from each of the values.
zeroed = x - x_mean

In [None]:
# What do you think the mean of zeroed is?
zeroed.mean()

In [None]:
# This will give us a better view of it - correct to four decimal places.
print(f'{zeroed.mean():0.4f}')

So, subtracting the mean of the sample results in the mean being zero.

The standard deviation is an adjustment to the above calculation.

The goal is to summarise the zeroed array without losing information about the distance of each point from the mean.

Let's see if we can come up with a plot of the idea.

In [None]:
# Create a plot.
fig, ax = plt.subplots(figsize=(12, 6))

# Plot the zeroed array, each value spaced out evenly along the x axis.
# Note the x axis is just the position of the value in the zeroed array.
ax.plot(range(len(zeroed)), zeroed, 'k.')

# Plot the y=0 line.
ax.axhline(y=0.0, color='grey', linestyle='-');

One idea is to take the average (vertical) distance each point is from the mean, zero.

We need to be careful - there are negative and positive values here.

By definition, they sum to zero.

In [None]:
# Sum the array.
print(f'{zeroed.sum():0.4f}')

We can try instead taking the absolute value.

This makes sense, because we expect distances to be positive.

In [None]:
# Absolute values.
np.abs(zeroed)

In [None]:
# Average absolute value.
np.mean(np.abs(zeroed))

While this is a reasonable measure of spread, it is not typically the one used.

For a discussion, see this Cross Validated post: https://stats.stackexchange.com/q/118

Instead we usually square the values.

Note squaring a number gives a positive value.

It is also somewhat easier to work with analytically.

In [None]:
# Square the values.
np.square(zeroed)

Note that when you square numbers, bigger values get bigger.

Larger deviations from the mean will contribute relatively more to the standard deviation.

That is not necessarily a bad thing, it is just something to note.

In [None]:
# Create a plot.
fig, ax = plt.subplots(figsize=(12, 6))

# Plot the squared zeroed array, each value spaced out evenly along the x axis.
# Note the x axis is just the position of the value in the zeroed array.
ax.plot(range(len(zeroed)), np.square(zeroed), color='green', marker='.', linestyle='none')

# Plot the zeroed array, each value spaced out evenly along the x axis.
# Note the x axis is just the position of the value in the zeroed array.
ax.plot(range(len(zeroed)), zeroed, 'k.')

# Plot the y=0 line.
ax.axhline(y=0.0, color='grey', linestyle='-');

In [None]:
# Calculate the average squared result.
np.mean(np.square(zeroed))

Now, because we have squared the original values, we often take the square root of the answer.

In [None]:
# Calculate the square root of the average squared result.
np.sqrt(np.mean(np.square(zeroed)))

So, here is the full calculation.

In [None]:
# The full calculation using the original array.
np.sqrt(np.mean(np.square(x - np.mean(x))))

This common calculation is built into numpy.

Note this is very close (by design) to the second parameter we sent to `np.random.normal`.

In [None]:
# Note that the function is built into numpy.
x.std()

<h2 style="color: rgb(0, 91, 94);">Bessel's Correction</h2>

<hr style="border-top: 1px solid rgb(0, 91, 94);" />

The above calculation of the standard deviation has one flaw.

If you calculate the standard deviation of a sample, it is a biased estimator for the standard deviation of the population.

<h3 style="color: rgb(0, 91, 94);">Excel's Standard Deviation</h3>

This was a common issue in Excel in years gone by.

See the following warning on the official STDEV function documentation for Excel:

*Important: This function has been replaced with one or more new functions that may provide improved accuracy and whose names better reflect their usage. Although this function is still available for backward compatibility, you should consider using the new functions from now on, because this function may not be available in future versions of Excel.*

https://support.microsoft.com/en-us/office/stdev-function-51fecaaa-231e-4bbb-9230-33650a72c9b0

The functions replacing it are STDEVP and STDEV.S:

https://support.microsoft.com/en-us/office/stdev-s-function-7d69cf97-0c1f-4acf-be27-f3e83904cc23

https://support.microsoft.com/en-us/office/stdevp-function-1f7c1c88-1bec-4422-8242-e9f7dc8bb195

<h3 style="color: rgb(0, 91, 94);">Biased Estimators</h3>

The issue is apparent when we repeatedly draw samples from a population.

If we use the above calculation on each sample, we will systematically under-estimate the standard deviation.

Let us see if we can see this in a plot.

We will use small small samples as the effect it clearer.

In [None]:
# Create 100000 samples of size 5 - standard deviation is 2.0.
samples = np.random.normal(0.0, 2.0, (100000, 5))
samples

In [None]:
# Calculate standard deviation without correction.
stdevs = samples.std(axis=1)
stdevs

In [None]:
# View a histogram - hopefully we can see the estimate is too small.
fig, ax = plt.subplots(figsize=(12, 6))

# Plot histogram.
plt.hist(stdevs, bins=100)

# Draw a vertical line where the actual standard deviation is.
plt.axvline(x=2.0, color='red');

It seems clear that the tip of the curve is below the actual value.

<h3 style="color: rgb(0, 91, 94);">The Correction</h3>

Bessel's correction actually applies to the variance.

The variance of a sample is the square of the standard deviation.

It is what we get is we do not apply the `np.sqrt` function in the standard deviation calculation.

The correction is to mutliply the calculation be $\frac{n}{n-1}$.

In [None]:
# Uncorrected variance.
np.mean(np.square(x - np.mean(x)))

In [None]:
# Corrected variance.
np.mean(np.square(x - np.mean(x))) * (len(x) / (len(x) - 1.0))

The correction can be applied to the standard deviation directly, but will still lead to 

<h3 style="color: #001a79;">Exercise 2</h3>

<hr style="border-top: 1px solid #001a79;" />

<i style="color: #001a79;">Remember to do these exercises in your own notebook in your assessment repository.</i>

Show that the difference between the standard deviation calculations is greatest for small sample sizes.

<hr style="border-top: 1px solid rgb(0, 91, 94);" />

<h2 style="color: rgb(0, 91, 94);">End</h2>