Bias

In [None]:
# importing numerical arrays
import numpy as np
# importing plotting capabilities
import matplotlib.pyplot as plt

Cognitive Bias  
There are distinctive patterns in the errors that people make.  
Systematic errors are known as biases and recur predictably in particular circumstances.  
When a handsome & confident speaker takes to the stage, we anticipate that the audience will judge his comments more favourably than what he deserves.

Exercise 1:  
Give three real-world examples of different types of cognitive bias.

References:  
https://www.towergateinsurance.co.uk/liability-insurance/cognitive-biases  
https://positivepsychology.com/cognitive-biases/

Example 1: Gender Bias  
This bias focuses on our human tendencies to assign specific characteristics and behaviours to a particular gender while not providing supportive evidence, deeming these as stereotypes.  
Certain stereotypes such as, women are more caring than men, men have better salaries than women, but without any concrete statistics/evidence to duly confirm these, it can be seen as misinformation and we easily accept these as fact.

Example 2: Anchoring Effect  
We humans tend to cling or anchor to only the first piece of information presented to us without doing further research on the matter.  
This bias is common when purchasing an item, for example, during negotiations, the seller sets the price of their car to €2,000 but the buyer haggles them down to €1,750.  
The buyer will feel they bagged a bargain, although the Latin phrase Caveat Emptor applies, meaning "let the buyer beware".  
The seller can choose to raise the price beyond its actual worth compared to similar car models advertised, they would not state the exact price of €1,500 because they would be haggled down further.  
The buyer may neglect to conduct detailed research on buying a car, they can just easily choose a buyer in close proximity to their town instead of looking further for a more affordable car.

Example 3: Confirmation Bias  
This bias is based on searching for information that only confirms our own beliefs and expectations regarding various societal matters.  
This only serves to provide the individual with self-assurance and security but as the saying goes, "that is only one side of the coin".  
It is biased as you will only focus on your agenda and not consider the opinions and perspectives of others to help understand societal matters in a better way.

Statistical Bias  
The Mean and the Standard Deviation are two common calculations in the world of statistics.  
Mean is what's known as the average (adding up all numbers you have, then dividing the total by the number of numbers).

Mean

In [None]:
# Generating a sample of 1000 values from a normal distribution
x = np.random.normal(10.0, 1.0, 1000)
x

If we take the mean of the sample, it is a good estimate of the population mean.

In [None]:
x.mean()

What is meant by a good estimate?  
We will take more samples and investigate this.

In [None]:
# Running a simulation of taking 1000 samples of size 1000
samples = np.random.normal(10.0, 1.0, (1000, 1000))
samples

In [None]:
# calculating the mean of the 1st sample at index 0
samples[0].mean()

In [None]:
# calculating the mean of all samples
sample_means = samples.mean(axis = 1)
sample_means

In [None]:
# plotting results on a histogram
plt.hist(sample_means)

Standard Deviation  
Calculation is not as familiar as the mean.  
It's designed to give a measure of how far the numbers are away from the mean in general.

In [None]:
# an array of numbers with one large number
numbers1 = np.array([1, 1, 1, 1, 10])
# the mean of the array
np.mean(numbers1)

# an array of numbers all close together
numbers2 = np.array([2, 2, 3, 3, 4])
# the mean of the array
np.mean(numbers2)

The mean values are limited as a summary of the data points, so we use standard deviation, giving us a measure of the spread of the data points.

In [None]:
# Generating sample to calculate standard deviation
x = np.random.normal(10.0, 1.0, 1000)
x

# calculating the mean
x_mean = x.mean()
# subtracting the mean from each value
zeroed = x - x.mean
zeroed

# calculating the mean of zeroed
zeroed.mean()
print(f'{zeroed.mean():0.4f}')

Subtracting the mean of the sample results in the mean being 0.  
Standard deviation is an adjustment to the calculation above.  
We want to summarise the zeroed array without losing information about the distance of each point from the mean.

In [None]:
# creating a plot
fig, ax = plt.subplots(figsize = (12, 6))
# plotting the zeroed array
ax.plot(range(len(zeroed)), zeroed, 'k.')
# plotting the y = 0 line
ax.axhline(y = 0.0, color = 'grey', linestyle = '-')

We could take the average (vertical) distance each point is from the mean of 0.  
But there are positive and negative values in the plot.

In [None]:
# Sum the array
print(f'{zeroed.sum():0.4f}')

The sum is equal to 0, so we can try to take the absolute value.  
We expect distances to be positive.  
Reasonable spread is observed, so we square the values to ensure positive values, and larger values will become larger.  
Larger deviations from the mean will contribute relatively more to the standard deviation as a result.

In [None]:
# Absolute values
np.abs(zeroed)

# Average absolute value
np.mean(np.abs(zeroed))

# Square the values to ensure positive values
np.square(zeroed)

In [None]:
# creating a plot
fig, ax = plt.subplots(figsize = (12, 6))

# plotting the squared zeroed array
ax.plot(range(len(zeroed)), np.square(zeroed), color = 'red', marker = '.', linestyle = 'none')

# plotting the zeroed array
ax.plot(range(len(zeroed)), zeroed, 'k.')

# plotting the y = 0 line
ax.axhline(y = 0.0, color = 'grey', linestyle = '-')

In [None]:
# calculating the averaged squared result
np.mean(np.square(zeroed))

# taking the square root of the answer
np.sqrt(np.mean(np.square(zeroed)))

# full calculation (built into numpy) with the original array
np.sqrt(np.mean(np.square(x - np.mean(x))))

Close to the 2nd parameter sent to np.random.normal()

In [None]:
# Standard deviation function built in numpy
x.std()

Biased Estimators  
If we calculate the standard deviation of a sample, it is a biased estimator for the standard deviation of the population.  
If we use the above calculation on each sample, we systematically underestimate the standard deviation.  
We will use small samples to see in a plot.

In [None]:
# creating 100,000 samples of size 5, where mean = 0.0 and standard deviation (SD) = 2.0
samples = np.random.normal(0.0, 2.0, (100000, 5))
samples

In [None]:
# calculating SD without correction
stdevs = samples.std(axis = 1)
stdevs

In [None]:
# creating a histogram
fig, ax = plt.subplots(figsize = (12, 6))

# plotting a histogram
plt.hist(stdevs, bins = 100)

# drawing a vertical line where the actual SD is
plt.axvline(x = 2.0, color = 'red')

The tip of the curve is below the actual value.

Bessel's Correction  
This correction applies to the variance, which is the square of the standard deviation (SD).  
It's what we get if we do not apply the np.sqrt() function in our calculation for SD.  
The correction is to multiply the calculation by n/n-1

In [None]:
# Uncorrected variance
np.mean(np.square(x - np.mean(x)))

# Corrected variance
np.mean(np.square(x - np.mean(x))) * (len(x) / len(x) - 1.0)

Exercise 2:  
Show that the difference between the standard deviation calculations is greatest for small sample sizes.

References:  
https://web.pdx.edu/~newsomj/pa551/lecture4.htm  
https://s4be.cochrane.org/blog/2018/09/26/a-beginners-guide-to-standard-deviation-and-standard-error/  
https://towardsdatascience.com/using-standard-deviation-in-python-77872c32ba9b