In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from numpy import mean, std
from matplotlib.mlab import csv2rec
from pylab import poly_between

# stats60 specific
from code.utils import sample_density
figsize = (8,8)

## Measurements and Data (Chapter 6)

* We’ve talked about summaries of a list of numbers so far...
* All such numbers come from some **measurement**.
* For example, Stanford frosh SAT scores were **measured**
   when you had your best SAT score.
* Book uses the example of $K_{20}$ the US national prototype kilogram.

## Measurement


<h3>
No matter how carefully it was made, 
a measurement could have come out a bit differently.
</h3>

**But how much?**


- The best way to find out is to replicate the measurement.

- The SD of the replicates estimates the likely size of the chance error in a single measurement.

## Measurement model

The basic measurement model is

     measurement = exact value + chance error

### Greek notation

- Call the individual measurement $M$
- the exact value $\mu$
- the chance error $\epsilon$

The measurement model is
$$M = \mu + \epsilon$$

### Repeated measurements

- This is a situation in which an experiment is 
repeated several times.

- Produces a list with $n$ entries 

$$\text{measurement}_i = \text{exact value} + \text{chance error}_i$$

### Greek notation

- Call our list of measurements $[M_1, \dots, M_n]$. 
- Then, $$M_i = \mu + \epsilon_i$$

### Histogram of the measurements

- Our measurement model says that the only thing changing between measurements is the chance error.
-  The histogram of the measurements will be the histogram of the *chance error*
  , shifted by the *exact value*.
- In standard units, the histogram of the *measurements*
   will look like the histogram of the *chance error*
   in standard units.
- If the normal curve fits the histogram of the *chance error*
   well, it will fit the histogram of the *measurements*
   well as well.

### Example: Weighing an apple

* Suppose we have an apple that weights exactly 8 ounces.
* Experiment: weigh the apple 100 different times.
* If we know the exact weight of the apple is 8 ounces, we can find the chance errors.

### Histogram of apples

In [None]:
%%capture
apple = np.random.standard_normal(5000)*1.2 + 8
apple_fig = plt.figure(figsize=(10,10))
apple_ax = apple_fig.gca()
sample_density(apple, bins=25, ax=apple_ax, alpha=0.5, 
             facecolor='green')

apple_ax.set_yticks([])
apple_ax.set_xlabel('Weight of apple (ounces)', fontsize=20)
apple_ax.set_title('Mean: %0.1f, SD: %0.1f' % (apple.mean(), apple.std()),
                   fontsize=20)

In [None]:
apple_fig

### Histogram of chance error

In [None]:
%%capture
error = apple - 8
error_fig = plt.figure(figsize=(6,6))
error_ax = error_fig.gca()
sample_density(error, bins=25, ax=error_ax, alpha=0.5, 
             facecolor='red')

error_ax.set_yticks([])
error_ax.set_xlabel('Chance error (ounces)', fontsize=20)
error_ax.set_title('Mean: %0.1f, SD: %0.1f' % (error.mean(), error.std()),
                   fontsize=20)


In [None]:
error_fig

The chance error *averages out* to be about 0.

## Wisdom of crowds (crowdsourcing)

In 1906, while visiting a livestock fair, Sir Francis Galton stumbled upon an intriguing contest: 

[An ox was on display](http://en.wikipedia.org/wiki/The_Wisdom_of_Crowds#cite_note-1), and the villagers were invited to guess the animal's weight after it was slaughtered and dressed.

- ~800 participants. No one hit the exact mark: 1,198 pounds.
- However: the average was 1,197!
- The chance error in 800 measurements averaged out to close to 0!
- Villagers' bias seems very close to 0.

## Size of chance error

- The likely size (in absolute value) of a chance error in a single
measurement can be estimated by the SD of a sequence
of repeated measurements.

## Normality of chance errors

- Chance errors typically have a normal histogram, and the mean of the measurements is the exact value.
- To get rid of chance error: repeat the measurement, and take average.
- <font color="red">But, how many observations before the average is accurate? </font>
- This is why the normal approximation is useful! More on this later...

## Outliers

- Not all individual measurements fit the normal curve.
- This could be because the histogram of measurements *shouldn’t*
   fit the normal curve exactly ...
- Or, an error was made in some of the measurements ...
- Usually impossible to tell which ...

## Bias or systematic error

Conceptual definition (excerpt from book):


- <h3> Bias affects all measurements the same way, pushing them 
in the same direction</h3>

- <h3> Chance errors change from measurement 
to measurement, sometimes up and sometimes down.</h3>


## Measurement model

In light of the possibility of bias, we should
rephrase our measurement model as:

     measurement = exact value + bias + chance error

### Greek notation

- Call the bias $B$. Then
$$M = \mu + B + \epsilon$$

## Repeated measurements

- Produces a list with $n$ entries 
$$\text{measurement}_i = \text{exact value} + \text{bias} + \text{chance error}_i$$

### Greek notation

- Call our list of measurements $[M_1, \dots, M_n]$. Then, 
$$M_i = \mu + B + \epsilon_i$$

## Take away

** A measurement has three parts:**

- **True value.** This is what we care about.
- **Chance error.** This is something unavoidable in the measurement process. With many measurements, this should average out.
- **Bias.** This is undesirable.

### Selection bias

Suppose that the height of all female freshman at U.S. colleges follows a normal
curve with average 68 inches and SD 3 inches.

- What would you expect the histogram to look like if you tried to 
measure the average height by choosing names of students 
at random from [http://stanfordwho.edu](http://stanfordwho.edu)?

- **What if you made your sample by choosing to measure height
of all female basketball players?** We will assume that these are chosen from females 72 inches or taller (which is of course a gross simplification).

In [None]:
%%capture
female_fig = plt.figure(figsize=(10,10))
female_ax = female_fig.gca()
female_sample = np.random.standard_normal(5000)*3+68
sample_density(female_sample,
             bins=30, ax=female_ax)
female_ax.set_title('Choosing from http://stanfordwho.edu', fontsize=20)

In [None]:
female_fig

In [None]:
female_ax.set_title('Mean: %0.1f, SD: %0.1f' % (female_sample.mean(), 
                                                female_sample.std()),
                   fontsize=20)
female_fig

In [None]:
%%capture
biased_sample = np.random.standard_normal(17000)*3+68
biased_sample = biased_sample[biased_sample > 72]
biased_fig = plt.figure(figsize=(10,10))
biased_ax = biased_fig.gca()
sample_density(biased_sample, 
             bins=25, ax=biased_ax, facecolor='gray')
biased_ax.set_title('Choosing from basketball players', fontsize=20)


In [None]:
biased_fig

In [None]:
biased_ax.set_title('Mean: %0.1f, SD: %0.1f' % (biased_sample.mean(), 
                                                biased_sample.std()),
                   fontsize=20)
biased_fig

## Take away from basketball example

- This way of measuring introduced **bias** in that
the average of the gray histogram is roughly 73, about 5 inches
larger than the true average of 68.

- It also changed the histogram of the **chance error**. It is now *skewed right.*

- Computing mean from a biased sample won't generally be able to tell
you much about the **exact value.**

- Would you say this bias is similar to bias we discussed in NFIP study?

In [None]:
%%capture
biased_error_fig = plt.figure(figsize=(10,10))
biased_ax = biased_error_fig.gca()
biased_error = biased_sample - biased_sample.mean()
sample_density(biased_error, 
             bins=25, ax=biased_ax, facecolor='gray')
biased_ax.set_title('Mean: %0.1f, SD: %0.1f' % (biased_error.mean(), 
                                                biased_error.std()),
                   fontsize=20)



In [None]:
biased_error_fig

## Problems encountered with bias

* Bias doesn’t disappear with repeated measurements $$\text{average}(\text{list with bias}) = \text{average}(\text{unbiased list}) + \text{bias}$$

### Greek notation

* In this notation $\bar{M} = \mu + B + \bar{\epsilon}$
* Becomes very worrying when trying to compare two averages.

## Example: Weighing an apple and an orange with biased scales

* Suppose we also have an orange that weighs exactly 8 ounces.
* We weigh the apple and orange 100 different times each.
* The scale we use for the orange has same SD, but, without knowing
it we use our biased scale for the oranges and an unbiased scale for the apples.

In [None]:
%%capture
bias = 2 
biased_apple = apple + bias
biased_fig = plt.figure(figsize=(8,8))
biased_ax = biased_fig.gca()
sample_density(biased_apple, bins=25, ax=biased_ax, alpha=0.5, 
             facecolor='orange')

biased_ax.set_yticks([])
biased_ax.set_xlabel('Weight of orange using biased scale (ounces)', fontsize=20)
biased_ax.set_title('Mean: %0.1f, SD: %0.1f' % (biased_apple.mean(), 
                                                        biased_apple.std()),
                   fontsize=20)

In [None]:
biased_fig

### Histogram of apples and oranges

In [None]:
%%capture
orange = np.random.standard_normal(5000)*1.4 + 8
sample_density(orange + bias, ax=apple_ax, bins=25,
            facecolor='orange')
apple_ax.set_title('')
apple_ax.set_xlabel('Weight (ounces)')


In [None]:
apple_fig

### Dealing with bias

* What if we weighed the apple half the time on one scale and half on the other?
* And did the same with the oranges ...
* Will the histogram still look like a normal curve?
* Would you say that this is similar to randomization in NFIP study?

### Histogram of apples on two scales

In [None]:
%%capture
apple_random = apple + 2 * np.random.binomial(1, 0.5, size=apple.size)
orange_random = orange + 2 * np.random.binomial(1, 0.5, size=apple.size)
random_fig = plt.figure(figsize=(10,10))
random_ax = random_fig.gca()
sample_density(apple_random, ax=random_ax, bins=30,
            facecolor='green')
sample_density(orange_random, ax=random_ax, bins=30,
            facecolor='orange')
random_ax.set_xlabel('Weight (ounces)', fontsize=20)

In [None]:
random_fig

### Take-away from the randomization experiment


- Both histograms have had the average shifted by 1 (half the bias).

- The bias in the difference we saw when we used one scale exclusively
for oranges has been reduced.

- The weights of the apples and oranges are *confounded*
   by the difference between the scales.

- The variable that says which scale was used to make each weight is a *confounding variable*.

- Randomization allowed us to eliminate this confounder.

- Close to a normal curve, but not quite.


### Other example of measurements


* Opinion polls. (Does our measurement model work here? What is bias?)
* Weighing 100 different apples on Scale 1 instead of 1 apple 100 times. (Would you get the same SD? Would it be larger? Smaller?)
* The SAT test is a measurement of "aptitude". 
* An MRI scan is a measurement (actually, several 100,000 measurements).
* Body weight is a measurement.