## Experiments (Chapter 1)

* Does smoking cause lung cancer?
* How can this be established – by comparing smokers to non-smokers?
* Is a vaccine effective against some infectious disease?

### Treatment and control groups

* In the smoking study: ** treatment = smokers**, ** control = non-smokers**.

* In a vaccine study: ** treatment = patients who are vaccinated**, ** control = non-vaccinated**.

* Ideally, the only difference between ** treatment**
   and ** control**
   is whether or not they receive the treatment.

### Randomized controlled experiments

* The best way to establish smoking causes lung cancer is a **randomized controlled experiment**.
* In such a study, patients would be assigned randomly to smoking or non-smoking group.
* These experiments are not always possible: we can't force people to smoke.

### Why randomize?

* In the polio vaccine example described in the book, wealthy families whose children were more vulnerable to polio also were more likely to volunteer for vaccination.
* If ** treatment**
   is assigned based on whoever volunteers, this could bias the experiment against the vaccine, i.e. its apparent effectiveness is diminished.

* This means there will be differences between ** treatment**
   and ** control**
   groups other than just the vaccine.

## NFIP study


Group | Size | Rate
--- | --- | ---
Treatment | 225,000 | 25
Control | 725,000 | 54
No consent | 125,000 | 44

### Placebo effect and blinding

* Subjects in the ** control**
   group should be given a "treatment" with no effect. That is, they should ideally be *blinded*.
* Why? So the response is not due to the idea of a vaccine, but the vaccine itself.
* In the vaccination example, children were given an injection of salt and water.
* This treatment is called a ** placebo**.

## Double blinding

* If the doctors know who receives treatment and who receives placebo, they may also bias the results by their reporting.
* For example, polio diagnosis is not perfect, a doctor with interest in the success of the vaccine may declare a treated child with mild polio as healthy; or an untreated (placebo) child who is close to healthy as having mild polio.
* This bias may be conscious or unconscious on the doctors’ part.

## Double-blind study


Group | Size | Rate
--- | --- | ---
Treatment | 200,000 | 28
Control | 200,000 | 71
No consent | 350,000 | 46

## Another advantage

- As we’ll see later, the other advantage of the randomized controlled experiment is that the only  difference in rates between ** treatment**
   and ** control**
   is randomness.
- We will compute the chances seeing a difference
in rate as large as (71 - 28) per 100,000 *assuming the vaccine has no effect*. 
- **The chances will be very small.**

## Observational studies (Chapter 2)

* Unlike in randomized controlled experiments, in *observational studies*
  , the subjects are assigned to ** treatment**
   or ** control**
   by an *uncontrolled*
   mechanism.
* In a smoking / lung cancer study, subjects choose to smoke or not.
* Very often ** treatment**
   or ** control**
   groups differ by more than just the treatment.

## Smoking & socio-economic status

Smoking is related to socio-economic status. 

Smokers:
* tend to be in  lower socio-economic status groups with less access to medical care;
* will tend to have higher incidence of some diseases based on this fact alone.

## Association is not causation

* In children, shoe size is associated to reading ability.
* However, having big feet does not cause children to score high on reading tests.

## Confounding

The big problem with observational studies is **confounding**

- Confounding means there is a   
difference between 
the treatment and 
control groups 
– other than treatment – 
which affects the response being studied. 
      
- A confounder is a third variable, associated with  exposure and with disease.


## Confounding in reading ability

In our example with show size and reading ability, **age is a confounder.**

- Both reading and ability are associated to age.
- As children age, their feet grow.
- As children age, their reading improves.

### The problem with confounding

- It is generally impossible to rule out all possible confounding variables.

- Therefore establishing a causal link between two observed variables, e.g. smoking and lung cancer can be difficult.

- Fisher, one of the greatest statisticians believed there was a confounding variable in the case of lung cancer and cigarettes. 

- Modern medicine would say he was wrong...

## Sex bias in admissions to graduate school

A study from UC Berkeley:


Major | # (Male) | % (Male) | # (Female) | % (Female)
--- | --- | --- | --- | ---
A | 825 | 62 | 108 | 82
B | 560 | 63 | 25 | 68
C | 325 | 37 | 593 | 34
D | 417 | 33 | 375 | 35
E | 191 | 28 | 393 | 24
F | 373 | 6 | 341 | 7
Total | 2691 | <font color="red">44</font>| 1835 | <font color="red">35</font>

Female admission rates are almost as good or better in all of these majors,
but the overall rate is lower.

**Is this evidence of bias against female applicants?**

No, but it is confusing. It is a phenomenon known as [Simpson's paradox](http://en.wikipedia.org/wiki/Simpson's_paradox).

### Females were applying to more competitive majors

In [None]:
%%capture

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

from code.probability import Multinomial
table = np.zeros((6,2,2))
table[0,0] = [825*.62,825*.38]
table[0,1] = [108*.82,108*.18]
table[1,0] = [560*.63,560*.37]
table[1,1] = [25*.68,25*.32]
table[2,0] = [325*.37,325*.63]
table[2,1] = [593*.34,593*.66]
table[3,0]= [417*.33,417*.67]
table[3,1]= [375*.35,375*.65]
table[4,0] = [191*.28,191*.72]
table[4,1] = [393*.24,393*.76]
table[5,0] = [373*.06,373*.94]
table[5,1] = [341*.07,341*.93]

UCB = Multinomial(table,
                  labels=[['A','B','C','D','E','F'],
                          ['Male','Female'],
                          ['Accept', 'Deny']])

UCB_female = UCB.condition_margin(1, 'Female')

UCB_male = UCB.condition_margin(1, 'Male')
UCB_male.sample(10000)
male = np.squeeze(UCB_male.prob)
UCB_female.sample(1000)
female = np.squeeze(UCB_female.prob)
male_major = np.sum(male, 1)
female_major = np.sum(female, 1)

major_fig = plt.figure(figsize=(10,10))
major_ax = major_fig.gca()
major_ax.bar(range(female_major.shape[0]), 100 * female_major, alpha=0.5, facecolor='blue', label='Female')
major_ax.bar(range(male_major.shape[0]), 100 * male_major, alpha=0.5, facecolor='yellow', label='Male')
major_ax.legend()
major_ax.set_xticklabels(['A','B','C','D','E','F'])
major_ax.set_xlabel('Major', fontsize=20)
major_ax.set_ylabel('Percentage', fontsize=20)
major_ax.set_title('Breakdown of Major by Gender', fontsize=20)

overall_rate = table.sum(1)
accept_rate = overall_rate[:,0] / overall_rate.sum(1)

accept_fig = plt.figure(figsize=(10,10))
accept_ax = accept_fig.gca()
accept_ax.bar(range(accept_rate.shape[0]), 100 * accept_rate, alpha=0.5, facecolor='red', label='Acceptance')
accept_ax.set_xticklabels(['A','B','C','D','E','F'])
accept_ax.set_xlabel('Major', fontsize=20)
accept_ax.set_ylabel('Percentage', fontsize=20)
accept_ax.set_title('Acceptance Rate by Major', fontsize=20)




In [None]:
major_fig

In [None]:
accept_fig

### Confounder

- In this example, *Major* was a *confounder* for the relationship
between  *Gender* and *Admission status.*



### Weighted average

A clearer picture can be found by computing a weighted average of the 
admission rate. 

The average will be weighted by the total number of people applying to each major.

 $$\begin{aligned}
   \text{Male} &= \frac{0.62 \times 933 + 0.63 \times 585 + 0.37 \times 918}{4526} \\
   & \qquad  +  \ \frac{0.33 \times 792 + 0.28 \times 584 + 0.06 \times 714}{4526} \\
   &= 39 \% \\
   \end{aligned}
   $$
   
   $$
   \begin{aligned}
   \text{Female} &= \frac{0.82 \times 933 + 0.68 \times 585 + 0.34 \times 918}{4526} \\
   & \qquad + \  \frac{0.35 \times 792 + 0.24 \times 584 + 0.07 \times 714}{4526} \\
   &= 43 \%
   \end{aligned}$$

