# Inference and Statistical Models

This section is based partly on Freedman, D.A., 2009. [_Statistical Models: Theory and Practice_, Revised Edition](http://www.amazon.com/Statistical-Models-Practice-David-Freedman/dp/0521743850/), Cambridge University Press.

Statistical models, and regression in particular, are used primarily for three purposes:

1. _Description_: to summarize data
2. _Prediction_: to predict future data
3. _Causal Inference_: to predict what would happen in response to an intervention

It is straightforward to check whether a regression model is a good summary of _existing_ data, although there is some subtlety in determining whether the summary is _good enough_.  How to measure goodness of fit appropriately is not always obvious, and adequacy of fit depends on the use of the summary.

Prediction is harder than description because it involves _extrapolation_: how can one tell what the future will bring? Why should the future be like the past? Is the system under study stable (i.e., _stationary_) or are its properties changing with time?

However, the hardest of these tasks is causal inference. The biggest difficulty in drawing causal inferences is _confounding_, especially when the data arise from _observational studies_ rather than _randomized, controlled experiments_. (_Natural experiments_ lie somewhere in between; there are few that approach the utility of
a randomized controlled experiment, but John Snow's study of the communication of cholera is a notable exception.)

_Confounding_ happens when one factor or variable manifests in the outcome in a way that cannot be distinguished from the _treatment_.

_Stratification_ (e.g., _cross tabulation_) can help reduce confounding. So can modeling&mdash;in some cases, but not in others. 
For modeling to help, it is generally necessary for the structure of the model to 
correspond to how the data were actually generated.
Unfortunately, most models in science, especially social science, are chosen out of habit or computational
convenience, not because they have a basis in the science itself.
This often produces misleading results, and the misleading impression that those results have small
uncertainties.


### The Method of Comparison
The basic method to mitigate confounding is to compare(at least) two groups, one that receives _treatment_ and a _control group_ that does not (or that gets a different treatment).
To minimize bias, the treatment group and the control group should be as similar as possible but for the fact that
one gets treatment and the other does not.

If subjects _self-select_ for treatment, that generally results in bias. So does allowing the experimenter flexibility to select the groups.
The best way to minimize bias, and to be able to quantify the uncertainty in the resulting inferences, is to assign subjects to treatment or control _randomly_.

For human subjects, the mere fact of receiving treatment&mdash;even a treatment with no real effect&mdash;can
produce changes in response. This is called _the placebo effect_.
For that reason, it is important that human subjects be _blind_ to whether they are treated or not, for instance,
by giving subjects in the control group a _placebo_.
That makes the treatment and control groups more similar.
Both groups receive something: the difference is in _what_ they
receive, rather than _whether_ they receive anything.

Also, subjective elements can deliberately or inadvertently enter the assessment of subjects' responses to treatment,
making it important for the people assessing the responses to be _blind_ to which subjects received treatment.
When neither the subjects nor the assessors know who was treated, the experiment is _double blind_.

See [SticiGui: Does Treatment Have an Effect?](http://www.stat.berkeley.edu/~stark/SticiGui/Text/experiments.htm) for more discussion.

### Example: Smoking and Cancer
See Freedman, 2009.

Smokers get more heart attacks, lung cancer, and other diseases than non-smokers.
Is it because they smoke?

Most smokers are male, and gender matters for many of those diseases.
So does age, exposure to other environmental agents such as air pollution, etc.
How can we tell whether smoking is responsible for the increased morbidity and mortality?

### Example: HIP trial of the early 1960s
See Freedman, 2009.
700,000 members of a NY health plan, including 62,000 women between age 40 and 64, who were
randomly assigned to be screened for breast cancer or not.
This is a controlled, randomized experiment.


### Example: Snow's study of the origins of Cholera
See [SticiGui](http://www.stat.berkeley.edu/~stark/SticiGui/Text/experiments.htm#cholera)
This is a natural experiment, but a spectacularly good one. Indeed, it helped establish the germ theory
of disease.

### Example: Yule's study of the Causes of Pauperism
Freedman, 2009.
This is a regression model applied to data from an observational study in an attempt to make causal inferences.

## The 2-sample problem 

Suppose we have a group of $N$ individuals who are randomized into two groups, a _treatment_ group of size $N_t$ and a _control_ group of size $N_c = N - N_t$.
Label the individuals from $1$ to $N$.
Let ${\mathcal T}$ denote the labels of individuals assigned to treatment and ${\mathcal C}$ denote 
the labels of those assigned to control.

For each of the $N$ individuals, we measure a quantitative (real-valued) response.
Each individual $i$ has two _potential responses_: the response $c_i $individual would have if assigned to 
the control group, and the response $t_i$ the individual would have if assigned to the treatment group.

We assume that individul $i$'s response depends _only_ on that individual's assigment, and not on anyone else's assignment.
This is the assumption of _non-interference_. 
In some cases, this assumption is reasonable; in others, it is not.

For instance, imagine testing a vaccine for a communicable disease.
If you and I have contact, whether you get the disease might depend on whether I am vaccinated&mdash;and _vice versa_&mdash;since if the vaccine protects me from illness, I won't infect you.
Similarly, suppose we are testing the effectiveness of an advertisement for a product.
If you and I are connected and you buy the product, I might be more likely to buy it, even if I don't
see the advertisement.

Conversely, suppose that "treatment" is exposure to a carcinogen, and the response whether the
subject contracts cancer. 
On the assumption that cancer is not communicable, my exposure and your disease
status have no connection.

The _strong null hypothesis_ is that individual by individual, treatment makes no difference whatsoever: $c_i = t_i$ for all $i$.

If so, any differences between statistics computed for the treatment and control groups are entirely due to the luck of the draw: which individuals happened to be assigned to treatment and which to control.

We can find the _null distribution_ of any statistic computed from the responses of the two groups: if the strong null hypothesis is true, we know what individual $i$'s response would have been whether assigned to treatment or to control&mdash;namely, the same.

For instance, suppose we suspect that treatment tends to increase response: in general, $t_i \ge c_i$.
Then we might expect $\bar{c} = \frac{1}{N_c} \sum_{i \in {\mathcal C}} c_i$ to be less than
$\bar{t} = \frac{1}{N_t} \sum_{i \in {\mathcal T}} t_i$.
How large a difference between $\bar{c}$ and $\bar{t}$ would be evidence that treatment increases the response,
beyond what might happen by chance through the luck of the draw?

This amounts to asking whether the observed difference in means between the two groups is a high percentile
of the distribution of that difference in means, calculated on the assumption that the null hypothesis is true.

Because of how subjects are assigned to treatment or to control, all allocations of $N_t$ subjects to
treatment are equally likely.

One way to partition the $N$ subjects randomly into a group of size $N_c$ and a group of size $N_t$ is
to permute the $N$ subjects at random, then take the first $N_c$ in the permuted list to be the control
group, and the remaining $N_t$ to be the treatment group.

[Note: discussion of how to construct a random permutation. Issues with assigning random numbers to
all items of the list. Compare with Knuth's algorithm, also for computational burden.]

### Student's $t$ statistic

The _mean_ of a list of of numbers $\{x_j\}_{j=1}^n$ is
$$
   \bar{x} \equiv \frac{1}{n} \sum_{j=1}^n x_j.
$$

The _sample standard deviation_ of the list is
$$
   s(x)  \equiv \sqrt{ \frac{1}{n-1} \sum_{j=1}^n \left ( x_j - \bar{x} \right )^2 }.
$$
Note the $n-1$ in the denominator.

Student's $t$ statistic is [TO DO].

It is common to use Student's 
$t$ test in these circumstances, but the assumptions for the $t$ statistic to have Student's $t$ distribution are not met: for Student's t statistic to have Student's t distribution, the two groups need to be independent random samples from normal distributions with the same means and variances.

Instead, we will construct a permutation test using Student's $t$ statistic.

## Gender Bias in Teaching Evaluations
MacNell, Driscoll, and Hunt (2014. [What's in a Name: Exposing Gender Bias in Student Ratings of Teaching](http://link.springer.com/article/10.1007%2Fs10755-014-9313-4), _Innovative Higher Education_) conducted a controlled, randomized experiment on
the effect of students' perception of instructors' gender on teaching evaluations
in an online course.
Students in the class did not know the instructors' true genders.

MacNell et al. randomized 43 students in an online course into four groups: 8 students were told their
instructor was female and their instructor truly was female

<table>
<tr><th>Adjective</th> <th>F - M</th></tr>
<tr><td>Caring</td><td> -0.47</td></tr>
<tr><td> Consistent</td><td> -0.57</td></tr>
<tr><td> Enthusiastic</td><td> -0.76</td></tr>
<tr><td> Fair </td><td>-0.47</td></tr>
<tr><td> Feedback</td><td> -0.46</td></tr>
<tr><td> Helpful</td><td> -0.35</td></tr>
<tr><td> Knowledgeable</td><td> -0.67</td></tr>
<tr><td> Praise</td><td> -0.61</td></tr>
<tr><td> Professional</td><td> -0.80</td></tr>
<tr><td> Prompt</td><td> -0.61</td></tr>
<tr><td> Respectful</td><td> -0.22</td></tr>
<tr><td> Responsive</td><td> -0.61</td></tr>
</table>



MacNell et al. graciously shared their data.
The data are coded as follows:

    Group
          3 (8 students) - TA identified as male, true TA gender female 
	      4 (12 students) - TA identified as male, true TA gender male
	      5 (12 students) - TA identified as female, true TA gender female
	      6 (11 students) - TA identified as female, true TA gender male
    tagender - 1 if true male, 0 if true female 
    taidgender - 1 if identified as male, 0 if identified as female 
    gender - 1 if student is male, 2 if student is female

There are grades for 47 students but evaluations for only 43 (4 did not respond). 
The grades are not linked to the evaluations, per the IRB protocol.

In [9]:
# the data are in a .csv file called "Macnell-RatingsData.csv" in the directory Data
ratings <- read.csv("Data/Macnell-RatingsData.csv");  # reads a .csv file into a DataFrame
ratings[1:5,]
summary(ratings) # summary statistics for the data. Note the issue with "age"

Unnamed: 0,group,professional,respect,caring,enthusiastic,communicate,helpful,feedback,prompt,consistent,fair,responsive,praised,knowledgeable,clear,overall,gender,age,tagender,taidgender
1,3,5,5,4,4,4,3,4,4,4,4,4,4,3,5,4,2,1990,0,1
2,3,4,4,4,4,5,5,5,5,3,4,5,5,5,5,4,1,1992,0,1
3,3,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,2,1991,0,1
4,3,5,5,5,5,5,3,5,5,5,5,3,5,5,5,5,2,1991,0,1
5,3,5,5,5,5,5,5,5,3,4,5,5,5,5,5,5,2,1992,0,1


     group        professional      respect          caring      enthusiastic  
 Min.   :3.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:3.50   1st Qu.:4.000  
 Median :4.000   Median :5.000   Median :5.000   Median :4.00   Median :4.000  
 Mean   :4.465   Mean   :4.326   Mean   :4.326   Mean   :3.93   Mean   :3.907  
 3rd Qu.:6.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.00   3rd Qu.:4.500  
 Max.   :6.000   Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
  communicate       helpful         feedback         prompt     
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:4.000   1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4.000  
 Median :4.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.953   Mean   :3.744   Mean   :3.953   Mean   :3.977  
 3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   M

In [10]:
grades <- read.csv("Data/Macnell-GradeData.csv");  # reads a .csv file into a DataFrame
grades[1:5,]
summary(grades) # summary statistics for the data. Note the issue with "age"

Unnamed: 0,group,grade,tagender,taidgender
1,3,77.4,0,1
2,3,89.02,0,1
3,3,53.5,0,1
4,3,88.32,0,1
5,3,90.02,0,1


     group           grade          tagender        taidgender    
 Min.   :3.000   Min.   :49.46   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:3.500   1st Qu.:75.20   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :5.000   Median :80.13   Median :1.0000   Median :0.0000  
 Mean   :4.532   Mean   :79.01   Mean   :0.5106   Mean   :0.4894  
 3rd Qu.:6.000   3rd Qu.:85.09   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :6.000   Max.   :95.10   Max.   :1.0000   Max.   :1.0000  

In [None]:
# First test option: two groups
simPermuTTest <- function(x, y, iter) {
    tStat <- function(z, n) {
        N <- length(z);
        m <- N-n;
        sPool <- sqrt(((m-1)*var(z[1:m])+(n-1)*var(z[(m+1):N]))/(N-2));
        (mean(z[(m+1):N])-mean(z[1:m]))/(sPool*sqrt(1/n + 1/m))
    }
    n <- length(y);              # number of treated subjects
    z <- c(x, y)                 # pooled responses
    ts <- abs(tStat(z,n))        # test statistic
    sum(replicate(iter, (abs(tStat(sample(z),n))>=ts)))/iter
}

In [None]:
# It's good practice to set the seed of the random number generator, so that your work will be
# reproducible. I'm using the date of this lecture as the seed.  
# Don't reset the seed repeatedly in your analysis! That compromises the pseudorandom behavior of the PRNG.
# R uses the Mersenne Twister PRNG, which is good for statistics but not adequate for cryptography
set.seed(20150630)

In [None]:
# second test option: use TAs as their own controls
simPermuTTest2 <- function(m1, f1, m2, f2, iter) {
    tStat <- function(z, n) {
        N <- length(z);
        m <- N-n;
        sPool <- sqrt(((m-1)*var(z[1:m])+(n-1)*var(z[(m+1):N]))/(N-2));
        (mean(z[(m+1):N])-mean(z[1:m]))/(sPool*sqrt(1/n + 1/m))
    }
    n1 <- length(f1);              # number of students assigned to instructor 1 when instructor 1
                                   # was identified as female
    N1 <- length(f1)+length(m1)    # total number of students assigned to instructor 1
    n2 <- length(f2);              # number of students assigned to instructor 2 when instructor 2
                                   # was identified as female
    z <- c(m1, f1, m2, f2);        # pooled responses
    N <- length(z);                # total number of students
    ts <- abs(tStat(c(m1,f1),n1) + tStat(c(m2,f2),n2)) # test statistic
    sum(replicate(iter, {zp <- sample(z);
                         zp1 <- zp[1:N1];
                         zp2 <- zp[(N1+1):N];
                         abs(tStat(zp1,n1) + tStat(zp2,n2))> ts
                        }))/iter
}