# Design of Experiements

###  Observation studies and Experiments

####  Obversational study:
- Collect data in a way that does not directly interfere with how the data arise. 
- Only correlation can be inferred.

#### Experiment:
- Randomly assign subjects to various treatments
- Causation can be inferred

Relationships between using screens (mobile, tablet, laptop, etc) at bedtime and attention span.

- **Observational Study:** We sample 2 types of people from the population. Those who choose to use screens at bedtime and those who don't. Then we find the average attention span for the 2 groups of people and compare. Based on the observational study, even if we find a difference between the average attention span of these groups of people, we can't attribute this difference solely to using screens because there may be other variables that we didn't control for this study that could've contributed to the observed difference. For example, people who use screens at night might also be using screens for longer period during the day and their attention span might be affected by the daytime usage as well. 

- **Experiment:** We sample a group of people from the population and then we randomly assign these people into 2 groups: those who are asked to use screens at bedtime and those asked not to use them. The difference is that the decision to use or not use screens at bedtime is NOT left to the subject. Rather, it is imposed by the researcher. At the end, we compare the attention spans of the 2 groups. Such variables that might contribute to the outcome, called confounding variables, are most likely represented equally in the 2 groups due to random assignment. Therefore, if we find the difference between the 2 averages, we can indeed make a causal statement attributing this difference to bedtime screen usage. 

### Random sampling and random assignment

**Random sampling:** Occurs when the subjects are being selected for a study. If the subjects are selected randomly from the population, then the resulting sample is likely representative of the population and therefore the study's results can be generalizable to that population. 

** Random assignment:** Only occurs in experimental settings where subjects are being assigned to various treatments. Random assignments allow for causal conclusions. 

- **Random assignment + random sampling:** Causal and generalizable to the whole population. This is the ideal experiment but such studies are difficult to carry out, especially if the experimental units are humans, since it may be difficult to randomly sample people from the population and then impose treatments on them. 

- **Random assignment + no random sampling:** Causal but not generalizable. Experiments that rely on volunteers employ random assignment but not random sampling. These studies can be used to make causal conclusions but the conclusions but the conclusions can only be applied to the sample and the results cannot be generalized to the population. 

- **No random assignment + random sampling:** Not Causal but generalizable. A study that uses no random assignment but does use random sampling is your typical observational study. Results can only be used to make association statements but they can be generalized to the whole population.   

- **No random assignment + No random sampling:** Neither Causal nor generalizable. These studies can only be used to make non-causal association statements. This is NOT an ideal study. 

### Random sampling or random assignment:

One of the early studies linking smoking and lung cancer compared patients who are already hospitalized with lung cancer to similar patients without lung cancer (hospitalized for other reasons), and recorded whether each patient smoked. Then, proportions of smokers for patients with and without lung cancer were compared. 

**QUESTION:** Does this study employ random sampling and/or random assignment?

**ANSWER:** Neither random sampling nor random assignment. Random assignment is not employed because the conditions are not imposed on the patients by the people conducting the study. Random sampling is not employed because the study records the patients who are already hospitalized, so it wouldn't be appropriate to apply the findings back to the population as a whole. 


### Identify the scope of inference of study

Volunteers were recruited to participate in a study where they were asked to type 40 bits of trivia - for example, "an ostrich's eye is bigger than its brain" - into a computer. A randomly selected half of these subjects were told the information would be saved in the computer. The other half were told the items they typed would be erased. 

Then, the subjects were asked to remember these bits of trivia, and the number of bits of trivia each subject could correctly recall were recorded. It was found that the subjects were significantly more likely to remember information if they thought they would not be able to find it later. 

The results of the study **cannot** be generalized to all people and a causal link between believing information is stored and memory **can** be inferred based on these results. 

There is no random sampling since the subjects of the study were volunteers, so the results cannot be generalized to all people. However, due to random assignment, we are able to infer a causal link between the belief information is stored and the ability to recall that same information. 

### Simpson's Paradox

Often, when one mentions "a relationship between variables", we think of a relationship between just 2 variables (Explanatory variable X and Response variable Y). However, truly understanding the relationship between 2 variables might require considering other potentially related variables as well.  If we don't we might find ourselves in a Simpson's Paradox. 

Labelling variables as explanatory and response does not guarantee the relationship between the two is actually causal, even if there is an association identified. We use these labels only to keep track of which variable we suspect affects the other. 

We could study the relationship between 3 explanatory variables and a single response variable. This is often a more realistic scenario since most real world relationships are multivariable. 

For example, if we're interested in the relationship between calories consumed daily and heart health, we would probably also want to consider information on variables like age and fitness level of the person as well. **Not considering an important variable when studying a relationship can result in what we call a Simpson's Paradox, which illustrate the effect the ommission of an explanatory variable can on the measure of association between another explanatory variable and the response variable. **. 

In other words, the inclusion of a 3rd variable in the analysis can change the apparent relationship between the other 2 variables. 


In [5]:
library(dplyr)

In [6]:
ucb_admit <- read.csv('datasets/UCB_ADMIT.csv')
cols <- c("Admit", "Gender", "Dept")
colnames(ucb_admit) <- cols
head(ucb_admit)

Admit,Gender,Dept
Admitted,Male,A
Admitted,Male,A
Admitted,Male,A
Admitted,Male,A
Admitted,Male,A
Admitted,Male,A


In [7]:
glimpse(ucb_admit)

Observations: 4,526
Variables: 3
$ Admit  <fct> Admitted, Admitted, Admitted, Admitted, Admitted, Admitted, ...
$ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, ...
$ Dept   <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, ...


### Number of males and females admitted

In [8]:
ucb_admission_counts <- count(ucb_admit, Gender, Admit)
ucb_admission_counts

Gender,Admit,n
Female,Admitted,557
Female,Rejected,1278
Male,Admitted,1198
Male,Rejected,1493


### Proportion of males admitted overall

Next we'll calculate the percentage of males and percentage of females admitted, by creating a new variable, called prop (short for proportion) based off of the counts calculated in the previous exercise and using the mutate() from the dplyr package.

Proportions for each row of the data frame we created in the previous exercise can be calculated as n / sum(n). Note that since the data are grouped by gender, sum(n) will be calculated for males and females separately.

In [9]:
ucb_admission_counts %>%
    group_by(Gender) %>%
    mutate(prop = n / sum(n)) %>%
    filter(Admit == "Admitted")

Gender,Admit,n,prop
Female,Admitted,557,0.3035422
Male,Admitted,1198,0.4451877


It looks like 44% of males were admitted versus only 30% of females. BUT there's more to this story. 

### Proportion of males admitted for each department

Finally we'll make a table similar to the one we constructed earlier, except we'll first group the data by department. The goal is to compare the proportions of male admitted students across departments.

Proportions for each row of the data frame we create can be calculated as n / sum(n). Note that since the data are grouped by department and gender, sum(n) will be calculated for males and females separately for each department.

In [10]:
ucb_admission_counts <- ucb_admit %>%
    group_by(Dept, Gender, Admit) %>%
    count()

ucb_admission_counts

Dept,Gender,Admit,n
A,Female,Admitted,89
A,Female,Rejected,19
A,Male,Admitted,512
A,Male,Rejected,313
B,Female,Admitted,17
B,Female,Rejected,8
B,Male,Admitted,353
B,Male,Rejected,207
C,Female,Admitted,202
C,Female,Rejected,391


In [11]:
ucb_admission_counts  %>%
  group_by(Dept, Gender) %>%
  mutate(prop = n/sum(n))

Dept,Gender,Admit,n,prop
A,Female,Admitted,89,0.82407407
A,Female,Rejected,19,0.17592593
A,Male,Admitted,512,0.62060606
A,Male,Rejected,313,0.37939394
B,Female,Admitted,17,0.68
B,Female,Rejected,8,0.32
B,Male,Admitted,353,0.63035714
B,Male,Rejected,207,0.36964286
C,Female,Admitted,202,0.34064081
C,Female,Rejected,391,0.65935919


### Gapminder dataset

In [12]:
library(gapminder)

"package 'gapminder' was built under R version 3.4.4"

In [13]:
head(gapminder)

country,continent,year,lifeExp,pop,gdpPercap
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134


In [14]:
glimpse(gapminder)

Observations: 1,704
Variables: 6
$ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...


### Sampling strategies


Why do we sample in the first place? Why not try to collect data from the entire population of interest? You could try to take a census but it isn't easy. First, taking a census requires a lot more resources than collecting data from a sample of the population. Second, certain individuals in your population might be hard to locate or collect data from. If these individuals that are missed in the census are different from those in the rest of the population, the census data will be biased. 

For example, in the US census, illegal immigrants are often not recorded properly since they tend to be reluctant to fill out census forms with the concern that this information could be shared with immigrantion. However, these individuals might have characteristics different from the rest of the population and hence, not getting information from them might result in unreliable data from geographical regions with high concentrations of illegal immigrants. Lastly, populations are constantly changing. Even if you do have the required resources and manage to collect data from everyone in the population, your population will be different tomorrow and so the hard work required to collect such data may not pay off.

### Sample is natural

If you think about it, sampling is actually quite natural. Think about something you are cooking. We taste or in other words examine a small part of what we're cooking to get an idea about the dish as a whole. Afterall, we would never eat a whole pot of soup just to check its taste. When you taste a spoonful of soup and decide the spoonful you tasted isn't salty enough, what you're doing is simply explanatory analysis for the sample at hand. If you then generalize and conclude that your entire soup needs salt, that's making an inference. For your inference to be valid, the spoonful you tasted (your sample), needs to be representative of the entire pot (your population). If your spoonful comes only from the surface and the salt is collected at the bottom, what you tasted is probably not going to be representative of the whole pot. On the other hand, if you first stir the soup thoroughly before you taste, your spoonful will more likely be representative of the whole pot. 

Sampling data is a bit different than sampling soup though. Let's introduce a few commonly used sampling methods: 

* Simple random sampling
* Stratified sampling
* Cluster sampling
* Multistage sampling

### Simple random sampling
We randomly select cases from the population, such that each case is equally likely to be selected. This is similar to randomly drawing names from a hat. 

### Stratified sampling
We first divide the population into homogeneous groups called strata and then we randomly sample from within each stratum. For example, if we wanted to make sure that people from low, medium, and high socioeconomic status are equally represented in a study, we would first divide our population into 3 groups as such and then sample from within each group. 

### Cluster sampling
In cluster sampling, we divide the population into clusters, randomly sample a few clusters, and then sample all observations within these clusters. The clusters, unlike strata in stratified sampling, are heterogeneous within themselves and each cluster is similar to the others, such that we can get away with sampling from just a few of the clusters. 

### Multistage sampling
Multistage sampling adds another steps to cluster sampling. Just like in cluster sampling, we divide the population into clusters, randomly sample a few clusters, and then we randomly sample observations from within those clusters. 

...

Cluster and multistage sampling are often used from economical reasons. One might divide a city into geographic regions that are on average similar to each other and then sample randomly from a few randomly picked regions in order to avoid travelling to all regions. 

### Examples

### Sampling strategies, determine which...

A consulting company is planning a pilot study on marketing in Boston. They identify the zip codes that make up the greater Boston area, then sample 50 randomly selected addresses from each zip code and mail a coupon to these addresses. They then track whether the coupon was used in the following month.

What sampling strategy has this company used?

**Stratified sampling**

### Sampling strategies, choose worst...

A school district has requested a survey be conducted on the socioeconomic status of their students. Their budget only allows them to conduct the survey in some of the schools, hence they need to first sample a few schools.

Students living in this district generally attend a school in their neighborhood. The district is broken into many distinct and unique neighborhoods, some including large single-family homes and others with only low-income housing.

Which approach would likely be the least effective for selecting the schools where the survey will be conducted?

** Cluster sampling, where each cluster is a neighborhood. This sampling strategy would be a bad idea because each neighborhood has a unique socioeconomic status. A good study would collect information about every neighborhood **

### Sampling in R

In [22]:
library(openintro)
library(dplyr)

In [23]:
# load county data
data(county)

In [26]:
head(county)

name,state,pop2000,pop2010,fed_spend,poverty,homeownership,multiunit,income,med_income
Autauga County,Alabama,43671,54571,6.068095,10.6,77.5,7.2,24568,53255
Baldwin County,Alabama,140415,182265,6.139862,12.2,76.7,22.6,26469,50147
Barbour County,Alabama,29038,27457,8.752158,25.0,68.0,11.1,15875,33219
Bibb County,Alabama,20826,22915,7.122016,12.6,82.9,6.6,19918,41770
Blount County,Alabama,51024,57322,5.13091,13.4,82.0,3.7,21070,45549
Bullock County,Alabama,11714,10914,9.973062,25.3,76.9,9.9,20289,31602


In [28]:
summary(county)

                name           state         pop2000           pop2010       
 Washington County:  30   Texas   : 254   Min.   :     67   Min.   :     82  
 Jefferson County :  25   Georgia : 159   1st Qu.:  11210   1st Qu.:  11104  
 Franklin County  :  24   Virginia: 134   Median :  24608   Median :  25857  
 Jackson County   :  23   Kentucky: 120   Mean   :  89623   Mean   :  98233  
 Lincoln County   :  23   Missouri: 115   3rd Qu.:  61766   3rd Qu.:  66699  
 Madison County   :  19   Kansas  : 105   Max.   :9519338   Max.   :9818605  
 (Other)          :2999   (Other) :2256   NA's   :3                          
   fed_spend          poverty     homeownership     multiunit    
 Min.   :  0.000   Min.   : 0.0   Min.   : 0.00   Min.   : 0.00  
 1st Qu.:  6.964   1st Qu.:11.0   1st Qu.:69.50   1st Qu.: 6.10  
 Median :  8.669   Median :14.7   Median :74.60   Median : 9.70  
 Mean   :  9.991   Mean   :15.5   Mean   :73.26   Mean   :12.33  
 3rd Qu.: 10.857   3rd Qu.:19.0   3rd Qu.:78.4

In [29]:
glimpse(county)

Observations: 3,143
Variables: 10
$ name          <fct> Autauga County, Baldwin County, Barbour County, Bibb ...
$ state         <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama,...
$ pop2000       <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399, 112...
$ pop2010       <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 118...
$ fed_spend     <dbl> 6.068095, 6.139862, 8.752158, 7.122016, 5.130910, 9.9...
$ poverty       <dbl> 10.6, 12.2, 25.0, 12.6, 13.4, 25.3, 25.0, 19.5, 20.3,...
$ homeownership <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, 71.4,...
$ multiunit     <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7, 4.3,...
$ income        <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16916, 2057...
$ med_income    <dbl> 53255, 50147, 33219, 41770, 45549, 31602, 30659, 3840...


In [25]:
# since DC is not a state by definition, we're going to remove it

county_noDC <- county %>%
    filter(state != "District of Columbia") %>%
    droplevels()

# we'll drop levels for good measure
# dso the District of Columbia is removed completely
# from our dataframe. 

head(county_noDC)

name,state,pop2000,pop2010,fed_spend,poverty,homeownership,multiunit,income,med_income
Autauga County,Alabama,43671,54571,6.068095,10.6,77.5,7.2,24568,53255
Baldwin County,Alabama,140415,182265,6.139862,12.2,76.7,22.6,26469,50147
Barbour County,Alabama,29038,27457,8.752158,25.0,68.0,11.1,15875,33219
Bibb County,Alabama,20826,22915,7.122016,12.6,82.9,6.6,19918,41770
Blount County,Alabama,51024,57322,5.13091,13.4,82.0,3.7,21070,45549
Bullock County,Alabama,11714,10914,9.973062,25.3,76.9,9.9,20289,31602


Suppose our limited resources require that we collect data from only 150 of the over 3000 counties in the United States. One option is to take a simple random sample.

In [30]:
# simple random sample of 150 counties

county_srs <- county_noDC %>%
    sample_n(size = 150)

glimpse(county_srs)

Observations: 150
Variables: 10
$ name          <fct> Monroe County, Lincoln County, Clark County, Douglas ...
$ state         <fct> West Virginia, Idaho, South Dakota, Colorado, Ohio, M...
$ pop2000       <dbl> 14583, 4044, 4143, 175766, 23072, 55099, 98890, 5071,...
$ pop2010       <dbl> 13502, 5208, 3691, 285465, 23770, 62500, 99892, 4772,...
$ fed_spend     <dbl> 10.735891, 5.403610, 14.566784, 2.109113, 8.442575, 8...
$ poverty       <dbl> 13.3, 15.3, 13.1, 2.9, 20.8, 11.5, 13.7, 19.1, 23.2, ...
$ homeownership <dbl> 84.9, 75.4, 81.5, 82.5, 80.2, 76.4, 79.8, 78.1, 69.4,...
$ multiunit     <dbl> 1.7, 2.1, 9.6, 14.9, 5.9, 10.9, 12.2, 12.2, 9.4, 3.2,...
$ income        <dbl> 18927, 19011, 23909, 42418, 18003, 24282, 22529, 2322...
$ med_income    <dbl> 39574, 45714, 43894, 99198, 33407, 44659, 48618, 3544...


However, if we wanted to obtain equal number of counties from each state that is 3 counties per state, a simple random sample won't ensure that. We can confirm this by counting the number of counties per state...

In [31]:
county_srs %>% 
    group_by(state) %>%
    count()

state,n
Alabama,4
Arizona,1
Arkansas,6
California,2
Colorado,4
Florida,2
Georgia,8
Idaho,4
Illinois,4
Indiana,5


If we instead want to sample 3 counties per state to make up our sample of 150 counties, we should use stratified sampling.

In [34]:
# stratified sample of 150 counties
# each state is a stratum

county_str <- county_noDC %>%
    group_by(state) %>%
    sample_n(size = 3)

glimpse(county_str)

Observations: 150
Variables: 10
$ name          <fct> Lawrence County, Lauderdale County, Washington County...
$ state         <fct> Alabama, Alabama, Alabama, Alaska, Alaska, Alaska, Ar...
$ pop2000       <dbl> 34803, 87966, 18097, 7208, 13913, 30711, 69423, 33489...
$ pop2010       <dbl> 34339, 92709, 17581, 7523, 13592, 31275, 71518, 37220...
$ fed_spend     <dbl> 6.652960, 8.106128, 7.829987, 12.917187, 12.702840, 3...
$ poverty       <dbl> 13.6, 17.7, 19.7, 19.7, 10.9, 6.5, 34.4, 20.0, 13.5, ...
$ homeownership <dbl> 78.7, 73.0, 83.0, 53.7, 59.2, 64.0, 76.3, 72.0, 77.7,...
$ multiunit     <dbl> 5.1, 14.7, 2.6, 19.4, 25.9, 32.2, 5.2, 7.7, 6.3, 11.4...
$ income        <dbl> 19370, 22341, 18824, 21278, 26413, 34923, 12294, 1564...
$ med_income    <dbl> 40516, 39345, 36431, 55217, 60776, 75517, 30184, 4168...


In [35]:
county_str %>% 
    group_by(state) %>%
    count()

state,n
Alabama,3
Alaska,3
Arizona,3
Arkansas,3
California,3
Colorado,3
Connecticut,3
Delaware,3
Florida,3
Georgia,3
