content/labs/06_lab.qmd

---
title: "Lab 6: External Validity"
subtitle: "**Due:** Monday, February 26, 11:59 PM"
author: 
    - "**Name:** Your name here"
    - "**Mac ID:** The first half of your Mac email address"
format: 
     pdf:
       documentclass: article
fontsize: 12pt
urlcolor: blue
highlight-style: nord
number-sections: true
geometry:
      - left=1in
      - right=1in
      - top=1in
      - bottom=1in
header-includes:
    - \usepackage{setspace}
    - \doublespacing
    - \usepackage{float}
    - \floatplacement{figure}{t}
    - \floatplacement{table}{t}
    - \usepackage{flafter}
    - \usepackage{ragged2e}
    - \usepackage{booktabs}
    - \usepackage{amsmath}
    - \usepackage{url}
---


```{r setup, include=FALSE}
# Global options for the knitting behavior of all subsequent code chunks
knitr::opts_chunk$set(echo = TRUE)

# Packages
library(tidyverse)
library(DeclareDesign)

# Add extra packages here if needed
```


# External Validity

In the readings for this week, [Coppock et al (2018)](https://doi.org/10.1073/pnas.1808083115) mention how the correspondence in effects between representative and convenience samples depends on the distribution of individual treatment effects.

The following design simulates a model with **heterogeneous treatment effects**, and compares the result of survey experiments conducted with a representative and convenience sample.

```{r}
# Parameters
N = 1000 # population

n = 100 # sample

effect = 0.5

# Model
model = declare_model(
  N = N,
  U = rnorm(N),
  X = runif(N), # observed covariate
  potential_outcomes(
    # interaction represents heterogeneity
    Y ~ Z * effect * X + U
    )
)
```

We are also specifying `X` as an observed covariate that moderates the treatment effect, something like digital literacy. It's generated by random draws of a uniform distribution between 0 and 1 (hence the `runif` function). If `X` is 1, the unit experiences the full effect. If it is zero, the effect disappears. The numbers in between scale the treatment effect accordingly. This is a way to simulate heterogeneous treatment effects.

The inquiry is standard fare:

```{r}
# Inquiry
inquiry = declare_inquiry(
  ATE = mean(Y_Z_1 - Y_Z_0)
)
```

Then we have to compare two data strategies, the survey experiment with a random sample and the one with a convenience sample. At this point our research design branches into two paths, since each data strategy will also have its own analogous answer strategy. They are essentially two different designs, but we can recycle some components.

This is how it looks for the representative sample:

```{r}
# Data strategy
r_sampling = declare_sampling(S = complete_rs(N, n = n))

assignment = declare_assignment(Z = complete_ra(N))

# Answer strategy
measurement = declare_measurement(Y = reveal_outcomes(Y ~ Z))

estimator = declare_estimator(
  Y ~ Z, 
  inquiry = "ATE"
  )
```


Then put everything together:

```{r}
r_design = model + inquiry +
  r_sampling + assignment +
  measurement + estimator
```

To create our convenience sample, we need a custom sampling function that makes it so that units with higher `X` are more likely to be drawn.

```{r}
convenience_sampling = function(data){
  
  id = sample(data$ID, size = n, prob = data$X) 
  
  data$S = with(
    data,
    ifelse(
      data$ID %in% id, 1, 0
      )
    )
  
  data[data$S == 1, ]
}
```

Then we pass this custom function to `declare_sampling`.

```{r}
c_sampling = declare_sampling(handler = convenience_sampling)
```


And now we can create a separate design for our convenience sample

```{r}
c_design = model + inquiry +
  c_sampling + assignment +
  measurement + estimator
```

Then we can diagnose both designs at once:

```{r}
# remember to replace with student number
set.seed(123) 

r_diag = diagnose_design(r_design)

c_diag = diagnose_design(c_design)
```

And we can use the following function to fetch the bias and RMSE of each design.

```{r}
diagnosands = rbind(
  r_diag$diagnosands_df %>%
  select(design, bias, rmse),
  c_diag$diagnosands_df %>%
  select(design, bias, rmse)
)

diagnosands
```

::: {.callout-note}
## **Task 1**

Which design is better in terms of bias and RMSE? What explains this?

:::

::: {.callout-note}
## **Task 2**

What happens to the bias and RMSE of both designs as the sample size `n` *increases* but the population `N` remains constant? What happens when the sample size decreases? What explains this?

_**Hint:**_ *Show two results, one with a larger sample and one with smaller sample.*

:::

::: {.callout-note}
## **Task 3**

What happens to the bias and RMSE when the population and sample sizes are the same? What explains this?

_**Hint:**_ *It may be faster to compute this by choosing a number in between the original population and sample sizes rather than making them both equal to 1,000.*

:::

# Answers

## Task 1

Work on your answer here.

## Task 2

Work on your answer here.

## Task 3

Work on your answer here.