## Part 1: Connecting Regression to Random Experiments
### Effect of Awareness Campaigns on MDW Knowledge of their Rights in Hong Kong (Boittin and Mo et al. 2024)
#### Load Packages

In [None]:
# RUN THIS CELL
# Load packages
library(testthat)
library(tidyverse) %>% suppressMessages()

#### Read in data and take a look!
Lucky that our professor ran this study, so we have access to the data. This only a small part of the data collected. Let's take a peek at the first ten rows. 

* Each row is the response of one migrant domestic workers participant. 
* `treated` indicates if they were administered the poster treatment. Takes on `1` if treated, `0` if not. 
* `score` indicates their score on the knowledge test. Takes on a value between 0-1, with 1 being a perfect score. 

In [None]:
# load in data
hkdat <- read.csv("hk_data_section.csv") %>% select(-starts_with("knowledge"))

# look at first ten rows
head(hkdat, 10)

### Random assignment to Treatment (`treated`)
Migrant domestic workers in HK were randomly assigned to a treatment of awareness raising (being shown a poster). Then, the authors measured the MDW's knowledge about their own rights. 

The authors want to know the effect of the awareness raising campaigns on knowledge of rights. 

#### What columns correspond to the treatment assignment and knowledge test?

* treatment: `treated`- takes on 1 for treated, and 0 for control
* knowledge: `score` - ranges from 0 to 1, with 1 being high score

In [None]:
# how many were treated?
table(hkdat$treated) # equal!

In [None]:
# what was the average overall score for the entire group
hkdat %>% summarise(avg_score = mean(score, na.rm = T)) # the entire sample scored on average a 70% on their knowledge test

### What's the effect of being exposed to an awareness raising poster?
**Method 1**: Calculating the ATE by calculating the difference between the mean knowledge score of the treated group and the mean knowledge score of the control group. 

Revising our functions `filter`, and `summarise` and `mean`

In [None]:
# let's now calcualte the average score for the treated group
score_treated = hkdat %>% filter(treated == 1) %>% 
    summarise(avg_score = mean(score, na.rm = T)) %>%
    # remove the single value from the dataframe
    pull()

# knoweldge score for migrant workers who were treated
score_treated

In [None]:
# let's now calcualte the average score for the treated group
score_control = hkdat %>% filter(treated == 0) %>% 
    summarise(avg_score = mean(score, na.rm = T))%>%
    # remove the single value from the dataframe
    pull()

# knoweldge score for migrant workers who were in the control
score_control

In [None]:
# average treatment effect is calculated by taking the difference
score_treated-score_control

### Let's see this on a graph!

In [None]:
# Plotting the data
hkdat %>% ggplot(aes(x = treated, y = score)) +
    # add points
    geom_point() + 
    # add regression line
    geom_smooth(method = "lm", se = F)

**Method 2**: Now lets do the same thing using `lm()` and it should return the same value. 

In [None]:
lm(score~treated, data = hkdat) %>% summary()

By using regression, we are also told if the difference is statistically significant, which we see it isn't. Therefore, we are not confident enough to say there is any meaningful effect of the treatment on knowledge of rights for migrant domestic workers. 

How confident are we that awareness raising posters may not be the most worthwhile to invest in? Why? What if we were not able to randomize?


_Your answer here_

Return to ppt!

---
## Part 2: group_by() tutorial

We could be skeptics. How can we check that randomization worked? These two groups (treatment and control) should be almost the same on every other factor besides the treatment status. 

In [None]:
# View the starting dataset: hkdat
hkdat %>% head()

**Example:** We have information on nationality. 

Steps:
1. For the control group, find the distribution of nationality (e.g. how many in the control are from Indonesia, The Philippines	, etc).
2. For the treatment group, find the distribution of nationality (e.g. how many in the treated are from Indonesia, The Philippines	, etc).

If treatment assignment was random, the counts in each nationality (i.e. distribution) should be similar. 

**Step 1)** Now lets look at nationality for the control group...

In [None]:
# How do I subset to just the control group? .... filter!
#hkdat %>% filter(treated ==0) 

Now we count the number of observations for each MDW's nationality group, which is documented in the column `nationality`.\

Functions we use:
* `filter()` to isolate the control group
* `group_by()` to tell R we are interested in knowing information for each `nationaity`
* `summarise()` to reduce to one row per group (i.e. each nationality). 
* What are we interested in learning about for each nationality? The count, and we can find this using `n()`!

In [None]:
# find the counts of each nationality 
hkdat %>% filter(treated ==0) %>%
    # group_by nationality (what we want to see the summary statistics of)
    group_by(nationality) %>%
    # take the summary statistic
    summarise(count = n())

**Step 2)** Now lets do the same thing but for the treated group. I repeat the same steps:

Functions we use:
* `filter()` to isolate the *treatment* group
* `group_by()` to tell R we are interested in knowing information for each `nationaity`
* `summarise()` to reduce to one row per group (i.e. each nationality). 
* What are we interested in learning about for each nationality? The count, and we can find this using `n()`!

In [None]:
# find the counts of each nationality 
hkdat %>% filter(treated ==1) %>%
    # group_by nationality (what we want to see the summary statistics of)
    group_by(nationality) %>%
    # take the summary statistic
    summarise(count = n())

**How do these distributions compare?** They're almost the same because random assignment worked! We could do this for any variable in the dataset and it should be the same. In the problem sets, you would need maybe elaborate one sentence on why randomization leads to equal distribution. 

## Practice Using Groupby

**Q1:** We have information on education

Steps:

1. For the control group, find the distribution of education (i.e. how many in the control group are at each educational level?)
2. For the treatment group, find the distribution of education  (i.e. how many in the treatment group are at each educational level?)
3. Compare the distributions 

In [None]:
# find the counts of each nationality 
hkdat_edu_control <- NULL # YOUR CODE HERE

hkdat_edu_control

In [None]:
. = ottr::check("tests/Q1a.R")

In [None]:
# find the counts of each nationality 
hkdat_edu_treated <- NULL # YOUR CODE HERE

hkdat_edu_treated 

In [None]:
. = ottr::check("tests/Q1b.R")

How do the distributions compare? Are the about the same? What does that say about randomization?

_Your text here_

**Q2 Challenge**\
Use group_by() to compare the average ages in the treatment vs control group. They should be the same. Do the two groups have the same average age?

* Think carefully, what are you grouping by? Which column is this information stored in. 
* What summary statistic are you interested in?

In [None]:
age_comparison <- NULL # YOUR CODE HERE

age_comparison

In [None]:
. = ottr::check("tests/Q2.R")

How do the averages compare? Are the about the same? What does that say about randomization?

_Your answer here_

---

## Coding Progress Check/Review
This is for your own review/practice! These are the functions we have learned so far: 

* filter()
* select()
* summarise()
* group_by()
* mean()
* n()
* nrow()
* read.csv()
* lm()


**Q3)** Read in the dataset that is stored in a file called "hk_data_section.csv" and store it in `my_dat` (this is the same data, just having you practice it!)

In [None]:
my_dat <- NULL # YOUR CODE HERE

In [None]:
. = ottr::check("tests/Q3.R")

**Q4)** Filter your dataset (`my_dat`) to where `ud_freedom_movement` is "Never" and store in a new dataframe called `no_fom` (stands for no freedom of movement). This condition subsets to respondents that reported they do not have freedom of movement. 

In [None]:
# YOUR SOLUTION HERE
no_fom <- NULL # YOUR CODE HERE

# display
head(no_fom)

In [None]:
. = ottr::check("tests/Q4.R")

**Q5)** From `no_fom`, select the columns:
* treated
* score
* educ
* nationality
* ud_freedom_movement

And store in a dataframe called `no_fom_simp` (no freedom of movement simplified)

In [None]:
no_fom_simp <- NULL # YOUR CODE HERE

# display
head(no_fom_simp)

In [None]:
. = ottr::check("tests/Q5.R")

**Q6)** Using `no_fom`, find the average knowledge score of those who have no freedom of movement. Do you expect this to be lower or higher than the knoweldge score of the general population (which we found during the demo to be 70%). Is this result surprising?

In [None]:
# display
avg_knowledge <- NULL # YOUR CODE HERE

# display
avg_knowledge

In [None]:
. = ottr::check("tests/Q6.R")

**Q7)** Find the average knowledge score **for each nationality** using the dataset `no_fom` (those who have no freedom of movement). Name the column of averages `avg_knowledge_score`. What nationality seems to be the most aware? 

*(Extra/Challenge: If you got the test to pass, you can then try to groupby two variables, `nationality` and `treated`, and see how the treatment effect appears to vary across nationalities - if you do this, the check/test will not pass, but it's interesting!)*

In [None]:
# YOUR ANSWER HERE
avg_knowledge_by_nat <- NULL # YOUR CODE HERE

# display
avg_knowledge_by_nat

In [None]:
. = ottr::check("tests/Q7.R")

**Q8)** Find the average treatment effect of the awareness raising posters (`treated`) on the knowledge scores (`score`) for this sub-population (`no_fom`) using linear regresssion (`lm()`). What value corresponds to the treatment effect $\alpha$ and $\beta_1$?

    lm(dv ~ iv, data = df)

In [None]:
# YOUR ANSWER HERE
mod <- NULL # YOUR CODE HERE

# Display
mod %>% summary()

In [None]:
. = ottr::check("tests/Q8.R")

How would you interpret _$\alpha$_ and _$\beta_1$_?

_Your solution here_

**Final Q:** So what do you think, do awareness raising posters work?