# Section 4: Simple Linear Regression

In [None]:
# RUN THIS CELL
# Load packages
library(testthat)
library(tidyverse) %>% suppressMessages()

Load dataset

In [None]:
# read in dataset
cho <- read.csv("cho_rep_clean.csv") %>% rename(catholic = catholic2)

# display first 6 rows
head(cho)

## The Dataset

How many rows are in the dataset? What does each row represent?

In [None]:
n_countries <- nrow(cho)
n_countries

### Variables
- What is the variable that represents human trafficking severity? 
- What is the variable that represents legal prostitution?

_Your answer here_

Filter to the countries with legalized prostitution. 

In [None]:
cho_legal <- cho %>% filter(prostitutionlaw == 1)
cho_legal

Filter to the countries without legalized prostitution. 

In [None]:
cho_legal <- cho %>% filter(prostitutionlaw == 0)
cho_legal

## Simple Linear Regression
A simple linear regression will compare these two groups. 

### Visual
The code below plots a graph with legal status of prostitution on the X axis, and human trafficking severity on the Y axis. The blue line is the line of best fit extracted from the linear regression. 

Does it look like there is a significant difference in human trafficking flows between those with legalized prostitution and those without? Is this suggestive that legalizing prostitution could be increase human trafficking?

In [None]:
cho %>% ggplot(aes(x =prostitutionlaw, y =  htflowsunodc)) +
geom_point(alpha = 0.05) + 
geom_smooth(method = "lm", se = F) +
  theme(text = element_text(size = 20)) 

### If I have a dataset, can I retrieve the line of best fit between two variables myself?
Yes, you can!

To run the linear regression yourself, all you need to know is the **dependent variable, the independent variable, and the dataset name**. (No, you do not need to know the math to fit the line yourself; r will do it for you! If you are interested in the math, let me know and I can point you at some resources/online videos.)

We use the `lm` function to run a simple linear regression. The `lm` function takes in...

    lm(dv ~ iv, data = df)

It's that simple! So for the cho data, it would be...

In [None]:
# EXAMPLE: Bivariate Equation
mod_legal <- lm(htflowsunodc ~ prostitutionlaw, data = cho)

summary(mod_legal)

#### Interpretation
A one unit increase in legal status of prostituion is associated with a 0.70 increase in the human trafficking index. Correlation, not causation.  

---
## Your turn! Let's practice using and interpreting linear regression with two variables
Note: This is real data. So the relationships your observing are reflective of the real world. Isn't that cool?

### Q1) Democracy and Human Trafficking?
Run a regression where we investigate: Does democracy have a relationship with human trafficking flows?

- IV: `democracy`
- DV: `htflowsunodc`

$$htflowsunodc = \alpha + \beta_1 democracy + \epsilon_i$$

Visually, it looks like there is a relationship.

In [None]:
# RUN, No need to edit
cho %>% ggplot(aes(x =democracy, y =  htflowsunodc)) +
geom_point() + 
geom_smooth(method = "lm", se = F) +
  theme(text = element_text(size = 20)) 

Your task: Use `lm` to estimate the relationship. 

In [None]:
# YOUR ANSWER HERE
mod1 <- NULL # YOUR CODE HERE

summary(mod1)

In [None]:
. = ottr::check("tests/Q1.R")

#### Interpretation
How would you interpret $\alpha$ and $\beta_1$? Your value for $\alpha$ should be 2.02, and $\beta_1$ should be 0.77. Are democracies associated with higher rates of trafficking inflow?

_Replace this text._

### Q2) Democracy and Legalized Prostitution?

Run a regression where we investigate: Does gdp per capita have a relationship with human trafficking flows?

- IV: `democracy`
- DV: `prostitutionlaw`

$$prostitutionlaw = \alpha + \beta_1 democracy + \epsilon_i$$

Visually, it looks like there is a relationship. 

In [None]:
# RUN, No need to edit
cho %>% ggplot(aes(x =democracy, y =  prostitutionlaw)) +
geom_point() + 
geom_smooth(method = "lm", se = F) +
  theme(text = element_text(size = 20)) 

Your task: Use `lm` to estimate the relationship.

In [None]:
# YOUR ANSWER HERE
mod2 <- NULL # YOUR CODE HERE

summary(mod2)

In [None]:
. = ottr::check("tests/Q2.R")

#### Interpretation
There is a relationship!

Because prostitution law is a binary variable (takes on 0/1 values), you can interpret this as: 

* $\alpha = 0.20755$: Non-democracies have an average probability of 0.21 for legalizing prostitution. 
* $\beta_1 = 0.3899$: On average, relative to non-democracies, democracies are associated with 0.39 higher probability of legalized prostitution.

Are democracies, on average, associated with a higher probability of legalizing prostitution? Which coefficient tells you this–$\alpha$ or $\beta_1$? What if $\beta_1$ were negative?

_Replace this text_

### Q3) Wait a minute...Could democracy be driving this relationship?

In other words, we could theorize democracies are more likely to legalize prostitution and democracies are likely to be recording human trafficking inflows.

Is the relationship that we see (between legalized prostitution and human trafficking) simply be because democracies are more proactively documenting human trafficking? Legalizing prostitution doesn't actually make a difference and democracy is the alternative explanation of why we see this relationship. 

More on this next time. 

In [None]:
cho %>% ggplot(aes(x =prostitutionlaw, y =  htflowsunodc)) +
geom_point(alpha = 0.05) + 
geom_smooth(method = "lm", se = F) +
  theme(text = element_text(size = 20)) 

In your own words, why is democracy an alternative explanation to this relationship?

_Your answer here_

---
### Extra Time
Explore more relationships in the code chunk below using `the` lm function:

1. Are countries with higher gdp per capita (`gdp_pc_const_ppp_ln`) associated with lower rates of trafficking (`htflowsunodc`)? What is the estimated relationship?
2. Are countries with higher shares of catholics (`catholic`) associated with lower rates of legalization (`prostitutionlaw`)? What is the estimated relationship?
3. Are countries in West Europe (`reg_west_europe`) associated with higher rates of legalization (`prostitutionlaw`) relative to the rest of the countries?
4. What about sub-saharan africa (`reg_ssa`)?
5. What about latin america (`reg_latam`)?

In [None]:
# YOUR CODE HERE



### Preview: Accounting for Democracy

The cell below is the original regression evaluateing the association between legalizing prostituion and human trafficking.

    lm(dv ~ iv, data = df)

In [None]:
# RUN CELL DO NOT CHANGE
lm(htflowsunodc ~ prostitutionlaw, data = cho) %>% summary()

The cell below accounts for democracy as a potential alternative explanation. We add in democracy on the right hand side as a "control" variable. 

*The cell below reads: What is the association of prostitutionlaw on human trafficking flows, holding democracy constant.*

Example code:

        lm(dv ~ iv + control, data = df)

In [None]:
# RUN CELL DO NOT CHANGE
lm(htflowsunodc ~ prostitutionlaw + democracy, data = cho) %>% summary()

Look as the value for `prostitutionlaw` in the `Estimate` column for both regression outputs above. This represents $\beta_1$. How does this change before and after including democracy? Why?

_Your answer here_

---
## Summary
We are using linear regression (via the `lm` function) to quantify the relationship between X and Y. As of right now, this is purely a correlational relationship, not a causal relationship. 

This relationship is written as: 
$$Y= \alpha + \beta_1 X + \epsilon_i$$

$\alpha$ is interpreted as: The value of Y when X = 0. \
$\beta_1$ is interpreted as: A one unit increase in X is associated with a $\beta_1$ unit increase in Y. 

## Next Time

Since we can theorize these alternative explanations to this relationship, is there a way we can account for these in the regression?

Yes:) We can try to isolate for the effect of legalizing prostitution by adding "controls", more next time!

## What you need to know for Assignment #2
Already covered:
* Direction of Correlation
* Independent variable, dependent variables
* How do you interpret Y = 𝛼 + ꞵX 

Next time: 
* “Control” variables
* lm() with control variables
