## Causal Inference
# School of Information, University of Michigan 
## Week 3

### Resources:
- Course Manual, which can be found in Coursera
- [Instrumental Variables & Randomized Encouragement Trials: Driving Engagement of Learners](assets/MediumArticle.pdf)

## Part 1

### Background

Researchers in Coursera are interested in figuring out whether a certain learning style can cause a learner to be more engaged and thus more likely to ultimately complete a course.

### Data

The data file lecture3.csv contains 6 variables for 49,909 learners on the online learning platform Coursera. Below are the descriptions of each variable in the data:

- *paid_enroll*: dummy variable that equals to 1 if a learner has paid for enrollment, 0 otherwise
- *prv_wk_nbr*: the most recent course week a learner has completed, as a measure of how far the learner is into the class (e.g. if a learner most recently completed week 2 of a course, this variable is equal to 2)
- *prv_wk_min*: the minutes a learner spent in the previous week on the platform
- *message*: equal to 1 if a learner is in the treatment group (i.e. he/she received a message), 0 otherwise
- *binge*: equal to 1 if a learner has binged, 0 otherwise (bingeing behavior is defined as completing and starting consecutive weeks of a course on the same day)
- *complete*: dummy variable that is equal to 1 if a learner completed the next week in the course, 0 otherwise


In [1]:
# Import statements. Run this cell.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from scipy import stats
from linearmodels import IV2SLS

In [2]:
#Uploading data for assignment. Run this cell.
data_coursera = pd.read_csv('assets/lecture3.csv')

#Uncomment below to see the first five lines of the dataframe.
data_coursera.head()

Unnamed: 0,paid_enroll,prv_wk_nbr,prv_wk_min,message,binge,complete
0,1,2,193,0,1,1.0
1,0,5,194,0,1,1.0
2,1,1,45,0,0,1.0
3,1,4,118,0,0,1.0
4,0,5,247,0,1,1.0


## Questions

We are interested in investigating whether bingeing, defined as completing and starting consecutive weeks of a course on the same day, increases the likelihood of completing the following week in a course.

**Note**: You can refer to the manual for the methods we use in the assignment if you need to. 

**Use the data_coursera dataframe uploaded above to answer the questions below unless otherwise specified.**

**1.** Using robust standard errors in the statsmodels module, regress the variable *complete* on *binge*. Assign the coefficient in front of *binge* to the variable `binge_coeff1_1` and ensure that its data type is float. (Round to two decimal places.) (1 pt)

In [3]:
# YOUR CODE HERE

#Regress complete on binge and fit model: 
m = smf.ols('complete~binge', data = data_coursera).fit()

#Print old model summary: 
print(m.summary())

#Now, let's calculate robust standard error
ols_robust = m.get_robustcov_results(cov_type= 'HC1')

#Print robust standard error model summary: 
print(ols_robust.summary())

#Hence, assigning coefficient in front of the binge variable to binge_coeff1_1: 

binge_coeff1_1 = round(0.4619,2)

print(f'Binge Coefficient rounded to 2 decimal points: {binge_coeff1_1}')
#raise NotImplementedError()

                            OLS Regression Results                            
Dep. Variable:               complete   R-squared:                       0.258
Model:                            OLS   Adj. R-squared:                  0.258
Method:                 Least Squares   F-statistic:                 1.735e+04
Date:                Tue, 23 Feb 2021   Prob (F-statistic):               0.00
Time:                        22:52:04   Log-Likelihood:                -4289.1
No. Observations:               49808   AIC:                             8582.
Df Residuals:                   49806   BIC:                             8600.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4937      0.003    151.014      0.0

In [4]:
# Hidden Tests, checking value of binge_coeff1_1.

**2.** Now run the regression (using robust standard errors) one more time with additional controls: *paid_enroll*, *prv_wk_nbr*, *prv_wk_min*. Assign the coefficient in front of *binge* to the variable `binge_coeff1_2` and ensure that its data type is float. (Round to two decimal places.) (1 pt)

In [5]:
# YOUR CODE HERE
#Regress complete on binge, paid_enroll, prv_wk_nbr, prv_wk_min and fit model: 
m_2 = smf.ols('complete~binge+paid_enroll+prv_wk_nbr+prv_wk_min', data = data_coursera).fit()

#Print old model summary: 
print(m_2.summary())

#Now, let's calculate robust standard error
ols_robust_2 = m_2.get_robustcov_results(cov_type= 'HC1')

#Print robust standard error model summary: 
print(ols_robust_2.summary())

#Hence, assigning coefficient in front of the binge variable to binge_coeff1_1: 

binge_coeff1_2 = round(0.3172 ,2)

print(f'Binge Coefficient rounded to 2 decimal points: {binge_coeff1_2}')
#raise NotImplementedError()

                            OLS Regression Results                            
Dep. Variable:               complete   R-squared:                       0.334
Model:                            OLS   Adj. R-squared:                  0.334
Method:                 Least Squares   F-statistic:                     6243.
Date:                Tue, 23 Feb 2021   Prob (F-statistic):               0.00
Time:                        22:52:04   Log-Likelihood:                -1611.9
No. Observations:               49808   AIC:                             3234.
Df Residuals:                   49803   BIC:                             3278.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.4317      0.004     96.942      

In [6]:
# Hidden Tests, checking value of binge_coeff1_2.

When the point estimate we are interested in (i.e., the coefficient in front of variable binge) changes drastically with the inclusion of further covariates, we consider that to be worrisome for causal inference purposes (remember the regression sensitivity analysis). Furthermore, intuitively the positive correlation between bingeing and completion could just be the result of self-selection by learners who are both inherently more likely to complete as well as more likely to binge because of higher motivation. To overcome this problem, researchers in Coursera decided to run a randomized encouragement trial. They randomly split their learners into two groups. The treatment group received a message immediately after completing a week of material. The goal of the message was to encourage learners to start the next week right away (see below). The control group didn’t receive the message.

<img src="assets/Congratulations.png" alt="Treatment Message" style="width: 500px;"/>

<img src="../Data/Congratulations.png" alt="Treatment Message" style="width: 300px;"/>

## Part 2

### Questions 

We will be using the binary variable message as our instrument to investigate the impact of binging on completion of the following week’s lecture.

**1.** Since messages were randomly assigned, we know that the independence assumption is satisfied. What does the exclusion restriction mean in this context? (2 pts)

**Note**: This question will be manually graded. 

YOUR ANSWER HERE

Exclusion restriction in this context means that message variable cannot not directly affect completion of the following week's lecture. The instrument variable "message" would have to go through a single channel (in this case "binge") to affect the completion of the following week's lecture. 

**2.** Let’s look at the first stage relationship.

**2a.** Using robust standard errors in the statsmodels module, regress variable *binge* on variable *message*. Assign the results (using the `.get_robustcov_results()` method) to the variable `robust_reg2_2a`. (1 pt)

In [7]:
# YOUR CODE HERE

#Regress treatment variable "binge" on instrument variable "message" (first stage) and fit model: 
m_3 = smf.ols('binge~message', data = data_coursera).fit()

#Print old model summary: 
print(m_3.summary())

#Now, let's calculate robust standard error
robust_reg2_2a = m_3.get_robustcov_results(cov_type= 'HC1')

#Print robust standard error model summary: 
print(robust_reg2_2a.summary())

robust_reg2_2a
#raise NotImplementedError()

                            OLS Regression Results                            
Dep. Variable:                  binge   R-squared:                       0.016
Model:                            OLS   Adj. R-squared:                  0.016
Method:                 Least Squares   F-statistic:                     830.7
Date:                Tue, 23 Feb 2021   Prob (F-statistic):          3.52e-181
Time:                        22:52:05   Log-Likelihood:                -16399.
No. Observations:               49909   AIC:                         3.280e+04
Df Residuals:                   49907   BIC:                         3.282e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.8243      0.002    387.409      0.0

<statsmodels.regression.linear_model.OLSResults at 0x7fd3a848c5c0>

In [8]:
# Hidden Tests, checking the coefficients and standard errors of robust_reg2_2a.

**2b.** Do we have a strong first stage? Explain. (1 pt)

**Note**: This question will be manually graded.

YOUR ANSWER HERE

Yes, it is because we see a high t-statistics of 28.822 which is greater than the squareroot of 10, and a positive coefficient for "message variable" 0.0867 suggesting a strong first stage. 

Another point to note here is that, if we do not have the strong first stage then exclusion restriction property of instrument variable will be violated. That is, the instrument variable "message" would affect the outcome variable "completion" without going through the treatment variable "binge", which would not be desirable. 

**3.**  Let’s look at the intention-to-treat effect.

**3a.** Calculate the “intention-to-treat” (ITT) effect by running the reduced form regression. That is, using robust standard errors, regress *complete* on *message*. Based on your regression results, how much does receiving a message change the likelihood of completing the next week? Assign this number (the coefficient in front of variable *message*) to the variable `l_change2_3` and ensure that its data type is float. (Round to two decimal places.) (1 pt)

In [9]:
# YOUR CODE HERE
#Regress outcome variable "complete" on instrumental variable "message" (ITT) and fit model: 
m_4 = smf.ols('complete~message', data = data_coursera).fit()

#Print old model summary: 
print(m_4.summary())

#Now, let's calculate robust standard error
robust_reg2_4 = m_4.get_robustcov_results(cov_type= 'HC1')

#Print robust standard error model summary: 
print(robust_reg2_4.summary())

#Hence, from above we can see the likelihood of completing the next week after receiving the message: 
l_change2_3 = round(0.0113, 2)

print(f"Coefficient in front of variable message: {l_change2_3}")
#raise NotImplementedError()

                            OLS Regression Results                            
Dep. Variable:               complete   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     17.05
Date:                Tue, 23 Feb 2021   Prob (F-statistic):           3.65e-05
Time:                        22:52:05   Log-Likelihood:                -11725.
No. Observations:               49808   AIC:                         2.345e+04
Df Residuals:                   49806   BIC:                         2.347e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.8896      0.002    458.907      0.0

In [10]:
# Hidden Tests, checking value of l_change2_3

**3b.** Based on the p-value, can you conclude that it is significant at the 5% level? Explain. (i.e. report BOTH the p-value AND the decision rule based on p-value to determine if the coefficient differs from 0 at the 5% significance level) (1 pt) 

**Note**: This question will be manually graded.

YOUR ANSWER HERE

Yes, at p-value = 0.000 it is statistically significant at 5% significant level. The p-value = 0.000 is lower than the 5% significant level, hence null hypothesis (receiving message has no impact on completion) can be rejected, and we would proceed with the alternate hypothesis.      

**4.** The ITT doesn’t take into account that some users may not comply with the treatment assignment. With heterogeneous treatment effects, we have an additional assumption: monotonicity. This means that there are “no defiers” in the population. Explain what “no defiers” means in this context. (2 pts) 

**Note**: This question will be manually graded.

YOUR ANSWER HERE

In the above context, it would mean that people assigned to the treatment "binge" will be more likely to comply with bingeing as there would be no defiers (i.e people who would do opposite when asked to binge) in the population. 
This assumption will ensure that instrumental variable "message" will affect people in one direction only, hence, making more people likely to get the treatment under no defiers assumption.  

**5.** Assuming the no defiers assumption is satisfied,  calculate the share of “always-takers,” which is given by the probability of bingeing when assigned to not receive a message. (You can calculate this value by dividing the number of learners who binged without receiving a message to the total number of learners in the no message group.) Assign the value to the variable `at_share2_5` and ensure that its data type is float. (Round to two decimal places.) (1 pt)

In [11]:
# YOUR CODE HERE
data_coursera.head()

#Calculate share of always_takers (Means they binge even when the message is not received) : 
#Ratio = Number of learners binged without receiving a message/Total learners in no message group

#Number of learners binged without receiving a message: 
binged_no_msg = data_coursera['binge'][(data_coursera['message']==0) & (data_coursera['binge']==1)].count()

#Total learners in no message group: 
no_msg_group = data_coursera['message'][data_coursera['message']==0].count()

#Hence, share of always_takers: 
at_share2_5 = round(binged_no_msg/no_msg_group, 2)

#raise NotImplementedError()

In [12]:
print(f"Share of always_takers: {at_share2_5}")

Share of always_takers: 0.82


In [13]:
# Hidden Tests, checking value of at_share2_5

**6.** Similarly, calculate the share of “never-takers”, which is given by the probability of not bingeing when assigned to receiving a message. Assign the value to the variable `nt_share2_6` and ensure that its data type is float. (Round to two decimal places.) (1 pt)

In [14]:
# YOUR CODE HERE
#Calculate share of nevers_takers (Means they do not binge even when they receive the message) : 
#Ratio = Number of learners not binged with receiving a message/Total learners in message group

#Number of learners not binged after receiving a message: 
no_binged_msg_recvd = data_coursera['binge'][(data_coursera['message']==1) & (data_coursera['binge']==0)].count()

#Total learners in message group: 
msg_recvd_group = data_coursera['message'][data_coursera['message']==1].count()

#Hence, share of always_takers: 
nt_share2_6 = round(no_binged_msg_recvd/msg_recvd_group, 2)
#raise NotImplementedError()

In [15]:
print(f"Share of never_takers: {nt_share2_6}")

Share of never_takers: 0.09


In [16]:
# Hidden Tests, checking value of nt_share2_6.

**7.** ITT effects divided by the difference in compliance rates between the two groups (i.e. the effect of the instrument on the treatment) captures the causal effect of bingeing on compliers who binged as a result of the experiment. That is, the IV estimate we are interested in is equal to the reduced form divided by the first stage. Calculate the IV estimate manually, that is, divide the reduced form coefficient from Part 2, Question 3a (rounded to two decimal places) by the first stage coefficient from Part 2, Question 2a (rounded to two decimal places). Assign the value to the variable `answer2_7` and ensure that its data type is float. (Round to two decimal places.) (1 pt)

In [17]:
# YOUR CODE HERE

#Estimate IV estimate = Intention-for-treatment / first stage

#ITT is already known = l_change2_3

#First Stage, we can get it by taking the coefficient associated with the message and rounding it 2 decimal places: 
first_stage_message = round(0.0867,2)

answer2_7 = round(l_change2_3/first_stage_message,2)

#raise NotImplementedError()

In [18]:
print(f" IV Estimate : {answer2_7}")

 IV Estimate : 0.11


In [19]:
# Hidden Tests, checking value of answer2_7.

**8.** In order to obtain a measure of the precision of our IV estimate we want to use the 2SLS method. Using robust standard errors, run a two-stage least squares regression, where the outcome variable is *complete*, the instrumented variable is *binge* and the instrument is the variable *message*. Use the IV2SLS module from the linearmodels library. Assign the results using the `.fit()` method to the variable `iv2sls2_8`. (2 pts)

**Note**: Be sure to remove any NAs from the dataframe before proceeding. 

In [20]:
data_coursera.columns

Index(['paid_enroll', 'prv_wk_nbr', 'prv_wk_min', 'message', 'binge',
       'complete'],
      dtype='object')

In [21]:
# YOUR CODE HERE

#Remove any null values from the dataset: 
data_coursera_2 = data_coursera.dropna()

#Add constant: 
data_coursera_2['const'] = 1


#Dependent variable = complete
#Endogenous variable = binge
#Exogenous variable = const, paid_enroll, prv_wk_nbr, prv_wk_min
#Instrument = message


#Now, we can plug in above variables: 
iv2sls2_8 = IV2SLS(dependent = data_coursera_2['complete'],
                   exog= data_coursera_2['const'], #data_coursera_2[['const', 'paid_enroll', 'prv_wk_nbr', 'prv_wk_min']],
                   endog = data_coursera_2['binge'],
                   instruments= data_coursera_2['message']).fit(cov_type= 'robust')

#Print summary: 
print(iv2sls2_8.summary)

#Alternatively, we can also re-write above using a 2 stage regression formula (But in above case we only require 'const' as exog variable: 
#alternatively = IV2SLS.from_formula('complete ~ const + [binge ~ message]', data = data_coursera_2).fit(cov_type= 'robust')

#raise NotImplementedError()

                          IV-2SLS Estimation Summary                          
Dep. Variable:               complete   R-squared:                      0.1213
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1213
No. Observations:               49808   F-statistic:                    19.395
Date:                Tue, Feb 23 2021   P-value (F-stat)                0.0000
Time:                        22:52:06   Distribution:                  chi2(1)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          0.7862     0.0248     31.690     0.0000      0.7376      0.8348
binge          0.1254     0.0285     4.4040     0.00

In [22]:
# Hidden Tests, checking the coefficients and standard errors of iv2sls2_8.