**Due Date: Monday, March 11th, 11:59pm**

- Save a copy of the notebook to your Google Drive. You can do that by going to the menu and clicking `File` > `Save` > `SAVE A COPY IN DRIVE`.
- Fill out the missing parts and run the code modules.
- Answer the questions (if any) in a separate document or by adding a new `Text` block inside the Colab.
- Download the solved notebook by going to the menu and clicking `File` > `Download .ipynb`.
- Make sure the downloaded version is showing your solutions.
- Upload your solutions to BruinLearn (under "Colab Assignment #6: Causal Inference").

In [1]:
import numpy as np
import pandas as pd

np.random.seed(0)

We are going to work with the data of an A/B experiment --- a randomized experiment --- that tracks a new feature release for a product. The data contains the `treatment` variable (1 for having the feature enabled, and 0 for having the feature distabled), the `engagement` of the users (in minutes), and some other characteristics about the users.

First, download the data.

In [2]:
!wget -O experiment_data.csv https://www.dropbox.com/s/i288i9my64ee4mh/ab_experiment_engagement_data.csv?dl=0

--2024-03-05 20:02:26--  https://www.dropbox.com/s/i288i9my64ee4mh/ab_experiment_engagement_data.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.2.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.2.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/i288i9my64ee4mh/ab_experiment_engagement_data.csv [following]
--2024-03-05 20:02:27--  https://www.dropbox.com/s/raw/i288i9my64ee4mh/ab_experiment_engagement_data.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc946a04f3aba3075832fe46bde0.dl.dropboxusercontent.com/cd/0/inline/COjM88lhBFlaUC4LRknpGVCJLBBYmbm9YpwMKdglEV-pUdQHexjdfetWD_5ERSJ6D_SM3_7ooCIwt8c1MnXOr7IREypkjDrMBmjTZfsBQhqI7qj2yvdKaAp7nIwWDJJauok/file# [following]
--2024-03-05 20:02:27--  https://uc946a04f3aba3075832fe46bde0.dl.dropboxusercontent.com/cd/0/inline/COjM88lhBFlaUC4LRknpGVCJLBBYmbm9YpwMKdglEV-pUdQHexjdfe

Load the data.

In [3]:
experiment_data = pd.read_csv("./experiment_data.csv")

print("The first three rows of the data are:")
experiment_data.head(3)

The first three rows of the data are:


Unnamed: 0,treatment,age,friend_cnt,engagement
0,0.0,37.056209,47.0,21.036701
1,0.0,31.600629,38.0,17.321933
2,0.0,33.914952,51.0,22.148228


Check some of the statistics in the dataset.

In [4]:
experiment_data.describe()

Unnamed: 0,treatment,age,friend_cnt,engagement
count,1800.0,1800.0,1800.0,1800.0
mean,0.444444,29.944523,39.706111,20.873576
std,0.497042,3.933893,7.806994,5.295282
min,0.0,17.815428,15.0,0.105721
25%,0.0,27.258271,34.0,17.191072
50%,0.0,29.964266,40.0,20.791688
75%,1.0,32.495479,45.0,24.44043
max,1.0,42.683899,70.0,38.892067


##Estimate Average Treatment Effect (ATE)

Compute the average treatment effect for enabling the feature for the users.

In [5]:
# TODO: Assign the right values to control_mean_engagement,
#       treatment_mean_engagement, and ate
control_data = experiment_data[experiment_data.treatment == 0]
treatment_data = experiment_data[experiment_data.treatment == 1]

control_mean_engagement = np.mean(control_data.engagement)
treatment_mean_engagement = np.mean(treatment_data.engagement)
ate = treatment_mean_engagement - control_mean_engagement
# END OF TODO

print("Control average outcome: ", control_mean_engagement)
print("Treatment average outcome: ", treatment_mean_engagement)
print("Estimated ATE is: ", ate)

Control average outcome:  18.473860162402563
Treatment average outcome:  23.873220452686102
Estimated ATE is:  5.399360290283539


Test if the observed average treatment effect is statistically significant.

In [6]:
# TODO: Assign the right values to p_val
from scipy import stats

t_stat, p_val = stats.ttest_ind(treatment_data.engagement,
                                control_data.engagement,
                                equal_var = False)

# END OF TODO

print("p-value for comparing the mean outcome between two groups is: ", p_val)

p-value for comparing the mean outcome between two groups is:  7.234859301282946e-116


What can you say based on the previous computations about the relationship between having the feature enabled and engagement among the users?

The value of the Average Treatment Effect tells us that users with the feature enabled are showing 5.399 units more of engagement than users without the feature enabled.

The p-value being less than an alpha of 0.05 informs us that observed difference in mean for the two groups (treatment and control groups) is statistically significant.

##Linear Regression

This time, let's use a linear regression model from `statsmodels` to estimate the average treatment effect. The regression model should only have the treatment variable as a feature.

In [9]:
# TODO: fit a simple linear regression to the data
import statsmodels.api as sm
y = experiment_data['engagement']
X = experiment_data['treatment']

X_with_intercept = sm.add_constant(X)

model = sm.OLS(y, X_with_intercept).fit()
# END OF TODO

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             engagement   R-squared:                       0.257
Model:                            OLS   Adj. R-squared:                  0.256
Method:                 Least Squares   F-statistic:                     621.5
Date:                Tue, 05 Mar 2024   Prob (F-statistic):          4.60e-118
Time:                        20:23:23   Log-Likelihood:                -5286.7
No. Observations:                1800   AIC:                         1.058e+04
Df Residuals:                    1798   BIC:                         1.059e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         18.4739      0.144    127.942      0.0

What can you say about the estimated effect and the statistical significance based on these results? Does the results from the linear regression model match our previous results?

We obtained a coefficient of 5.399, which indicates that users with the feature enabled show 5.399 units higher engagement compared to users without the feature enabled.

We also obtain a very low p value associated with the treatment (0.000), indicating statistical significance.

The results from the linear regression match our previous results.

##Linear Regression with Additional Covariates

This time, let's add the other features in the dataset (i.e., age and friend count) into the linear regression model and check how the results change.

In [11]:
# TODO: fit a linear regression with all the provided covariates to the data
import statsmodels.api as sm
y = experiment_data['engagement']
X = experiment_data[['treatment', 'age', 'friend_cnt']]

X_with_intercept = sm.add_constant(X)

model = sm.OLS(y, X_with_intercept).fit()
# END OF TODO

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             engagement   R-squared:                       0.858
Model:                            OLS   Adj. R-squared:                  0.857
Method:                 Least Squares   F-statistic:                     3607.
Date:                Tue, 05 Mar 2024   Prob (F-statistic):               0.00
Time:                        20:25:51   Log-Likelihood:                -3799.4
No. Observations:                1800   AIC:                             7607.
Df Residuals:                    1796   BIC:                             7629.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.7387      0.425     18.219      0.0

- How does the estimated ATE changes with the covariates added to the model? Explain your findings.
- How does the confidence interval for ATE change in the new model? Explain your findings.

We can see that our treatment coefficient changes from 5.399 to 5.4650 with the addition of the covariates to the model, meaning that we see an increase in engagement from the previous model for those user with the feature enabled.

We see that the confidence interval for the previous model was [4.975, 5.824] and the confidence interval for the covariate model is [5.279, 5.651] for treatment. The confidence interval is tighter compared to the previous model, indicating increased precision.

