In [None]:
version = "REPLACE_PACKAGE_VERSION"

## Experiment Design and Analysis
## School of Information, University of Michigan

## Part 1:
- 1. What are experiments?
- 2. Experimental Design
- 3. Lab vs. Field Experiments
- 4. Online Field Experiments

## Overview
### The objective of this part is to:

- Apply theory of experiment design and knowledge of analysis techniques to real experiment data.

### Resources:
- StatsModels
    - We recommend using a python library called [StatsModels](https://www.statsmodels.org/stable/index.html) for data analysis

- Dataset used in this part: Fixed-Price Auction data [download csv file](assets/Part1_data.csv)
    - Source for dataset: [Chen, Y., et al. Sealed bid auctions with ambiguity: Theory and experiments. (2007).](https://www.sciencedirect.com/science/article/pii/S0022053107000178)

In [113]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.stats.api as sms
from scipy import stats
#you may or may not use all of the above libraries, and that is OK!
data = pd.read_csv('assets/Part1_data.csv') #Data for this part

## Part A

Suppose subjects were randomly assigned to two treatment groups. We want to know if the randomization was properly applied to these groups. In other words, we want to know if the proportion of participants in these demographic groups are different between the two treatments.

1. To determine if the randomization worked, for each of the two treatments, modify the following ```stats_calculator``` function so that it can input the ```data``` dataframe and tabulates the mean, standard deviation, minimum and maximum of the following variables: female, age, number of siblings, white, asian, african american, hispanic, and other ethnicities. (6 points)


In [29]:
import numpy as np
def stats_calculator(provided_data):
    """
    Write the function so that it fills-in the mean, standard deviation, 
    minimum and maximum of the following variables: Female, Age, 
    Number of siblings, White, Asian, African American, Hispanic, and Other ethnicities.
    
    It should return a dataframe with these calculations based on the partially-completed dataframe below.
    
    """


Your function should return a dataframe with each of the variables and their completed statistics. Check that it does:

In [30]:
stats_calculator(data)

Unnamed: 0,variable,mean,std. dev.,max,min
0,female,0.64,0.48,1.0,0.0
1,age,22.51,3.49,31.0,18.0
2,siblings,1.64,1.2,5.0,0.0
3,white,0.47,0.5,1.0,0.0
4,asian,0.27,0.44,1.0,0.0
5,african,0.11,0.31,1.0,0.0
6,hispanic,0.08,0.25,1.0,0.0
7,other,0.08,0.27,1.0,0.0


In [31]:
"""Check that the function above outputs the (rounded) statistics"""
assert stats_calculator(data).iloc[0][1] == 0.64, "Part A #1 female mean value differs"
assert stats_calculator(data).iloc[1][2] == 3.49, "Part A #1 age std. dev value differs"

In [32]:
"""Hidden test Part A: Check function abv outputs (rounded) statistics"""
# Hidden tests

'Hidden test Part A: Check function abv outputs (rounded) statistics'

In [33]:
"""Part A: Check function abv outputs (rounded) statistics"""
# Hidden tests

'Part A: Check function abv outputs (rounded) statistics'

## Part B (4 points)

We can also use a more objective measure to identify if our treatment groups were properly randomized.

1. Using a __t-test__ (make sure you use the _correct_ type of t-test) and the ```data``` dataframe again, analyze the differences between the two treatment groups (__k1_8_exp_lot__ and __k1_8_lot_exp__) for the female, age, and hispanic demographic variables by completing the following ```objective_randomization``` function. (4 points)

**Round any calculations to the hundredth decimal. Do not use percentages.**

In [34]:
def objective_randomization(provided_data):
    """
    
    Complete the function that takes the provided data and runs a t-test on the 
    female, age, and hispanic demographic variables between the two treatments
    and outputs the results in the following partially-completed dataframe.
    Round your results to the nearest hundredth.
    Tip: you can choose to use either the statsmodels stats library or the scipy stats library to calculate the t-statistic and p-value.
    
    """

    

Your function should return a dataframe with each of the variables and their completed t-statistic and p-value across the treatments. 

Check that it does:

In [38]:
"""Check that the function above outputs the required statistics"""
result = objective_randomization(data)
assert abs(result.iloc[0][1]) == 0.23, "checking the value of the female t-statistic"

In [39]:
"""Part B # 1: Check function abv outputs required statistics"""
# Hidden tests

'Part B # 1: Check function abv outputs required statistics'

## Part 2:


## Overview
### The objective of this part is to:

- Apply theory of experiment design and knowledge of analysis techniques to real experiment data.


### Resources:
- StatsModels, Scipy.stats, Numpy
    - We recommend using two python libraries called [StatsModels](https://www.statsmodels.org/stable/index.html) and [Scipy.stats](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html) for data analysis. For this dataset, we'll be using [Numpy](https://numpy.org/devdocs/reference/index.html) as well.

- Optional Reading: [Holt C.A, & Laury S.K. Risk Aversion and Incentive Effects. (2002).](https://www.jstor.org/stable/3083270)


Datasets used in this part
- Trust data [download csv file](assets/Part2_data1.csv)
- Fixed-Price Auction data [download csv file](assets/Part2_data2.csv)
    - Source for dataset: [Chen, Y., et al. Sealed bid auctions with ambiguity: Theory and experiments. (2007).](https://www.sciencedirect.com/science/article/pii/S0022053107000178)

In [40]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.stats.api as sms
from scipy import stats
#you may or may not use all of the above libraries, and that is OK!

trust_data = pd.read_csv('assets/Part2_data1.csv') #Trust Game data for this part
fpa_data = pd.read_csv('assets/Part2_data2.csv') #First-price auction data for this part - this is the same dataset from last part

In [41]:
#uncomment the below line to view readme files for this dataset (includes explanation of variable names)
!cat assets/Part2_data1_readme.md
!cat assets/Part2_data2_readme.md

#uncomment the below line to view snippet of csv file
trust_data.head()
fpa_data.head()

### Assignment Topic: Data analysis of a laboratory experiment on trust

### Background:
We upload data files from laboratory experiments conducted at the University of Michigan.

Subjects are grouped in pairs, paired with one assigned the role of an investor and another a recipient.

    - The investor holds a set amount of money and can choose to give any fraction of that amount to the recipient – or none.

    - The amount given is multiplied by a set amount and the recipient can choose to give any fraction of the multiplied amount back to the investor – or none.

The data given was collected from an experiment involving the Trust Game and it contains decisions from the “investors” and “recipients.”

### Data:
The Trust Game data has the following variables:
   - Period: period which the game was held
   - group #: pair the player was in
   - player #: order or role the player had
   - player role: first if investor, second if recipient
   - decision type: INVEST if investor, RETURN

Unnamed: 0,treatment,session,period,subject,disttype,highdist,lowdist,group,v,b,...,irritation,moodswings,withdrawal,major,sdmajor,major1,major2,major3,major4,major5
0,k1_8_exp_lot,061018_1,1,1,Low,0,1,1,48,40,...,0,0,0,2,Electrical Engineering - Signal Processing (st...,0,1,0,0,0
1,k1_8_exp_lot,061018_1,1,2,High,1,0,4,76,15,...,0,0,0,4,public health,0,0,0,1,0
2,k1_8_exp_lot,061018_1,1,3,High,1,0,3,73,53,...,0,0,0,5,german and film and video studies,0,0,0,0,1
3,k1_8_exp_lot,061018_1,1,4,High,1,0,4,74,51,...,1,0,0,5,spanish,0,0,0,0,1
4,k1_8_exp_lot,061018_1,1,5,Low,0,1,1,72,52,...,0,1,0,2,Engineering,0,1,0,0,0


## Part A (3 points)

1. For the Trust Game, subjects are grouped in pairs, paired with one assigned the role of an investor and another a recipient. Let's examine  the correlation between the amounts the investors invest and the amounts the recipients return. Complete the function below to return the correlation coefficient. (3 points)

**Round any calculations to the hundredth decimal. Do not use percentages.**

In [43]:
def inv_rec_corrcoef(provided_data):
    """ Later in this problem set, you will be modeling OLS regressions on your data. For now, we'll calculate just
    the correlation coefficient using numpy. If needed, refer to the numpy documentation linked above.
    """
    # YOUR CODE HERE
    # raise NotImplementedError()
    # return answer
    
    

Your function should return a float with the correct coefficient value. Check that it does:

In [44]:
assert type(inv_rec_corrcoef(trust_data)) == np.float64

In [45]:
"""Part A #1: Checking value of correlation coefficient"""
# Hidden tests

'Part A #1: Checking value of correlation coefficient'

## Part B (4 points)

For the first-price auctions experiment, there are ten experimental sessions, with eight subjects per session. In this context, subjects are tasked with completing auction and lottery (Holt-Laury 2002) tasks in two orders. In five of the ten sessions, subjects first complete a lottery task, followed by 30 rounds of auctions. In the other five sessions, subjects first complete 30 rounds of auctions, followed by a lottery task. At the end of each session, subjects complete a demographics survey. The data sets extract the first period auction data for each treatment.

In this case, say that the control for the first-price auction experiment is the order in which subjects complete the lottery task followed by the auction task (k1_8_lot_exp) and the outcome variable we want to measure is the bid-value ratio (b/v).

1. Using differences-in-means, what is the average treatment effect for the first-price auction experiment? (4 points)

**Round any calculations to the hundredth decimal. Do not use percentages.**

In [92]:
def ate_fpa_payoff(provided_data):
    """
    Write the function to manually check the differences in means of bid-value ratios across the different groups explained above.
    To do this, please create a new dataframe column called 'bidval_ratio' in the provided data.
    Your function should output a dataframe with the following columns: 'lot_auc_mean', 'auc_lot_mean', 'diff in means'.
    """
    # YOUR CODE HERE


Your function should return a dataframe with the correct values and columns. Check that it does:

In [93]:
assert isinstance (ate_fpa_payoff(fpa_data), pd.core.frame.DataFrame), "checking your data is in a dataframe"

In [94]:
assert ate_fpa_payoff(fpa_data).columns[0] == 'lot_auc_mean', "checking df column names"
assert ate_fpa_payoff(fpa_data).columns[1] == 'auc_lot_mean', "checking df column names"
assert ate_fpa_payoff(fpa_data).columns[2] == 'diff in means', "checking df column names"

In [95]:
"Part B: lot_auc_mean value"
# Hidden tests

'Part B: lot_auc_mean value'

In [96]:
"Part B: auc_lot_mean value"
# Hidden tests

'Part B: auc_lot_mean value'

## Part C (12 points)

Continuing with the ```fpa_data``` dataset from last part, we would expect subjects to bid a certain fraction of their value in a first-price sealed bid auction depending on their risk attitudes (e.g., risk neutral, risk averse). Let's explore what effect gender has on bid-value ratios when controlled with risk. This time, let's calculate this average treatment effect using an ordinary least-squares regression.

1. Using the ```fpa_data``` dataframe and an ordinary least-squares regression model, complete the ```ols_riskfemale_on_bidvalue``` function to evaluate how subjects’ risk attitudes and gender (in the form of the _female_ variable) affect their bid/value ratio. For now, we'll just return a summary view of our data. (2 points)

**Round any calculations to the hundredth decimal. Do not use percentages.**

In [116]:
def ols_riskgender_on_bidvalue(provided_data):

    """
    The easiest way to evaluate how subjects' risk attitudes and gender affect their bid/value ratio is to run an OLS linear
    regression on your fpa_data dataframe. Use the statsmodels library to run an OLS linear regression, and return the summary
    view of your results.

    """
    
   

In [None]:
# >>> Y = duncan_prestige.data['income']
# >>> X = duncan_prestige.data['education']
# >>> X = sm.add_constant(X)
# >>> model = sm.OLS(Y,X)
# >>> results = model.fit()
# >>> results.params

Your function should return a summary view of your results. Check that it does:

2. Now, modify the ols_riskgender_on_bidvalue function to access the model's coefficients (parameters) and associated p-values, instead of printing out the entire summary view. For now, we won't worry about rounding. (3 points)

In [None]:
def ols_riskgender_on_bidvalue(provided_data):

    """
    The easiest way to evaluate how subjects' risk attitudes and gender affect their bid/value ratio is to run an OLS linear
    regression on your data dataframe. Use the statsmodels library to run an OLS linear regression, and this time return the
    the coefficients and the p-values for your model.

    """

   

Your function should return a raw tuple of your results in pandas Series form. Check that it does:

In [None]:
ols_riskgender_on_bidvalue(fpa_data)

In [None]:
"checking your return value is a tuple of type pandas series"
assert isinstance (ols_riskgender_on_bidvalue(fpa_data)[0], pd.core.series.Series)
assert isinstance (ols_riskgender_on_bidvalue(fpa_data)[1], pd.core.series.Series)

In [None]:
"checking the value of 'const' for both values"
# Hidden tests

3. Now, let's make our results more readable. Let's modify our function once again to this time create a dataframe that has the coefficients and p-values for the control variables and constant, **rounding to the nearest hundredth decimal**. (4 points)

In [None]:
def ols_riskgender_on_bidvalue_df(provided_data):

    """
    This function should use the results of the ols_riskgender_on_bidvalue function above to output a dataframe
    that has the coefficients and p-values for the control variables and constant.
    The dataframe should have the following columns: 'variable', 'coefficient', and 'p-value'

    """
    # define your parameters for your model and the p-values, then fill in the rest of the function below.

    # YOUR CODE HERE


Your function should return a dataframe of your results. Check that it does:

In [None]:
ols_riskgender_on_bidvalue_df(fpa_data)

In [None]:
"""Part C: Check the dataframe outputs the correct p-values from OLS model"""
# Hidden tests

In [None]:
"""Part C: Check the dataframe outputs the correct coefficients from OLS model"""
# Hidden tests

In [None]:
4. If you remove the risk attitudes variable from the model, does it have a significant effect on how gender contributes to bid/value ratios? Complete the ```ols_female_on_bidvalue``` function to assess this. Part of the function has already been completed for you. (3 points)

**Round any calculations to the hundredth decimal. Do not use percentages.**

In [None]:
def ols_gender_on_bidvalue_df(provided_data):

    """
    Complete the function that takes the provided data and creates a OLS model that determines the effect of
    gender (using the female variable) on subjects' bid/value ratios. It should output (by filling in the missing values)
    a dataframe that has the coefficients for the control variables and intercept.

    """
    # assign your X and Y variables, and define your parameters and pvalues. Then, fill in the rest of the function below.
    # YOUR CODE HERE
    raise NotImplementedError()



Your function should return a dataframe with each of the variables and their completed coefficient and p-value for the OLS model.

Check that it does:

In [None]:
ols_gender_on_bidvalue_df(fpa_data)

In [None]:
assert ols_gender_on_bidvalue_df(fpa_data).iloc[0][2] == 0, "checking the const p-value value"

In [None]:
"""Check that the dataframe outputs the correct values from the OLS model"""
# Hidden tests

## Part 3:
- 1. Power & Sample Sizes
- 2. Randomization - Blocking & Clustering
- 3. Differences-in-Differences

## Overview
### The objective of this part is to:

- Applying theory of experiment design and knowledge of analysis techniques to real experiment data.



### Resources:
- StatsModels and Scipy.stats
    - We recommend using two python libraries called [StatsModels](https://www.statsmodels.org/stable/index.html) and [scipy.stats](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html) for data analysis

- Datasets used for this part:
    - MovieLens Data: [Part3_data.csv](assets/Part3_data.csv)
        - Source for dataset: [Chen, Y. et al. Social Comparisons and Contributions to Online Communities: A Field Experiment on MovieLens. (2010).](https://www-jstor-org.proxy.lib.umich.edu/stable/27871259)

In [121]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.stats.api as sms
import math as math
from scipy import stats
from statsmodels.stats.power import TTestIndPower
#you may or may not use all of the above libraries, and that is OK!

movie_data = pd.read_csv('assets/Part3_data.csv') #Data for this part

In [122]:
#uncomment the below line to view readme files for this dataset (includes explanation of variable names)
# !cat assets/Part3_data_readme.md

#uncomment the below line to view snippet of csv file
movie_data.head()

Unnamed: 0,userid,expcondition,userage,compare_w_median,ratings_lifetime,pre_rating,post_rating,active,male,weeks,control
0,42126,control,old,1,1300,0,0,0,0,496,1
1,47947,control,old,0,765,0,38,1,0,361,1
2,49034,control,old,-1,313,0,41,1,0,347,1
3,51898,conformity,old,0,898,0,8,1,1,301,0
4,52797,conformity,old,0,885,0,33,1,0,300,0


## Introduction
In `movie_data`, you will find a dataframe containing a portion of the data from MovieLens experiment. To simply this part, you will only find one treatment condition where the experimenters tested the impact of social influence on moving ratings. This treatment was administrated through sending a tailored email that emphasized social influence. In contrast, subjects in the control received a plain version of email.

## Part A (6 points)

First, we should check if our sample is relatively balanced across our treatment and control groups. Test the following hypotheses using a t-test:

1. The number of ratings in the month before the intervention (`pre_rating`) are balanced between the treatment and control groups. (3 points)

**Round any calculations to the hundredth decimal. Do not use percentages.**

In [123]:
def pre_ratings(provided_data):

    """
    Write the function to manually check the differences in means of pre-rating values between the control and treatment groups.
    Your function should output a named dataframe with the following columns: 'avg control', 'avg treatment', 't-statistic', 'p-value'.
    The dataframe should be named, 'Difference in Means between Pre-Rating Groups'.
    Tip: you can choose to use either the statsmodels stats library or the scipy stats library to calculate the t-statistic and p-value.
    """
    # YOUR CODE HERE


Your function should return a named dataframe with each of the variables and their completed statistics. Check that it does:

In [124]:
assert pre_ratings(movie_data).name == "Difference in Means between Pre-Rating Groups"
df_columns = ['avg control', 'avg treatment', 't-statistic', 'p-value']
for index, title in enumerate(pre_ratings(movie_data).columns):
    assert title == df_columns[index]

In [125]:
"""Checking avg_control and avg_treatment values"""
# Hidden tests

'Checking avg_control and avg_treatment values'

In [126]:
"""checking your t-statistic and p-value are correct"""
# Hidden tests

'checking your t-statistic and p-value are correct'

2. Test that the gender composition (variable 'male') is similar between the treatment and control groups. (3 points)

**Round any calculations to the hundredth decimal. Do not use percentages.**

In [127]:
def male_gender_comp(provided_data):

    """
    Write the function to manually check the differences in means of participant gender across the control and treatment groups.
    Your function should output a named dataframe with the following columns: 'avg control', 'avg treatment', 't-statistic', 'p-value'
    The dataframe should be named, 'Difference in Means of Males'.
    Tip: you can choose to use either the statsmodels stats library or the scipy stats library to calculate the t-statistic and p-value.
    """


Your function should return a named dataframe with each of the variables and their completed statistics. Check that it does:

In [128]:
"""Checking the t-statistic and p-value in your dataframe are correct"""
assert next(iter(male_gender_comp(movie_data)['t-statistic'])) == -0.49
assert next(iter(male_gender_comp(movie_data)['p-value'])) == 0.62

In [129]:
"""Part A #2: Checking your dataframe is named, and your columns are in order"""
# Hidden tests

'Part A #2: Checking your dataframe is named, and your columns are in order'

In [130]:
"""Part A #2: Checking your avg control and avg treatment values are correct"""
# Hidden tests

'Part A #2: Checking your avg control and avg treatment values are correct'

## Part B (6 points)

From the MovieLens experiment, we know that we want to estimate the impact of social influence on moving ratings on the MovieLens platform. Let’s estimate this by using difference-in-differences to examine the effects of post_rating for the treatment and control group.

1. Create a new variable, delta, in the dataframe and output the dataframe. Delta should show the difference in pre_rating and post_rating (calculate using post_rating – pre_rating). (2 points)

In [131]:
def delta_ratings(provided_data):

    """
    Write the function to output a new dataframe with the following columns: 'userid','compare_w_median','pre_rating','post_rating','delta','control'.
    The content of the columns should come from movie_data. Delta should be calculated as post_rating - pre_rating.
    The dataframe should be named, 'Delta in Ratings'.
    """


Your function should return a named dataframe with each of the variables and their values. Check that it does:

In [132]:
"""checking you have a named dataframe"""
assert delta_ratings(movie_data).name == "Delta in Ratings"

In [133]:
"""checking your column names and orders"""
# Hidden tests

'checking your column names and orders'

2. Use an ordinary least squares regression model to explore the average treatment-effect on delta. Using a t-test, what is the significance using the t-statistic and p-value of this effect? (4 points)

**Round any calculations to the hundredth decimal. Do not use percentages.**

In [211]:
import statsmodels.api as sm

def ate_delta_avg(provided_data):

    """ The easiest way to evaluate the average treatment effect is to run a linear regression, with delta as
    the dependent variable, and control as the independent variable. Use the statsmodels library to run an
    OLS linear regression, and return a named dataframe with the t-statistic and pvalue associated your model's control
    data.
    The dataframe should have the following columns: 't-statistic', 'p-value', and should be named 'Average Treatment
    Effect on Delta'
    """
    # complete the function by assigning your X and Y, and fitting your model. Remember to add a constant.
    # YOUR CODE HERE
    # raise NotImplementedError()


Your function should return a named dataframe with the correct values. Check that it does:

In [212]:
ate_delta_avg(movie_data) #again, if you get a deprecation warning, that is fine.

Unnamed: 0,t-statistic,p-value
-1,16.29,0.62


In [213]:
"""Part B #2: Checking your dataframe is named, and your columns are in order"""
# Hidden tests

'Part B #2: Checking your dataframe is named, and your columns are in order'

In [214]:
"""checking your t-statistic and p-value are correct"""
# Hidden tests

'checking your t-statistic and p-value are correct'

## Part C (8 points)

What if we break this comparison down by group, specifically the total number of ratings users complete compared with the median ratings (compare_w_median)?

1. Output the t-statistics and p-values for the average treatment-effect on delta across median ratings (where ```compare_w_median``` == -1, where ```compare_w_median``` == 0, and where ```compare_w_median``` == 1). (8 points)

In [215]:
def ate_delta_median_values(provided_data):
    """ The easiest way to evaluate the average treatment effect is to run a linear regression for each of the
    distinct 'compare_w_median' values, with delta as the dependent variable, and control as the independent
    variable. Use the statsmodels library to run OLS linear regressions, and output a dataframe.

    The dataframe should be indexed, with the index values as follows: 'below median', 'at median', 'abv median'.
    The dataframe should have the following columns: 't-statistic', 'p-value'.
    The dataframe should be named 'Average Treatment Effect across Median Scores'

    """
    
   

Your function should return a named and indexed dataframe with the correct values. Check that it does:

In [216]:
ate_delta_median_values(delta_ratings(movie_data))

Unnamed: 0,t-statistic,p-value
below median,-0.05,0.02
at median,-1.66,0.03
abv median,-0.32,0.48


In [217]:
"""checking your dataframe is named, your columns are in order, and you have a dataframe index"""
assert ate_delta_median_values(delta_ratings(movie_data)).name == 'Average Treatment Effect across Median Scores'
check_col = iter(ate_delta_median_values(delta_ratings(movie_data)).columns)
check_ind = iter(ate_delta_median_values(delta_ratings(movie_data)).index)
assert next(check_ind) == 'below median'
assert next(check_ind) == 'at median'
assert next(check_ind) == 'abv median'
assert next(check_col) == 't-statistic'
assert next(check_col) == 'p-value'

In [218]:
"""checking your below-median t-statistic and p-value are correct"""
# Hidden tests

'checking your below-median t-statistic and p-value are correct'

In [219]:
"""checking your at-median t-statistic and p-value are correct"""
# Hidden tests

'checking your at-median t-statistic and p-value are correct'

In [220]:
"""checking your abv-median t-statistic and p-value are correct"""
# Hidden tests

'checking your abv-median t-statistic and p-value are correct'

## Part D (5 points)

Based on the `post_rating` observations, perform a sample-size calculation to determine the minimum number of subjects that are needed for detecting difference in means between the treament and the control groups. Assume $\alpha=0.05$ and $\beta = 0.1$, and that the variances are the same for the treatment and control groups.

The `solve_power` method of the `TTestIndPower` class provided by `statsmodels` can be used to solve this problem. You may want to carefully read its [documentation] (https://www.statsmodels.org/stable/generated/statsmodels.stats.power.TTestIndPower.solve_power.html#statsmodels.stats.power.TTestIndPower.solve_power) to understand how to use it. Use `TTestIndPower().solve_power` to call the function. Make sure to round up the number of observations by using math.ceil(....).

You'd also need to understand the difference between *population* and *sample* standard deviation, and how to use `pd` functions to calculate either one.

In [221]:
def power_calc(provided_data):
    """
    Your function should return a named pd.Series, "Power Analysis", that contains the following fields:
     - ctrl_mean: the mean for the control group
     - trtm_mean: the mean for the treatment group
     - pop_std: the population standard deviation for both the control and treatment groups (i.e., all records)
     - num_obs: the minimum number of subjects required
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# Check your result
power_calc(movie_data)

In [None]:
# Visible tests

stu_ans = power_calc(movie_data)

assert isinstance(stu_ans, pd.Series), "Part D: Your function should return a pd.Series. "
assert stu_ans.name == "Power Analysis", "Part D: Your Series should be named correctly. "
assert list(stu_ans.index) == ['ctrl_mean', 'trtm_mean', 'pop_std', 'num_obs'], "Part D: Your Series should have the correct indices. "

del stu_ans

## part 4:
- 1. Threats to Validity
- 2. Instrumental Variables

## Overview
### The objective of this part is to:

- Apply theory of experiment design and knowledge of analysis techniques to real experiment data.

### Resources:
- StatsModels
    - We recommend using a python library called [StatsModels](https://www.statsmodels.org/stable/index.html) for data analysis


- Dataset used in this part: Kiva Crowdsourcing Team data [view source files](https://www.openicpsr.org/openicpsr/project/100358/version/V2/view)
    - Source for dataset: [Chen, Y., et al. Recommending teams promotes prosocial lending in online microfinance (2016).](https://www.pnas.org/content/113/52/14944)

In [361]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from linearmodels.iv import IV2SLS #you may get a deprecation warning for this library -- this is fine.

data = pd.read_csv('assets/Part4_data.csv', nrows=100) #Data for this part

In [362]:
#uncomment the below line to view readme files for this dataset (includes explanation of variable names)
!cat assets/Part4_data_readme.md

#uncomment the below line to view snippet of csv file
data.head()

### Assignment Topic: Data analysis of a field experiment on Kiva

### Background:

We upload data files from a field experiment conducted by the University of Michigan on the Kiva platform. They are in csv format to import into the Jupyter Notebook.

### Data:

The data file contains eight variables for 64,800 subjects. Below are descriptions of each field in the file:

    - shuffled_lender_id: a de-identified identifier to represent unique subjects.
    
    - treatment_id: the type of treatment the subjects received as part of the experiment.
    
        1. No-Contact
        
        2. Team-Exist
        
        3. Location-Explanation
        
        4. Location-NoExplanation
        
        5. History-Explanation
        
        6. History-NoExplanation
        
        7. Leaderboard-Explanation
        
        8. Leaderboard-NoExplanation
        
    - join: Whether a lender joined a team after the experiment.
    
    - join_rec: Whether a lender joined a recommended 

Unnamed: 0,shuffled_lender_id,treatment_id,join,join_rec,opened,amt_diff_1d,amt_diff_7d,amt_diff_30d
0,0,7,0,0,1,0.0,0.0,-50.0
1,1,3,0,0,1,0.0,0.0,0.0
2,2,4,0,0,1,0.0,0.0,0.0
3,3,2,0,0,0,0.0,0.0,0.0
4,4,5,0,0,0,0.0,0.0,0.0


## Part A (15 points)

We want to assess the effectiveness of joining a team on Kiva -- specifically, what impact joining a team has on donations. Using the variable indicating whether users have joined a team (```join```) and the differences in donations made over a certain period (```amt_diff_1d```, ```amt_diff_7d```, ```amt_diff_30d```), we can find if joining a team has impact on donations. However, variables that determine whether subjects join a team may also affect the amount they donate since we only inform subjects about joining a team.

In this case, our instrumental variable is whether the subjects were sent an e-mail to inform about the team functionality on Kiva.

***Before you go on, recall from lecture the requirements an instrumental variable must satisfy - you will be investigating these requirements in this notebook.***

1. To get started, we need to create the instrumental variable. Add a new column named ```email``` in the dataframe. The value for ```email``` should be ```1``` if users received an e-mail as part of their treatment group (```treatment_id``` != 1), and ```0``` if they did not (```treatment_id``` = 1). (2 points)

In [363]:
data.rename(columns={'join': 'join_any'}, inplace=True) #since join is also the name of a pandas method, we rename the column to avoid confusion
data['email'] = (data['treatment_id'] != 1).astype(int)
# raise NotImplementedError()

In [364]:
assert pd.notnull(data['email'].all()), "email column must contain either 0 or 1"

In [365]:
assert data.loc[data['treatment_id'] != 1,'email'].all() == 1, "all treatments except treatment 1 received an email"
assert data.loc[data['treatment_id'] == 1,'email'].all() == 0, "all treatments except treatment 1 received an email"

2. Next, we will create a constant, equal to 1. Add a column in the dataframe called ```const```. (1 point)

In [366]:
# YOUR CODE HERE
# raise NotImplementedError()

data['const'] = 1

In [367]:
assert data['const'].all() == 1, "the constant value should be 1"

In lecture, the 2-stage least squares model was used. Now, let’s follow the steps indicated in lecture to create this model to measure the effect described above. First, we need to estimate the effect of e-mailing users to join a team on whether they join a team.

3. Using statsmodels, create an ordinary least squares regression model that does this. Fit the model and store it in the variable: ```model_fs```. Using the predict method from ```model_fs```, store the predicted values in a new column in your dataframe called ```predicted_join```. Recall from lecture that since we have created this new variable, we can estimate the effect of joining a team on lending amounts without worrying about the effect of potential unobserved or missing variables. (4 points)

Note: ensure your model has a constant.

In [368]:
def email_join_ols(provided_data):
    # YOUR CODE HERE
    # raise NotImplementedError()


Your function should return a dataframe with the correct values and columns. Check that it does:

In [370]:
assert 'predicted_join' in data, "checking there is a column named predicted_join in data"

In [371]:
"""checking the correct predicted_join values are present"""
# Hidden tests

'checking the correct predicted_join values are present'

Now that we have the predicted values of whether a subject would be expected to join a team based on if they were e-mailed, we can move to the second stage.

4. In this stage, we will run the estimation of the effect of joining a team on the amount a subject lends. However, instead of using the ```join``` variable, we will use our new ```predicted_join``` variable. Using statsmodels again, and ensuring your model has a constant, create three ordinary least squares regression models which estimate the effect of the prediction of users joining a team on the following:

a. ```amt_diff_1d```, storing the fitted model in ```model_1d``` (3 points)

In [372]:
def pred_join_amt_1d(provided_data):
    X = sm.add_constant(provided_data[['predicted_join', 'amt_diff_1d']].iloc[:])
    model_1d = sm.OLS(provided_data['join_any'].iloc[:], X).fit()
    return model_1d

Your function should return a summary view of your results. Check that it does:

In [374]:
"""checking your t-value is correct"""
# Hidden tests

'checking your t-value is correct'

b. ```amt_diff_7d```, storing the fitted model in ```model_7d``` (3 points)

c. ```amt_diff_30d```, storing the fitted model in ```model_30d``` (2 points)

Your function should return a summary view of your results. Check that it does:

In [382]:
"""checking your t-value is correct"""
# Hidden tests

'checking your t-value is correct'

## Part B (10 points)

Now we have estimated the effect of joining a team on lending using instrumental variables! However, there is a more direct way to complete these two stages.

Using the IV2SLS (Instrumental Variables 2-Stage Least Squares) function ([Documentation](https://bashtage.github.io/linearmodels/iv/iv/linearmodels.iv.model.IV2SLS.html)) in the linearmodels library will achieve everything we did above faster -- and it will more correctly estimate the standard errors.

1. First, let's make things simpler for ourselves. For this analysis, we will only need a dataframe with the following columns: ```email```, ```join_any```, ```amt_diff_1d```, ```amt_diff_7d```, ```amt_diff_30d```, and the ```const``` column you created above. (As in our other models, we need a constant included in the models we will be creating with IV2SLS.) (1 point)

In [383]:
iv_dataframe = pd.DataFrame()
iv_dataframe = data[['email', 'join_any', 'amt_diff_1d', 'amt_diff_7d', 'amt_diff_30d', 'const']]
# YOUR CODE HERE
# raise NotImplementedError()

```iv_dataframe``` should yield a dataframe with the correct calculated, given, and new column and row values. Check that it does:

In [384]:
iv_dataframe.head()

Unnamed: 0,email,join_any,amt_diff_1d,amt_diff_7d,amt_diff_30d,const
0,1,0,0.0,0.0,-50.0,1
1,1,0,0.0,0.0,0.0,1
2,1,0,0.0,0.0,0.0,1
3,1,0,0.0,0.0,0.0,1
4,1,0,0.0,0.0,0.0,1


In [385]:
"""checking your dataframe columns are all present"""
assert 'email' and 'join_any'and 'amt_diff_1d' and 'amt_diff_7d' and 'amt_diff_30d'and 'const' in iv_dataframe

After looking over the documentation of the IV2SLS function, create and fit three models (with the three lending measurement periods used in number 3) that estimate the effect of joining a team on lending amounts considering the instrument of emailing subjects about joining a team.

According to the [linearmodels IV2SLS documentation](https://bashtage.github.io/linearmodels/iv/iv/linearmodels.iv.model.IV2SLS.html), you will need to provide a given set of parameters to the function in order to create the model. The dependent variables and instruments are straightforward, but what are exogenous and endogenous regressors?

An exogenous regressor does not co-vary with the model’s random error while an endogenous regressor does. In our model’s case, we know the ```join_any``` variable co-varies with the random error while ```const``` cannot (since it is a constant!).

2. Create and fit the model for the first lending measurement period, the period referred to in ```amt_diff_1d``` (3 points)

In [394]:
def iv_model_1d(provided_data):

    """ Take some time to think about what exactly you're modeling here, then read the linearmodels documentation.
    What is the instrument? What is the dependent variable, what are the endogenous and exogenous regressors?
    Tip: the covariance should be unadjusted in this model, (and your following models)
    """
    # iv_result_1d = your code here
    # YOUR CODE HERE


Your function should return a summary view of your results. Check that it does:

In [395]:
iv_model_1d(iv_dataframe)

0,1,2,3
Dep. Variable:,dependent,R-squared:,0.0015
Estimator:,IV-2SLS,Adj. R-squared:,-0.0087
No. Observations:,100,F-statistic:,0.0102
Date:,"Mon, Oct 31 2022",P-value (F-stat),0.9195
Time:,17:30:16,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,-3.331e-16,0.0994,-3.35e-15,1.0000,-0.1949,0.1949
email,0.0115,0.1137,0.1011,0.9195,-0.2114,0.2344


In [397]:
"""checking your 1d model has an unadjusted covariance and your p-value is correct"""
# Hidden tests

'checking your 1d model has an unadjusted covariance and your p-value is correct'

3. Create and fit the model for the first lending measurement period, the period referred to in ```amt_diff_7d``` (3 points)

In [400]:
def iv_model_7d(provided_data):
    # YOUR CODE HERE
    iv_result_7d = IV2SLS(provided_data['join_any'].iloc[:].values,
                       provided_data[['const']], provided_data[['email']],
                          provided_data[['amt_diff_7d']],
                         ).fit(cov_type="unadjusted")
    return iv_result_7d

Your function should return a summary view of your results. Check that it does:

In [401]:
iv_model_7d(iv_dataframe)

0,1,2,3
Dep. Variable:,dependent,R-squared:,-13.461
Estimator:,IV-2SLS,Adj. R-squared:,-13.609
No. Observations:,100,F-statistic:,0.0018
Date:,"Mon, Oct 31 2022",P-value (F-stat),0.9662
Time:,17:31:39,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,0.9444,22.033,0.0429,0.9658,-42.240,44.129
email,-1.0741,25.326,-0.0424,0.9662,-50.712,48.563


In [None]:
"""checking your 7d model has an unadjusted covariance and your p-value is correct"""
# Hidden tests

4. Create and fit the model for the first lending measurement period, the period referred to in ```amt_diff_30d``` (3 points)

In [402]:
def iv_model_30d(provided_data):
    # YOUR CODE HERE
    iv_result_30d = IV2SLS(provided_data['join_any'].iloc[:].values,
                       provided_data[['const']], provided_data[['email']],
                          provided_data[['amt_diff_30d']],
                         ).fit(cov_type="unadjusted")
    return iv_result_30d

Your function should return a summary view of your results. Check that it does:

In [403]:
iv_model_30d(iv_dataframe)

0,1,2,3
Dep. Variable:,dependent,R-squared:,-0.0040
Estimator:,IV-2SLS,Adj. R-squared:,-0.0143
No. Observations:,100,F-statistic:,0.0005
Date:,"Mon, Oct 31 2022",P-value (F-stat),0.9820
Time:,17:32:07,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,-0.0191,1.2933,-0.0148,0.9882,-2.5539,2.5156
email,0.0335,1.4865,0.0225,0.9820,-2.8799,2.9469


In [404]:
"""checking your 30d model has an unadjusted covariance and your p-value is correct"""
# Hidden tests

'checking your 30d model has an unadjusted covariance and your p-value is correct'