# A/B Testing the Udacity Website

In these exercises, we'll be analyzing data on user behavior from an experiment run by Udacity, the online education company. More specifically, we'll be looking at a test Udacity ran to improve the onboarding process on their site.

Udacity's test is an example of an "A/B" test, in which some portion of users visiting a website (or using an app) are randomly selected to see a new version of the site. An analyst can then compare the behavior of users who see a new website design to users seeing their normal website to estimate the effect of rolling out the proposed changes to all users. While this kind of experiment has it's own name in industry (A/B testing), to be clear it's just a randomized experiment, and so everything we've learned about potential outcomes and randomized experiments apply here. 

(Udacity has generously provides the data from this test under an Apache open-source license, and you can find their [original writeup here](https://www.kaggle.com/tammyrotem/ab-tests-with-python/notebook). If you're interested in learning more on A/B testing in particular, it seems only fair while we use their data to flag they have a full course on the subject [here](https://www.udacity.com/course/ab-testing--ud257).)

## Udacity's Test

The test [is described by Udacity as follows](https://www.kaggle.com/tammyrotem/ab-tests-with-python/notebook): 

At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials".

**Current Conditions Before Change**

- If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first.
- If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

**Description of Experimented Change**

- In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course.
- If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free.
- At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This [screenshot](images/udacity_checkyoureready.png) shows what the experiment looks like.

**Udacity's Hope is that...**:

> this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time -- without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.



## Gradescope Autograding

Please follow [all standard guidance](https://www.practicaldatascience.org/ids720_specific/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file `exercise_abtesting.ipynb` before uploading.

You can check that you have answers for all questions in your `results` dictionary with this code:

```python
assert set(results.keys()) == {
    "ex4_avg_oec",
    "ex5_avg_guardrail",
    "ex7_ttest_pvalue",
    "ex9_ttest_pvalue_clicks",
    "ex10_num_obs",
    "ex11_guard_ate",
    "ex11_guard_pvalue",
    "ex11_oec_ate",
    "ex11_oec_pvalue",
    "ex14_se_treatment",
}
```


### Submission Limits

Please remember that you are **only allowed THREE submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total.

## Import the Data

### Exercise 1

Begin by importing Udacity's data on user behavior [here.](https://github.com/nickeubank/MIDS_Data/tree/master/udacity_AB_testing) 

There are TWO datasets for this test — one for the control data (users who saw the original design), and one for treatment data (users who saw the experimental design). Udacity decided to show their test site to 1/2 of visitors, so there are roughly the same number of users appearing in each dataset (though this is not a requirement of AB tests).

Please remember to load the data directly from github to assist the autograder.

In [29]:
import pandas as pd

url_control = "https://github.com/nickeubank/MIDS_Data/raw/refs/heads/master/udacity_AB_testing/control_data.csv"
url_treatment = "https://github.com/nickeubank/MIDS_Data/raw/refs/heads/master/udacity_AB_testing/experiment_data.csv"

control = pd.read_csv(url_control)
treatment = pd.read_csv(url_treatment)

print(control.head())
print(treatment.head())

          Date  Pageviews  Clicks  Enrollments  Payments
0  Sat, Oct 11       7723     687        134.0      70.0
1  Sun, Oct 12       9102     779        147.0      70.0
2  Mon, Oct 13      10511     909        167.0      95.0
3  Tue, Oct 14       9871     836        156.0     105.0
4  Wed, Oct 15      10014     837        163.0      64.0
          Date  Pageviews  Clicks  Enrollments  Payments
0  Sat, Oct 11       7716     686        105.0      34.0
1  Sun, Oct 12       9288     785        116.0      91.0
2  Mon, Oct 13      10480     884        145.0      79.0
3  Tue, Oct 14       9867     827        138.0      92.0
4  Wed, Oct 15       9793     832        140.0      94.0


### Exercise 2

Explore the data. Can you identify the unit of observation of the data (e.g. what is represented by each row)?

To be clear, the columns represent stages in a user funnel:

- Some number of users arrive at the website and are counted as Pageviews,
- Some portion of those users then click to enroll (and are counted as clicks),
- Some portion of those users then actually enroll in the free trial (after seeing an informational popup, in the case of treatment individuals),
- Finally some portion of those users end up paying at the end of the free trial period.

(Note this is not the only way that A/B test data can be collected and/or reported — this is just what Udacity provided, presumably to help address privacy concerns.)

Answer:

In the raw datasets provided, the unit of observation is a day within a specific experimental arm. Instead of seeing individual user-level data, each row represents a daily aggregate of user interactions.

## Pick your measures

### Exercise 3

The easiest way to analyze this data is to stack it into a single dataset where each observation is a day-treatment-arm (so you should end up with two rows per day, one for those who are in the treated groups, and one for those who were in the control group). Note that currently nothing in the data identifies whether a given observation is a treatment group observation or a control group observation, so you'll want to make sure to add a "treatment" indicator variable.

The variables in the data are:

- Pageviews: number of unique users visiting homepage
- Clicks: number of those users clicking "Start Free Trial"
- Enrollments: Number of people enrolling in trial
- Payments: Number of people who eventually pay for the service. Note the `payment` column reports payments for the users who first visited the site on the reported date, not payments occurring on the reported date.

In [30]:
# create a 'treatment' column and stack the data
control["treatment"] = 0
treatment["treatment"] = 1

df = pd.concat([control, treatment], axis=0).reset_index(drop=True)

print(df.head())
print(f"Total rows: {len(df)}")
print(f"Control group rows: {len(control)}")
print(f"Treatment group rows: {len(treatment)}")

          Date  Pageviews  Clicks  Enrollments  Payments  treatment
0  Sat, Oct 11       7723     687        134.0      70.0          0
1  Sun, Oct 12       9102     779        147.0      70.0          0
2  Mon, Oct 13      10511     909        167.0      95.0          0
3  Tue, Oct 14       9871     836        156.0     105.0          0
4  Wed, Oct 15      10014     837        163.0      64.0          0
Total rows: 74
Control group rows: 37
Treatment group rows: 37


### Exercise 4

Given Udacity's goals, what outcome are they hoping will be impacted by their manipulation?

Or, to ask the same question in the language of the Potential Outcomes Framework, what is their $Y$?

Or to ask the same question in the language of Kohavi, Tang and Xu, what is their *Overall Evaluation Criterion (OEC)*?

(I'm only asking one question, I'm just trying to phrase it using different terminologies we've encountered to help you see how they all fit together)

When you feel like you have your answer, please compute it. Store the average value of the variable in `results` under the key `ex4_avg_oec`. **Please round your answer to 4 decimal places.**

NOTE: You'll probably notice you have two choices to make when it comes to actually computing the OEC. 

- You could probably imagine either computing a ratio or a difference of two things — please calculate the difference.
- You may also be unsure whether to normalize by `Clicks`. Normalizing by clicks will help account for variation that comes from day-to-day variation in users, so it's a good thing to do. With infinite data, you'd expect to get the same results without normalizing by `Clicks` (since on average the same share of users are in each arm of the experiment), but for finite data it's a good strategy. Note that this is only ok because users make the choice to click or not *before* they see different versions of the website (it is "pre-treatment").

Just to make sure you're on track, your measure should have an average value of *about* 9%.

In [31]:
# calculate conversion rates
df["oec"] = (df["Enrollments"] - df["Payments"]) / df["Clicks"]

results = {}
results["ex4_avg_oec"] = round(df["oec"].mean(), 4)

print(results["ex4_avg_oec"])

0.0941


### Exercise 5

Given Udacity's goals, what outcome are they hoping will *not* be impacted by their manipulation? In other words, what do they want to measure to ensure their treatment doesn't have unintended negative consequences that might be really costly to their operation?

Note that while this isn't how Kohavi, Tang, and Xu use the term "guardrail metrics" — they usually use the term to refer to things we measure to ensure the experiment is working the way it should — some people would also use the term "guardrail metrics" for something that could be impacted even if the experiment is working correctly, but which the organization wants to track to ensure they aren't impacted because they are deemed really important.

Again, please normalize by `Clicks`. Store the average value of this guardrail metric as `ex5_avg_guardrail` and **round your answer to 4 decimal places.**

In [32]:
# calculate guardrail metric
df["guardrail"] = df["Payments"] / df["Clicks"]

results["ex5_avg_guardrail"] = round(df["guardrail"].mean(), 4)

print(results["ex5_avg_guardrail"])

0.1158


## Validating The Data

### Exercise 6

Whenever you are working with experimental data, the first thing you want to do is verify that users actually were randomly sorted into the two arms of the experiment. In this data, half of users were supposed to be shown the old version of the site and half were supposed to see the new version.

`Pageviews` tells you how many unique users visited the welcome site we are experimenting on. `Pageviews` is what is sometimes called an "invariant" or "guardrail" variable, meaning that it shouldn't vary across treatment arms—after all, people have to visit the site before they get a chance to see the treatment, so there's no way that being assigned to treatment or control should affect the number of pageviews assigned to each group.

"Invariant" variables are also an example of what are known as a "pre-treatment" variable, because pageviews are determined before users are manipulated in any way. That makes it analogous to gender or age in experiments where you have demographic data—a person's age and gender are determined before they experience any manipulations, so the value of any pre-treatment attributes should be the same across the two arms of our experiment. This is what we've previously called "checking for balance," If pre-treatment attributes aren't balanced, then we may worry our attempt to randomly assign people to different groups failed.  Kohavi, Tang and Xu call this a "trust-based guardrail metric" because it helps us determine if we should trust our data.

To test the quality of the randomization, calculate the average number of pageviews for the treated group and for the control group. Do they look similar?


In [33]:
avg_pageviews_control = df[df["treatment"] == 0]["Pageviews"].mean()
avg_pageviews_treatment = df[df["treatment"] == 1]["Pageviews"].mean()
difference = abs(avg_pageviews_treatment - avg_pageviews_control)

print(
    f"Average Pageviews - Control: {avg_pageviews_control: .2f}, Treatment: {avg_pageviews_treatment: .2f}"
)
print(f"Difference: {difference: .2f}")

Average Pageviews - Control:  9339.00, Treatment:  9315.14
Difference:  23.86


Answer:

Yes, the average pageviews for the treatment group and the control group look similar.

### Exercise 7

"Similar" is a tricky concept -- obviously, we expect *some* differences across groups since users were *randomly* divided across treatment arms. The question is whether the differences between groups are larger than we'd expect to emerge given our random assignment process. To evaluate this, let's use a `ttest` to test the statistical significance of the differences we see. 

**Note**: Remember that scipy functions don't accept `pandas` objects, so you use a scipy function, you have to pass the numpy vectors underlying your data with the `.values` operator (e.g. `df.my_column.values`). 

Does the difference in `pageviews` look statistically significant?

Store the resulting p-value in `ex7_ttest_pvalue` **rounded to four decimal places.**

In [34]:
from scipy import stats

# prepare data
control_pageviews = df[df["treatment"] == 0]["Pageviews"].values
treatment_pageviews = df[df["treatment"] == 1]["Pageviews"].values

# perform t-test
t_stat, p_value = stats.ttest_ind(control_pageviews, treatment_pageviews)

results["ex7_ttest_pvalue"] = round(p_value, 4)

print(results["ex7_ttest_pvalue"])

0.8877


Answer:

Because the p-value > 0.05, we fail to reject the null hypothesis that the two groups are different. The difference in pageviews between the two groups is not statistically significant, indicating that the randomization was successful and that the groups are balanced.

### Exercise 8

`Pageviews` is not the only "pre-treatment" variable in this data we can use to evaluate balance/use as a guardrail metric. What other measure is pre-treatment? Review the description of the experiment if you're not sure.

Answer:

"Clicks" is the second pre-treatment variable. Since the decision to click happens before the user sees the new version of the site, the number of Clicks should be balanced across both groups, making it a valid guardrail metric to check for randomization bias.

### Exercise 9

Check if the other pre-treatment variable is also balanced. Store the p-value of your test of difference in `results` under the key `"ex9_ttest_pvalue_clicks"` **rounded to four decimal places.**

In [35]:
# prepare data
control_clicks = df[df["treatment"] == 0]["Clicks"].values
treatment_clicks = df[df["treatment"] == 1]["Clicks"].values

# perform t-test
t_stat_clicks, p_value_clicks = stats.ttest_ind(treatment_clicks, control_clicks)

results["ex9_ttest_pvalue_clicks"] = round(p_value_clicks, 4)

print(results["ex9_ttest_pvalue_clicks"])

0.9264


Answer:

Because the p-value > 0.05, we fail to reject the null hypothesis that the two groups are different. The difference in clicks between the two groups is not statistically significant, indicating that the randomization was successful and that the groups are balanced.

## Estimating the Effect of Experiment

### Exercise 10

Now that we've validated our randomization, our next task is to estimate our treatment effect. First, though, there's an issue with your data you've been able to largely ignore until now, but which you should get a grip on before estimating your treatment effect — can you tell what it is and what you should do about it?

Store the number of observations in your data *after* you've addressed this in `ex10_num_obs` (this is mostly meant as a way to sanity check your answer with autograder).

In [36]:
df.tail(20)

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments,treatment,oec,guardrail
54,"Tue, Oct 28",9396,736,162.0,120.0,1,0.057065,0.163043
55,"Wed, Oct 29",9262,727,201.0,96.0,1,0.144429,0.13205
56,"Thu, Oct 30",9308,728,207.0,67.0,1,0.192308,0.092033
57,"Fri, Oct 31",8715,722,182.0,123.0,1,0.081717,0.17036
58,"Sat, Nov 1",8448,695,142.0,100.0,1,0.060432,0.143885
59,"Sun, Nov 2",8836,724,182.0,103.0,1,0.109116,0.142265
60,"Mon, Nov 3",9359,789,,,1,,
61,"Tue, Nov 4",9427,743,,,1,,
62,"Wed, Nov 5",9633,808,,,1,,
63,"Thu, Nov 6",9842,831,,,1,,


In [37]:
df_cleaned = df.dropna(subset=["Payments"])

print(df.shape)
print(df_cleaned.shape)

(74, 8)
(46, 8)


In [38]:
results["ex10_num_obs"] = len(df_cleaned)

print(results["ex10_num_obs"])

46



### Exercise 11

Now that we've established we have good balance (meaning we think randomization was likely successful), we can evaluate the effects of the experiment. Test whether the OEC and the metric you *don't* want affected have different average values in the control group and treatment group. 

Because we've randomized, this is a consistent estimate of the Average Treatment Effect of Udacity's website change.

Calculate the difference in means in your OEC and guardrail metrics using a simple t-test. Store the resulting effect estimates in `ex11_oec_ate` and `ex11_guard_ate` and p-values in `ex11_oec_pvalue` and `ex11_guard_pvalue`. **Please round all answers to 4 decimal places.** Report your ATE in *percentage points*, where `1` denotes 1 percentage point.


In [39]:
def calculate_ate_results(data, metric_name):
    # separate control and treatment groups
    control_vals = data[data["treatment"] == 0][metric_name].values
    treatment_vals = data[data["treatment"] == 1][metric_name].values

    # calculate ATE in percentage points
    ate = (treatment_vals.mean() - control_vals.mean()) * 100

    # calculate p-value
    t_stat, p_val = stats.ttest_ind(treatment_vals, control_vals)

    return round(ate, 4), round(p_val, 4)


# calculate for OEC
results["ex11_oec_ate"], results["ex11_oec_pvalue"] = calculate_ate_results(
    df_cleaned, "oec"
)

# calculate for guardrail
results["ex11_guard_ate"], results["ex11_guard_pvalue"] = calculate_ate_results(
    df_cleaned, "guardrail"
)

print(f"OEC ATE: {results['ex11_oec_ate']}, P-value: {results['ex11_oec_pvalue']}")
print(
    f"Guardrail ATE: {results['ex11_guard_ate']}, P-value: {results['ex11_guard_pvalue']}"
)

OEC ATE: -1.5888, P-value: 0.1319
Guardrail ATE: -0.4897, P-value: 0.5928


### Exercise 12

Do you feel that Udacity achieved their goal? Did their intervention cause them any problems? If they asked you "What would happen if we rolled this out to everyone?" what would you say?

As you answer this question, a small additional question: up until this point you've (presumably) been reporting the default p-values from the tools you are using. These, as you may recall from stats 101, are two-tailed p-values. Do those seem appropriate for your OEC?

Answer:

1. Did Udacity achieve their goal?

- Udacity's primary goal was to reduce the number of "frustrated students" (those who enroll but don't finish) without significantly reducing the number of students who eventually pay.

- OEC (Frustration Rate): Output shows an ATE of -1.5888. This means the intervention reduced the frustration rate by about 1.59 percentage points. However, the p-value is 0.1319. Since this is greater than 0.05, we cannot say this reduction is statistically significant at the 95% confidence level.

- Guardrail (Net Conversion): Output shows an ATE of -0.4897 with a p-value of 0.5928. This suggests that the intervention did not have a statistically significant negative impact on the number of paying students, which is good.

- Conclusion: Strictly speaking, they did not achieve their goal in a statistically significant way. While the frustration rate trended downward, we cannot be confident the change wasn't just due to random noise.

2. Did their intervention cause any problems?

- No. Because the p-value for the Guardrail Metric is very high (0.5928), there is no evidence that the intervention significantly hurt Udacity's revenue or the number of students completing the course.

3. What would you say if they asked "What would happen if we rolled this out to everyone?"

- While the prompt seems to be moving the metrics in the right direction (reducing frustration without hurting revenue), the current data does not provide enough evidence to guarantee this result at scale. We might:

    - Running the experiment longer to increase statistical power.

    - Refining the intervention to make the time-commitment warning more impactful.

4. Are two-tailed p-values appropriate for your OEC?

- Because Udacity has a directional hypothesis (they specifically want to reduce the number of frustrated students), one could argue that a one-tailed test is appropriate.

- If we used a one-tailed test, our OEC p-value would be halved ($0.1319 / 2 \approx 0.066$). While this is much closer to the 0.05 threshold, it still doesn't quite reach conventional significance.

### Exercise 13

One of the magic things about experiments is that all you have to do is compare averages to get an average treatment effect. However, you *can* do other things to try and increase the statistical power of your experiments, like add controls in a linear regression model. 

As you likely know, a bivariate regression is exactly equivalent to a t-test, so let's start by re-estimating the effect of treatment on your OEC using a linear regression. Can you replicate the results from your t-test? They shouldn't just be close—they should be numerically equivalent (i.e. exactly the same to the limits of floating point number precision). 

In [40]:
import statsmodels.api as sm

# prepare data for regression
y = df_cleaned["oec"]
X = df_cleaned["treatment"]

# add constant
X = sm.add_constant(X)

# fit the OLS model
model_bivariate = sm.OLS(y, X).fit()

print(model_bivariate.summary())
print(round(model_bivariate.params["treatment"] * 100, 4))
print(round(model_bivariate.pvalues["treatment"], 4))

                            OLS Regression Results                            
Dep. Variable:                    oec   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.029
Method:                 Least Squares   F-statistic:                     2.356
Date:                Sun, 01 Feb 2026   Prob (F-statistic):              0.132
Time:                        22:39:51   Log-Likelihood:                 89.832
No. Observations:                  46   AIC:                            -175.7
Df Residuals:                      44   BIC:                            -172.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1021      0.007     13.948      0.0

### Exercise 14

Now add indicator variables for the date of each observation. Do the standard errors on your `treatment` variable change? If so, in what direction?

Store your new standard error in `ex14_se_treatment`. Round your answer to 4 decimal places.

You should have found that your standard errors decreased by about 30\%—this is why, although just comparing means *works*, if you have additional variables adding them to your analysis can be helpful (all the usual rules for model specification apply — for example, you still want to be careful about overfitting, which one could argue is maybe part of what's happening here). 

In many other cases, the effect of adding controls is likely to be larger — the date indicators we added to our data are perfectly balanced between treatment and control, so we aren't adding a lot of data to the model by adding them as variables. They're accounting for some day-to-day variation (presumably in the types of people coming to the site), but they aren't controlling for any residual baseline differences the way a control like "gender" or "age" might (since those kind of individual-level attributes will never be perfectly balanced across treatment and control). 

In [41]:
import statsmodels.api as sm

# 1. Create dummy variables for the Date column
df_with_dummies = pd.get_dummies(df_cleaned, columns=["Date"], drop_first=True)

# 2. Define independent variables (X)
date_cols = [col for col in df_with_dummies.columns if col.startswith("Date_")]
X_multi = df_with_dummies[["treatment"] + date_cols]

# 3. Add the constant for the intercept
X_multi = sm.add_constant(X_multi)

# 4. Define the dependent variable (y)
y = df_with_dummies["oec"]

# 5. Fit the model (ensure data types are float for statsmodels)
model_multi = sm.OLS(y.astype(float), X_multi.astype(float)).fit()

# 6. Extract and store the Standard Error for the treatment variable
results["ex14_se_treatment"] = round(model_multi.bse["treatment"], 4)

print(f"New Standard Error for Treatment: {results['ex14_se_treatment']}")
print(f"Bivariate SE was likely around: {model_bivariate.bse['treatment']:.4f}")

New Standard Error for Treatment: 0.0066
Bivariate SE was likely around: 0.0104


### Exercise 15

Does this result have any impact on the recommendations you would offer Udacity?

In [42]:
print(model_multi.summary())

                            OLS Regression Results                            
Dep. Variable:                    oec   R-squared:                       0.806
Model:                            OLS   Adj. R-squared:                  0.602
Method:                 Least Squares   F-statistic:                     3.962
Date:                Sun, 01 Feb 2026   Prob (F-statistic):           0.000978
Time:                        22:39:58   Log-Likelihood:                 126.29
No. Observations:                  46   AIC:                            -204.6
Df Residuals:                      22   BIC:                            -160.7
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const                0.1079      0.016  

Yes, these results significantly impact the recommendations offered to Udacity. 

While the initial t-test suggested the intervention might not be effective, the refined regression analysis provides the statistical evidence needed to support a rollout.

1. From Insignificance to Statistical Confidence

- Previous Finding: In the simple comparison (Exercise 11), the p-value for the OEC (Frustration Rate) was 0.1319, which failed to meet the standard 0.05 threshold for significance.

- New Finding: By adding date indicators, the p-value for the treatment effect dropped to 0.025. This means we can now reject the null hypothesis and conclude that the intervention had a statistically significant effect.

2. Increased Precision (Reduced Noise)

- The standard error for the treatment effect decreased from 0.0104 to 0.0066.

- This precision gain occurred because the date indicators "soaked up" the day-to-day variation in user behavior that was previously treated as random noise. By accounting for which day an observation occurred, the model could more accurately isolate the specific impact of the treatment.

3. Final Recommendation for Udacity

- Roll Out the Change: The intervention successfully reduced the "Frustration Rate" (students enrolling without enough time) by approximately 1.6 percentage points (coef = -0.0159).

- Business Impact: Since the previous checks showed no significant negative impact on the number of paying students (the Guardrail Metric), Udacity should implement this feature. It improves the student experience and coaching efficiency without hurting the bottom line.

In [43]:
results

{'ex4_avg_oec': np.float64(0.0941),
 'ex5_avg_guardrail': np.float64(0.1158),
 'ex7_ttest_pvalue': np.float64(0.8877),
 'ex9_ttest_pvalue_clicks': np.float64(0.9264),
 'ex10_num_obs': 46,
 'ex11_oec_ate': np.float64(-1.5888),
 'ex11_oec_pvalue': np.float64(0.1319),
 'ex11_guard_ate': np.float64(-0.4897),
 'ex11_guard_pvalue': np.float64(0.5928),
 'ex14_se_treatment': np.float64(0.0066)}

In [44]:
assert set(results.keys()) == {
    "ex4_avg_oec",
    "ex5_avg_guardrail",
    "ex7_ttest_pvalue",
    "ex9_ttest_pvalue_clicks",
    "ex10_num_obs",
    "ex11_guard_ate",
    "ex11_guard_pvalue",
    "ex11_oec_ate",
    "ex11_oec_pvalue",
    "ex14_se_treatment",
}