In [1]:
import pandas as pd
import psycopg2 as pg2
from sqlalchemy import create_engine

engine = create_engine('postgresql://testuser:testpass@localhost:5432/postgresql_analysis')

con = pg2.connect(host='localhost',
                  user='testuser',
                  password='testpass',
                  database='postgresql_analysis')
con.autocommit = True
cur = con.cursor()

In [2]:
def select(query):
    return pd.read_sql(query, con)

### Experiment Analysis

Experimentation, also known as A/B testing or split testing, is considered the gold standard for establishing causality. Much data analysis work involves establishing correlations: one thing is more likely to happen when another thing also happens, whether that be an action, an attribute, or a seasonal pattern.

#### The Data Set

we will use a data set for a mobile game from the fictional Tanimura Studios. There are four tables. 

- The *game_users* table contains records for people who downloaded the mobile game, along with the date and country. 
- The *game_actions* table contains records for things the users did in the game. 
- The *game_purchases* table tracks purchases of in-game currency in US dollars.
- The *exp_assignment* table contains records of which variant users were assigned to for a particular experiment. 

#### Experiment with Binary Outcomes: The Chi-Squared Test

As you might expect, a binary outcome experiment has only two outcomes: either an action is taken or it isn’t. Either a user completes a registration flow or they don’t. A consumer clicks on a website ad or they don’t. A student graduates or they don’t. For these types of experiments, we calculate the proportion of each variant that completes the action. The numerator is the number of completers, while the denominator is all units that were exposed. This metric is also described as a rate: completion rate, click-through rate, graduation rate, and so on.

To determine whether the rates in the variants are statistically different, we can use the **chi-squared test**, which is a statistical test for categorical variables. Data for a chi-squared test is often shown in the form of a contingency table, which shows the frequency of observations at the intersection of two attributes. This looks just like a pivot table to those who are familiar with that type of table.

https://www.mathsisfun.com/data/chi-square-test.html

In [3]:
query_01 = """
        SELECT a.variant
            ,count(case when b.user_id is not null then a.user_id end) as completed 
            ,count(case when b.user_id is null then a.user_id end) as not_completed
        FROM exp_assignment a
        LEFT JOIN game_actions b on a.user_id = b.user_id
        and b.action = 'onboarding complete'
        WHERE a.exp_name = 'Onboarding'
        GROUP BY 1
        """

select(query_01)

Unnamed: 0,variant,completed,not_completed
0,variant 1,38280,11995
1,control,36268,13629


In [4]:
query_02 = """
        SELECT a.variant
            ,count(a.user_id) as total_cohorted
            ,count(b.user_id) as completions
            ,count(b.user_id) * 1.0 / count(a.user_id) as pct_completed
        FROM exp_assignment a
        LEFT JOIN game_actions b on a.user_id = b.user_id
        and b.action = 'onboarding complete'
        WHERE a.exp_name = 'Onboarding'
        GROUP BY 1
        """

select(query_02)

Unnamed: 0,variant,total_cohorted,completions,pct_completed
0,variant 1,50275,38280,0.761412
1,control,49897,36268,0.726857


<img align="left" width="331" alt="Screen Shot 2022-04-26 at 4 03 44 PM" src="https://user-images.githubusercontent.com/73784742/165252143-d4c1a89b-db4f-4940-91eb-57d5bdd19e55.png">

We can see that variant 1 did indeed have more completions than the control experience, with 76.14% completing compared to 72.69%. 

But is this difference statistically significant, allowing us to reject the hypothesis that there is no difference? For this, we plug our results into an online calculator and confirm that the completion rate for variant 1 was significantly higher at a 95% confidence level than the completion rate for the control. Variant 1 can be declared the winner.

#### Experiments with Continuous Outcomes: The t-Test

Many experiments seek to improve continuous metrics, rather than the binary outcomes. Continuous metrics can take on a range of values. Examples include amount spent by customers, time spent on page, and days an app is used. Ecommerce sites often want to increase sales, and so they might experiment on product pages or checkout flows. Content sites may test layout, navigation, and head‐ lines to try to increase the number of stories read. A company running an app might run a remarketing campaign to remind users to come back to the app.

For these and other experiments with continuous success metrics, the goal is to figure out whether the average values in each variant differ from each other in a statistically significant way. The relevant statistical test is the **two-sample t-test**, which determines whether we can reject the null hypothesis that the averages are equal with a defined confidence interval, usually 95%. The statistical test has three inputs, all of which are straightforward to calculate with SQL: the mean, the standard deviation, and the count of observations.

---

We will consider whether that new flow increased user spending on in-game currency.

In [5]:
query_03 = """
        SELECT variant
            ,count(user_id) as total_cohorted
            ,avg(amount) as mean_amount
            ,stddev(amount) as stddev_amount
        FROM
        (
            SELECT a.variant
            ,a.user_id
            ,sum(coalesce(b.amount,0)) as amount
            FROM exp_assignment a
            LEFT JOIN game_purchases b on a.user_id = b.user_id
            WHERE a.exp_name = 'Onboarding'
            GROUP BY 1,2
        ) a
        GROUP BY 1
        """

select(query_03)

Unnamed: 0,variant,total_cohorted,mean_amount,stddev_amount
0,variant 1,50275,3.687589,19.220194
1,control,49897,3.781218,18.940378


https://www.evanmiller.org/ab-testing/t-test.html#!3.781/18.94/49897;3.688/19.22/50275@95

There is no significant difference between the control and variant groups at a 95% confidence interval. The “variant 1” group appears to have increased onboarding completion rates but not the amount spent.

---

Another question we might consider is whether variant 1 affected spending among those users who completed the onboarding. Those who don’t complete the onboarding never make it into the game and therefore don’t even have the opportunity to make a purchase.

In [6]:
query_04 = """
        SELECT variant
            ,count(user_id) as total_cohorted
            ,avg(amount) as mean_amount
            ,stddev(amount) as stddev_amount
        FROM
        (
            SELECT a.variant
            ,a.user_id
            ,sum(coalesce(b.amount,0)) as amount
            FROM exp_assignment a
            LEFT JOIN game_purchases b on a.user_id = b.user_id
            JOIN game_actions c on a.user_id = c.user_id
            and c.action = 'onboarding complete'
            WHERE a.exp_name = 'Onboarding'
            GROUP BY 1,2
        ) a
GROUP BY 1
        """

select(query_04)

Unnamed: 0,variant,total_cohorted,mean_amount,stddev_amount
0,variant 1,38280,4.843091,21.899284
1,control,36268,5.202146,22.048994


https://www.evanmiller.org/ab-testing/t-test.html#!5.202/22.049/36268;4.843/21.899/38280@95

The average for the control group is statistically significantly higher than that for variant 1 at a 95% confidence interval. This result may seem perplexing, but it illustrates why it is so important to agree on the success metric for an experiment up front. The experiment variant 1 had a positive effect on onboarding completion and so can be judged a success. It did not have an effect on the overall spending level. This could be due to a mix shift: the additional users who made it through onboarding in variant 1 were less likely to pay. If the underlying hypothesis was that increasing onboarding completion rates would increase revenue, then the experiment should not be judged a success, and the product managers should come up with some new ideas to test.

#### Challenges with Experiments and Options for Rescuing Flawed Experiments

- Variant Assignmnet

Restrict the entities included by excluding those that shouldn’t be eligible via JOINs and WHERE conditions. After doing this, you should check to make sure that the resulting population is a large enough sample to produce significant results.

Careful data profiling can check whether entities have been assigned to multiple variants or whether users with high or low engagement prior to the experiment are clustered in a particular variant.

- Outliers

In most cases, we are more interested in whether a treatment has an effect across a range of individuals, and thus adjusting for these outliers can make an experiment result more meaningful.

In [9]:
query_05 = """
        SELECT a.variant
            ,count(distinct a.user_id) as total_cohorted
            ,count(distinct b.user_id) as purchasers
            ,count(distinct b.user_id) * 1.0 / count(distinct a.user_id) as pct_purchased
        FROM exp_assignment a
        LEFT JOIN game_purchases b on a.user_id = b.user_id
        JOIN game_actions c on a.user_id = c.user_id
        and c.action = 'onboarding complete'
        WHERE a.exp_name = 'Onboarding'
        GROUP BY 1
        """

select(query_05)

Unnamed: 0,variant,total_cohorted,purchasers,pct_purchased
0,control,36268,4988,0.137532
1,variant 1,38280,4981,0.13012


The percentage of users who purchased in the control group is 13.7%, compared to 13.0% for variant 1. The conversion rate is statistically significantly higher for the control group. In this case, even though the rate of purchasing was higher for the control group, on a practical level we may be willing to accept this small decline if we believe that more users completing the onboarding process has other benefits. More players might boost rankings, for example, and players who enjoy the game may spread it to their friends via word of mouth, both of which can help growth and may then lead to attracting other new players who will become purchasers.

- Time Boxing

Experiments are often run over the course of several weeks. This means that individuals who enter the experiment earlier have a longer window in which to complete actions associated with the success metric. To control for this, we can apply time boxing—imposing a fixed length of time relative to the experiment entry date and considering actions only during that window.

In [10]:
query_06 = """
        SELECT variant
            ,count(user_id) as total_cohorted
            ,avg(amount) as mean_amount
            ,stddev(amount) as stddev_amount
        FROM
        (
            SELECT a.variant
                ,a.user_id
                ,sum(coalesce(b.amount,0)) as amount
            FROM exp_assignment a
            LEFT JOIN game_purchases b on a.user_id = b.user_id 
            and b.purch_date <= a.exp_date + interval '7 days'
            WHERE a.exp_name = 'Onboarding'
            GROUP BY 1,2
        ) a
        GROUP BY 1
        """

select(query_06)

Unnamed: 0,variant,total_cohorted,mean_amount,stddev_amount
0,variant 1,50275,1.351688,5.612986
1,control,49897,1.369382,5.766338


- Repeated Exposure Experiment

Measuring repeated exposure experiments is trickier than measuring one-and-done experiments due to novelty effects and regression to the mean. A *novelty effect* is the tendency for behavior to change just because something is new, not because it is necessarily better. *Regression to the mean* is the tendency for phenomena to return to an average level over time. As an example, changing any part of a user interface tends to increase the number of people who interact with it, whether it is a new button color, logo, or placement of functionality. Initially the metrics look good, because the click-through rate or engagement goes up. This is the novelty effect. But over time, users get used to the change, and they tend to click or use the functionality at rates that return closer to the baseline. This is the regression to the mean. 

The important question to answer when running this kind of experiment is whether the new baseline is higher (or lower) than the previous one. One solution is to allow passage of a long enough time period, in which you might expect regression to happen, before evaluat‐ ing the results. In some cases, this will be a few days; in others, it might be a few weeks or months.

#### When Controlled Experiments Aren’t Possible: Alternative Analyses

- Pre-/Post-Analysis

A pre-/post-analysis compares either the same or similar populations before and after a change. The measurement of the population before the change is used as the control, while the measurement after the change is used as the variant or treatment.

Pre-/post-analysis works best when there is a clearly defined change that happened on a well-known date, so that the before and after groups can be cleanly divided. In this type of analysis, you will need to choose how long to measure before and after the change, but the periods should be equal or close to equal. 

Imagine that the onboarding flow for our mobile game includes a step in which the user can check a box to indicate whether they want to receive emails with game news. This had always been checked by default, but a new regulation requires that it now be unchecked by default. On January 27, 2020, the change was released into the game, and we would like to find out if it had a negative effect on email opt-in rates. To do this, we will compare the two weeks before the change to the two weeks after the change and see whether the opt-in rate is statisti‐cally significantly different. We could use one-week or three-week periods, but two weeks is chosen because it is long enough to allow for some day-of-week variability and also short enough to restrict the number of other factors that could otherwise affect users’ willingness to opt in.

In [11]:
query_07 = """
        SELECT 
        case when a.created between '2020-01-13' and '2020-01-26' then 'pre'
             when a.created between '2020-01-27' and '2020-02-09' then 'post'
             end as variant
            ,count(distinct a.user_id) as cohorted
            ,count(distinct b.user_id) as opted_in
            ,count(distinct b.user_id) * 1.0 / count(distinct a.user_id) as pct_optin
            ,count(distinct a.created) as days
        FROM game_users a
        LEFT JOIN game_actions b on a.user_id = b.user_id 
        and b.action = 'email_optin'
        WHERE a.created between '2020-01-13' and '2020-02-09'
        GROUP BY 1
        """

select(query_07)

Unnamed: 0,variant,cohorted,opted_in,pct_optin,days
0,post,27617,11220,0.406271,14
1,pre,24662,14489,0.587503,14


We can see that the users who went through the onboarding flow before the change had a much higher email opt-in rate—58.75%, compared to 40.63% afterward. 

- Natural Experiment Analysis

A *natural experiment* occurs when entities end up with different experiences through some process that approximates randomness. One group receives the normal or control experience, and another receives some variation that may have a positive or negative effect.

For this type of analysis to have validity, we must be able to clearly determine which entities were exposed. Additionally, a control group that is as similar as possible to the exposed group is needed.

As an example in the video game data set, imagine that, during the time period of our data, users in Canada were accidentally given a different offer on the virtual currency purchase page the first time they looked at it: an extra zero was added to the number of virtual coins in each package. So, for example, instead of 10 coins the user would receive 100 game coins, or instead of 100 coins they would receive 1,000 game coins, and so on. The question we would like to answer is whether Canadians converted to buyers at a higher rate than other users. Rather than compare to the entire user base, we will compare only to users in the United States.

In [12]:
query_08 = """
        SELECT a.country
            ,count(distinct a.user_id) as total_cohorted
            ,count(distinct b.user_id) as purchasers
            ,count(distinct b.user_id) * 1.0 / count(distinct a.user_id) as pct_purchased
        FROM game_users a
        LEFT JOIN game_purchases b on a.user_id = b.user_id
        WHERE a.country in ('United States','Canada')
        GROUP BY 1
        """

select(query_08)

Unnamed: 0,country,total_cohorted,purchasers,pct_purchased
0,Canada,20179,5011,0.248327
1,United States,45012,4958,0.110148


The share of users in Canada who purchased is in fact higher—24.83%, compared to 11.01% of those in the United States. The conversion rate in Canada is statistically significantly higher at a 95% confidence interval.

- Analysis of Populations Around a Threshold

We can leverage the idea that subjects on either side of the threshold value are likely quite similar to each other. So instead of comparing the entire populations that did and did not receive the reward or intervention, we can compare only those that were close to the threshold both on the positive and the negative side. The formal name for this is regression discontinuity design (RDD).

#### Conclusion

Data profiling can be useful in tracking down issues that occur. When randomized experiments are not possible, a variety of other techniques are available, and SQL can be used to create synthetic control and variant groups.