In [1]:
import pandas as pd
import psycopg2 as pg2
from sqlalchemy import create_engine

engine = create_engine('postgresql://testuser:testpass@localhost:5432/postgresql_analysis')

con = pg2.connect(host='localhost',
                  user='testuser',
                  password='testpass',
                  database='postgresql_analysis')
con.autocommit = True
cur = con.cursor()

In [2]:
def select(query):
    return pd.read_sql(query, con)

### Experiment Analysis

Experimentation, also known as A/B testing or split testing, is considered the gold standard for establishing causality. Much data analysis work involves establishing correlations: one thing is more likely to happen when another thing also happens, whether that be an action, an attribute, or a seasonal pattern.

#### The Data Set

we will use a data set for a mobile game from the fictional Tanimura Studios. There are four tables. 

- The *game_users* table contains records for people who downloaded the mobile game, along with the date and country. 
- The *game_actions* table contains records for things the users did in the game. 
- The *game_purchases* table tracks purchases of in-game currency in US dollars.
- The *exp_assignment* table contains records of which variant users were assigned to for a particular experiment. 

#### Experiment with Binary Outcomes: The Chi-Squared Test

As you might expect, a binary outcome experiment has only two outcomes: either an action is taken or it isn’t. Either a user completes a registration flow or they don’t. A consumer clicks on a website ad or they don’t. A student graduates or they don’t. For these types of experiments, we calculate the proportion of each variant that completes the action. The numerator is the number of completers, while the denominator is all units that were exposed. This metric is also described as a rate: completion rate, click-through rate, graduation rate, and so on.

To determine whether the rates in the variants are statistically different, we can use the **chi-squared test**, which is a statistical test for categorical variables. Data for a chi-squared test is often shown in the form of a contingency table, which shows the frequency of observations at the intersection of two attributes. This looks just like a pivot table to those who are familiar with that type of table.

https://www.mathsisfun.com/data/chi-square-test.html

In [6]:
query_01 = """
        SELECT a.variant
            ,count(case when b.user_id is not null then a.user_id end) as completed 
            ,count(case when b.user_id is null then a.user_id end) as not_completed
        FROM exp_assignment a
        LEFT JOIN game_actions b on a.user_id = b.user_id
        and b.action = 'onboarding complete'
        WHERE a.exp_name = 'Onboarding'
        GROUP BY 1
        """

select(query_01)

Unnamed: 0,variant,completed,not_completed
0,variant 1,38280,11995
1,control,36268,13629


In [8]:
query_02 = """
        SELECT a.variant
            ,count(a.user_id) as total_cohorted
            ,count(b.user_id) as completions
            ,count(b.user_id) * 1.0 / count(a.user_id) as pct_completed
        FROM exp_assignment a
        LEFT JOIN game_actions b on a.user_id = b.user_id
        and b.action = 'onboarding complete'
        WHERE a.exp_name = 'Onboarding'
        GROUP BY 1
        """

select(query_02)

Unnamed: 0,variant,total_cohorted,completions,pct_completed
0,variant 1,50275,38280,0.761412
1,control,49897,36268,0.726857


<img align="left" width="331" alt="Screen Shot 2022-04-26 at 4 03 44 PM" src="https://user-images.githubusercontent.com/73784742/165252143-d4c1a89b-db4f-4940-91eb-57d5bdd19e55.png">

We can see that variant 1 did indeed have more completions than the control experience, with 76.14% completing compared to 72.69%. 

But is this difference statistically significant, allowing us to reject the hypothesis that there is no difference? For this, we plug our results into an online calculator and confirm that the completion rate for variant 1 was significantly higher at a 95% confidence level than the completion rate for the control. Variant 1 can be declared the winner.

#### Experiments with Continuous Outcomes: The t-Test

Many experiments seek to improve continuous metrics, rather than the binary outcomes. Continuous metrics can take on a range of values. Examples include amount spent by customers, time spent on page, and days an app is used. Ecommerce sites often want to increase sales, and so they might experiment on product pages or checkout flows. Content sites may test layout, navigation, and head‐ lines to try to increase the number of stories read. A company running an app might run a remarketing campaign to remind users to come back to the app.

For these and other experiments with continuous success metrics, the goal is to figure out whether the average values in each variant differ from each other in a statistically significant way. The relevant statistical test is the **two-sample t-test**, which determines whether we can reject the null hypothesis that the averages are equal with a defined confidence interval, usually 95%. The statistical test has three inputs, all of which are straightforward to calculate with SQL: the mean, the standard deviation, and the count of observations.

---

We will consider whether that new flow increased user spending on in-game currency.

In [9]:
query_03 = """
        SELECT variant
            ,count(user_id) as total_cohorted
            ,avg(amount) as mean_amount
            ,stddev(amount) as stddev_amount
        FROM
        (
            SELECT a.variant
            ,a.user_id
            ,sum(coalesce(b.amount,0)) as amount
            FROM exp_assignment a
            LEFT JOIN game_purchases b on a.user_id = b.user_id
            WHERE a.exp_name = 'Onboarding'
            GROUP BY 1,2
        ) a
        GROUP BY 1
        """

select(query_03)

Unnamed: 0,variant,total_cohorted,mean_amount,stddev_amount
0,variant 1,50275,3.687589,19.220194
1,control,49897,3.781218,18.940378


https://www.evanmiller.org/ab-testing/t-test.html#!3.781/18.94/49897;3.688/19.22/50275@95

There is no significant difference between the control and variant groups at a 95% confidence interval. The “variant 1” group appears to have increased onboarding completion rates but not the amount spent.

---

Another question we might consider is whether variant 1 affected spending among those users who completed the onboarding. Those who don’t complete the onboarding never make it into the game and therefore don’t even have the opportunity to make a purchase.

In [10]:
query_04 = """
        SELECT variant
            ,count(user_id) as total_cohorted
            ,avg(amount) as mean_amount
            ,stddev(amount) as stddev_amount
        FROM
        (
            SELECT a.variant
            ,a.user_id
            ,sum(coalesce(b.amount,0)) as amount
            FROM exp_assignment a
            LEFT JOIN game_purchases b on a.user_id = b.user_id
            JOIN game_actions c on a.user_id = c.user_id
            and c.action = 'onboarding complete'
            WHERE a.exp_name = 'Onboarding'
            GROUP BY 1,2
        ) a
GROUP BY 1
        """

select(query_04)

Unnamed: 0,variant,total_cohorted,mean_amount,stddev_amount
0,variant 1,38280,4.843091,21.899284
1,control,36268,5.202146,22.048994


https://www.evanmiller.org/ab-testing/t-test.html#!5.202/22.049/36268;4.843/21.899/38280@95

The average for the control group is statistically significantly higher than that for variant 1 at a 95% confidence interval. This result may seem perplexing, but it illustrates why it is so important to agree on the success metric for an experiment up front. The experiment variant 1 had a positive effect on onboarding completion and so can be judged a success. It did not have an effect on the overall spending level. This could be due to a mix shift: the additional users who made it through onboarding in variant 1 were less likely to pay. If the underlying hypothesis was that increasing onboarding completion rates would increase revenue, then the experiment should not be judged a success, and the product managers should come up with some new ideas to test.