## Why is A/B testing important?

- No guessing
- Provides accurate answers - quickly
- Allows to rapidly iterate on ideas
- It is one of the only statistically sound ways to establish causal relationships

## A/B test process

1. Develop a hypothesis about your product or business
2. Randomly assign users to two different groups
3. Expose:
    - Group 1 to the current product rules
    - Group 2 to a product that tests the hypothesis
4. Pick whichever performs better according to a set of KPIs

## Where can A/B testing be used?

Users + ideas -> A/B test
- Testing impact of drugs
- Incentivizing spending
- Driving user growth
- Many others

## Key Performance Indicators (KPIs)

- A/B Tests: Measure impact of changes on KPIs
- KPIs: metrics important to an organization
    - ex. For a drug company, these may be remission rates of a cancer, or like the likelihood of a particular side-effect
    - ex. FOr a mobile game, it may be something like revenue, or play time per user

### How to identify KPIs
Experience + Domain Knowledge + Exploratory Data Analysis
- Experience & Knowledge: what is important to a business
- Exploratory Analysis: what metrics and relationships impact these KPIs

## Example: meditation app

### Services
- Paid Subscription
- In-app purchases

### Goals/KPIs
- Maintain high free -> paid conversion rate
- The app is growing quickly and we are motivated to maintain a strong free-trial to paying user conversion rate

### Dataset 1: User demographics

In [31]:
import pandas as pd
# load customer_demographics
cd = pd.read_csv("/Users/jay/Desktop/Personal-Projects/ab_testing_dataset/user_demographics_v1.csv")

# load purchase_data
subs = pd.read_csv("/Users/jay/Desktop/Personal-Projects/ab_testing_dataset/purchase_data_v1.csv")

In [32]:
cd.head()

Unnamed: 0,uid,reg_date,device,gender,country,age
0,54030035.0,2017-06-29T00:00:00Z,and,M,USA,19
1,72574201.0,2018-03-05T00:00:00Z,iOS,F,TUR,22
2,64187558.0,2016-02-07T00:00:00Z,iOS,M,USA,16
3,92513925.0,2017-05-25T00:00:00Z,and,M,BRA,41
4,99231338.0,2017-03-26T00:00:00Z,iOS,M,FRA,59


In [33]:
subs.head()

Unnamed: 0,date,uid,sku,price
0,2017-07-10,41195147,sku_three_499,499
1,2017-07-15,41195147,sku_three_499,499
2,2017-11-12,41195147,sku_four_599,599
3,2017-09-26,91591874,sku_two_299,299
4,2017-12-01,91591874,sku_four_599,599


In [34]:
subs = subs.rename(columns={'date': 'subs_date'})

In [35]:
subs.head()

Unnamed: 0,subs_date,uid,sku,price
0,2017-07-10,41195147,sku_three_499,499
1,2017-07-15,41195147,sku_three_499,499
2,2017-11-12,41195147,sku_four_599,599
3,2017-09-26,91591874,sku_two_299,299
4,2017-12-01,91591874,sku_four_599,599


In [36]:
subs.loc[subs.uid == 92513925.0]

Unnamed: 0,subs_date,uid,sku,price
5370,2017-10-20,92513925,sku_three_499,499
9003,2017-05-29,92513925,sku_two_299,299
9004,2017-08-23,92513925,sku_four_599,599
9005,2018-03-26,92513925,sku_six_1299,299


In [37]:
# merge customer_demographics (left) and customer_subscriptions (right)
sub_data_demo = cd.merge(
                  #right dataframe
                  subs,
                  # join type
                  how='inner',
                  # columns to match
                  on=['uid'])

In [38]:
sub_data_demo.head()

Unnamed: 0,uid,reg_date,device,gender,country,age,subs_date,sku,price
0,92513925.0,2017-05-25T00:00:00Z,and,M,BRA,41,2017-10-20,sku_three_499,499
1,92513925.0,2017-05-25T00:00:00Z,and,M,BRA,41,2017-05-29,sku_two_299,299
2,92513925.0,2017-05-25T00:00:00Z,and,M,BRA,41,2017-08-23,sku_four_599,599
3,92513925.0,2017-05-25T00:00:00Z,and,M,BRA,41,2018-03-26,sku_six_1299,299
4,16377492.0,2016-10-16T00:00:00Z,and,M,BRA,20,2018-03-17,sku_one_199,199


### Group Data: .groupby()
- by: fields to group by
- axis: axis=0 will group by columns, axis=1 will group by rows
- as_index: as_index=True will use group labels as index

sub_data_grp = sub_data_demo.groupby(by=['country', 'device'],
                                     axis=0,
                                     as_index=False)

### Aggregate data: .agg()
- pass the name of an aggregation function to agg():

sub_data_grp.price.agg('mean')
- pass a list of names of aggregation functions:

sub_data_grp.price.agg(['mean','median'])
- pass a dictionary of column names and aggregation functions:

sub_data_grp.agg({'price':['mean','max','min'],
                  'age':['mean','max','min']})

In [39]:
def truncated_mean(data):
    """Compute the mean exclusding outliers"""
    top_val = data.quantile(.9)
    bot_val = data.quantile(.1)
    trunc_data = data[(data <= top_val) & (data >= bot_val)]
    mean = trunc_data.mean()
    return(mean)

In [42]:
# Compute max_purchase_date
max_purchase_date = current_date - timedelta(days=28)

# Filter to only include users who registered before our max date
purchase_data_filt = cd[cd.reg_date < max_purchase_date]

# Filter to contain only purchases within the first 28 days of registration
purchase_data_filt = purchase_data_filt[(purchase_data_filt.date <= 
                        purchase_data_filt.reg_date + timedelta(days=28))]

# Output the mean price paid per purchase
print(purchase_data_filt.price.mean())

TypeError: Cannot compare type 'Timestamp' with type 'str'

In [2]:
# # Set the max registration date to be one month before today
# max_reg_date = current_date - timedelta(days=28)

# # Find the month 1 values
# month1 = np.where((purchase_data.reg_date < max_reg_date) &
#                  (purchase_data.date < purchase_data.reg_date + timedelta(days=28)),
#                   purchase_data.price, 
#                   np.NaN)
                 
# # Update the value in the DataFrame
# purchase_data['month1'] = month1

# # Group the data by gender and device 
# purchase_data_upd = purchase_data.groupby(by=['gender', 'device'], as_index=False) 

# # Aggregate the month1 and price data 
# purchase_summary = purchase_data_upd.agg(
#                         {'month1': ['mean', 'median'],
#                         'price': ['mean', 'median']})

# # Examine the results 
# print(purchase_summary)

## Exploratory Data Analysis

### Example: Week Two Conversion Rate

- Week 2 Conversion Rate Users who subscribe in the second week after the free trial
- Users must have:
    - Completed the free trial
    - Not subscribed in the first week
    - Had a full second week to subscribe or not

#### Using the Timedelta class
- Lapse Date: Date the trial ends for a given user

In [7]:
# import pandas as pd
# from datetime import timedelta

In [8]:
# # Define the most recent date in our data
# current_date = pd.to_datetime('2018-03-17')

# # The last date a user could lapse be included
# max_lapse_date = current_date - timedelta(days=14)

# # Filter down to only only eligible users
# conv_sub_data = sub_data_demo[sub_data_demo.lapse_date < max_lapse_date]

#### Date Differences
- Step 1: Filter to the relevant set of users
- Step 2: Calculate the time between a users lapse and subscribed dates
- Step 3: Convert the *sub_time* from a *timedelta* to an int

In [12]:
# # How many days passed before the user subscribed
# sub_time = conv_sub_data.subscription_date - conv_sub_data.lapse_date

# # Save this value in our dataframe
# conv_sub_data['sub_time'] = sub_time

# # Extract the days field from the sub_time
# conv_sub_data['sub_time'] = conv_sub_data.sub_time.dt.days

#### Conversion rate calculation

In [13]:
# # filter to users who have did not subscribe in the right window
# conv_base = conv_sub_data[(conv_sub_data.sub_time.notnull())
#                          | (conv_sub_data.sub_time > 7)]
# total_users = len(conv_base)

In [14]:
# total_subs = np.where(conv_sub_data.sub_time.notnull()
#                      & (conv_base.sub_time <= 14), 1, 0)
# total_subs = sum(total_subs)

In [15]:
# conversion_rate = total_subs / total_users

#### Plotting Time Series Data

In [16]:
# # Group the data and aggregate first_week_purchases
# user_purchases = user_purchases.groupby(by=['reg_date', 'uid']).agg({'first_week_purchases': ['sum']})

# # Reset the indexes
# user_purchases.columns = user_purchases.columns.droplevel(level=1)
# user_purchases.reset_index(inplace=True)

# # Find the average number of purchases per day by first-week users
# user_purchases = user_purchases.groupby(by=['reg_date']).agg({'first_week_purchases': ['mean']})
# user_purchases.columns = user_purchases.columns.droplevel(level=1)
# user_purchases.reset_index(inplace=True)

# # Plot the results
# user_purchases.plot(x='reg_date', y='first_week_purchases')
# plt.show()

#### Pivoting the data

In [17]:
# # Pivot the data
# country_pivot = pd.pivot_table(user_purchases_country, 
#                                values=['first_week_purchases'], 
#                                columns=['country'], index=['reg_date'])
# print(country_pivot.head())

In [18]:
# # Pivot the data
# device_pivot = pd.pivot_table(user_purchases_device, 
#                               values=['first_week_purchases'], 
#                               columns=['device'], index=['reg_date'])
# print(device_pivot.head())

#### Examining Different cohorts

In [19]:
# # Plot the average first week purchases for each country by registration date
# country_pivot.plot(x='reg_date', y=['USA', 'CAN', 'FRA', 'BRA', 'TUR', 'DEU'])
# plt.show()