# A/B test

Perform an A/B test on the dataset AB_Test_Results.csv. This dataset is from [Kaggle](https://www.kaggle.com/datasets/sergylog/ab-test-data).

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from scipy import stats
from scipy.stats import shapiro
from scipy.stats import normaltest
from scipy.stats import mannwhitneyu

## Load data

In [2]:
df = pd.read_csv("Data/AB_Test_Results.csv")
df.head()

Unnamed: 0,USER_ID,VARIANT_NAME,REVENUE
0,737,variant,0.0
1,2423,control,0.0
2,9411,control,0.0
3,7311,control,0.0
4,6174,variant,0.0


Create "BUY" column, 1 if "REVENUE" > 0, else 0.

In [3]:
df['BUY'] = 1*(df.REVENUE > 0)
df.head()

Unnamed: 0,USER_ID,VARIANT_NAME,REVENUE,BUY
0,737,variant,0.0,0
1,2423,control,0.0,0
2,9411,control,0.0,0
3,7311,control,0.0,0
4,6174,variant,0.0,0


## Sorting data

In [4]:
df_sorted = df.copy()
df_sorted.sort_values(['USER_ID', 'REVENUE'], inplace=True, ascending=[True, False])
df_sorted.head()

Unnamed: 0,USER_ID,VARIANT_NAME,REVENUE,BUY
2406,2,control,0.0,0
3479,2,control,0.0,0
7076,2,control,0.0,0
4145,3,variant,0.0,0
5377,3,control,0.0,0


We sorted the data and removed duplicate entries to maximize revenue retention. Next, we added the columns "RANK_REVENUE" and "RANK_CONVERSION", representing the average ranks of "REVENUE" and "BUY", respectively. This step prepares the dataset for the Mann-Whitney U test.

In [5]:
df_sorted.loc[:, 'RANK_REVENUE'] = df_sorted['REVENUE'].rank(method='average')
df_sorted.loc[:, 'RANK_CONVERSION'] = df_sorted['BUY'].rank(method='average')

df_sorted

Unnamed: 0,USER_ID,VARIANT_NAME,REVENUE,BUY,RANK_REVENUE,RANK_CONVERSION
2406,2,control,0.0,0,4924.5,4924.5
3479,2,control,0.0,0,4924.5,4924.5
7076,2,control,0.0,0,4924.5,4924.5
4145,3,variant,0.0,0,4924.5,4924.5
5377,3,control,0.0,0,4924.5,4924.5
...,...,...,...,...,...,...
2998,9996,control,0.0,0,4924.5,4924.5
1064,9998,control,0.0,0,4924.5,4924.5
6883,10000,variant,0.0,0,4924.5,4924.5
8921,10000,control,0.0,0,4924.5,4924.5


In [7]:
df_control = df_sorted[df_sorted.VARIANT_NAME == 'control'].copy()
df_variant = df_sorted[df_sorted.VARIANT_NAME == 'variant'].copy()

## Any intersection between control and variant groups?

Having the same users in both the control and variant groups would lead to inaccurate results, as each user would be counted twice, skewing the data. Therefore, before conducting any analysis, we need to remove duplicate users.

In [8]:
df_overlap = pd.merge(df_control, df_variant, on="USER_ID").copy()
df_overlap

Unnamed: 0,USER_ID,VARIANT_NAME_x,REVENUE_x,BUY_x,RANK_REVENUE_x,RANK_CONVERSION_x,VARIANT_NAME_y,REVENUE_y,BUY_y,RANK_REVENUE_y,RANK_CONVERSION_y
0,3,control,0.0,0,4924.5,4924.5,variant,0.00,0,4924.5,4924.5
1,3,control,0.0,0,4924.5,4924.5,variant,0.00,0,4924.5,4924.5
2,10,control,0.0,0,4924.5,4924.5,variant,0.00,0,4924.5,4924.5
3,18,control,0.0,0,4924.5,4924.5,variant,0.00,0,4924.5,4924.5
4,25,control,0.0,0,4924.5,4924.5,variant,0.00,0,4924.5,4924.5
...,...,...,...,...,...,...,...,...,...,...,...
2496,9979,control,0.0,0,4924.5,4924.5,variant,0.00,0,4924.5,4924.5
2497,9982,control,0.0,0,4924.5,4924.5,variant,0.00,0,4924.5,4924.5
2498,9996,control,0.0,0,4924.5,4924.5,variant,6.46,1,9971.0,9924.5
2499,10000,control,0.0,0,4924.5,4924.5,variant,0.00,0,4924.5,4924.5
