# A/B Testing - Analysis of the Effectiveness of Two Landing Page Variants

## Import dependencies.

In [50]:
import pandas as pd
import plotly.express as px

## Load and viaualize data

### Load data

In [2]:
data = pd.read_csv('../../data/ab_data_tourist.csv')
data.head()

Unnamed: 0,user_id,date,group,purchase,price
0,851104,2021-01-21,A,0,0
1,804228,2021-01-12,A,0,0
2,661590,2021-01-11,B,0,0
3,853541,2021-01-08,B,0,0
4,864975,2021-01-21,A,1,150000


Add a feature of tour destination.

In [32]:
def get_tour_type(price):
    result = ''
    
    if price == 100000:
        result = 'Thailand'
    elif price == 60000:
        result = 'Turkey'
    elif price == 200000:
        result = 'Maldives'
    elif price == 10000:
        result = 'St. Petersburg'
    elif price == 150000:
        result = 'Kamchatka'

    return result


data['destination'] = data['price'].apply(lambda price: get_tour_type(price))

### Exploratory data analysis

#### Preliminary data analysis

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   user_id   294478 non-null  int64 
 1   date      294478 non-null  object
 2   group     294478 non-null  object
 3   purchase  294478 non-null  int64 
 4   price     294478 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 11.2+ MB


All columns have a correct data type except or the column 'date'. We need to convert the 'date' column to Datetime data type to perform operations with dates.

In [4]:
data['date'] = pd.to_datetime(data['date'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   user_id   294478 non-null  int64         
 1   date      294478 non-null  datetime64[ns]
 2   group     294478 non-null  object        
 3   purchase  294478 non-null  int64         
 4   price     294478 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 11.2+ MB


Check the duration of the test interval for both groups.

In [5]:
group_a_start = data[data['group'] == 'A']['date'].dt.date.min()
group_a_end = data[data['group'] == 'A']['date'].dt.date.max()
group_b_start = data[data['group'] == 'B']['date'].dt.date.min()
group_b_end = data[data['group'] == 'B']['date'].dt.date.max()

print(f'Grop A test interval: {group_a_start} - {group_a_end}')
print(f'Grop B test interval: {group_b_start} - {group_b_end}')

Grop A test interval: 2021-01-02 - 2021-01-24
Grop B test interval: 2021-01-02 - 2021-01-24


Test intervals are identical for both test groups. No action is required to equalize the test intervals.

Check for empty values.

In [6]:
data.isnull().sum()

user_id     0
date        0
group       0
purchase    0
price       0
dtype: int64

There are no empty values in data.

Check if there are users who got into both groups during the test.

In [7]:
user_group_count = data.groupby('user_id')['group'].count().reset_index()
users_in_both_groups = user_group_count[user_group_count['group'] != 1]
print('Number of users present in both groups:', users_in_both_groups.shape[0])

# We need to delete these users from the dataset.
print('Total test data:', data.shape[0])
users_in_both_groups = users_in_both_groups['user_id'].to_list()
data = data[~data['user_id'].isin(users_in_both_groups)]
print('Total test data after deletion:', data.shape[0])

Number of users present in both groups: 3894
Total test data: 294478
Total test data after deletion: 286690


#### Basic data analysis

For both groups count the number of website visits, number of purchases, total amount of all purchases.

In [21]:
visits_a = data[data['group'] == 'A'].shape[0]
visits_b = data[data['group'] == 'B'].shape[0]

num_purchases_a = data[data['group'] == 'A']['purchase'].sum()
num_purchases_b = data[data['group'] == 'B']['purchase'].sum()

total_purchase_a = data[data['group'] == 'A']['price'].sum()
total_purchase_b = data[data['group'] == 'B']['price'].sum()

print('Grouop\t\tVisits\t\tPurchases\tSum')
print(f'A\t\t{visits_a}\t\t{num_purchases_a}\t\t{total_purchase_a}')
print(f'B\t\t{visits_b}\t\t{num_purchases_b}\t\t{total_purchase_b}')

Grouop		Visits		Purchases	Sum
A		143293		17220		1396120000
B		143397		17025		1510100000


The groups A and B are balanced based on the number of visits: 143293 vs 143397.

Group A shows a slight higher number of purchases, where group B shows a higher total amount of all purchases.

Calculate conversion rate and average bill for both groups.

In [29]:
conversion_a = num_purchases_a / visits_a * 100
conversion_b = num_purchases_b / visits_b * 100

average_bill_a = total_purchase_a / num_purchases_a
average_bill_b = total_purchase_b / num_purchases_b

print('Group\t\tConversion rate\t\tAverage bill')
print(f'A\t\t{conversion_a.round(2)}\t\t\t{average_bill_a.round(2)}')
print(f'B\t\t{conversion_b.round(2)}\t\t\t{average_bill_b.round(2)}')

Group		Conversion rate		Average bill
A		12.02			81075.49
B		11.87			88698.97


Based on the numbers obtained above:
* Group A has a slightly higher conversion (0.15 %)
* Group B has a higher average bill (9 %)

Calculate purchasing power for each of the destinations in both groups.

In [36]:
purchasing_power = data.groupby(['group', 'destination', 'price'])['purchase'].sum().reset_index()

purchasing_power[purchasing_power['purchase'] != 0]

Unnamed: 0,group,destination,price,purchase
1,A,Kamchatka,150000,3430
2,A,Maldives,200000,1691
3,A,St. Petersburg,10000,5096
4,A,Thailand,100000,1807
5,A,Turkey,60000,5196
7,B,Kamchatka,150000,3388
8,B,Maldives,200000,1671
9,B,St. Petersburg,10000,5118
10,B,Thailand,100000,5141
11,B,Turkey,60000,1707


Purchasing power analysis per destination suggests that groups A and B are more ore less similar in terms of such destinations as Kamchatka, Maldives and St. Petersburg, but for Thailand and Turkey they are opposite (group A has approximately triple Turkey compared to Thailand, group B has triple Thailand compared to Turkey).

### Analyze data in terms of metric stabilization

Group data by group and date. For each group calculate number of visits, number of purchases and total amount of purchases.

Then add information for the daily conversion rate and average bill.

In [61]:
group_metrics = data.groupby(['group', 'date']).agg({
    'user_id': 'count',
    'purchase': 'sum',
    'price': 'sum'
}).reset_index().rename(columns={'user_id': 'visits', 'price': 'total_amount'})

group_metrics['conversion_rate'] = group_metrics['purchase'] / group_metrics['visits'] * 100
group_metrics['average_bill'] = group_metrics['total_amount'] / group_metrics['purchase']

print(group_metrics.head())

  group       date  visits  purchase  total_amount  conversion_rate  \
0     A 2021-01-02    2813       354      29170000        12.584429   
1     A 2021-01-03    6494       738      61420000        11.364336   
2     A 2021-01-04    6481       787      63050000        12.143188   
3     A 2021-01-05    6330       780      63460000        12.322275   
4     A 2021-01-06    6518       750      62460000        11.506597   

   average_bill  
0  82401.129944  
1  83224.932249  
2  80114.358323  
3  81358.974359  
4  83280.000000  


Based on the data above we will calculate the cumulative statistics on the group data: cumulative conversion rate and cumulateive average bill in both groups.

In [65]:
group_metrics_a = group_metrics[group_metrics['group'] == 'A']
group_metrics_b = group_metrics[group_metrics['group'] == 'B']

group_metrics_a[['visits_cum', 'purchase_cum', 'total_amount_cum']] = \
    group_metrics_a[['visits', 'purchase', 'total_amount']].cumsum()
group_metrics_a




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,group,date,visits,purchase,total_amount,conversion_rate,average_bill,visits_cum,purchase_cum,total_amount_cum
0,A,2021-01-02,2813,354,29170000,12.584429,82401.129944,2813,354,29170000
1,A,2021-01-03,6494,738,61420000,11.364336,83224.932249,9307,1092,90590000
2,A,2021-01-04,6481,787,63050000,12.143188,80114.358323,15788,1879,153640000
3,A,2021-01-05,6330,780,63460000,12.322275,81358.974359,22118,2659,217100000
4,A,2021-01-06,6518,750,62460000,11.506597,83280.0,28636,3409,279560000
5,A,2021-01-07,6493,780,65770000,12.012937,84320.512821,35129,4189,345330000
6,A,2021-01-08,6602,785,64440000,11.890336,82089.171975,41731,4974,409770000
7,A,2021-01-09,6538,780,64620000,11.930254,82846.153846,48269,5754,474390000
8,A,2021-01-10,6575,743,57750000,11.30038,77725.437416,54844,6497,532140000
9,A,2021-01-11,6593,776,57130000,11.770059,73621.134021,61437,7273,589270000


In [8]:
data

Unnamed: 0,user_id,date,group,purchase,price
0,851104,2021-01-21,A,0,0
1,804228,2021-01-12,A,0,0
2,661590,2021-01-11,B,0,0
3,853541,2021-01-08,B,0,0
4,864975,2021-01-21,A,1,150000
...,...,...,...,...,...
294473,751197,2021-01-03,A,0,0
294474,945152,2021-01-12,A,0,0
294475,734608,2021-01-22,A,0,0
294476,697314,2021-01-15,A,0,0
