# A/B Test Challenge



---

#### What is an A/B Test? 

It is a decision making support & research methodology that allow you to measure an impact of a change in a product (e.g.: a digital product). For this challenge you will analyse the data resulting of an A/B test performed on a digital product where a new set of sponsored ads are included.


#### Measure of success

Metrics are need it to measure the success of your product. They are typically split in the following categories: 

- __Enganged based metrics:__ number of users, number of downloads, number of active users, user retention, etc.

- __Revenue and monetization metrics:__ ads and affiliate links, subscription-based, in-app purchases, etc.

- __Technical metrics:__ service level indicators (uptime of the app, downtime of the app, latency).



---

## Metrics understanding

In this part you must analyse the metrics involved in the test. We will focus in the following metrics:

- Activity level + Daily active users (DAU).

- Click-through rate (CTR)

### Activity level

In the following part you must perform every calculation you consider necessary in order to answer the following questions:

- How many activity levels you can find in the dataset (Activity level of zero means no activity).

- What is the amount of users for each activity level.

- How many activity levels do you have per day and how many records per each activity level.

At the end of this section you must provide your conclusions about the _activity level_ of the users.

__Dataset:__ `activity_pretest.csv`

In [2]:
# your-code
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy import inspect
import pandas as pd
import numpy as np

import seaborn as sns


In [3]:
df=pd.read_csv('./data/activity_pretest.csv')
#How many activity levels you can find in the dataset (Activity level of zero means no activity).
df1 = df[df.activity_level != 0]
df1['activity_level'].unique() #20 values of activity
df1

Unnamed: 0,userid,dt,activity_level
909125,428070b0-083e-4c0e-8444-47bf91e99fff,2021-10-01,1
909126,93370f9c-56ef-437f-99ff-cb7c092d08a7,2021-10-01,1
909127,0fb7120a-53cf-4a51-8b52-bf07b8659bd6,2021-10-01,1
909128,ce64a9d8-07d9-4dca-908d-5e1e4568003d,2021-10-01,1
909129,e08332f0-3a5c-4ed2-b957-87e464e89b97,2021-10-01,1
...,...,...,...
1859995,200d65e6-b1ce-4a47-8c2b-946db5c5a3a0,2021-10-31,20
1859996,535dafe4-de7c-4b56-acf6-aa94f21653bc,2021-10-31,20
1859997,0428ca3c-e666-4ef4-8588-3a2af904a123,2021-10-31,20
1859998,a8cd1579-44d4-48b3-b3d6-47ae5197dbc6,2021-10-31,20


In [4]:
#What is the amount of users for each activity level.
df1.groupby(["activity_level"]).count().sort_values(["userid"], ascending=False)


Unnamed: 0_level_0,userid,dt
activity_level,Unnamed: 1_level_1,Unnamed: 2_level_1
5,49227,49227
2,49074,49074
18,48982,48982
10,48943,48943
16,48934,48934
12,48911,48911
19,48901,48901
6,48901,48901
11,48832,48832
9,48820,48820


In [5]:
#How many activity levels do you have per day and how many records per each activity level.
df2=df1.groupby(["activity_level", "dt"]).size()
#df1.groupby(["dt"]).count().sort_values(["activity_level"], ascending=False)
#df1.groupby(["activity_level"]).count()
df1


#DT PASAR A DATETIME; MIRAR!!!

Unnamed: 0,userid,dt,activity_level
909125,428070b0-083e-4c0e-8444-47bf91e99fff,2021-10-01,1
909126,93370f9c-56ef-437f-99ff-cb7c092d08a7,2021-10-01,1
909127,0fb7120a-53cf-4a51-8b52-bf07b8659bd6,2021-10-01,1
909128,ce64a9d8-07d9-4dca-908d-5e1e4568003d,2021-10-01,1
909129,e08332f0-3a5c-4ed2-b957-87e464e89b97,2021-10-01,1
...,...,...,...
1859995,200d65e6-b1ce-4a47-8c2b-946db5c5a3a0,2021-10-31,20
1859996,535dafe4-de7c-4b56-acf6-aa94f21653bc,2021-10-31,20
1859997,0428ca3c-e666-4ef4-8588-3a2af904a123,2021-10-31,20
1859998,a8cd1579-44d4-48b3-b3d6-47ae5197dbc6,2021-10-31,20


### Daily active users (DAU)

![ab_test](./img/user_activity_ab_testinG.JPG)


The daily active users (DAU) refers to the amount of users that are active per day (activity level of zero means no activity). You must perform the calculation of this metric and provide your insights about it.

__Dataset:__ `activity_pretest.csv`

In [8]:
# your-code
df1.describe()
#DAU = Amount of user active per day
#count userid per day
#60000 users active per day
daudf = df1.groupby('dt') \
       .agg({'userid':'count', 'activity_level':'mean'}) \
       .reset_index()
dau_mean = daudf["activity_level"].mean()
dau_mean
daudf

Unnamed: 0,dt,userid,activity_level
0,2021-10-01,30634,10.265881
1,2021-10-02,30775,10.267782
2,2021-10-03,30785,10.225045
3,2021-10-04,30599,10.288473
4,2021-10-05,30588,10.26566
5,2021-10-06,30639,10.218545
6,2021-10-07,30637,10.246565
7,2021-10-08,30600,10.29732
8,2021-10-09,30902,10.266488
9,2021-10-10,30581,10.28521


### Click-through rate (CTR)

![ab_test](./img/ad_click_through_rate_ab_testing.JPG)

Click-through rate (CTR) refers to the percentage of clicks that the user perform from the total amount ads showed to that user during a certain day. You must perform the analysis of this metric (e.g.: average CTR per day) and provide your insights about it.

__Dataset:__ `ctr_pretest.csv`

In [15]:
# your-code
df3=pd.read_csv('./data/ctr_pretest.csv')

df3.describe()
ctr_mean = df3["ctr"].mean()
ctr_mean

33.00024155646148

---

## Pretest metrics 

In this section you will perform the analysis of the metrics using the dataset that includes the result for the test and control groups, but only for the pretest data (i.e.: prior to November 1st, 2021). You must provide insights about the metrics (__Activity level__, __DAU__ and __CTR__) and also perform an hyphotesis test in order to determine whether there is any statistical significant difference between the groups prior to the start of the experiment. You must try different approaches (i.e.: __z-test__ and __t-test__) and compare the results.


__Datasets:__ `activity_all.csv`, `ctr_all.csv`

In [21]:
# your-code
from statsmodels.stats.weightstats import ztest
from scipy import stats
df4=pd.read_csv('./data/activity_all.csv')
df4_x= df4[df4.activity_level != 0]
df4_DAU=df4_x.groupby(['dt','groupid']).agg({'userid':'count', 'activity_level':'mean'}).reset_index()
df4_DAU["dt"] = pd.to_datetime(df4_DAU["dt"]) 

#pretest=df4_DAU antes del 1/11/2021 y comparar grupo 1 con grupo 0 H0= medias son iguales H1= medias son distintas(NO!!)
mask =  df4_DAU['dt'] < '2021-11-1'
df_DAU_pretest = df4_DAU.loc[mask]

group_0_mean = df_DAU_pretest.loc[df_DAU_pretest['groupid'] == 0,'activity_level'].mean()
group_1_mean=df_DAU_pretest.loc[df_DAU_pretest['groupid'] == 1,'activity_level'].mean()
alpha = 0.05
print(f'Hypothesis mean: {group_0_mean}',
      f'\nSample mean: {group_1_mean}',
      f'\nProbability threshold: {alpha}')
Z_score, p_value = ztest(df_DAU_pretest.loc[df_DAU_pretest['groupid'] == 1,'activity_level'], value=group_0_mean)#alternative='smaller')

print(f'Z_score: {Z_score}', f'\np-value: {p_value}')


Hypothesis mean: 10.254766009835997 
Sample mean: 10.257895924766082 
Probability threshold: 0.05
Z_score: 0.3821029513866412 
p-value: 0.7023850027341372


In [20]:
group_0_mean

10.254766009835997

In [22]:
t_value, p_value = stats.ttest_1samp(df_DAU_pretest.loc[df_DAU_pretest['groupid'] == 1,'activity_level'], group_0_mean)
print('p_value: ', p_value)

p_value:  0.7050784751630224


In [24]:
df5=pd.read_csv('./data/ctr_all.csv')
#pretest=df5 antes del 1/11/2021 y comparar grupo 1 con grupo 2 H0= medias son iguales H1= medias son distintas(NO!!)
df5["dt"] = pd.to_datetime(df5["dt"]) 
mask2 =  df5['dt'] < '2021-11-1'
df5_pretest = df5.loc[mask2]

group_0_mean2 = df5_pretest.loc[df5_pretest['groupid'] == 0,'ctr'].mean()

alpha = 0.05
Z_score, p_value = ztest(df5_pretest.loc[df5_pretest['groupid'] == 1,'ctr'], value=group_0_mean2)#alternative='smaller')

print(f'Z_score: {Z_score}', f'\np-value: {p_value}')

Z_score: -0.534881018112348 
p-value: 0.5927321349138212


In [25]:
t_value, p_value = stats.ttest_1samp(df5_pretest.loc[df5_pretest['groupid'] == 1,'ctr'], group_0_mean2)
print('p_value: ', p_value)

p_value:  0.5927323848018218


---

## Experiment metrics 

In this section you must perform the same analysis as in the previous section, but using the data generated during the experiment (i.e.: after November 1st, 2021). You must provide insights about the metrics (__Activity level__, __DAU__ and __CTR__) and also perform an hyphotesis test in order to determine whether there is any statistical significant difference between the groups during the experiment. You must try different approaches (i.e.: __z-test__ and __t-test__) and compare the results.


__Datasets:__ `activity_all.csv`, `ctr_all.csv`

In [26]:
# your-code
#TEST= df all DAU DESPUES del 1/11/2021 y comparar grupo 1 con grupo 0 H0= medias son iguales H1= medias son distintas(NO!!)


mask3 =  df4_DAU['dt'] >= '2021-11-1'
df4_DAU_exp = df4_DAU.loc[mask3]

group_0_mean3 = df4_DAU_exp.loc[df4_DAU_exp['groupid'] == 0,'activity_level'].mean()

alpha = 0.05
Z_score, p_value = ztest(df4_DAU_exp.loc[df4_DAU_exp['groupid'] == 1,'activity_level'], value=group_0_mean3)#alternative='smaller')

print(f'Z_score: {Z_score}', f'\np-value: {p_value}')

#mean = 30
#t_value, p_value = stats.ttest_1samp(hourly_emp['Hourly Rate'], mean)
#print('p_value: ', p_value)
#0.05 < p_value

Z_score: -0.23986299777142084 
p-value: 0.8104364670304072


In [27]:
t_value, p_value = stats.ttest_1samp(df4_DAU_exp.loc[df4_DAU_exp['groupid'] == 1,'activity_level'], group_0_mean3)
print('p_value: ', p_value)


p_value:  0.8121238443351877


In [28]:
#TEST= df all CTR DESPUES del 1/11/2021 y comparar grupo 1 con grupo 0 H0= medias son iguales H1= medias son distintas(NO!!)


mask4 =  df5['dt'] >= '2021-11-1'
df5_exp = df5.loc[mask4]

group_0_mean4 = df5_exp.loc[df5_exp['groupid'] == 0,'ctr'].mean()

alpha = 0.05
Z_score, p_value = ztest(df5_exp.loc[df5_exp['groupid'] == 1,'ctr'], value=group_0_mean4)#alternative='smaller')

print(f'Z_score: {Z_score}', f'\np-value: {p_value}')






Z_score: 2706.0742317177355 
p-value: 0.0


In [29]:
t_value, p_value = stats.ttest_1samp(df5_exp.loc[df5_exp['groupid'] == 1,'ctr'], group_0_mean4)
print('p_value: ', p_value)

p_value:  0.0


---

## Conclusions

Please provide your conclusions after the analyses and your recommendation whether we may or may not implement the changes in the digital product.

# your-conclusions
### The A/B test demonstrate that the change in the CTR is significantly positive so is something to take into account.
### On the other side the change for the activity level is not significant so its no necesary to take into account this change.



---