## TODOs
- Check if there is any user in A and B
- Remove 0s (condition is that revenue es > 0)

### Imports

In [1]:
import sqlite3
import warnings

import pandas as pd
from pandas.core.common import SettingWithCopyWarning
import altair as alt
from scipy import stats
from statsmodels.stats import weightstats

In [2]:
# This will remove unnecessary warnings that jupyter always rise
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

### Constants

In [3]:
DB_FILE = 'Data scientist exercise.db'
DB_SELECT = 'SELECT * from access_log'

# Column names
EVENT_TYPE = 'event_type'
REVENUE = 'revenue'
VARIANT = 'variant'
CITY = 'city'
USER_ID = 'user_id'

# Event type values
EVENT_TYPE_P_VIEW = 'property_view'
EVENT_TYPE_F_ADDED = 'property_favorite_added'
EVENT_TYPE_B_REQUEST = 'booking_request'

### Connection and data read

In [4]:
conn = sqlite3.connect(DB_FILE)
data = pd.read_sql(DB_SELECT, con=conn)

In [5]:
# Para no "ensuciar" el dataframe original y leer la bbdd todo el rato, trabajaremos con la copia por ahora
df = data.copy()

### Data Exploratory and Cleanse

Since we are interested in studying the *Booking Requests*, we will filter it now and later we will use the global dataframe to get other variables.

In [23]:
df_br = df[df[EVENT_TYPE]==EVENT_TYPE_B_REQUEST].reset_index(drop=True).copy()
df_br.head()

Unnamed: 0,datetime,user_id,variant,city,event_type,revenue
0,2021-08-01 00:50:26,556943737,A,rome,booking_request,211.234257
1,2021-08-01 05:02:36,630002484,B,madrid,booking_request,
2,2021-08-01 08:39:23,741334523,B,madrid,booking_request,288.934075
3,2021-08-01 16:29:02,715418235,A,madrid,booking_request,212.030443
4,2021-08-01 16:30:38,107297881,A,rome,booking_request,175.026215


We have two groups, A/B, one is control and the other is treatment, and roll is assigned per user. Treatment group have higher fees that it is included in the shopping cart. We need to test if the **Conversion Rate** is affected by the increase in price but the extra revenue earned actually compensates it. We are going to assume that the fee charged is exactly the revenue we get.

In [24]:
print('Number of records with negative or zero revenue for Booking Requests: {}'.format(len(df_br[df_br[REVENUE]<=0])))

Number of records with negative or zero revenue for Booking Requests: 22


In [25]:
print('Number of records for Booking Requests group A: {}'.format(len(df_br[df_br[VARIANT]=='A'])))
print('Number of records for Booking Requests group B: {}'.format(len(df_br[df_br[VARIANT]=='A'])))

Number of records for Booking Requests group A: 226
Number of records for Booking Requests group B: 226


In [27]:
df_br_na = df_br[df_br[REVENUE].isna()]
print('Number of NA records for Booking Requests: {}'.format(len(df_br_na)))
alt.Chart(df_br_na).mark_bar().encode(
    x=alt.X('count({})'.format(VARIANT)),
    y=alt.Y(VARIANT)
)

Number of NA records for Booking Requests: 16


In [31]:
df_br_zero = df_br[df_br[REVENUE]==0]
print('Number of zero records for Booking Requests: {}'.format(len(df_br_zero)))
alt.Chart(df_br_zero).mark_bar().encode(
    x=alt.X('count({})'.format(VARIANT)),
    y=alt.Y(VARIANT)
)

Number of zero records for Booking Requests: 22


In [29]:
alt.Chart(df_br_na).mark_bar().encode(
    x=alt.X('count({})'.format(CITY)),
    y=alt.Y(CITY)
)

In [30]:
alt.Chart(df_br_zero).mark_bar().encode(
    x=alt.X('count({})'.format(CITY)),
    y=alt.Y(CITY)
)

After some data exploration we find out that there are 16 Booking Requests with *null* revenue and 22 zeros, and it does not seem to be related with the city. Therefore, since we have *12+7=19* records from group **A** and *9+10=19* records from group **B**, the solution proposed in this case is to remove all the Null and zero values from the study since it will not affect the distribution of the data.

In [32]:
df_br = df_br[df_br[REVENUE].notnull()]
df_br = df_br[df_br[REVENUE]>0]

In [34]:
df_to_work_with = df_br.groupby([USER_ID, VARIANT]).agg({REVENUE: 'sum'}, axis=1).reset_index()

In [35]:
selection = alt.selection_multi(fields=[VARIANT], bind='legend')
alt.Chart(df_to_work_with).mark_bar().encode(
    x=alt.X(REVENUE, bin=True),
    y=alt.Y('count()', stack=None),
    color=alt.Color(VARIANT),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.1))
).add_selection(selection)

Even though our data seem to be already a Normal Distribution of known mean and standard deviation, it is better if we standarize it to be normal distributions with mean 0 and standard deviation 1

In [63]:
df_A = df_to_work_with[df_to_work_with[VARIANT]=='A'].copy()
df_B = df_to_work_with[df_to_work_with[VARIANT]=='B'].copy()

In [37]:
A_MEAN = df_A[REVENUE].mean()
B_MEAN = df_B[REVENUE].mean()

A_STD = df_A[REVENUE].std()
B_STD = df_B[REVENUE].std()

df_A[REVENUE] = (df_A[REVENUE] - A_MEAN)/A_STD
df_B[REVENUE] = (df_B[REVENUE] - B_MEAN)/B_STD

revenue_A = df_A[REVENUE].values
revenue_B = df_B[REVENUE].values

In [38]:
df_new = df_A.append(df_B, ignore_index=True)
selection = alt.selection_multi(fields=[VARIANT], bind='legend')
alt.Chart(df_new).mark_bar().encode(
    x=alt.X(REVENUE, bin=True),
    y=alt.Y('count()', stack=None),
    color=alt.Color(VARIANT),
    opacity=alt.condition(selection, alt.value(1), alt.value(0))
).add_selection(selection)

In [39]:
TOTAL_USERS = len(df[USER_ID].unique())
TOTAL_USERS_A = len(df.loc[df[VARIANT]=='A',USER_ID].unique())
TOTAL_USERS_B = len(df.loc[df[VARIANT]=='B',USER_ID].unique())
USERS_WITH_BR = len(df_br[USER_ID].unique())
USERS_WITH_BR_A = len(df_br.loc[df_br[VARIANT]=='A',USER_ID].unique())
USERS_WITH_BR_B = len(df_br.loc[df_br[VARIANT]=='B',USER_ID].unique())
CVR_A = USERS_WITH_BR_A/TOTAL_USERS_A
CVR_B = USERS_WITH_BR_B/TOTAL_USERS_B



In [40]:
CVR_A

0.24182242990654207

In [41]:
CVR_B

0.175990675990676

## Two sampled Z-Test

In [66]:
_, pval = weightstats.ztest(df_A[REVENUE].values, df_B[REVENUE].values, value=df_A[REVENUE].mean()-df_B[REVENUE].mean(), 
                            alternative='two-sided')
print('P-value={}'.format(pval))

P-value=1.0
