## A|B test analizes whether a change applied to an experimental group is statistically different from a control group (where the change has not been applied to).
## For example, we can use A|B test to analyse a price elasticity of demand problem which is whether the change in price impacts the quantity of good demanded. Carrying out an A|B test to measure the impact of runing a promotion in a subset of stores to evaluate the impact on profits before lauching that policy broadly could minimize the lost (negative impact). 



# First example: Randomized experiment
### Setting: a company is looking for driving more customers to download a mobile app and register for the loyalty program.
### Strategy: change link to a button of the app store 
### Evaluation: randomized experiment
### Expected results: improve the click of the download app page

### In the randomized experiment we control for the following variables:
### i) for user characteristics , we need gurantee that all people have a chance to visit the website during the experiment so the experiment will be run for at least 24 hours and maximum a week
### ii) to control for someone that visits the site more than once, we use the IP as the unit of diversion. Using the IP or a proxy of it or a cookie we count each people only once. Here we use IP level for the split. Thus, we assign user to treatment or control group based on the IP level. So when the user comes back will se the same version of the website (we do not want that user sees both version)
### iii) to control for people who have already an account, we only consider visitors without an account

In [None]:
#let's import all the needed libraries for the analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind, ttest_rel

In [2]:
#upload grossery website data
grossery_web_data_orig = pd.read_csv('grocerywebsiteabtestdata.csv')
print(grossery_web_data_orig.head())

   RecordID   IP Address  LoggedInFlag  ServerID  VisitPageFlag
0         1  39.13.114.2             1         2              0
1         2    13.3.25.8             1         1              0
2         3  247.8.211.8             1         1              0
3         4  124.8.220.3             0         3              0
4         5  60.10.192.7             0         2              0


## Preparing the data for analysis

In [3]:

#i) each individual IP is shown only once so we need to consolidate data
grossery_web_consolidated_data = grossery_web_data_orig.groupby(['IP Address', 'LoggedInFlag','ServerID'])['VisitPageFlag'].sum().reset_index(name='sum_VisitPageFlag')
#if there  IP address with more than one visit we set them equal to one  
grossery_web_consolidated_data['sum_VisitPageFlag'].max() # the max is 4
grossery_web_consolidated_data['visitFlag'] = np.where(grossery_web_consolidated_data['sum_VisitPageFlag'] != 0,1 ,0 )

#ii) we create the control and treatment group
grossery_web_consolidated_data['Group'] = np.where(grossery_web_consolidated_data['ServerID'] == 1, 'treatment' ,'control' )

#iii) remove data that user have already an account
grossery_web_consolidated_data = grossery_web_consolidated_data[grossery_web_consolidated_data['LoggedInFlag'] != 1]

grossery_web_consolidated_data.head()

Unnamed: 0,IP Address,LoggedInFlag,ServerID,sum_VisitPageFlag,visitFlag,Group
0,0.0.108.2,0,1,0,0,treatment
2,0.0.111.8,0,3,0,0,control
4,0.0.163.1,0,2,0,0,control
7,0.0.181.9,0,1,1,1,treatment
11,0.0.20.3,0,1,0,0,treatment


In [5]:
#we perform a test of means. This test helps us to decide if the difference that we see in the data actually exists in the population and it is not random.
treatment = grossery_web_consolidated_data[grossery_web_consolidated_data['Group']=='treatment']
control = grossery_web_consolidated_data[grossery_web_consolidated_data['Group']=='control']

ttest_ind(treatment['visitFlag'], control['visitFlag'], equal_var = False)

Ttest_indResult(statistic=11.879472502167134, pvalue=1.781696815610413e-32)

 The p-value is less than 0.05 and thus the difference in means is statistically significant meaning that there is evidence to reject the null hypothesis that the mean of both groups are similar. So, there is evidence that the new version of the site( changing link to a button of the app store) will improve the click through rate for our download app page.

In [7]:

pd.crosstab(grossery_web_consolidated_data.Group, grossery_web_consolidated_data.visitFlag, margins=True, margins_name="Total", normalize='index')

visitFlag,0,1
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,0.814043,0.185957
treatment,0.767455,0.232545
Total,0.798477,0.201523


The previous table shows that the treatment group, the group that sees the new version of the website, clicked on the link 4% more than the control group (they see the original website)

## Second example: Matched pair design 
## It is usually used when data observations is low and the concern about bias is great. For each treatment, one or more controls are selected. Each treatment store is matched to a control stores so that have very similar control variables (for example: sales volumen, product sold and state location)


### The treatment stores data is in the file cherry_product_treatment 

In [10]:

treatment_data = pd.read_csv('cherry-product-treatment-stores-.csv')
treatment_data.head()

Unnamed: 0,Store ID,City,State,Zip Code,Category Sales,Product Count,Size
0,112,Laguna Niguel,CA,92677,2692.66,6,Large
1,734,Los Angeles,CA,90007,7125.88,5,Large
2,1183,La Mesa,CA,91942,3166.64,5,Large
3,1594,San Jose,CA,95118,6875.2,6,Large
4,3064,Danville,CA,94526,84.31,3,Large


### The data for all stores is in the newproductcontroldata.csv

In [11]:

all_data = pd.read_csv('newproductcontroldata.csv')
all_data.head()

Unnamed: 0,Store ID,City,State,Zip Code,Category Sales,Product Count,Size
0,1,ALABASTER,AL,35007,18.88,1,Large
1,2,BIRMINGHAM,AL,35209,44125.66,6,Large
2,3,DECATUR,AL,35601,46627.92,5,Large
3,4,HUNTSVILLE,AL,35806,26658.48,6,Large
4,5,MOBILE,AL,36606,1863.6,3,Large


### Let s look at stores in California

In [12]:

# so let s limit the control stores to California
ca_data = all_data[all_data['State']=='CA']

### We want to use the variables size and product count to match control stores and treatment stores. 

### First, we encode the categorical variable size to numeric.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder()
enc.fit(np.array(ca_data['Size']).reshape(-1, 1))
ca_data['size_cat'] = enc.fit_transform(np.array(ca_data['Size']).reshape(-1, 1))

enc = OrdinalEncoder()
enc.fit(np.array(treatment_data['Size']).reshape(-1, 1))
treatment_data['size_cat'] = enc.fit_transform(np.array(treatment_data['Size']).reshape(-1, 1))

### Now we perform the matching. We consider only five control matches per treatment store. The distance metric that we use is the euclidean distance (L2). So we want to find controls that have the minimun distance with respect to the treatment store in terms of size and product count.

In [18]:
from sklearn.metrics import pairwise_distances
K = 5 # 1:K matching so number of matches

all_controls_in_ca = ca_data[~ca_data['Store ID'].isin(treatment_data['Store ID'])]
for i in range(len(treatment_data)):
    sim = pairwise_distances(np.array(treatment_data[['size_cat','Product Count']].iloc[i,:]).reshape(1,-1),np.array(all_controls_in_ca[['size_cat','Product Count']]), metric='l2')
    index_match = sim.argsort()[0,0:K] 
    if i == 0:
        id_treatment =  np.repeat(treatment_data['Store ID'].iloc[i],K,axis=0)
        id_control = np.array( all_controls_in_ca['Store ID'].iloc[index_match] )
        sales_control = np.array( all_controls_in_ca['Category Sales'].iloc[index_match] )
    else: 
        id_treatment = np.concatenate( [id_treatment,np.repeat(treatment_data['Store ID'].iloc[i],K,axis=0)] )
        id_control = np.concatenate( [ id_control, np.array( all_controls_in_ca['Store ID'].iloc[index_match] ) ] ) 
        sales_control = np.concatenate( [sales_control ,np.array( all_controls_in_ca['Category Sales'].iloc[index_match] )])
matched_pair = pd.DataFrame(np.concatenate([np.expand_dims(id_treatment,axis=1),np.expand_dims(id_control,axis=1)],axis=1))
matched_pair.columns = ['Treatment Store','Control Store']
matched_pair

Unnamed: 0,Treatment Store,Control Store
0,112,2656
1,112,3506
2,112,1608
3,112,1171
4,112,5010
5,734,104
6,734,2654
7,734,2652
8,734,2550
9,734,2549


### In randomized control trials we compared control and treatment groups using t-test. In matched pair, we are comparing each treatment unit to one or more control unit. The comparison is unit by unit instead of group. So, instead of comparing the mean and variance of these two groups (treatment and controls), we compare the mean and variance of the difference between each matched pair to see if it statistically different than zero.

In [20]:
ttest_rel(treatment_data.loc[treatment_data.index.repeat(K)]['Category Sales'],sales_control)

Ttest_relResult(statistic=-1.4533405618403128, pvalue=0.15250410633507738)

### From this test, we conclude that there is no evidence that the introduction of the new product increases sales.