Still to do
- Big Query integration
- Writing the cache folders etc
- Review doc

## A/B Testing eCommerce Site with BigQuery

This report explores how we can analyse and visualise the results from an A/B test. If you want to run this notebook, the install instructions are kept [in the Readme.md](/Readme.md).

### Introduction

A/B testing refers to an experiment technique to determine whether a new design brings improvement, according to a chosen metric. The idea is to compare an existing version of a feature on a website (A) with a new one (B), by randomly splitting traffic and comparing metrics on each of the splits. A feature can be a whole web page or simply the colour of a button.

In this report we will learn how to pull in the data from BigQuery and analyse results of an A/B test giving valuable insights for an online eCommerce platform, wishing to test their landing pages and drive higher conversion rates.

### Context

A company selling hotel rooms across Spain wants to see what landing page is better to direct potential customers. Does page A or page B give us a better conversion rate?

**Page A** would land the customer on the homepage of the site, with a main search bar, a view of all the regions in Spain and a few best selling hotel options. 
**Page B** would land the customer on a city / region page related to the search query of the customer. For example “hotels in Mallorca”.

In order to determine which page will give the better conversion rate, we will split the incoming traffic 50:50 to both pages.

Our evaluation metrics will include:
- **Conversion rate**
- **Average time per session**


First let's import the libraries we will require:

In [160]:
from oauth2client.service_account import ServiceAccountCredentials
from googleapiclient.discovery import build
import httplib2
import pandas as pd
import json
import numpy as np
from pandas.io.json import json_normalize

import plotly.offline as py
import plotly.graph_objs as go
import cufflinks as cf
from cufflinks import tools

from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest

In [161]:
df = pd.read_csv('ab_test_data.csv')
df.head()

Unnamed: 0,Session Id,Conversion,Session Time,Landing Page
0,9036,0,6.73,A
1,4129,0,1.75,A
2,3976,0,1.46,A
3,5052,0,4.4,A
4,8929,0,5.85,A


Let's prepare the data into arrays to be used later in the analysis.

In [162]:
session_time_A = df.loc[df['Landing Page'] == 'A','Session Time'].values
session_time_B = df.loc[df['Landing Page'] == 'B','Session Time'].values

conversions_A = df.loc[df['Landing Page'] == 'A','Conversion'].values
conversions_B = df.loc[df['Landing Page'] == 'B','Conversion'].values

### What Are The Average Values per Landing Page?

Let's visualize the average conversion rate and session time for each landing page.

In [174]:
fig = df.groupby('Landing Page')[['Conversion','Session Time']].mean().iplot(asFigure=True, kind='bar', 
  subplots=True, subplot_titles=True, 
        legend=False)

fig.layout.height = 400
fig.layout.width = 800
fig.layout.template = 'plotly'

fig.layout.yaxis2.ticksuffix = 's'

fig.show()

Visually, it appears that landing page B has a slightly higher conversion, while landing page A may have a longer session time.

Let's see if these insights hold true when put through statistical analysis.

In [164]:
session_time_A = df.loc[df['Landing Page'] == 'A','Session Time'].values
session_time_B = df.loc[df['Landing Page'] == 'B','Session Time'].values

conversions_A = df.loc[df['Landing Page'] == 'A','Conversion'].values
conversions_B = df.loc[df['Landing Page'] == 'B','Conversion'].values

### Conversion Rate Test Analysis

**H0:** Landing pages A and B show no difference in average conversion rate

In order to evaluate the conversion rate result, we will use the proportions ztest from the statsmodels package, a similar method to the chi-squared test.

Each variable explained:
- Count - the number of successes (conversions) in each sample.
- Nobs - the sample size of each sample.

In [165]:
count = np.array([conversions_A.sum(), conversions_B.sum()])
nobs = np.array([len(conversions_A), len(conversions_B)])
stat, p = proportions_ztest(count, nobs)

print('Statistics=%.3f, p=%.3f' % (stat,p))

Statistics=-2.916, p=0.004


The p value is 0.004, which means we can reject the null hypothesis with 99% confidence. We can conclude that there is a statistically significant difference between the convesion rates of landing page A and B, with landing page B having a higher rate.

In [197]:
print('Landing page B has an average conversion rate higher than page A by:')
print(round((conversions_B.sum()/len(conversions_B)-conversions_A.sum()/len(conversions_A))*100,2),'% points')

Landing page B has an average conversion rate higher than page A by:
6.4 % points


### Average Session Time Test Analysis

**H0:** Landing pages A and B show no difference in average conversion rate

In order to evaluate the session time result, we will use the ttest from the scipy package.

In [169]:
stat, p = ttest_ind(session_time_A, session_time_B)
print('Statistics=%.3f, p=%.3f' % (stat, p))

Statistics=1.877, p=0.061


The p value in this case is 0.061 which shows we are able to reject the null hypothesis at 90% confidence, but not quite at 95% confidence.

In [203]:
print('Landing page A has an average session time higher than page B by:')
print(round((session_time_A.mean()/session_time_B.mean()-1)*100,2),'%')

Landing page A has an average session time higher than page B by:
7.19 %


- The overall results show that while session time is higher on landing page A, the customers convert better on page B.
- This may be because, while on page A, they are able to browse the homepage and spend more time looking for the page / hotel they are interested in, page B takes them deeper into the funnel, directly to the area they're looking for. This makes page B more relevant and likely to convert.