# A/B Testing

## What is A/B testing?

Although well-chosen data visualizations can help uncover interesting *correlations* in our data, they cannot demonstrate *causality*. Fortunately, we can set up **randomized experiments** in order to do this. We'll look at one of the simplest versions, the **A/B test**.

We'll be working with data from an experiment conducted by Scott W. H. Young from Montana State University in 2013. We won't be replicating the entire research project, but focus on the parts which help us develop better intuition about how A/B tests can be useful to us in general.

You can find the dataset [here](https://scholarworks.montana.edu/xmlui/handle/1/3507), and the article [here](https://quod.lib.umich.edu/w/weave/12535642.0001.101?view=text;rgn=main#N3) if you want to learn more (note that our results won't necessarily match those of the authors due to differences how we define the response variables).

The experiment wished to test how implementing slight modifications to the University's library homepage would affect user engagement with the "Interact" category, which offered person-to-person assistance and support on topics related to the library. This category had been neglected by users in the past, and management's main goal was to increase clicks and user retention.

## Exploring our options

This was the original homepage:

![alt](data/images/interact_original.png "Interact, original")

Management decided that they wanted to test several variants of the name of the category, to see which one would attract the most users. The alternatives they considered were:

* **Interact** (the default category, hence the "control"), indexed as `/index.php`
* **Connect** (variant 1), indexed as `/index2.php`
* **Learn** (variant 2), indexed as `/index3.php`
* **Help** (variant 3), indexed as `/index4.php`
* **Services** (variant 4), indexed as `/index5.php`

They asked the web design team to come up with alternative homepages. This is the relevant part of the homepage for variant 1:

![alt](data/images/connect_original.png "Connect, original")

For variant 2:

![alt](data/images/learn_original.png "Learn, original")

For variant 3:

![alt](data/images/help_original.png "Help, original")

For variant 4:

![alt](data/images/services_original.png "Services, original")

As you can see, this test is a *multi-branched* experiment. A/B tests is a term usually reserved for one-branched tests (one control and one treatment) - when there are several treatment groups, they are called multi-branched tests. However, the logic is exactly the same, only with more than one variant to try.

We've just defined our treatment variables. To keep things simple, our response variable will be the click-through rate. This rate tells you the clicks a link received as a percentage of the total number of clicks on the page. So if the homepage received, say, 500 clicks in a given time period, and the link got 20 clicks, then that link's click-through rate was 20/500 = 4%.

There are tools that make running an A/B test on a webpage very streamlined and efficient, like [Google Optimize](https://support.google.com/optimize/answer/6211930?hl=en), [CrazyEgg](https://www.crazyegg.com/ab-testing) and [Matomo](https://matomo.org/docs/ab-testing/). We won't use any of those here, but you are free to look into them if you'd like.

## The control group

The team collected data between May 29, 2013 and June 18, 2013 (a three-week period) with CrazyEgg. Users were randomly assigned to one of the five alternatives (either control or one of the four variants) when they visited the webpage.

Let's see how the users interacted with each link in the homepage:

In [None]:
import pandas as pd
# Interact (control)
interact = pd.read_csv('data/interact.csv')
# Connect
connect = pd.read_csv('data/connect.csv')
# Learn
learn = pd.read_csv('data/learn.csv')
# Help
help_data = pd.read_csv('data/help.csv')
# Services
services = pd.read_csv('data/services.csv')

These files contain clicks per link for each variant. Let's examine the control group metrics (Interact) and plot them as a pie chart. The columns that interest us are `Name` and `No. clicks`:

In [None]:
interact.head(10)

The number of clicks was:

In [None]:
sum(interact['No. clicks'])

In [None]:
interact = interact[interact['Visible?']==True]

In [None]:
# The pie chart
categories = ['FIND', 'Search', 'REQUEST', 'INTERACT']
interact_reduced = interact[interact['Name'].isin(categories)]
interact_reduced = interact_reduced.groupby('Name')['No. clicks'].sum()
others = pd.Series(sum(interact['No. clicks']) - sum(interact_reduced), index=['Others'])
interact_reduced = interact_reduced.append(others)
interact_reduced.plot.pie(figsize=(5, 5), autopct='%1.1f%%', pctdistance=1.3, labeldistance=1.5)

Let's have a look at a heatmap of click actions, which allow us to visualize exactly where and how frequently users interacted with various parts of a webpage:

![alt](data/images/interact_heatmap.jpg "Interact, heatmap")

## Variant 1 (Connect)

Let's do the same with variant 1:

In [None]:
sum(connect['No. clicks'])

In [None]:
connect = connect[connect['Visible?']==True]

In [None]:
# The pie chart
categories = ['FIND', 'Search', 'REQUEST', 'CONNECT']
connect_reduced = connect[connect['Name'].isin(categories)]
connect_reduced = connect_reduced.groupby('Name')['No. clicks'].sum()
others = pd.Series(sum(connect['No. clicks']) - sum(connect_reduced), index=['Others'])
connect_reduced = connect_reduced.append(others)
connect_reduced.plot.pie(figsize=(5, 5), autopct='%1.1f%%')

Connect drives our click-through rate up. The heatmap is:

![alt](data/images/connect_heatmap.jpg "Connect, heatmap")

Let's see if the other variants have good results as well.

## Variant 2 (Learn)

In [None]:
sum(learn['No. clicks'])

In [None]:
learn = learn[learn['Visible?']==True]

In [None]:
# The pie chart
categories = ['FIND', 'Search', 'REQUEST', 'LEARN']
learn_reduced = learn[learn['Name'].isin(categories)]
learn_reduced = learn_reduced.groupby('Name')['No. clicks'].sum()
others = pd.Series(sum(learn['No. clicks']) - sum(learn_reduced), index=['Others'])
learn_reduced = learn_reduced.append(others)
learn_reduced.plot.pie(figsize=(5, 5), autopct='%1.1f%%')

Now the heatmap:
    
![alt](data/images/learn_heatmap.jpg "Learn, heatmap")


## Variant 3 (Help)

In [None]:
sum(help_data['No. clicks'])

In [None]:
help_data = help_data[help_data['Visible?']==True]

In [None]:
# The pie chart
categories = ['FIND', 'Search', 'REQUEST', 'HELP']
help_data_reduced = help_data[help_data['Name'].isin(categories)]
help_data_reduced = help_data_reduced.groupby('Name')['No. clicks'].sum()
others = pd.Series(sum(help_data['No. clicks']) - sum(help_data_reduced), index=['Others'])
help_data_reduced = help_data_reduced.append(others)
help_data_reduced.plot.pie(figsize=(5, 5), autopct='%1.1f%%')

This is larger than the control group but lower than variant 1 (Connect):

![alt](data/images/help_heatmap.jpg "Help, heatmap")

## Variant 4 (Services)

In [None]:
sum(services['No. clicks'])

In [None]:
services = services[services['Visible?']==True]

In [None]:
# The pie chart
categories = ['FIND', 'Search', 'REQUEST', 'SERVICES']
services_reduced = services[services['Name'].isin(categories)]
services_reduced = services_reduced.groupby('Name')['No. clicks'].sum()
others = pd.Series(sum(services['No. clicks']) - sum(services_reduced), index=['Others'])
services_reduced = services_reduced.append(others)
services_reduced.plot.pie(figsize=(5, 5), autopct='%1.1f%%')

This is the heatmap:

![alt](data/images/services_heatmap.jpg "Services, heatmap")

This performs at the same level as Connect. The final results are:

| Variant       | Text     | Click-through rate |
|---------------|----------|--------------------|
| Control group | Interact | 1.8                |
| Variant 1     | Connect  | 3.6                |
| Variant 2     | Learn    | 1.4                |
| Variant 3     | Help     | 2.4                |
| Variant 4     | Services | 3.5                |

From what we see in this table, the best options, in order, are Connect, Services, Help, Interact and Learn. It could be argued that the difference between Services and Connect isn't really so big so as to prefer one over the other. That is why A/B tests routinely make use of tests of statistical significance to address precisely this concern. You'll be learning about these sorts of statistical tests in future cases.