# A/B Testing for Shoefly.com

Our favorite online shoe store, ShoeFly.com is performing an A/B Test. We have two different versions of an ad, which they have placed in emails, as well as in banner ads on Facebook, Twitter, and Google. We want to know how the two ads are performing on each of the different platforms on each day of the week. We are going to analyze the data using aggregate measures.

- Analyzing Ad Sources

In [5]:
import pandas as pd

ad_clicks = pd.read_csv('ad_clicks.csv')
print(ad_clicks.head())

                                user_id  ... experimental_group
0  008b7c6c-7272-471e-b90e-930d548bd8d7  ...                  A
1  009abb94-5e14-4b6c-bb1c-4f4df7aa7557  ...                  B
2  00f5d532-ed58-4570-b6d2-768df5f41aed  ...                  A
3  011adc64-0f44-4fd9-a0bb-f1506d2ad439  ...                  B
4  012137e6-7ae7-4649-af68-205b4702169c  ...                  B

[5 rows x 5 columns]


We want to know which ad platform is getting the most views.

How many views (i.e., rows of the table) came from each utm_source?

In [6]:
most_views = ad_clicks.groupby('utm_source').user_id.count().reset_index()
print(most_views)

  utm_source  user_id
0      email      255
1   facebook      504
2     google      680
3    twitter      215



If the column ad_click_timestamp is not null, then someone actually clicked on the ad that was displayed.Let's make a new column called is_click, which is True if ad_click_timestamp is not null and False otherwise.

In [7]:
ad_clicks['is_click'] = ~ad_clicks.ad_click_timestamp.isnull()
print(ad_clicks.head())

                                user_id utm_source  ... experimental_group is_click
0  008b7c6c-7272-471e-b90e-930d548bd8d7     google  ...                  A     True
1  009abb94-5e14-4b6c-bb1c-4f4df7aa7557   facebook  ...                  B    False
2  00f5d532-ed58-4570-b6d2-768df5f41aed    twitter  ...                  A    False
3  011adc64-0f44-4fd9-a0bb-f1506d2ad439     google  ...                  B    False
4  012137e6-7ae7-4649-af68-205b4702169c   facebook  ...                  B    False

[5 rows x 6 columns]



We want to know the percent of people who clicked on ads from each utm_source. We'll start by grouping by utm_source and is_click and counting the number of user_id‘s in each of those groups.

In [8]:
clicks_by_source = ad_clicks.groupby(['utm_source', 'is_click']).user_id.count().reset_index()
print(clicks_by_source)

  utm_source  is_click  user_id
0      email     False      175
1      email      True       80
2   facebook     False      324
3   facebook      True      180
4     google     False      441
5     google      True      239
6    twitter     False      149
7    twitter      True       66


Now let's pivot the data so that the columns are is_click (either True or False), the index is utm_source and the values are user_id

In [10]:
clicks_pivot= clicks_by_source.pivot(
    columns= 'is_click',
    index= 'utm_source',
    values= 'user_id'
)
print(clicks_pivot)

is_click    False  True
utm_source             
email         175    80
facebook      324   180
google        441   239
twitter       149    66


Let's create a new column in clicks_pivot called percent_clicked which is equal to the percent of users who clicked on the ad from each utm_source.

Was there a difference in click rates for each source?

In [11]:
clicks_pivot['percent_clicked'] = clicks_pivot[True] / (clicks_pivot[True] + clicks_pivot[False])
print(clicks_pivot)

is_click    False  True  percent_clicked
utm_source                              
email         175    80         0.313725
facebook      324   180         0.357143
google        441   239         0.351471
twitter       149    66         0.306977


- Analyzing an A/B Test

The column experimental_group tells us whether the user was shown Ad A or Ad B.
Were approximately the same number of people shown both ads?

In [13]:
number_of_people = ad_clicks.groupby('experimental_group').user_id.count().reset_index()
print(number_of_people)

  experimental_group  user_id
0                  A      827
1                  B      827


Using the column is_click that we defined earlier, check to see if a greater percentage of users clicked on Ad A or Ad B

In [14]:
number_of_As_and_Bs = ad_clicks.groupby(['experimental_group', 'is_click']).user_id.count().reset_index()
print(number_of_As_and_Bs)

  experimental_group  is_click  user_id
0                  A     False      517
1                  A      True      310
2                  B     False      572
3                  B      True      255


We think that the clicks might have changed by the day of the week. Let's start by creating two dataframes: a_clicks and b_clicks, which contain only the results for A group and B group, respectively. 

In [16]:
a_clicks = ad_clicks[ad_clicks.experimental_group == 'A'].reset_index()
b_clicks = ad_clicks[ad_clicks.experimental_group == 'B'].reset_index()
print(a_clicks)
print(b_clicks)

     index                               user_id  ... experimental_group is_click
0        0  008b7c6c-7272-471e-b90e-930d548bd8d7  ...                  A     True
1        2  00f5d532-ed58-4570-b6d2-768df5f41aed  ...                  A    False
2        5  013b0072-7b72-40e7-b698-98b4d0c9967f  ...                  A    False
3        6  0153d85b-7660-4c39-92eb-1e1acd023280  ...                  A    False
4        7  01555297-d6e6-49ae-aeba-1b196fdbb09f  ...                  A    False
..     ...                                   ...  ...                ...      ...
822   1643  fceb13ea-fd8c-446a-a61f-f977d404330a  ...                  A    False
823   1646  fd7d06ea-38b5-4ed9-acc9-777047db8c56  ...                  A    False
824   1647  fe570a20-448f-40ed-930b-8482b8a7c231  ...                  A     True
825   1649  fe8b5236-78f6-4192-9da6-a76bba67cfe6  ...                  A    False
826   1652  ff3af0d6-b092-4c4d-9f2e-2bdd8f7c0732  ...                  A     True

[827 rows x 7 c

For each group (a_clicks and b_clicks), we will calculate the percent of users who clicked on the ad by day.

In [17]:
clicks_by_day_A = a_clicks.groupby(['day', 'is_click']).user_id.count().reset_index()
clicks_by_day_B = b_clicks.groupby(['day', 'is_click']).user_id.count().reset_index()
clicks_A_pivot = clicks_by_day_A.pivot(
  columns = 'is_click',
  index = 'day',
  values = 'user_id'
)
clicks_A_pivot['percent_clicked'] = clicks_A_pivot[True] / (clicks_A_pivot[True] + clicks_A_pivot[False])
clicks_B_pivot = clicks_by_day_B.pivot(
  columns = 'is_click',
  index = 'day',
  values = 'user_id'
)
clicks_B_pivot['percent_clicked'] = clicks_B_pivot[True] / (clicks_B_pivot[True] + clicks_B_pivot[False])
print(clicks_A_pivot)
print(clicks_B_pivot)

is_click       False  True  percent_clicked
day                                        
1 - Monday        70    43         0.380531
2 - Tuesday       76    43         0.361345
3 - Wednesday     86    38         0.306452
4 - Thursday      69    47         0.405172
5 - Friday        77    51         0.398438
6 - Saturday      73    45         0.381356
7 - Sunday        66    43         0.394495
is_click       False  True  percent_clicked
day                                        
1 - Monday        81    32         0.283186
2 - Tuesday       74    45         0.378151
3 - Wednesday     89    35         0.282258
4 - Thursday      87    29         0.250000
5 - Friday        90    38         0.296875
6 - Saturday      76    42         0.355932
7 - Sunday        75    34         0.311927


By comparing the result of each ad, we have come to the conclusion that ad A has more percent clicked than B ad. We recommend using the Ad A over the Ad B. 