## A/B Testing for ShoeFly.com
Our favorite online shoe store, ShoeFly.com is performing an A/B Test. They have two different versions of an ad, which they have placed in emails, as well as in banner ads on Facebook, Twitter, and Google. They want to know how the two ads are performing on each of the different platforms on each day of the week. Help them analyze the data using aggregate measures.

In [9]:
import pandas as pd
import numpy as np

In [10]:
ad_clicks = pd.read_csv('C:/Users/ushai/Dropbox/Data Science/CodeCademy/P-R-O-J-E-C-T-S/P.6. Pandas - A,B Testing for ShoeFly.com - Aggregation/ad_clicks.csv')

## Analyzing Ad Sources

#### 1) Examine the first few rows of ad_clicks.

In [11]:
ad_clicks.head()

Unnamed: 0,user_id,utm_source,day,ad_click_timestamp,experimental_group
0,008b7c6c-7272-471e-b90e-930d548bd8d7,google,6 - Saturday,7:18,A
1,009abb94-5e14-4b6c-bb1c-4f4df7aa7557,facebook,7 - Sunday,,B
2,00f5d532-ed58-4570-b6d2-768df5f41aed,twitter,2 - Tuesday,,A
3,011adc64-0f44-4fd9-a0bb-f1506d2ad439,google,2 - Tuesday,,B
4,012137e6-7ae7-4649-af68-205b4702169c,facebook,7 - Sunday,,B


#### 2) Your manager wants to know which ad platform is getting you the most views. How many views (i.e., rows of the table) came from each utm_source?

In [12]:
source = ad_clicks.groupby('utm_source').user_id.count().reset_index()

In [13]:
source

Unnamed: 0,utm_source,user_id
0,email,255
1,facebook,504
2,google,680
3,twitter,215


#### 3) If the column ad_click_timestamp is not null, then someone actually clicked on the ad that was displayed. Create a new column called is_click, which is True if ad_click_timestamp is not null and False otherwise.


In [16]:
# Ideal way of finding NaN or not .isnull() gives answer in True or False
ad_clicks['is_click'] = ~ad_clicks.ad_click_timestamp.isnull()

In [17]:
ad_clicks.head()

Unnamed: 0,user_id,utm_source,day,ad_click_timestamp,experimental_group,is_click
0,008b7c6c-7272-471e-b90e-930d548bd8d7,google,6 - Saturday,7:18,A,True
1,009abb94-5e14-4b6c-bb1c-4f4df7aa7557,facebook,7 - Sunday,,B,False
2,00f5d532-ed58-4570-b6d2-768df5f41aed,twitter,2 - Tuesday,,A,False
3,011adc64-0f44-4fd9-a0bb-f1506d2ad439,google,2 - Tuesday,,B,False
4,012137e6-7ae7-4649-af68-205b4702169c,facebook,7 - Sunday,,B,False


#### 4) We want to know the percent of people who clicked on ads from each utm_source. Start by grouping by utm_source and is_click and counting the number of user_id's in each of those groups. Save your answer to the variable clicks_by_source.

In [18]:
clicks_by_source = ad_clicks.groupby(['utm_source', 'is_click']).user_id.count().reset_index()

In [19]:
clicks_by_source

Unnamed: 0,utm_source,is_click,user_id
0,email,False,175
1,email,True,80
2,facebook,False,324
3,facebook,True,180
4,google,False,441
5,google,True,239
6,twitter,False,149
7,twitter,True,66


#### 5) Now let's pivot the data so that the columns are is_click (either True or False), the index is utm_source, and the values are user_id.


Save your results to the variable clicks_pivot.

In [36]:
clicks_pivot = clicks_by_source.pivot(columns = 'is_click',\
                                      index = 'utm_source',\
                                      values = 'user_id')

In [37]:
clicks_pivot

is_click,False,True
utm_source,Unnamed: 1_level_1,Unnamed: 2_level_1
email,80,175
facebook,180,324
google,239,441
twitter,66,149


#### 6) Create a new column in clicks_pivot called percent_clicked which is equal to the percent of users who clicked on the ad from each utm_source.


Was there a difference in click rates for each source?

In [42]:
clicks_pivot['percent_clicked'] = (clicks_pivot[True] / (clicks_pivot[True] + clicks_pivot[False])) * 100

In [44]:
np.round(clicks_pivot,2)

is_click,False,True,percent_clicked
utm_source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
email,80,175,68.63
facebook,180,324,64.29
google,239,441,64.85
twitter,66,149,69.3


## Analyzing an A/B Test

#### 7) The column experimental_group tells us whether the user was shown Ad A or Ad B. Were approximately the same number of people shown both adds?

In [45]:
experimental_group = ad_clicks.groupby('experimental_group').user_id.count().reset_index()

In [46]:
experimental_group

Unnamed: 0,experimental_group,user_id
0,A,827
1,B,827


#### 8) Using the column is_click that we defined earlier, check to see if a greater percentage of users clicked on Ad A or Ad B.

In [55]:
clicks = ad_clicks.groupby(['experimental_group', 'is_click']).user_id.count().reset_index()

In [56]:
clicks

Unnamed: 0,experimental_group,is_click,user_id
0,A,False,517
1,A,True,310
2,B,False,572
3,B,True,255


In [57]:
# Clicks pivot
clicks_pivot = clicks.pivot(columns = 'is_click', index = 'experimental_group', values = 'user_id')

In [58]:
clicks_pivot

is_click,False,True
experimental_group,Unnamed: 1_level_1,Unnamed: 2_level_1
A,517,310
B,572,255


In [59]:
clicks_pivot['percent_clicked'] = (clicks_pivot[True] / (clicks_pivot[True] + clicks_pivot[False])) * 100

In [60]:
clicks_pivot = np.round(clicks_pivot, 2)

In [61]:
clicks_pivot

is_click,False,True,percent_clicked
experimental_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,517,310,37.48
B,572,255,30.83


#### 9) The Product Manager for the A/B test thinks that the clicks might have changed by day of the week. Start by creating two DataFrames: a_clicks and b_clicks, which contain only the results for A group and B group, respectively.

In [20]:
a_clicks = ad_clicks[ad_clicks.experimental_group == 'A']

In [30]:
a_clicks.head()

Unnamed: 0,user_id,utm_source,day,ad_click_timestamp,experimental_group,is_click
0,008b7c6c-7272-471e-b90e-930d548bd8d7,google,6 - Saturday,7:18,A,True
2,00f5d532-ed58-4570-b6d2-768df5f41aed,twitter,2 - Tuesday,,A,False
5,013b0072-7b72-40e7-b698-98b4d0c9967f,facebook,1 - Monday,,A,False
6,0153d85b-7660-4c39-92eb-1e1acd023280,google,4 - Thursday,,A,False
7,01555297-d6e6-49ae-aeba-1b196fdbb09f,google,3 - Wednesday,,A,False


In [31]:
b_clicks = ad_clicks[ad_clicks.experimental_group == 'B']

In [32]:
b_clicks.head()

Unnamed: 0,user_id,utm_source,day,ad_click_timestamp,experimental_group,is_click
1,009abb94-5e14-4b6c-bb1c-4f4df7aa7557,facebook,7 - Sunday,,B,False
3,011adc64-0f44-4fd9-a0bb-f1506d2ad439,google,2 - Tuesday,,B,False
4,012137e6-7ae7-4649-af68-205b4702169c,facebook,7 - Sunday,,B,False
9,01a210c3-fde0-4e6f-8efd-4f0e38730ae6,email,2 - Tuesday,15:21,B,True
10,01adb2e7-f711-4ae4-a7c6-29f48457eea1,google,3 - Wednesday,,B,False


#### 10) For each group (a_clicks and b_clicks), calculate the percent of users who clicked on the ad by day.

### a_clicks by day

In [33]:
# Group by 'is_clicks' and 'day'.
a_clicks_by_day = a_clicks.groupby(['is_click','day']).user_id.count().reset_index()

In [34]:
a_clicks_by_day

Unnamed: 0,is_click,day,user_id
0,False,1 - Monday,70
1,False,2 - Tuesday,76
2,False,3 - Wednesday,86
3,False,4 - Thursday,69
4,False,5 - Friday,77
5,False,6 - Saturday,73
6,False,7 - Sunday,66
7,True,1 - Monday,43
8,True,2 - Tuesday,43
9,True,3 - Wednesday,38


In [36]:
# Pivot
a_clicks_by_day_pivot = a_clicks_by_day.pivot(columns = 'is_click', index = 'day', values = 'user_id')

In [37]:
a_clicks_by_day_pivot

is_click,False,True
day,Unnamed: 1_level_1,Unnamed: 2_level_1
1 - Monday,70,43
2 - Tuesday,76,43
3 - Wednesday,86,38
4 - Thursday,69,47
5 - Friday,77,51
6 - Saturday,73,45
7 - Sunday,66,43


In [62]:
a_clicks_by_day_pivot['percent_clicked'] = (a_clicks_by_day_pivot[True] / (a_clicks_by_day_pivot[True] + a_clicks_by_day_pivot[False])) * 100


In [63]:
a_clicks_by_day_pivot = np.round(a_clicks_by_day_pivot, 2)

In [64]:
a_clicks_by_day_pivot

is_click,False,True,percent_clicked
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1 - Monday,70,43,38.05
2 - Tuesday,76,43,36.13
3 - Wednesday,86,38,30.65
4 - Thursday,69,47,40.52
5 - Friday,77,51,39.84
6 - Saturday,73,45,38.14
7 - Sunday,66,43,39.45


### b_clicks by day

In [39]:
b_clicks_by_day = b_clicks.groupby(['is_click','day']).user_id.count().reset_index()

In [40]:
b_clicks_by_day

Unnamed: 0,is_click,day,user_id
0,False,1 - Monday,81
1,False,2 - Tuesday,74
2,False,3 - Wednesday,89
3,False,4 - Thursday,87
4,False,5 - Friday,90
5,False,6 - Saturday,76
6,False,7 - Sunday,75
7,True,1 - Monday,32
8,True,2 - Tuesday,45
9,True,3 - Wednesday,35


In [41]:
b_clicks_by_day_pivot = b_clicks_by_day.pivot(columns = 'is_click', index = 'day', values = 'user_id')

In [42]:
b_clicks_by_day_pivot

is_click,False,True
day,Unnamed: 1_level_1,Unnamed: 2_level_1
1 - Monday,81,32
2 - Tuesday,74,45
3 - Wednesday,89,35
4 - Thursday,87,29
5 - Friday,90,38
6 - Saturday,76,42
7 - Sunday,75,34


In [65]:
b_clicks_by_day_pivot['percent_clicked'] = (b_clicks_by_day_pivot[True] / (b_clicks_by_day_pivot[True] + b_clicks_by_day_pivot[False])) * 100
b_clicks_by_day_pivot = np.round(b_clicks_by_day_pivot, 2)
b_clicks_by_day_pivot

is_click,False,True,percent_clicked
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1 - Monday,81,32,28.32
2 - Tuesday,74,45,37.82
3 - Wednesday,89,35,28.23
4 - Thursday,87,29,25.0
5 - Friday,90,38,29.69
6 - Saturday,76,42,35.59
7 - Sunday,75,34,31.19


### A and B Comparison

In [66]:
a_clicks_by_day_pivot, b_clicks_by_day_pivot

(is_click       False  True  percent_clicked
 day                                        
 1 - Monday        70    43            38.05
 2 - Tuesday       76    43            36.13
 3 - Wednesday     86    38            30.65
 4 - Thursday      69    47            40.52
 5 - Friday        77    51            39.84
 6 - Saturday      73    45            38.14
 7 - Sunday        66    43            39.45,
 is_click       False  True  percent_clicked
 day                                        
 1 - Monday        81    32            28.32
 2 - Tuesday       74    45            37.82
 3 - Wednesday     89    35            28.23
 4 - Thursday      87    29            25.00
 5 - Friday        90    38            29.69
 6 - Saturday      76    42            35.59
 7 - Sunday        75    34            31.19)

## Conclusion

It appears that Ad A is getting more clicks than Ad B.