## Copyright 2021 Parker Dunn parker_dunn@outlook.com
  
#### Alternate: pgdunn@bu.edu & pdunn91@gmail.com   
#### July 13th, 2021

### Codecademy - A/B Testing for ShoeFly.com

Skill Path: Analyze Data with Python  
Section: Data Manipulation with Pandas  
Topic: Aggregates in Pandas

### Assignment Context

This mini-project was a way to review generating aggregate statistics about data in a pandas DataFrame. I have now learned the basics of pandas DataFrames and how to create and view aggregate statisitics from the DataFrames. My previous mini-project "Petal Power Inventory" demonstrates the basic skills, and this mini-project will cover some of the fundamentals of generating a viewing statistics.

### Assignment Description

"Our favorite online shoe store, ShoeFly.com is performing an A/B test. They have two different versions of an ad, which they have placed in emails, as well as in banner ads on Facebook, Twitter, and Google. They want to know how the two ads are performing on each of the different platforms on each day of the wee. Help them analyze the data using aggregate measures."

In [1]:
# import codecademylib
# Data provided for the assignment in "ad_clicks.csv"

"*import codecademylib*" was included the assignment template. This is not a package that can be imported here so that line is commented out.

The contents of *codecademylib* were not explicitly provided. The webpage and contents can be downloaded but the contents of the package are not clearly specified.

I was able to open and generate a copy of *ad_clicks.csv*, so that file can still be imported and the script can be run here.

In [4]:
import pandas as pd

ad_clicks = pd.read_csv("ad_clicks.csv")

print(ad_clicks.head(10))

                                user_id utm_source            day  \
0  008b7c6c-7272-471e-b90e-930d548bd8d7     google   6 - Saturday   
1  009abb94-5e14-4b6c-bb1c-4f4df7aa7557   facebook     7 - Sunday   
2  00f5d532-ed58-4570-b6d2-768df5f41aed    twitter    2 - Tuesday   
3  011adc64-0f44-4fd9-a0bb-f1506d2ad439     google    2 - Tuesday   
4  012137e6-7ae7-4649-af68-205b4702169c   facebook     7 - Sunday   
5  013b0072-7b72-40e7-b698-98b4d0c9967f   facebook     1 - Monday   
6  0153d85b-7660-4c39-92eb-1e1acd023280     google   4 - Thursday   
7  01555297-d6e6-49ae-aeba-1b196fdbb09f     google  3 - Wednesday   
8  018cea61-19ea-4119-895b-1a4309ccb148      email     1 - Monday   
9  01a210c3-fde0-4e6f-8efd-4f0e38730ae6      email    2 - Tuesday   

  ad_click_timestamp experimental_group  
0               7:18                  A  
1                NaN                  B  
2                NaN                  A  
3                NaN                  B  
4                NaN          

In [5]:
# Which ad platfom is getting ShoeFly the most views?
# i.e. How many views come from each utm_source?

source_views = ad_clicks.groupby("utm_source").user_id.count().reset_index()
print(source_views)

  utm_source  user_id
0      email      255
1   facebook      504
2     google      680
3    twitter      215


In [6]:
# "If the column ad_click_timestamp is not null, then someone actually clicked on the ad that was displayed."
# creating a new column or this information

print(ad_clicks.iloc[0].ad_click_timestamp)
print(type(ad_clicks.iloc[0].ad_click_timestamp))

print(ad_clicks.iloc[1].ad_click_timestamp)
print(type(ad_clicks.iloc[1].ad_click_timestamp))

# Original solution:
# bool_click = lambda timestamp: True if (type(timestamp) == str) else False
# ad_clicks["is_click"] = ad_clicks.ad_click_timestamp.apply(bool_click)

# Revised solution (which is a little more efficient)
ad_clicks["is_click"] = ~ad_clicks.ad_click_timestamp.isnull()

print(ad_clicks.head(10))

7:18
<class 'str'>
nan
<class 'float'>
                                user_id utm_source            day  \
0  008b7c6c-7272-471e-b90e-930d548bd8d7     google   6 - Saturday   
1  009abb94-5e14-4b6c-bb1c-4f4df7aa7557   facebook     7 - Sunday   
2  00f5d532-ed58-4570-b6d2-768df5f41aed    twitter    2 - Tuesday   
3  011adc64-0f44-4fd9-a0bb-f1506d2ad439     google    2 - Tuesday   
4  012137e6-7ae7-4649-af68-205b4702169c   facebook     7 - Sunday   
5  013b0072-7b72-40e7-b698-98b4d0c9967f   facebook     1 - Monday   
6  0153d85b-7660-4c39-92eb-1e1acd023280     google   4 - Thursday   
7  01555297-d6e6-49ae-aeba-1b196fdbb09f     google  3 - Wednesday   
8  018cea61-19ea-4119-895b-1a4309ccb148      email     1 - Monday   
9  01a210c3-fde0-4e6f-8efd-4f0e38730ae6      email    2 - Tuesday   

  ad_click_timestamp experimental_group  is_click  
0               7:18                  A      True  
1                NaN                  B     False  
2                NaN                  A     F

In [9]:
# Need to figure out the percent of people who clicked on ads from each utm_source

clicks_by_source = ad_clicks.groupby(["utm_source","is_click"]).user_id.count().reset_index()
print(clicks_by_source.head(10),"\n")

clicks_pivot = clicks_by_source.pivot(columns="is_click",index="utm_source",values="user_id").reset_index()
# only used "user_id" as values because that's what was counted
print(clicks_pivot)

  utm_source  is_click  user_id
0      email     False      175
1      email      True       80
2   facebook     False      324
3   facebook      True      180
4     google     False      441
5     google      True      239
6    twitter     False      149
7    twitter      True       66 

is_click utm_source  False  True
0             email    175    80
1          facebook    324   180
2            google    441   239
3           twitter    149    66


Percentage of views that turned into clicks
- Email: about 30% of views resulted in clicks
- Facebook: about 35% of views resulted in clicks
- Google: about 35% of views resulted in clicks
- Twitter: about 30% of views resulted in clicks

Facebook and Google had the greatest total ad views (a.k.a. greatest number of times being displayed to customers) and also had the highest rates of views turning into clicks on the ShoeFly.com ad.

In [11]:
# Adding a percentage statistic to "clicks_pivot" for the rate of users clicking on the ad

print(clicks_pivot.columns,"\n")
clicks_pivot["percent_clicked"] = clicks_pivot[True] / (clicks_pivot[False] + clicks_pivot[True])
# My original solution had the True and False as strings not booleans
# NOTE: using True/False in this context seems unusual but works
print(clicks_pivot)

Index(['utm_source', False, True, 'percent_clicked'], dtype='object', name='is_click') 

is_click utm_source  False  True  percent_clicked
0             email    175    80         0.313725
1          facebook    324   180         0.357143
2            google    441   239         0.351471
3           twitter    149    66         0.306977


In [13]:
# Were approximately the same number of people shown both ads?


exp_group = ad_clicks.groupby("experimental_group").user_id.count().reset_index()
print(exp_group)

  experimental_group  user_id
0                  A      827
1                  B      827


In [15]:
# Check if a greater percentage of users clicked on Ad A or Ad B

clicks_by_ad = ad_clicks.groupby(["experimental_group","is_click"]).user_id.count().reset_index()
clicks_by_ad_pivot = clicks_by_ad.pivot(columns="is_click", index="experimental_group",values="user_id").reset_index()
print(clicks_by_ad_pivot)

is_click experimental_group  False  True
0                         A    517   310
1                         B    572   255


In [16]:
# "Product manager has" asked for information about each ad based on the day of the week

a_clicks = ad_clicks[ad_clicks.experimental_group == "A"].reset_index(drop=True)
b_clicks = ad_clicks[ad_clicks.experimental_group == "B"].reset_index(drop=True)
print(a_clicks.head(10), "\n")
print(b_clicks.head(10), "\n")

                                user_id utm_source            day  \
0  008b7c6c-7272-471e-b90e-930d548bd8d7     google   6 - Saturday   
1  00f5d532-ed58-4570-b6d2-768df5f41aed    twitter    2 - Tuesday   
2  013b0072-7b72-40e7-b698-98b4d0c9967f   facebook     1 - Monday   
3  0153d85b-7660-4c39-92eb-1e1acd023280     google   4 - Thursday   
4  01555297-d6e6-49ae-aeba-1b196fdbb09f     google  3 - Wednesday   
5  018cea61-19ea-4119-895b-1a4309ccb148      email     1 - Monday   
6  01fb228a-9d28-4cde-932c-59b933fa763b      email     7 - Sunday   
7  02405d93-9c33-4034-894a-b9523956a3ad    twitter    2 - Tuesday   
8  0254b59f-082d-4a5a-913d-4f2bba267768     google     5 - Friday   
9  041deef8-b242-4114-afd0-e584784ec9f0     google  3 - Wednesday   

  ad_click_timestamp experimental_group  is_click  
0               7:18                  A      True  
1                NaN                  A     False  
2                NaN                  A     False  
3                NaN            

In [18]:
a_clicks_by_day = a_clicks.groupby(["day","is_click"]).user_id.count().reset_index()
b_clicks_by_day = b_clicks.groupby(["day","is_click"]).user_id.count().reset_index()
a_clicks_by_day.rename(columns={"user_id" : "clicks_count"},inplace=True)
b_clicks_by_day.rename(columns={"user_id" : "clicks_count"},inplace=True)

a_clicks_by_day_pivot = a_clicks_by_day.pivot(columns="is_click",index="day",values="clicks_count").reset_index()
b_clicks_by_day_pivot = b_clicks_by_day.pivot(columns="is_click",index="day",values="clicks_count").reset_index()
print(a_clicks_by_day_pivot)


is_click            day  False  True
0            1 - Monday     70    43
1           2 - Tuesday     76    43
2         3 - Wednesday     86    38
3          4 - Thursday     69    47
4            5 - Friday     77    51
5          6 - Saturday     73    45
6            7 - Sunday     66    43


In [21]:
a_clicks_by_day_pivot["percentage"] = (a_clicks_by_day_pivot[True] \
          / (a_clicks_by_day_pivot[True] + a_clicks_by_day_pivot[False]) \
           * 100)

b_clicks_by_day_pivot["percentage"] = (b_clicks_by_day_pivot[True] \
          / (b_clicks_by_day_pivot[True] + b_clicks_by_day_pivot[False]) \
           * 100)
print(a_clicks_by_day_pivot,"\n")
print(b_clicks_by_day_pivot)

is_click            day  False  True  percentage
0            1 - Monday     70    43   38.053097
1           2 - Tuesday     76    43   36.134454
2         3 - Wednesday     86    38   30.645161
3          4 - Thursday     69    47   40.517241
4            5 - Friday     77    51   39.843750
5          6 - Saturday     73    45   38.135593
6            7 - Sunday     66    43   39.449541 

is_click            day  False  True  percentage
0            1 - Monday     81    32   28.318584
1           2 - Tuesday     74    45   37.815126
2         3 - Wednesday     89    35   28.225806
3          4 - Thursday     87    29   25.000000
4            5 - Friday     90    38   29.687500
5          6 - Saturday     76    42   35.593220
6            7 - Sunday     75    34   31.192661


The overall number of clicks and views does not seem to vary much throughout the weekdays.

The click rate does not change much either, but I did not test if the variations, on Wednesday for ad A and Tuesday for ad B in particular, are significant. 

Ad B looks like it has a greater variation in click percentages throughout the week. Thus, the high of 37.8% on Tuesday and the low of 25.0% on Thursday appear to be reasonable variation over the course of a week more so than a statistically significant difference on these two days.

Ad A looks like it has low variation in general except for the click percentage dip on Wednesdays. The dip is not extremely large, but the Wednesday result might be significant because of the consistency on the other days.

### Conclusion:

In terms of Ad A vs Ad B, I would recommend Ad A based on the data. Ad A had higher click percentages on average over the course of the week. Although the ad had a down day on Wednesday, the ad A click percentages are the same or higher for every day of the week compared to ad B.