# Data Science Online
## Part 7: Rocket Fuel and Consumer Response

<img src="images/ab_test.PNG" style="width: 900px; height: 250px;" />

*In this notebook, we will further explore the Rocket Fuel case study by examining how consumer response differs by time of day, day of week, and how many ads they saw.*

### Table of Contents


2 - [Application: Rocket Fuel by Day and Time](#section2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a - [Example: Consumer Response vs. Day of Week ](#subsection2a)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b - [Exercise: Consumer Response vs. Hour of Day](#subsection2b)

3 - [Application: Rocket Fuel by Total Ads](#section3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a - [Histograms](#subsection3a)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b - [Consumer Response vs. Total Ads Seen](#subsection3b)


In [1]:
# dependencies
from datascience import *
import numpy as np
%matplotlib inline
from scripts.exec_ed_scripts import *
import seaborn as sns

# make the plots bigger
sns.set(rc={'figure.figsize':(11.7,8.27)})

# 2. Application: Rocket  Fuel by Day and Time <a id="section2"></a>

### 2a. Example: Consumer Response vs. Day of Week <a id="subsection2a"></a>

Let's answer the question of how *conversions* changed on  *different days of the week* for the *different test groups*. 

Remember, our `ads` data looks like this:

In [2]:
# load the rocket fuel data
ads = pd.read_csv('https://raw.githubusercontent.com/ds-modules/exec_ed/master/data/rocketfuel_data_renamed.csv', index_col=0)

# display the first ten rows
ads.head()

Unnamed: 0_level_0,test group,converted,total ads,most ads day,most ads hour
user id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1069124,ad,0,130,1:Mon,20
1119715,ad,0,93,2:Tues,22
1144181,ad,0,21,2:Tues,18
1435133,ad,0,355,2:Tues,10
1015700,ad,0,276,5:Fri,14


#### Conversion Rate vs. Day of Week
Let's start with a slightly simpler problem: comparing the *average rate of conversion* for different days of the week (regardless of test group). 

First, we select the columns we need.

Then, we group by day. For our collection function, we want to calculate the conversion rate. Because the conversion rate is just the number of people who converted divided by the total number of people, and because `"converted"` is equal to 1 only when a person converted, we can get the conversion rate by taking the average of `"converted"`.

In [None]:
day_rates = ads.groupby("most ads day",as_index=False).mean()
day_rates

Finally, we can look at the values in a bar graph.

In [None]:
sns.barplot(x="most ads day", y="converted", data=day_rates);

#### Conversion Rate vs Day of Week and Test Group

Now, we want to return to our original question: how does conversion rate differ between different days of the week for the two different test groups? 

To answer our question, we only need to know three pieces of information: which test group a user was in, the day on which they saw the most ads, and whether or note they converted. We'll use a table transformation function to select only the three columns we need.

Next, we'll need to apply `pivot` so we can have one set of categories as the rows and the other set of categories as the columns. Our call to `pivot` can be broken down like this:

1. The first two arguments give the two sets of categories: "test group" and "most ads day"
2. The values will be calculated from "converted"
3. Our collection function will be `np.average` to calculate the rate at which each day-test group pair converted.

In [None]:
sns.barplot(x="most ads day", y="converted", hue="test group", data=ads)

Then, we can visualize our findings in a bar plot. The `barh` function gets `most ads day` as the argument since that's the one column in our table that still contains categorical data.

<div class="alert-warning">
    <b>QUESTION:</b> On which days is advertising the most effective? When is it least effective? What are some possible limitations to the conclusions we can draw from this graph?
    </div>

**ANSWER:** 

### 2b. EXERCISE: Consumer Response vs Hour of Day <a id="subsection2b"></a>
Next, we want to see how the conversion rate does or does not changes for each test group by *time of day*.

The process will be almost identical to that in 2a for comparing consumer response to day of the week, with any references to day of week changed to hour of day. Try walking through the process step-by-step, and refer to 2a for guidance as needed.

Start by creating the Table `ads_hour` that contains the three columns we need from `ads`: `"most ads hour"`, `"test group"` and `"converted"`.

Finally, create a horizontal bar graph by calling `barh` on your `hour_pivot` pivot table. Remember: the argument given to `barh` is the name of the column with the different categories we want to compare (in this case, the categories are the hours).

In [None]:
# make a bar plot
sns.barplot(x="most ads hour", y="converted", hue="test group", data=ads)

# 3. Application: Rocket Fuel and Total Ads <a id="section3"></a>

One final question we want to explore is how the *total number of ads* a user saw relates to how *likely they were to convert*. 

The "total ads" variable is a little different from the other variables we've grouped by so far. The days of the week and the hours of the day are coded as numbers, but they act a bit like categorical data in some ways. 

- They have very limited possible values (1-7 for days of week, 0-23 for hours of day)
- We can put them in order to some extent (day 1: Monday comes before day 2: Tuesday), but the ordering breaks down for the first and last values (day 7: Sunday comes before day 1: Monday)

In contrast, the count of total ads is definitely *numerical data*. "Total ads" can take on many, many values and we can unequivically say that one total ad value is greater or less than another. 

In this section, we'll talk a bit about ways to visualize numerical data, as well as how (and why) to treat numerical data as categorical for analysis.

### 3a. Histograms <a id="subsection3a"></a>

A *histogram* is a visualization useful for numerical data. It shows the *distribution* of the values in a column of numerical data: that is, it shows all the different possible values and how often those values occur.

Histograms are constructed using the `hist` method. As an example, here's the histogram of the variable "most ads hour", which gives the hours of the day when users saw the most ads.

In [None]:
ads.hist("most ads hour", bins=24);

The horizontal axis gives the different possible hours of the day. The axis is divided into **bins**: intervals that contain one or more possible values. For instance, a narrow bin could hold all users that saw the most ads in hour 7, and a wider bin could encompass all users that saw the most ads in hour 7, 8, or 9.

The height of the bar says what percentage of the users fell into the corresponding bin. So, a tall bar means that lots of users saw most of their ads during that bin's hour-of-day interval.

This histogram shows us that most users saw the most ads around hour 15 (3PM) and very few users saw many ads between hour 0 (midnight) and hour 6 (6 AM).

<br/>
<div class="alert-warning">
    <b>EXERCISE:</b> Use `hist` to create the histogram for the "total ads" column in the `ads` table. What does the histogram tell us about the distribution of "total ads"?
    </div>

In [None]:
ads.hist("total ads", bins=100)

### 3b. Consumer Response vs. Total Ads Seen <a id="subsection3b"></a>

We want to answer the question of how the *total number of ads seen* relates to the average *conversion rate*. 

We'll start the same way as we did to answer our questions about day of week and hour of day: create a new table with only the columns from `ads` that we will need for this question.

In [None]:
# from ads, select total ads, test group, and converted
ads_tot = ads.select("total ads", "test group", "converted")
ads_tot

<div class="alert-warning">
    <b>EXERCISE:</b> From our histogram in 3a, we can see that while some users saw as many as 2000+ ads total, almost all users saw between 1 and about 200 ads. To avoid skewing our analysis by including conversion rates for very rare "total ads" counts, let's select only the rows where there are less than 211 ads.

Hint: use the `where` function with the predicate `are.below`
</div>

In [3]:
# use where to get rows that saw less than 211 ads
ads_small = ads[ads["total ads"] < 211]
ads_small.head()

Unnamed: 0_level_0,test group,converted,total ads,most ads day,most ads hour
user id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1069124,ad,0,130,1:Mon,20
1119715,ad,0,93,2:Tues,22
1144181,ad,0,21,2:Tues,18
1496843,ad,0,17,7:Sun,18
1448851,ad,0,21,2:Tues,19


"total ads" is now restricted to 210 possible values- much less than the original range. Let's try redrawing the histogram.

In [None]:
# create a histogram for the distribution of total ads
ads_small.hist("total ads", bins=50);

The large majority of the values are still concentrated on the low end of the possible values, but we've reduced the *skew* a bit- you can now actually see the bars on the right side of the histogram. 

The histogram also gives us clues for how to visualize our "average conversion rates per test group and total ads count" we want to answer our question. Since some values of total ads are pretty rare, we will want to group similar "total value" counts together and calculate the average conversion rate for the group, rather than for each individual count. 

Note that this also seems like a reasonable assumption: we assume that someone who saw 147 ads will behave similarly to someone who saw 145 or 149.

We can make the numerical "total ads" into categorical ranges of total ad counts with a function called `change_to_range` that takes in a number and outputs a range containing that number. `apply` is a Table function that will apply `change_to_range` to each value in the "total ads" column.

In [11]:
def round_down_nearest_10(i):
    return math.floor(i/10) * 10

In [12]:
ads_small["total ad range"] =ads_small["total ads"].apply(round_down_nearest_10)
ads_small.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,test group,converted,total ads,most ads day,most ads hour,total ad range
user id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1069124,ad,0,130,1:Mon,20,130
1119715,ad,0,93,2:Tues,22,90
1144181,ad,0,21,2:Tues,18,20
1496843,ad,0,17,7:Sun,18,10
1448851,ad,0,21,2:Tues,19,20




<div class="alert-warning">
    <b>EXERCISE:</b> Now, we want to create a pivot table where the columns are the test groups, the rows are the total ads, and the values are calculated from "converted" and collected using `np.average`.
    </div>

Finally, use `barh` to create a bar plot of the data in `total_pivot`. The total ad ranges should be on the vertical axis.

In [None]:
sns.barplot(x="total ad range", y="converted", hue="test group",
           data=ads_small)

<div class="alert-warning">
    <b>QUESTION:</b> What can you infer from the plot? In what region is advertising most effective?
    </div>

**ANSWER:**

<div class="alert-warning">
    <b>QUESTION:</b> What do the above figures imply for the design of the next campaign assuming that consumer response would be similar?
    </div>

**ANSWER:** 

#### References

- Rocket Fuel data and discussion questions adapted from materials by Zsolt Katona and Brian Bell, BerkeleyHaas Case Series
- "For loop" section adapted from materials by Kelly Chen at [UC Berkeley Data Modules](https://github.com/ds-modules/core-resources)

Author: Keeley Takimoto