# Data Science: Bridging Principles and Practice
## Part 5: Numerical Data and Histograms

<img src="images/ab_test.PNG" style="width: 900px; height: 250px;" />

*In this notebook, we will learn about visualizing numerical data in histograms. We will then apply these concepts to the Rocket Fuel case to look at how the number of ads seen relates to the likelihood a user will convert.*

### Table of Contents


<ol start="5">
    <li><a href="section 5">Numerical Data: Rocket Fuel and Total Ads</a>
        <ol type=a>
            <br>
            <li><a href="section 5a">Histograms</a></li>
            <br>
            <li><a href="section 5b">Consumer Response vs. Total Ads Seen</a></li>
        </ol>
    </li>
    </ol>
    

In [None]:
# dependencies
import pandas as pd
%matplotlib inline
from scripts.exec_ed_scripts import *
import seaborn as sns

# make the plots bigger
sns.set(rc={'figure.figsize':(11.7,8.27)})

import warnings
warnings.filterwarnings('ignore')

In [None]:
# load the rocket fuel data
ads = pd.read_csv('data/rocketfuel_data_renamed.csv', index_col=0)

# display the first ten rows
ads.head()

##  5. Numerical Data: Rocket Fuel and Total Ads <a id="section 5"></a>

One question we want to explore is how the *total number of ads* a user saw relates to how *likely they were to convert*. 

The "total ads" variable is a little different from the other variables we've grouped by so far. The hours of the day are coded as numbers, but they act a bit like categorical data in some ways. 

- They have discrete, limited possible values (0 to 23)
- We can put them in order to some extent (hour 1 comes before hour 2), but the ordering breaks down for the first and last values (hour 23 comes before hour 0)

In contrast, the count of total ads is definitely *numerical data*. "Total ads" can take on many, many values and we can unequivocally say that one total ad value is greater or less than another. 

In this section, we'll talk a bit about ways to visualize numerical data, as well as how (and why) to treat numerical data as categorical for analysis.

## 5a. Histograms <a id="subsection 5a"></a>

A *histogram* is a visualization useful for numerical data. It shows the *distribution* of the values in a column of numerical data: that is, it shows all the different possible values and how often those values occur.

Histograms are constructed using the `hist` method, called on the DataFrame itself. `hist` takes one argument: the name of the column whose distribution you want to see. You can also add an option argument called `bins` which says how many bins (vertical bars) should be displayed.

As an example, here's the histogram of the variable "most ads hour", which gives the hours of the day when users saw the most ads.

In [None]:
# show the distribution of ads by hour in 24 bins
ads.hist("most ads hour", bins=24);

The horizontal axis gives the different possible hours of the day. The axis is divided into **bins**: intervals that contain one or more possible values. For instance, a narrow bin could hold all users that saw the most ads in hour 7, and a wider bin could encompass all users that saw the most ads in hour 7, 8, or 9.

The height of the bar says what percentage of the users fell into the corresponding bin. So, a tall bar means that lots of users saw most of their ads during that bin's hour-of-day interval.

This histogram shows us that most users saw the most ads around hour 15 (3PM) and very few users saw many ads between hour 0 (midnight) and hour 6 (6 AM).

<div class="alert alert-warning">
    <b>EXERCISE:</b> Use `hist` to create the histogram for the "total ads" column in the `ads` table. Set the number of bins to 100.
    </div>

In [None]:
# create a histogram for the total ads column
ads.hist("...", bins=100);

<div class="alert alert-warning"><p><b>QUESTION</b>: what does the histogram tell us about the distribution of "total ads"?</p>

<p>Hint: the plot is automatically sized to show all possible values of "total ads" on the horizontal axis. If there's a part of the histogram on the horizontal axis that looks like it doesn't have any values, it either means:
        </p>
<ul>
    <li>there are no values in that bin, but there are values in a bin to the right of that bin</li>
    <li>there are values in that bin, but there are so few that the bin can't be seen</li>
    </ul>

    </div>

**ANSWER**: *Write your answer here*

## 5b. Consumer Response vs. Total Ads Seen <a id="subsection 5b"></a>

We want to answer the question of how the *total number of ads seen* relates to the average *conversion rate*. 

From our histogram in  Part a, we can see that while some users saw as many as 2000+ ads total, almost all users saw between 1 and about 200 ads. To avoid skewing our analysis by including conversion rates for very rare "total ads" counts, let's select only the rows where there are less than 211 ads.

In [None]:
# use where to get rows that saw less than 211 ads
ads_small = ads[ads["total ads"] < 211]
ads_small.head()

"total ads" is now restricted to 210 possible values- much less than the original range. Let's try redrawing the histogram.

<div class="alert alert-warning">

<b>EXERCISE:</b> Re-draw the histogram for the `total ads` column using the `ads_small` DataFrame. Set the number of bins equal to 50.

</div>

In [None]:
# create a histogram for the distribution of total ads
ads_small.hist(...);

The large majority of the values are still concentrated on the low end of the possible values, but we've reduced the *skew* a bit- you can now actually see most of the bars on the right side of the histogram. 

The histogram also gives us clues for how to visualize our "average conversion rates per test group and total ads count" we want to answer our question. Since some values of total ads are pretty rare, we will want to group similar "total value" counts together and calculate the average conversion rate for the group, rather than for each individual count. 

Note that this also seems like a reasonable assumption: we assume that someone who saw 147 ads will behave similarly to someone who saw 145 or 149.

We can make the numerical "total ads" into categorical ranges of total ad counts with a function called `round_down_nearest_10` that will round each ad count to the nearest multiple of 10. `apply` is a method that will apply `round_down_nearest_10` to each value in the "total ads" column and create a new column called "total ad range".

In [None]:
# create a new column with total ads rounded down to the nearest 10
ads_small["total ad range"] =ads_small["total ads"].apply(round_down_nearest_10)
ads_small.head()


Now that the total ad data is divided into categories, we can visualize it on a bar plot. The following code should look familiar from the last notebook on categorical data.

In [None]:
# make a bar plot of the conversion rates by ranges of ads seen
sns.barplot(x="total ad range", y="converted", hue="test group",
           data=ads_small)

<div class="alert alert-warning">
    <b>QUESTION:</b> What can you infer from the plot? For what range is advertising most effective?
    </div>

**ANSWER:**

<div class="alert alert-warning">
    <b>QUESTION:</b> What do the above figures imply for the design of the next campaign assuming that consumer response would be similar?
    </div>

**ANSWER:** 

#### References

- Rocket Fuel data and discussion questions adapted from materials by Zsolt Katona and Brian Bell, BerkeleyHaas Case Series


Author: Keeley Takimoto