# Berkeley Program on Data Science and Analytics
## Module IV, Part I: Introduction to Data Science in Python

<img src="images/berkeley_img-4-1.jpg" style="width: 700px; height: 300px;" />


### Table of Contents

[Welcome to Jupyter Notebooks](#section 0)<br>

[The Data: Rocket Fuel ad campaign](#section case)<br>

1 - [Python](#section 1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a - [Expressions](#subsection 1a)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  [Errors](#subsection error)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b - [Names](#subsection 1b)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; c - [Functions](#subsection 1c)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; d - [Sequences](#subsection 1d)



2 - [Tables](#section 2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a - [Attributes](#subsection 2a)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b - [Transformations](#subsection 2b)

3 - [Problem: Rocket Fuel Costs, Benefits, and Efficacy](#section 3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a - [Conversion Proportions](#subsection 3a)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b - [Benefit](#subsection 3b)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; c - [ROI](#subsection 3c)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; d - [Opportunity Cost](#subsection 3d)

## Welcome to Jupyter  <a id='section 0'></a>

Welcome to the Jupyter Notebook! **Notebooks** are documents that can contain text, code, visualizations, and more. We'll be using them in this module to apply many of the concepts mentioned earlier in the course.

A notebook is composed of rectangular sections called **cells**. There are 2 kinds of cells: markdown and code. A **markdown cell**, such as this one, contains text. A **code cell** contains code in Python, a programming language that we will be using for the remainder of this module. You can select any cell by clicking it once. After a cell is selected, you can navigate the notebook using the up and down arrow keys.

To run a code cell once it's been selected, 
- press Shift-Enter, or
- click the Run button in the toolbar at the top of the screen. 

If a code cell is running, you will see an asterisk (\*) appear in the square brackets to the left of the cell. Once the cell has finished running, a number will replace the asterisk and any output from the code will appear under the cell.

In [None]:
# run this cell
print("Hello World!")

You'll notice that many code cells contain lines of blue text that start with a `#`. These are *comments*. Comments often contain helpful information about what the code does or what you are supposed to do in the cell. The leading `#` tells the computer to ignore them.

#### Editing

You can edit a Markdown cell by clicking it twice. Text in Markdown cells is written in [**Markdown**](https://daringfireball.net/projects/markdown/), a formatting syntax for plain text, so you may see some funky symbols when you edit a text cell. 

Once you've made your changes, you can exit text editing mode by running the cell. Edit the next cell to fix the misspelling.

Welcome to Module IV on Dtaa Science and Analytics!

Code cells can be edited any time after they are highlighted. Try editing the next code cell to print your name.

In [None]:
# edit the code to print your name
print("Hello: my name is NAME")

#### Saving and Loading

Your notebook can record all of your text and code edits, as well as any graphs you generate or calculations you make. You can save the notebook in its current state by clicking Control-S, clicking the floppy disc icon in the toolbar at the top of the page, or by going to the File menu and selecting "Save and Checkpoint".

The next time you open the notebook, it will look the same as when you last saved it.

**Note:** after loading a notebook you will see all the outputs (graphs, computations, etc) from your last session, but you won't be able to use any variables you assigned or functions you defined. You can get the functions and variables back by re-running the cells where they were defined- the easiest way is to highlight the cell where you left off work, then go to the Cell menu at the top of the screen and click "Run all above". You can also use this menu to run all cells in the notebook by clicking "Run all".

#### Completing the Notebooks

As you navigate the notebooks, you'll see cells with bold, all-capitalized headings that need to be filled in to complete the notebook. There are three types:
- **EXERCISE** cells require you to write code to solve a problem related to a case study
- **QUESTION** cells ask you to write short answers, often related to analyzing a graph or the result of a computation in the case study
- **PRACTICE** cells provide spaces to try out new coding skills at your own pace, unrelated to the case study. Since each coding skill taught in these notebooks is necessary for analyzing the cases, practice cells are a good way to get comfortable before applying those skills to real data.

## The Data: Rocket Fuel Ad Campaign <a id='section case'></a>

[Rocket Fuel Inc.](https://rocketfuel.com/programmatic-marketing-platform/) (NASDAQ: FUEL), works in digital advertising offering a "Programmatic Marketing Platform" that claims to optimize digital marketing through big data and machine learning techniques.

In 2015, Rocket Fuel ran a trial ad campaign for handbag manufacturer TaskBella. TaskBella was interested in answering two questions:

1. Would the campaign be successful?
2. If the campaign was successful, how much of that success could be attributed to the ads?

With the second question in mind, they agreed to run an **A/B test**. The majority of the people exposed to Rocket Fuel's content delivery network would see TaskBella's handbag ad (the **experimental group**). But, a small portion of people (the **control group**) would instead see a Public Service Announcement (PSA) in the exact size and place the ad would normally be. One PSA example is below:

<img src="images/smokey_bear_psa.PNG" style="width: 700px; height: 300px;" />

In this notebook, we'll duplicate some of their analysis. First, we'll look at whether there is any difference between the two groups.

Before we begin, we'll need a few extra tools to conduct our analysis. Run the next cell to load some code packages that we'll use later. 

Note: this cell MUST be run in order for most of the rest of the notebook to work.

In [None]:
# dependencies: THIS CELL MUST BE RUN
from datascience import *
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline

**Tables** are fundamental ways of organizing and displaying data. Run the next cell to load the Rocket Fuel case data.

In [None]:
# run this cell
ads = Table.read_table('data/rocketfuel_data_renamed.csv')
ads

This table, which we've named `ads`, is organized into six **columns**: one for each *category* of information collected about each user:

| user id                             | test group                                                                                                        | converted                                | total ads                                           | most ads day                                                     | most ads hour                                                        |
|-------------------------------------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------|-----------------------------------------------------|------------------------------------------------------------------|----------------------------------------------------------------------|
| The unique identifier for that user |  Which testing group the user was in: "ad"- where users saw the ads (the experimental group) or "psa"- where users saw the PSAs (the control)| Whether or not the user bought a handbag | The total number of ads (or PSAs) seen by that user | The day of the week on which the user saw the most ads (or PSAs) | The hour of the day during which the user saw the most ads (or PSAs) |

You can also think about the table in terms of its **rows**. Each row represents all the experimental information collected about a particular user. By default only the first ten rows are shown. Can you see how many rows there are in total?

The data in `ads` broadly falls into two types: numbers and text. *Numerical data* shows up green in code cells and can be positive, negative, or include a decimal.

In [None]:
# Numerical data

4

87623000983

-667

3.14159

Text data (also called *strings*) shows up red in code cells. Strings are enclosed in double or single quotes. Note that numbers can appear in strings.

In [None]:
# Strings
"a"

"Hi there!"

"We hold these truths to be self-evident, that all men are created equal."

# this is a string, NOT numerical data
"3.14159"

# 1. Python <a id='section 1'></a>

### 1a. Expressions <a id='section 1a'></a>
**Python** is  programming language- a way for us to communicate with the computer and give it instructions. Just like any language, Python has a *vocabulary* made up of words it can understand, and a *syntax* giving the rules for how to structure communication. These bits of communication are called **expressions**- they tell the computer what to do with the data we give it.

Here's an example of an expression. 

In [None]:
# an expression
14 + 20

When you run the cell, the computer **evaluates** the expression and prints the result. Note that only the last line in a code cell will be printed, unless you explicitly tell the computer you want to print the result.

In [None]:
# more expressions. what gets printed and what doesn't?
100 / 10

print(4.3 + 10.98)

33 - 9 * (40000 + 1)

884

Many basic arithmetic operations are built in to Python, like `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). There are many others, which you can find information about [here](http://www.inferentialthinking.com/chapters/03/1/expressions.html). 

The computer evaluates arithmetic according to the PEMDAS order of operations (just like you probably learned in middle school): anything in parentheses is done first, followed by exponents, then multiplication and division, and finally addition and subtraction.

In [None]:
# before you run this cell, can you say what it should print?
4 - 2 * (1 + 6 / 3)

#### A Note on Errors <a id="subsection error"></a>

Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Run it and see what happens.

In [None]:
print("This line is missing something."

You should see something like this (minus our annotations):

<img src="images/error.jpg"/>

The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure.  "`EOF`" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it.  (Of course, if you're frustrated, you can usually find out by searching for the error message online or posting on the Piazza.)

**PRACTICE:**
If you're new to python and coding, one of the best ways to get comfortable is to practice. Try writing and running different expressions in the cell below using numbers and the arithmetic operators `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). See if you can generate different error messages and figure out what they mean.

In [None]:
# Optional: try out different arithmetic operations


### 1b. Names <a id='section 1b'></a>
Sometimes, the values you work with can get cumbersome- maybe the expression that gives the value is very complicated, or maybe the value itself is long. In these cases it's useful to give the value a **name**.

We can name values using what's called an *assignment* statement.

In [None]:
# assigns 442 to x
x = 442

The assignment statement has three parts. On the left is the *name* (`x`). On the right is the *value* (442). The *equals sign* in the middle tells the computer to assign the value to the name.

You'll notice that when you run the cell with the assignment, it doesn't print anything. But, if we try to access `x` again in the future, it will have the value we assigned it.

In [None]:
# print the value of x
x

You can also assign names to expressions. The computer will compute the expression and assign the name to the result of the computation.

In [None]:
y = 50 * 2 + 1
y

We can then use these name as if they were numbers.

In [None]:
x - 42

In [None]:
x + y

**PRACTICE:**

In [None]:
# Optional: experiment with assigning names and doing arithmetic operations with named variables


**EXERCISE:** Before Rocket Fuel can evaluate the effectiveness of the ad campaign, they need to know how much it cost.

The *total number of advertisements* was $14597182$. The *CPM* was $\$9$. Use these numbers to assign the correct values to `total_ads`, `cpm`, and `cost_per_ad`.

Note: for the third variable, we want the cost *for each ad*. What do we need to do to the CPM to get the per-ad cost?

In [None]:
# replace the ... with the total number of ads
total_ads = ...
total_ads

In [None]:
# replace the ... with the cost per thousand ads
cpm = ...
cpm

In [None]:
# replace the ... with an expression to calculate the cost per ad
cost_per_ad = ...
cost_per_ad

Then, calculate the overall cost by multiplying the number of ads by how much each ad cost. Assign this value to the name `cost`.

Hint: you can do the calculation by using only using `total_ads`, `cost_per_ad`, and the `*` division operator- no numbers needed. Your answer should be a six-digit number (before the decimal).

In [None]:
# replace the ... with an expression to calculate the cost of the ad campaign
cost = ...
cost

### 1c. Functions <a id='subsection 1c'></a>
We've seen that values can have names (often called **variables**), but operations may also have names. A named operation is called a **function**. Python has some functions built into it.

In [None]:
# a built-in function 
round

Functions get used in *call expressions*, where a function is named and given values to operate on inside a set of parentheses. The `round` function returns the number it was given, rounded to the nearest whole number.

In [None]:
# a call expression using round
round(1988.74699)

A function may also be called on more than one value (called *arguments*). For instance, the `min` function takes however many arguments you'd like and returns the smallest. Multiple arguments are separated by commas.

In [None]:
min(9, -34, 0, 99)

**PRACTICE:** 
* The `abs` function takes one argument (just like `round`)
* The `max` function takes one or more arguments (just like `min`)

Try calling `abs` and `max` in the cell below. What does each function do?

Also try calling each function *incorrectly*, such as with the wrong number of arguments. What kinds of error messages do you see?

In [None]:
# replace the ... with calls to abs and max
...

#### Dot Notation
Python has a lot of [built-in functions](https://docs.python.org/3/library/functions.html) (that is, functions that are already named and defined in Python), but even more functions are stored in collections called *modules*. Earlier, we imported the `math` module so we could use it later. Once a module is imported, you can use its functions by typing the name of the module, then the name of the function you want from it, separated with a `.`.

In [None]:
# a call expression with the factorial function from the math module
math.factorial(5)

**PRACTICE:**  `math` also has a function called `sqrt` that takes one argument and returns the square root. Call `sqrt` on 16 in the next cell.

In [None]:
# use math.sqrt to get the square root of 16
...

### 1d. Sequences <a id='subsection 1d'></a>

Working with big data, we want to be able to work with many values at the same time rather than manipulating each data point individually. We can do this using *sequences*: collections of data, all sharing the same type (e.g. numerical). 

The sequence we'll work with the most is an **array**. Arrays are made using the `make_array` function. 

As an example, we might look at prices for a TaskBella handbag at different stores.

In [None]:
# make an array
prices = make_array(105.99, 99.99, 119.95, 130, 124.99)

prices

You can retrieve items in an array by **indexing**. To index an item, put the numerical position of the item in square brackets next to the name of the array.

In [None]:
# get the item in position 1
prices[1]

When we ask for the item in position 1, we get $99.99$. This is because arrays are *zero-indexed*: the index starts counting at zero. So, the first item in the array is at position 0, the second item is at position 1, and so on.

**PRACTICE:** Try indexing different items from the `prices` array.

In [None]:
# practice indexing


#### Element-wise operations
In some cases, we may want to do calculations on each individual item in the array to return a new array of the same length.

We can do the *same operation* on every array item using arithmetic operators. This is called an **element-wise** operation. For instance, we might want to calculate the price for $5$ handbags bought at each of the different stores.

In [None]:
# multiply each price by 5
prices * 5

We can also use operators on two arrays of the same length to operate on each pair of corresponding elements. For example, we might multiply our `prices` array by an array of tax rates for each store to get the amount of sales tax.

In [None]:
tax_rates = make_array(0.095, 0.11, 0.087, 0.1, 0.084)

# multiply each price by its corresponding tax
prices * tax_rates

#### Reductions
In other cases, we might want to *reduce* an array of numbers to a single value using a particular function. Some examples of reduction functions are `sum`, `min`, `max`, `average`, and `median`. Many array functions come from the *Numpy* module. Just like with the `math` module, we can call functions from the Numpy module using dot notation. Numpy is abbreviated as `np`.

In [None]:
 # get the average handbag price
np.average(prices)

In [None]:
# get the lowest sales tax rate
np.min(tax_rates)

**PRACTICE:** Use the `prices` and `tax_rates` arrays to try some operations. Try adding, subtracting, multiplying, or dividing an array by a number, or doing element-wise operations with the two arrays (or with one array with itself).

In [None]:
# Optional: practice manipulating arrays


## 2. Tables <a id='section 2'></a>

The last section covered four basic concepts of python: expressions, names, functions, and sequences. In this next section, we'll see just how much we can do to examine and manipulate our data with only these minimal Python skills.

Let's look at our `ads` table again.

In [None]:
# display the ads table
ads

### 2a. Table Attributes <a id='subsection 2a'></a>

Every table has **attributes** that give information about the table, like the number of rows and the number of columns. Table attributes are accessed using the dot method. But, since an attribute doesn't perform an operation on the table, there are no parentheses (like there would be in a call expression).

Attributes you'll use frequently include `num_rows` and `num_columns`, which give the number of rows and columns in the table, respectively.

In [None]:
# get the number of columns
ads.num_columns

**PRACTICE:** Use `num_rows` to get the number of rows in our `ads` table.

In [None]:
# get the number of rows in ads


### 2b. Table Transformation <a id='subsection 2b'></a>

Suppose we want to answer the question of whether the ad campaign was profitable. Not all of our columns are relevant to this question, like `"most ads hour"`. We can save computational resources and avoid confusion by *transforming* our table before we start work.

#### Subsetting columns with `select` and `drop`
The `select` function is used to get a table containing only particular columns. `select` is called on a table using dot notation and takes one or more arguments: the name or names of the column or columns you want.

In [None]:
# make a new table with only the user ids and total ads
ads.select("user id", "total ads")

If instead you need all columns except a few, the `drop` function can get rid of specified columns. `drop` works very similarly to `select`: call it on the table using dot notation, then give it the name or names of what you want to drop.

In [None]:
# drop the total ads column
ads.drop("total ads")

**PRACTICE:** Create a table that only contains the columns "user id", "test group", and "most ads hour" two different ways- once using `select`, and once using `drop`.

In [None]:
# use select
...

In [None]:
# use drop
...

#### Filtering rows with `where`
Some analysis questions only deal with a subset of rows. How often do users convert when they saw the most ads during business hours (8AM-5PM)? What was the total number of ads seen by the control group? Was the conversion rate greater when users saw more ads on Monday than on Tuesday?

The **`where`** function allows us to choose certain rows based on two arguments:
- A column label
- A condition that each row should match, called the _predicate_ 

In other words, we call the `where` function like so: `table_name.where(column_name, predicate)`.


In [None]:
# get rows with users who saw the most ads on Monday
ads.where("most ads day", are.equal_to("1:Mon"))

There are many types of predicates, but some of the more common ones are:

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|


In [None]:
# example 2: get all rows that had more than 500 total ads
ads.where("total ads", are.above(500))

**EXERCISE:** 
Oftentimes, we want to calculate statistics separately for the control and experimental groups. Create two tables, one containing only rows where the user was in the `"experiment"` group and one with only rows where the user was in the `"control"` group.

Hint: use the `are.equal_to` predicate.

In [None]:
# users in the experiment group
experiment = ads.where(..., ...)

experiment

In [None]:
# users in the control group
control = ads.where(..., ...)
control

## 3. Problem: Rocket Fuel Costs, Benefits, and Efficacy <a id='section 3'></a>

We now have everything we start analyzing the Rocket Fuel case. In this section, we'll explore four questions:

* Was the campaign effective? Did more users convert as a result of seeing an ad?
* How much more money did TaskBella make as a result of running the campaign (ignoring advertising costs)?
* Was the campaign profitable (what was the ROI)?
* What was the opportunity cost of including a control group? How much more could have TaskaBella made with a smaller control group or not having a control group at all?

### 3a. Did more users convert as a result of the ad campaign? <a id='subsection 3a'></a>

We're interested in seeing if the buying behavior of users differed between the control and experimental groups. The two groups are very different in size, so it isn't fair to compare the number of people who converted in each group. Instead, we're going to look at the *proportion* of people in each group who bought a bag.

For both groups, the proportion will be calculated as:
$$\frac{\text{number of people in group who converted}}{\text{total number of people in group}}$$

Let's start with the control group. Getting the number of people in the control group is easy: we can just call `num_rows` on our `control` table from 2b.

In [None]:
# number of users in control group
num_control = control.num_rows
num_control

Next, we need a table with only users in the control group who converted. We can get this with a call to `where` on our table of control group users.

In [None]:
# table with only converting control group users
ctrl_converts = control.where("converted", are.equal_to(1))
ctrl_converts

From this new table, we can get the number of converting control group users by again using `num_rows`.

In [None]:
# number of people in the ctrl_converts table
num_ctrl_converts = ctrl_converts.num_rows
num_ctrl_converts

Finally, we can plug the number of control group converts and the total number of control group people into our formula to find the percentage.

In [None]:
# proportion of control group users who converted
ctrl_convert_proportion = (num_ctrl_converts / num_control)
ctrl_convert_proportion

**EXERCISE:** Find the proportion of people in the *experiment* group who converted. You can follow the exact same steps as we did above for the control group; in all steps the code will be identical except for the variable and table names.

Step 1: Get the number of people in the experiment group using the `experiment` table and `num_rows`.


In [None]:
# number of people in the experiment (ad) group
num_exper = ...
num_exper

Step 2: Use `where` on the `experiment` table to create a table with only the experiment group users who converted.

In [None]:
# use "where" to get only the experiment group users who converted
exper_converts = experiment.where(..., ...)
exper_converts

Step 3: Get the number of converted experiment group users using the table you just created and `num_rows`.

In [None]:
# count the number of converting experimental group members
num_exper_converts = ...
num_exper_converts

Step 4: Plug the values from step 1 and step 3 into the formula to calculate the proportion.

$$\frac{\text{number of people in group who converted}}{\text{total number of people in group}}$$

Hint: you don't have to type any numbers here; you can just use the names of the two variables you just created.

In [None]:
# the proportion of people in the experimental group that converted
exper_convert_proportion = ...
exper_convert_proportion

The next cell will print the values you calculated as percents of the control and experiment groups that converted, rounded to two decimal places. 

In [None]:
print("Control Group: {} % converted".format(round(ctrl_convert_proportion * 100, 2))) 
print("Experiment Group: {} % converted".format(round(exper_convert_proportion * 100, 2)))

**QUESTION:** Was the campaign effective? Was a user who saw the ad more likely to buy a bag than a user who didn't see the ad?

**ANSWER:** 

### 3b. How much more money did TaskBella make as a result of running the campaign (ignoring advertising costs)? <a id='subsection 3b'></a>

Here we're looking for the benefit of the campaign: the expected financial impact from the conversions resulting from the ads (excluding all advertising costs).

The formula for the benefit is as follows:

$$ (\text{value of a converted user}) * (\text{number of users in the experiment group}) * (\text{proportion of converting experiment group users} - \text{proportion of converting control group users}) $$

That is, we are looking for the number of people in the experiment group who bought a handbag and *wouldn't have bought one if they'd been in the control group*- the people whose conversion was the result of the ad campaign, This is why we subtract the control group conversion percentage from the experiment group conversion percentage.

We already have most of the parts of this formula- we just need to assemble them.

First, TaskBella estimates the value of a converted user to be $\$40$. In the following cell, assign `40` to the name `convert_val`.

In [None]:
# dollar value of converted user
convert_val = ...

Next, let's get the difference in conversion proportions for the experiment and control groups: 

$$\text{proportion of converting experiment group users} - \text{proportion of converting control group users}$$

You can do this easily by using the variables you just calculated: `exper_convert_proportion` and `ctrl_convert_proportion`.

In [None]:
# the difference between the experiment conversion proportion and the control conversion proportion
proportion_diff = ...
proportion_diff

Lastly, plug all the appropriate values into the benefit formula to get the benefit.

Hint: the number of users in the experiment group is saved as `num_exper`.

In [None]:
benefit = ...
benefit

### 3c. What was the Return on Investment (ROI)? <a id='subsection 3c'></a>

In 3a and 3b we saw that advertising resulted in a higher percentage of converting users and a positive benefit. But, would using the campaign still increase profits when advertising costs are accounted for?

Recall that back in part 1b we calculated the advertising costs and named them `cost`.

In [None]:
# the cost of the campaign
cost

**EXERCISE:** Calculate the ROI as 

$$\frac{\text{benefit} - \text{cost}}{\text{cost}}$$

In [None]:
# calculate the ROI
# remember to mind your order of operations
roi = ...
roi

### 3d. What was the opportunity cost of including a control group? <a id='subsection 3d'></a>

As we saw in 3b, having a control group is important to get a baseline with which to compare the experimental data. However, any users assigned to the control group are not seeing TaskBella's advertising, eating into profits.

We can calculate the *opportunity cost* of the control group as:

$$(\text{value of converted user}) * (\text{number of users in control group}) * (\text{proportion of experiment group users who converted} - \text{proportion of control group users who converted})$$

In other words, the opportunity cost is the additional amount of money users in the control group would have spent if they had seen the ads *purely as a result of seeing the ads*. Note that this is almost the same formula as for the benefit in 3b, except with the control group instead of the experiment group.

**EXERCISE:** Use `convert_val`, `num_control`, and `proportion_diff` to calculate the opportunity cost.

In [None]:
opp_cost = ...
opp_cost

**QUESTION:** Was the ad campaign profitable when all the costs are accounted for? Why or why not?

**ANSWER:**

#### References

- Sections of "Intro to Jupyter", "Table Transformation" adapted from materials by Kelly Chen and Ashley Chien in [UC Berkeley Data Science Modules core resources](http://github.com/ds-modules/core-resources)
- "A Note on Errors" subsection and "error" image adapted from materials by Chris Hench and Mariah Rogers for the Medieval Studies 250: Text Analysis for Graduate Medievalists [data science module](https://github.com/ds-modules/MEDST-250).
- Rocket Fuel data and discussion questions adapted from materials by Zsolt Katona and Brian Bell, BerkeleyHaas Case Series

Author: Keeley Takimoto (ktakimoto@berkeley.edu, github: ktakimoto)