# Data Science Online
## Part 2: Introduction to Python

<img src="images/berkeley_img-4-1.jpg" style="width: 700px; height: 300px;" />

*In this notebook, we'll cover some basics of how to send instructions to your computer in the Python programming language.*

### Table of Contents

<a href='#section case'>The Data: Rocket Fuel ad campaign</a>

1.  <a href='#section 1'>The Python Programming Language</a>

    a. <a href='#subsection 1a'>Expressions</a> and <a href='#subsection error'>Errors</a>

    b. <a href='#subsection 1b'>Names</a>

    c. <a href='#subsection 1c'>Functions</a>

    d. <a href='#subsection 1d'>Sequences</a>

## The Data: Rocket Fuel Ad Campaign <a id='section case'></a>

[Rocket Fuel Inc.](https://rocketfuel.com/programmatic-marketing-platform/) (NASDAQ: FUEL), works in digital advertising offering a "Programmatic Marketing Platform" that claims to optimize digital marketing through big data and machine learning techniques.

In 2015, Rocket Fuel ran a trial ad campaign for handbag manufacturer TaskBella. TaskBella was interested in answering two questions:

1. Would the campaign be successful?
2. If the campaign was successful, how much of that success could be attributed to the ads?

With the second question in mind, they agreed to run an **A/B test**. The majority of the people exposed to Rocket Fuel's content delivery network would see TaskBella's handbag ad (the **experimental group**). But, a small portion of people (the **control group**) would instead see a Public Service Announcement (PSA) in the exact size and place the ad would normally be. One PSA example is below:

<img src="images/smokey_bear_psa.PNG" style="width: 700px; height: 300px;" />

We can duplicate the Rocket Fuel analysis in a Jupyter notebook. But first, we need to learn a bit about how to talk to a computer.

Before we begin, we'll need a few extra tools to conduct our analysis. Run the next cell to load some code packages that we'll use later. A **code package** is a collection of code, already written by others, that we can use to analyze data.

<div class="alert alert-info"><b>Note:</b> this cell MUST be run in order for most of the rest of the notebook to work.</div>

In [None]:
# dependencies: THIS CELL MUST BE RUN
import pandas as pd
import numpy as np
import math

**DataFrames** are fundamental ways of organizing and displaying data in tables. Run the next cell to load the Rocket Fuel case data into a DataFrame.

In [None]:
# run this cell
ads = pd.read_csv('https://raw.githubusercontent.com/ds-modules/exec_ed/master/data/rocketfuel_data_renamed.csv', index_col=0)

# display the first ten rows
ads.head()

This table, which we've named `ads`, is organized into six **columns**: one for each *category* of information collected about each user:

| user id                             | test group                                                                                                        | converted                                | total ads                                           | most ads day                                                     | most ads hour                                                        |
|-------------------------------------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------|-----------------------------------------------------|------------------------------------------------------------------|----------------------------------------------------------------------|
| The unique identifier for that user |  Which testing group the user was in: "ad"- where users saw the ads (the experimental group) or "psa"- where users saw the PSAs (the control)| Whether or not the user bought a handbag (1 if they did, 0 if they didn't)| The total number of ads (or PSAs) seen by that user | The day of the week on which the user saw the most ads (or PSAs) | The hour of the day during which the user saw the most ads (or PSAs) |

You can also think about the table in terms of its **rows**. Each row represents all the experimental information collected about a particular user. By default only the first ten rows are shown. Can you see how many rows there are in total?

The data in `ads` broadly falls into two types: numbers and text. *Numerical data* shows up green in code cells and can be positive, negative, or include a decimal.

In [None]:
# Numerical data

4

87623000983

-667

3.14159

Text data (also called *strings*) shows up red in code cells. Strings are enclosed in double or single quotes. Note that numbers can appear in strings.

In [None]:
# Strings
"a"

"Hi there!"

"We hold these truths to be self-evident, that all men are created equal."

# to the computer this is a string, NOT numerical data
"3.14159"

<div class="alert alert-warning"><p><b>QUESTION:</b> Take another look at the data in the `ads` DataFrame. Which columns have numerical data? Which columns have string data?</p>
<p>The next cell has the names of all the columns. Inside the parentheses next to each name, replace the ellipses with "numerical" if the column has numerical data or "string" if the column has text data. The first one has been filled in as an example.</p></div>

In [None]:
user_id = "numerical"
test_group = "..."
converted = "..."
total_ads = "..."
most_ads_day = "..."
most_ads_hour = "..."

# 1. Python <a id='section 1'></a>

**Python** is  programming language- a way for us to communicate with the computer and give it instructions. Just like any language, Python has a *vocabulary* made up of words it can understand, and a *syntax* giving the rules for how to structure communication.

Like natural human languages, Python has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a few months.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

In this notebook, we're going to learn a few of those rules.

### 1a. Expressions <a id='subsection 1a'></a>
A piece of communication in Python is called an **expression**- it tells the computer what to do with the data we give it.

Here's an example of an expression. 

In [None]:
# an expression
14 + 20

When you run the cell, the computer **evaluates** the expression and prints the result. Note that only the last line in a code cell will ever be printed, unless you explicitly tell the computer you want to print the result.

In [None]:
# more expressions. When you run the cell, what gets printed and what doesn't?
100 / 10

print(4.3 + 10.98)

33 - 9 * (40000 + 1)

884

Many basic arithmetic operations can be used in Python, like `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). There are many others, which you can find information about [here](http://www.inferentialthinking.com/chapters/03/1/expressions.html). 

The computer evaluates arithmetic according to the PEMDAS order of operations (just like you may have learned in middle school): anything in parentheses is done first, followed by exponents, then multiplication and division, and finally addition and subtraction.

<div class="alert alert-warning"><b>EXERCISE:</b> According to the PEMDAS order of operations, what should the next cell print? Work it out in your head or on paper, then run the cell to check your answer.

In [None]:
# before you run this cell, can you say what it should print?
4 - 2 * (1 + 6 / 3)

<div class="alert alert-info"><b>OPTIONAL:</b> If you're new to programming, use the next cell to practice arithmetic operations on numbers using operators like `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division)

In [None]:
# practice doing arithmetic in Python


#### A Note on Errors <a id="subsection error"></a>

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Run it and see what happens.

In [None]:
print("This line is missing something."

You should see something like this (minus our annotations):

<img src="images/error.jpg"/>

The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure.  "`EOF`" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without fully deciphering it. 

<div class="alert alert-warning"><b> EXERCISE</b>: Scroll back up to the cell that generated an error. Fix the error, and re-run the cell. </div>

### 1b. Names <a id='subsection 1b'></a>
Sometimes, the values you work with can get cumbersome- maybe the expression that gives the value is very complicated, or maybe the value itself is long. In these cases it's useful to give the value a **name**.

We can name values using what's called an *assignment* statement.

In [None]:
# assigns 442 to x
x = 442

The assignment statement has three parts. On the left is the *name* (`x`). On the right is the *value* (442). The *equals sign* in the middle tells the computer to assign the value to the name.

You'll notice that when you run the cell with the assignment, it doesn't print anything. But, if we try to access `x` again in the future, it will have the value we assigned it.

In [None]:
# print the value of x
x

You can also assign names to expressions. The computer will compute the expression and assign the name to the result of the computation.

In [None]:
y = 50 * 2 + 1
y

We can then use these name as if they were numbers.

In [None]:
x - 42

In [None]:
x + y

<div class="alert alert-warning"><p><b>EXERCISE:</b> Before Rocket Fuel can evaluate the effectiveness of the ad campaign, they need to know how much it cost.</p>

<p>The *total number of advertisements* was $14597182$. The *CPM* (cost per thousand ads) was $\$9$. Use these numbers to assign the correct values to `total_ads`, `cpm`, and `cost_per_ad`.</p>

<p>Note: for the third variable, we want the cost *for each ad*. What do we need to do to the CPM to get the per-ad cost?</p></div>

In [None]:
# replace the ... with the total number of ads
total_ads = ...
total_ads

In [None]:
# replace the ... with the cost per thousand ads
cpm = ...
cpm

In [None]:
# replace the ... with an expression to calculate the cost per ad
cost_per_ad = ...
cost_per_ad

<div class="alert alert-warning"> <p><b>EXERCISE</b>: Then, calculate the overall cost by multiplying the number of ads by how much each ad cost. Assign this value to the name `cost`.</p>

<p>Hint: you can do the calculation by using only using `total_ads`, `cost_per_ad`, and the `*` multiplication operator- no numbers needed. Your answer should be a six-digit number (before the decimal).</p>

In [None]:
# replace the ... with an expression to calculate the cost of the ad campaign
cost = total_ads * cost_per_ad
cost

### 1c. Functions <a id='subsection 1c'></a>
We've seen that values can have names (often called **variables**), but operations may also have names. A named operation is called a **function**. Python has some functions built into it.

In [None]:
# a built-in function 
round

Functions get used in *call expressions*, where a function is named and given values to operate on inside a set of parentheses. The `round` function returns the number it was given, rounded to the nearest whole number.

In [None]:
# a call expression using round
round(1988.74699)

A function may also be called on more than one value (called *arguments*). For instance, the `min` function takes however many arguments you'd like and returns the smallest. Multiple arguments are separated by commas.

In [None]:
min(9, -34, 0, 99)

<div class= "alert alert-warning"><p><b>PRACTICE:</b></p>
<ul>
    <li>The `abs` function takes one argument (just like `round`)</li>
    <li>The `max` function takes one or more arguments (just like `min`)</li>
    </ul>

<p>Try calling `abs` and `max` in the cell below. What does each function do?</p>

<p>Also try calling each function *incorrectly*, such as with the wrong number of arguments. What kinds of error messages do you see?</p>

In [None]:
# replace the ... with calls to abs and max
...

#### Dot Notation
Python has a lot of [built-in functions](https://docs.python.org/3/library/functions.html) (that is, functions that are already named and defined in Python), but even more functions are stored in collections called *modules*. Earlier, we imported the `math` module so we could use it later. Once a module is imported, you can use its functions by typing the name of the module, then the name of the function you want from it, separated with a `.`.

In [None]:
# a call expression with the factorial function from the math module
math.factorial(5)

<div class="alert alert-warning"><b>PRACTICE:</b>  `math` also has a function called `sqrt` that takes one argument and returns the square root. Call `sqrt` on 16 in the next cell.

In [None]:
# use math.sqrt to get the square root of 16
root_16 = ...

# show the result of the expression
root_16

### 1d. Sequences <a id='subsection 1d'></a>

Working with big data, we want to be able to work with many values at the same time rather than manipulating each data point individually. We can do this using *sequences*: collections of data, all sharing the same type (e.g. numerical). 

The sequence we'll work with the most is an **array**. Arrays are made using the `make_array` function. 

As an example, we might look at prices for a TaskBella handbag at different stores.

In [None]:
# make an array
prices = np.array([105.99, 99.99, 119.95, 130, 124.99])

prices

You can retrieve items in an array by **indexing**. To index an item, put the numerical position of the item in square brackets next to the name of the array.

In [None]:
# get the item in position 1
prices[1]

When we ask for the item in position 1, we get $99.99$. This is because arrays are *zero-indexed*: the index starts counting at zero. So, the first item in the array is at position 0, the second item is at position 1, and so on.

<div class="alert alert-warning"><b>PRACTICE:</b> Try indexing different items from the `prices` array.</div>

In [None]:
# index the last item in the list
last_price = ...

# show the last item in the array
last_price

#### Element-wise operations
In some cases, we may want to do calculations on each individual item in the array to return a new array of the same length.

We can do the *same operation* on every array item using arithmetic operators. This is called an **element-wise** operation. For instance, we might want to calculate the price for $5$ handbags bought at each of the different stores.

In [None]:
# multiply each price by 5
prices * 5

We can also use operators on two arrays of the same length to operate on each pair of corresponding elements. For example, we might multiply our `prices` array by an array of tax rates for each store to get the amount of sales tax.

In [None]:
tax_rates = make_array(0.095, 0.11, 0.087, 0.1, 0.084)

# multiply each price by its corresponding tax
prices * tax_rates

#### Reductions
In other cases, we might want to *reduce* an array of numbers to a single value using a particular function. Some examples of reduction functions are `sum`, `min`, `max`, `average`, and `median`. Many array functions come from the *Numpy* module. Just like with the `math` module, we can call functions from the Numpy module using dot notation. Numpy is abbreviated as `np`.

In [None]:
 # get the average handbag price
np.average(prices)

In [None]:
# get the lowest sales tax rate
np.min(tax_rates)

<div class="alert alert-warning"> <b>EXERCISE:</b> Use the `prices` and `tax_rates` arrays to try some operations. <ol>
    <li>replace the first ellipses with an expression to add 10 to each price.</li>
    <li>replace the second set of ellipses with an expression to divide all taxes in half</li>
    <li>replace the last set of ellipses to create a variable `max_price` that is the largest price in the `prices` array. Use the function `np.max`</li>
    </ol>
    </div>

In [None]:
#  practice manipulating arrays
price_plus_10 = ...

taxes_reduced_by_half = ...

max_price = ...

### Conclusion

In this notebook, we covered expressions, names, functions, and sequences in Python: what they are and how to use them. We got some practice using Python to calculate some important numbers for the Rocket Fuel analysis. 

In the next notebook, we'll learn how to manipulate and transform data in a DataFrame.

#### References

- "A Note on Errors" subsection and "error" image adapted from materials by Chris Hench and Mariah Rogers for the Medieval Studies 250: Text Analysis for Graduate Medievalists [data science module](https://github.com/ds-modules/MEDST-250).
- Rocket Fuel data and discussion questions adapted from materials by Zsolt Katona and Brian Bell, BerkeleyHaas Case Series

Author: Keeley Takimoto