In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab04.ipynb")

# Lab 04: Functions and Visualizations

Welcome to Lab 04.

This week, we'll learn about functions, table methods such as `apply`, and how to generate visualizations.

Recommended Reading:

* [Applying a Function to a Column](https://inferentialthinking.com/chapters/08/1/Applying_a_Function_to_a_Column.html)
* [Visualizations](https://inferentialthinking.com/chapters/07/Visualization.html)

First, set up the notebook by running the cell below.

In [None]:
# Just run this cell
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

## 1. Defining functions

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50 (no percent sign).

A function definition has a few parts.

### `def`
It always starts with `def` (short for **def**ine):

```python
def
```

### Name
Next comes the name of the function.  Like other names we've defined, it can't start with a number or contain spaces. Let's call our function `to_percentage`:

```python
    def to_percentage
```


### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  A function can have any number of arguments (including 0!). 

`to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.

```python
    def to_percentage(proportion)
```

If we want our function to take more than one argument, we add a comma between each argument name. Note that if we had zero arguments, we'd still place the parentheses () after than name. 

We put a colon after the signature to tell Python the signature is complete. If you're getting a syntax error after defining a function, check to make sure you remembered the colon!

```python
    def to_percentage(proportion):
```


### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing an **indented** triple-quoted string:

```python
    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
```
    


### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function and every line **must be indented with a tab**.  Any lines that are *not* indented and left-aligned with the def statement is considered outside the function. 

Some notes about the body of the function:
- We can write code that we would write anywhere else.  
- We use the arguments that are defined in the function signature. We can do this because we assume that when we call the function, values are already assigned to those arguments.
- We generally avoid referencing variables defined *outside* the function. If you would like to reference variables outside of the function, pass them into the function as arguments!

Now, let's give a name to the number we multiply a proportion by to get a percentage:

```python
    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
```

### `return`
The special instruction `return` is part of the function's body and tells Python to make the value of the function call equal to whatever comes right after `return`.  You can think of the value that is returned by the funciton as the output of the function.

We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

```python
    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor
```

`return` only makes sense in the context of a function, and **can never be used outside of a function**. `return` is always the last line of the function because Python stops executing the body of a function once it hits a `return` statement.

**Note:** Using `return` inside a function tells Python what value the function evaluates to, or outputs. A function is not required to have a return statement if its job is not to evaluate to a value. For example, the `print` function, does not have a `return` value because the job of the `print` function is to prints a value out to the screen. Viewing a value on the screen and outputting a value, so it can be saved or used further, are two very different concepts.

#### Question 1.1.

Define `to_percentage` in the cell below.  Call your function to convert the proportion .2 to a percentage.  Name that percentage `twenty_percent`.


In [None]:
def ...
    ``` ... ```
    ... = ...
    return ...

twenty_percent = ...
twenty_percent

In [None]:
grader.check("q1_1")

Like you’ve done with built-in functions in previous labs (`max`, `abs`, etc.), you can pass in named values as arguments to your function.

#### Question 1.2.

Remember, functions can be used on values assigned to variables. Use the `to_percentage` function to convert the proportion named `a_proportion` (defined below) to a percentage called `a_percentage`.

**Note:** You don't need to define `to_percentage` again. Like other named values, and the built in functions (like `max` and `sum`) functions stick around after you define them.

In [None]:
a_proportion = 2**(.5)/2
a_percentage = ...
a_percentage

In [None]:
grader.check("q1_2")

### Scope
Here's something important about functions: the names assigned *within* a function body are only accessible within the function body. Once the function has returned, those names are gone.  So even if you created a variable called `factor` and defined `factor = 100` inside of the body of the `to_percentage` function and then called `to_percentage`, `factor` would not have a value assigned to it outside of the body of `to_percentage`. 

**Terminology:** We  say that the **scope** of these types of variables is **local to the function**, meaning that they are only accessible to code inside the function.

For example, `factor` was assigned a value within the function `to_percentage`, and is a local variable to that function. Referencing `factor` using code that is not within the body of the `to_percentage` function will return the error shown below.

![](error.png)

It's as if you never had a named object called `factor`.

**Note:** The variables defined in the signature of the function are considered to be created *within* the function as well, and as such are local to the function and will not be accessible outside of the function.

#### Question 1.3.

As we've seen with built-in functions, functions can also take strings (or arrays, or tables) as arguments, and they can return those things, too.

Define a function called `disemvowel`.  It should take a single string as its argument.  (You can call that argument whatever you want.)  It should return a copy of that string, but with all the characters that are vowels removed.  (In English, the vowels are the characters "a", "e", "i", "o", and "u".) You can use as many lines inside of the function to do this as you’d like.

**Hint:** To remove all the "a"s from a string, you can use `that_string.replace("a", "")`.  The `.replace` method for strings returns a new string, so you can call `replace` multiple times, one after the other. 


In [None]:
def disemvowel(a_string):
    """Removes all vowels from a string"""
    ...

# An example call to your function is provided.
# It's helpful to run an example call from
# time to time while you're writing a function,
# to see how it currently works. Feel free to add
# additional test calls or edit this one
# to make sure your function works as intended

disemvowel("can you read this without vowels?")

In [None]:
grader.check("q1_3")

### Looking up information about a function

If you ever forget how to use a particular function, you can use Python to help you learn more about functions. By starting a code cell with a question mark (`?`) and then the name of a function, you can run the cell to learn all about it.

In [None]:
?max

Notice that this command shows you the entire `Docstring` (the portion of the function inside the triple quotes, right underneath the signature). This is why writing a helpful `Docstring` is valuable, especially if you think others are going to use your code in the future. It's common to show the signature of the function, including the names of the arguments, as a hint to the user so they can infer which arguments to include and in which order. This even works with user defined functions, like `disemvowel`.

In [None]:
?disemvowel

### Calls on calls on calls
Just as you write a series of lines to build up a complex computation, it's useful to define a series of small functions that build on each other.  Since you can write any code inside a function's body, you can call other functions you've written.

If a function is a like a recipe, defining a function in terms of other functions is like having a recipe for cake telling you to follow another recipe to make the frosting, and another to make the jam filling.  This makes the cake recipe shorter and clearer, and it avoids having a bunch of duplicated frosting recipes.  It's a foundation of productive programming.

For example, suppose you want to count the number of characters *that aren't vowels* in a piece of text.  One way to do that is this to remove all the vowels and count the size of the remaining string.

#### Question 1.4.

Write a function called `num_non_vowels`.  It should take a string as its argument and return an integer number.  That number should be the number of characters in the argument string that *aren't* vowels. You should use the `disemvowel` function you wrote above inside of the `num_non_vowels` function.

**Hint:** Remember, the function `len` takes a string as its argument and returns the number of characters in it.

In [None]:
def num_non_vowels(a_string):
    """The number of characters in a string, minus the vowels."""
    ...

# Try calling your function yourself to 
# make sure the output is what you expect.

num_non_vowels('try me out with a few phrases')

In [None]:
grader.check("q1_4")

Functions can also encapsulate code that *displays output* instead of computing a value. For example, if you call `print` inside a function, and then call that function, something will get printed.

The `movies_by_year` dataset in the textbook has information about movie sales in recent years.  Suppose you'd like to display the year with the 5th-highest total gross movie sales, printed in a human-readable way.  

**One way to do this could be the block of code below.** Read through it carefully so you understand the operations this code completes, and then run the cell to verify it works.

In [None]:
movies_by_year = Table.read_table('movies_by_year.csv')
fifth_from_top_movie_year = movies_by_year.sort('Total Gross', descending=True).column('Year').item(5-1)
print('Year number', 5, 'for total gross movie sales was', fifth_from_top_movie_year)

**Oops!** After writing this, you realize you also wanted to print out the 2$^\text{nd}$ and 3$^\text{rd}$ highest years.  Instead of copying and editing your code for these additional cases, you decide to the code into a function.  Since the rank will vary from call to call, you should make that an argument (input) to your function and name it `k`.

#### Question 1.5.

Write a function called `print_kth_top_movie_year`.  It should take a single argument, the rank of the year (like 2, 3, or 5 in the above examples).  It should **`print`** out a message exactly like the one above.  

**Note:** Your function shouldn't have a `return` statement.


In [None]:
def print_kth_top_movie_year(k):
    ...
    ...
    print(...)
    
# Example calls to your function
print_kth_top_movie_year(2)
print_kth_top_movie_year(3)

In [None]:
grader.check("q1_5")

### OPTIONAL: `interact`

One interesting python library is [Jupyter Widgets](https://ipywidgets.readthedocs.io/en/8.1.0/index.html), in particular the [`interact`](https://ipywidgets.readthedocs.io/en/8.1.0/examples/Using%20Interact.html) function. It allows you to create interactive elements in your notebook, like sliders and dropdown menus, and use the selections as inputs to a function. Running the cell below will create a drop down menu for the elements in the array `np.arange(1, 10)` and display the results of passing the selected element to the function `print_kth_top_movie_year`. Run it and see how it works.

In [None]:
# interact also allows you to pass in an array for a function argument. 
# It will then present a dropdown menu of options.

interact(print_kth_top_movie_year, k=np.arange(1, 10));

Running the cell below will create a slider that users can use to select the value to be used as the argument `k`.

In [None]:
# You can also create a slider to select values

interact(print_kth_top_movie_year, k=widgets.IntSlider(min=1, max=10, step=1, value=5));

This topic is not required for you to learn, but we include it anyways in case you are inspired to learn more about the `interact` library when building your own notebooks in the future.

## 2. Functions and CEO Incomes

Let's take a look at the 2019 compensation of North Carolina CEOs. The following table contains information on the top 61 highest paid CEOs for companies headquartered in North Carolina. The data was compiled from the [AFL-CIO Executive Paywatch database](https://aflcio.org/paywatch/highest-paid-ceos), and ultimately came from [filings](https://www.sec.gov/answers/proxyhtf.htm) mandated by the SEC from all publicly-traded companies.

We've copied the raw data from the AFL-CIO database into a file called `nc-ceo-pay.csv`.

In [None]:
raw_compensation = Table.read_table('nc-ceo-pay.csv')
raw_compensation

We want to compute the average of the CEOs' pay. If you were to attempt to use the `np.average` function on this column, you'd run into an error:

```python
np.average(raw_compensation.column('Total Pay'))
```

![](error02.png)

You would specifically see a `TypeError`. The more cryptic parts of this error message are explaining that there's an issue using the `add` function because the data doesn't match the expected type. It found `strings`, which is unusual because we'd expect salary information to be numerical information.

Let's examine why this error occurred by looking at the values in the `Total Pay` column.     

#### Question 2.1.

Set `total_pay_type` to the type of the first value in the "Total Pay" column of the `raw_compensation` Table. Given the error message above, we might expect to type to be a string, or `str`.

**Reminder:** You can use the `type` function on this element to determine it's type. 

In [None]:
total_pay_type = ...
total_pay_type

In [None]:
grader.check("q2_1")

#### Question 2.2.

You should have found that the values in the `Total Pay` column are strings. It doesn't make sense to take the average of string values, so we need to convert them to numbers if we want to do this. 

Extract the first value in the column `Total Pay`.  It's [Brian Moynihan's](https://en.wikipedia.org/wiki/Brian_Moynihan) pay in 2019, in dollars.  Call it `brian_moynihan_pay_string`.


In [None]:
brian_moynihan_pay_string = ...
brian_moynihan_pay_string

In [None]:
grader.check("q2_2")

#### Question 2.3.

Convert `brian_moynihan_pay_string` to a number of *dollars*. Since number of dollars is numerical, your goal should be to convert the `string` to a `float`, which you can do using the function named `float`. However, attempting to just use this function on the string above will result in an error:

![](error03.png)

That's because the `float` function expects the string you provide to contain only numerical digits `0-9`. The dollar sign (`$`) and commas (`,`) are causing this issue. The trailing space at the end of the string might be unsightly, but it will not prevent the `float` function from running. 

We must "clean" these strings so they only contain the correct values. Doing so by hand, one row at a time, would be incredibly time consuming and tedious, so we should figure out how to use Python commands to help.

**Some hints**, as this question requires multiple steps:
- The string method `strip` will be useful for removing the dollar sign; it removes a specified character from the start or end of a string.  For example, the value of `"100%".strip("%")` is the string `"100"`. 
- The string method `replace` from earlier in this lab will be useful for removing the commas; remember this method can remove any character by replacing it with an empty string `''`
- You'll also need the function `float`, which converts a string that looks like a number to an actual number.  


In [None]:
brian_moynihan_pay = ...
brian_moynihan_pay

In [None]:
grader.check("q2_3")

To compute the average pay of North Carolina CEOs, we would need to do this for every CEO in the Table. That would involve running a version of this code 61 times, which seems very inefficient for us to do "manually", even with copying and pasting.

Functions can be used to help speed up this process. First, define a new function to give the expression that converts "total pay" strings to numeric values a name that we can reference.  Later in this lab, we'll see the payoff: we can call that function on every pay string in the Table at once using a single command; no copy and paste needed!

#### Question 2.4.

Copy the expression you used to compute `brian_moynihan_pay`, and use it as the return expression of the function below. But make sure you replace the specific name used in the previous question ( `brian_moynihan_pay_string`) with the more generic named `pay_string` that is specified in the signature of the `convert_pay_string_to_number` function. This function should work for any string found the `Total Pay` column, which means dollar signs (`$`) and commas (`,`) should be removed!

**Remember**: When working with functions, in general, you should not reference any variable outside of the function. Usually, you want to be working with the arguments that are passed into it, such as `pay_string` for this function. If you're using `brian_moynihan_pay_string` within your function, you're referencing an outside variable!

In [None]:
def convert_pay_string_to_number(pay_string):
    """Converts a pay string like '$100,000,000 ' to a number of dollars."""
    ...

# An example function call to see if your function works
convert_pay_string_to_number('$123,456 ')

In [None]:
grader.check("q2_4")

Again, the benefit of using a function called `convert_pay_string_to_number` is that it can convert *any* string with this format to a float representing dollars, not just a specific row of the Table.

We can call our function just like we call the built-in functions we've seen already throughout the course. The function you've written takes one argument (a string) and it returns a float.

In [None]:
# We can also compute Susan DeVores pay in the same way using this function
convert_pay_string_to_number(raw_compensation.where('Name', are.containing('Susan')).column("Total Pay").item(0))

So, what else have we gained by defining the `convert_pay_string_to_number` function?

## 3. `apply`ing Functions

In many ways Python treats functions as it treats any other named object. Just like integers, strings, and floats, functions can be used in assignment statements and as inputs to other functions. 

For example, if we want to we can create a new name for the built-in function `max`.

In [None]:
our_name_for_max = max
our_name_for_max(2, 6)

The old name for `max` is still around.

In [None]:
max(2, 6)

### Careful!
Because Python treats functions like any other type of named object, you can accidentally overwrite the function like you could any other piece of data.

Let's look at what happens when we assign the built-in function `max` a non-function value, like an integer. You'll notice that a `TypeError` will occur when you try running `max(2, 6)` on line 2 in the cell below. That's because in line 1, `max` was redefined to no longer be the `max` function that you know and love, but rather the integer with a value of `6`. Things like integers and strings are not callable like functions, which explains the error message you get when you run the cell below. 

![](error04.png)

**Hint**: If you ever try running a function and get this error message, look out for any functions that might have been renamed (likely by accident) somewhere else in your notebook.

### Why is this useful?

It may seem odd to give a function a new name, but this feature in Python does have it's uses. Since functions are treated just like other data types, it's possible to pass functions as arguments to other functions, just like you might pass an `int` or `str` object. 

Here's a simple but not-so-practical example: we can make an array of functions by providing function names as arguments to the `make_array` function.

In [None]:
make_array(max, np.average, are.equal_to)

#### Question 3.1.

Make an array containing any 3 other functions you've seen.  Call it `some_functions`.


In [None]:
some_functions = ...
some_functions

In [None]:
grader.check("q3_1")

Working with functions as values can lead to some funny-looking code. For example, see if you can figure out why the following code works. Check your explanation with a classmate or your instructor.

In [None]:
make_array(max, np.average, are.equal_to).item(0)(4, -2, 7)

### The `apply` method

In this cousre, the most useful example of passing functions as inputs to other functions is the Table method named `apply`.

`apply` calls a function many times, once on *each* element in a column of a table.  It returns an *array* of the results of applying the provided function on each element.  

Run the cell below to use `apply` to convert every CEO's pay to a number, using the function you defined earlier in this lab.

In [None]:
raw_compensation.apply(convert_pay_string_to_number, 'Total Pay')

Here's an illustration of what that did:

<img src="apply.png"/>

**Note:** We didn’t write `raw_compensation.apply(convert_pay_string_to_number(), 'Total Pay')` or `raw_compensation.apply(convert_pay_string_to_number('Total Pay'))`. 

We provided only the name of the function, with **no parentheses** ,and the name of the column to use it on. The `apply` method will call the function you specify, in this case `convert_pay_string_to_number`, using each value in the specified column, in this case`'Total Pay'`, automatically for you! It's a very quick operation, especially compared to going through one element at a time and calling the function.

#### Question 3.2.

Using `apply`, make a table that's a copy of `raw_compensation` with one additional column called `Total Pay ($)`.  That column should contain the result of applying `convert_pay_string_to_number` to the `Total Pay` column (as we did above).  Call the new table `compensation`.


In [None]:
compensation = raw_compensation.with_column(
    'Total Pay ($)',
    ...
    )
compensation

In [None]:
grader.check("q3_2")

Now that we have all the pays as numbers, we can learn more about them through computation.

#### Question 3.3.

Compute the average total pay of the CEOs in the dataset.


In [None]:
average_total_pay = ...
average_total_pay

In [None]:
grader.check("q3_3")

#### Question 3.4.

Companies pay executives in a variety of ways: 
* in salary
* by granting stock and/or options,
* other equity in the company, or
* with ancillary benefits (like private jets).
  
Compute the proportion of each CEO's Total Pay that was salary. Assign our answer, which should be an array of numbers (one for each CEO in the data set) to the name `salary_proportion`.

In [None]:
salary_proportion = ...
salary_proportion

In [None]:
grader.check("q3_4")

### Why is `apply` useful?

Some of the built-in functions and operators and almost all of the functions in the `numpy` library work equally well with single object inputs or an array of objects. You don't need to use a method like `apply` to call the function on each element in the array because they automatically work on each element of an array. These types of functions are called **vectorized** in data science language. However, there are many functions that aren't written in a way that they work on each element in an array automatically. The `disemvowel` function in this lab is one example; it will work on a single string, but it is not written to work on an array of strings.  The `apply` method gives you to use any function on every element in an array without knowing how to properly **vectorize** your functions.

#### Question 3.5.

Extend the table so it includes a new column labeled `'Salary (%)'` which contains the **proportions** contained in the array `salary_proportion`; you will convert these proportions to percentages in the next question. 

Call this new table `ceo_salary_percent`.

In [None]:
ceo_salary_percent = ...
ceo_salary_percent

In [None]:
grader.check("q3_5")

<!-- BEGIN QUESTION -->

#### Question 3.6.

Use the `set_format` Table method to format the `'Salary (%)'` column in `ceo_salary_percent` so the values display as percentages out of 100 instead of proportions out of 1. As a reminder, the syntax for `set_format` is:

```python
Table.set_format(column_or_columns, formatter)
```

Forget how to use the formatter? Revisit the example in [Chapter 6: Tables](https://inferentialthinking.com/chapters/06/Tables.html)

In [None]:
...
ceo_salary_percent

<!-- END QUESTION -->

## 4. Histograms

Earlier, we computed the average pay among the CEOs in our 61-CEO dataset. The average doesn't tell us everything about the amounts CEOs are paid, though. Maybe just a few CEOs make the bulk of the money, even among these 61.

We can use a histogram method to display the distribution of a set of numbers. The table method hist takes a single argument, the name of a column of numbers. It produces a histogram of the numbers in that column.

<!-- BEGIN QUESTION -->

#### Question 4.1.

Make a histogram of the total pay of the CEOs in `compensation`. Your bins should start at $\$ 0$ and end at $ \$ 25,000,000$ and be $ \$ 1,000,000$ wide. There is no automatic grader check on this question, so check with a classmate to make sure you have the right plot.

**Note:** Because your units are dollars (not thousands or millions of dollars) each bar has a very large number for its width. That means that the bars will have very small numbers for their heights, since the area of each bar is still going to be between 0 and 100 percent.

In [None]:
...

<!-- END QUESTION -->

#### Question 4.2

How many CEOs made more than $10 million in total pay? Compute the value using code (`where` and `num_rows` will be your friend!), then check that the value you found is consistent with what you see in the histogram.


In [None]:
num_ceos_more_than_10_million = ...
num_ceos_more_than_10_million

In [None]:
grader.check("q4_2")

# Submitting your work
You're done with this assignment! Assignments should be turned in using the following best practices:
1. Save your notebook.
2. Restart the kernel and run all cells up to this one.
3. Run the cell below with the code `grader.export(...)`. This will re-run all the tests. Make sure they are passing as you expect them to.
4. Download the file named `lab04_<date-time-stamp>.zip`, found in the explorer pane on the left side of the screen. **Note**: Clicking on the link in this notebook may result in an error, it's best to download from the file explorer panel.
5. Upload `lab04_<date-time-stamp>.zip` to the corresponding assignment on Canvas.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(pdf=False, force_save=True)