# Lab 4: Functions and Visualizations

Welcome to Lab 4! This week, we'll learn about functions, table methods such as `apply`, and how to generate visualizations! 

Recommended Reading:

* [Applying a Function to a Column](https://www.inferentialthinking.com/chapters/08/1/applying-a-function-to-a-column.html)
* [Visualizations](https://www.inferentialthinking.com/chapters/07/visualization.html)

First, set up the notebook by running the cell below.

In [16]:
import numpy as np
import pandas as pd

# These lines set up graphing capabilities.
import matplotlib.pyplot as plt

## 1. Functions

In this question, we'll look at campaign spending from the 2016 U.S. House of Representatives elections. All of this data comes from the [FEC](https://classic.fec.gov/disclosurehs/hsnational.do).

We've copied the raw data from the FEC into a file called `campaign_spending_2016.csv`. This data contains the candidate's name, their party, their state, whether they are an incumbent or not, how much they raised, and how much they spent.

In [17]:
campaign_spending = pd.read_csv('campaign_spending_2016.csv')
campaign_spending.head()

Unnamed: 0,name,party,state,incumbent_challenge_full,raised,spent
0,"COX, JOHN R.",REP,AK,Challenger,$0,$0
1,"DUNBAR, FORREST",DEM,AK,Challenger,$550.00,"$7,563.97"
2,"LEDOUX, GABRIELLE R",REP,AK,Challenger,$0,$0
3,"CHESNUT, DEBRA SUE",DEM,AK,Challenger,$0,$0
4,"VONDERSAAR, FRANK J",DEM,AK,Challenger,$0,$0


We want to compute the average raised in 2016. Try running the cell below.

In [18]:
np.average(campaign_spending["raised"])

TypeError: unsupported operand type(s) for /: 'str' and 'int'

You should see an error. Let's examine why this error occurred by looking at the values in the "disbursements" column. Use the `dtypes` function and you will see that "disbursements" is of type `object`. This is the same as a string. But we can't take the average of a string! We will need to convert this string to a number that we can actually use.

In [None]:
campaign_spending.dtypes

**Question 1.** It doesn't make sense to take the average of string values, so we need to convert them to numbers if we want to do this. See if you can figure out a way to do it using [this](https://stackoverflow.com/a/32465968) for help.

In [None]:
campaign_spending['raised'] = ...
campaign_spending['spent'] = ...
# Leave these lines of code to check your work.
# You should see no $ in raised or spent.
# raised and spent should both be floats now.
print(campaign_spending.head())
print(campaign_spending.dtypes)

Now re-run this code to get the average raised and average spent in 2016 House races.

To make it look nicer, we are going to `round` the numbers to two decimal points. See [this](https://www.tutorialspoint.com/python/number_round.htm) if you want more details on how to round numbers.

In [None]:
print("Amount raised was $", round(np.average(campaign_spending["raised"]), 2))
print("Amount spent was $", round(np.average(campaign_spending["spent"]), 2))

Notice how we had to do this for both "raised" and "spent". Imagine we had to do this for 100s of columns, such as a monthly campaign finance report. That would take a lot of time.

This is where functions come in.  First, we'll define a new function, giving a name to the expression that converts strings to numeric values.  Later in this lab we'll see the payoff: we can call that function on every string in the dataset at once.

**Question 2.** Copy the expression you used in Question 1 as the `return` expression of the function below, but replace the specific "raised" column with the generic `dollar_string` name specified in the first line of the `def` statement.

*Hint*: When dealing with functions, you should generally not be referencing any variable outside of the function. Usually, you want to be working with the arguments that are passed into it, such as `dollar_string` for this function.

In [None]:
def convert_string_to_number(dollar_string):
    """Converts a string like '$550' to a number of dollars."""
    return ...

Running that cell doesn't convert any particular string. Instead, it creates a function called `convert_string_to_number` that can convert any string with the right format to a number representing dollars.

We can call our function just like we call the built-in functions we've seen. It takes one argument (a string column from a data frame) and it returns a numeric column.

In [None]:
# Re-load the data so "raised" and "spent" are in the string format.
campaign_spending = pd.read_csv('campaign_spending_2016.csv')
# Now let's run the function.
campaign_spending['raised'] = convert_string_to_number(campaign_spending['raised'])
campaign_spending['spent'] = convert_string_to_number(campaign_spending['spent'])
campaign_spending.head()

## 2. Defining functions

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50.  (No percent sign)

A function definition has a few parts.

##### `def`
It always starts with `def` (short for **def**ine):

    def

##### Name
Next comes the name of the function.  Let's call our function `to_percentage`.
    
    def to_percentage

##### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  `to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.

    def to_percentage(proportion)

We put a colon after the signature to tell Python it's over.

    def to_percentage(proportion):

##### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing a triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
    
    
##### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function.  We can write anything we could write anywhere else.  First let's give a name to the number we multiply a proportion by to get a percentage.

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100

##### `return`
The special instruction `return` in a function's body tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor
Note that `return` inside a function gives the function a value, while `print`, which we have used before, is a function which has no `return` value and just prints a certain value out to the console. The two are **very** different. 

**Question 3.** Define `to_percentage` in the cell below.  Call your function to convert the proportion .2 to a percentage.  Name that percentage `twenty_percent`.

In [None]:
def ...
    """ ... """
    ... = ...
    return ...

twenty_percent = ...
twenty_percent

Like the built-in functions, you can use named values as arguments to your function.

**Question 4.** Use `to_percentage` again to convert the proportion named `a_proportion` (defined below) to a percentage called `a_percentage`.

*Note:* You don't need to define `to_percentage` again!  Just like other named things, functions stick around after you define them.

In [None]:
a_proportion = 2**(.5) / 2
a_percentage = ...
a_percentage

Here's something important about functions: the names assigned within a function body are only accessible within the function body. Once the function has returned, those names are gone.  So even though you defined `factor = 100` inside  the body of the `to_percentage` function up above and then called `to_percentage`, you cannot refer to `factor` anywhere except inside the body of `to_percentage`:

In [None]:
# You should see an error when you run this.  (If you don't, you might
# have defined factor somewhere above.)
factor

As we've seen with the built-in functions, functions can also take strings (or arrays, or tables) as arguments, and they can return those things, too.

**Question 5.** Define a function called `disemvowel`.  It should take a single string as its argument.  (You can call that argument whatever you want.)  It should return a copy of that string, but with all the characters that are vowels removed.  (In English, the vowels are the characters "a", "e", "i", "o", and "u".)

*Hint:* To remove all the "a"s from a string, you can use `that_string.replace("a", "")`.  The `.replace` method for strings returns another string, so you can call `replace` multiple times, one after the other. 

In [None]:
def disemvowel(a_string):
    ...
    ...

# An example call to your function.  (It's often helpful to run
# an example call from time to time while you're writing a function,
# to see how it currently works.)
disemvowel("Can you read this without vowels?")

##### Calls on calls on calls
Just as you write a series of lines to build up a complex computation, it's useful to define a series of small functions that build on each other.  Since you can write any code inside a function's body, you can call other functions you've written.

If a function is a like a recipe, defining a function in terms of other functions is like having a recipe for cake telling you to follow another recipe to make the frosting, and another to make the sprinkles.  This makes the cake recipe shorter and clearer, and it avoids having a bunch of duplicated frosting recipes.  It's a foundation of productive programming.

For example, suppose you want to count the number of characters *that aren't vowels* in a piece of text.  One way to do that is this to remove all the vowels and count the size of the remaining string.

**Question 5.** Write a function called `num_non_vowels`.  It should take a string as its argument and return a number.  The number should be the number of characters in the argument string that aren't vowels.

*Hint:* The function `len` takes a string as its argument and returns the number of characters in it.

In [None]:
def num_non_vowels(a_string):
    """The number of characters in a string, minus the vowels."""
    ...

# Try calling your function yourself to make sure the output is what
# you expect.

Functions can also encapsulate code that *do things* rather than just compute values.  For example, if you call `print` inside a function, and then call that function, something will get printed.

##### Print is not the same as Return
Let's look at an example of a function that prints a value but does not return it.

In [None]:
def print_number_five():
    print(5)

In [None]:
print_number_five()

However, if we try to use the output of `print_number_five()`, we see that we get an error when we try to add the number 5 to it!

In [None]:
print_number_five_output = print_number_five()
print_number_five_output + 5

It may seem that `print_number_five()` is returning a value, 5. In reality, it just displays the number 5 to you without giving you the actual value! If your function prints out a value without returning it and you try to use it, you will run into errors so be careful!

Defining a function is a lot like giving a name to a value with `=`.  In fact, a function is a value just like the number 1 or the text "the"!

For example, we can make a new name for the built-in function `max` if we want:

In [None]:
our_name_for_max = max
our_name_for_max(2, 6)

The old name for `max` is still around:

In [None]:
max(2, 6)

Try just writing `max` or `our_name_for_max` (or the name of any other function) in a cell, and run that cell.  Python will print out a (very brief) description of the function.

In [None]:
max

## 4. Histograms
Earlier, we computed the average amounts raised and spent by candidates in our dataset.  The average doesn't tell us everything about the amounts raised and spent, though.  Maybe just a few campaigns spend and raise the bulk of money.

We can use a *histogram* method to display more information about a set of numbers.  The table method `hist` takes a single argument, the name of a column of numbers.  It produces a histogram of the numbers in that column.

The below code produces a histogram of the amount raised and spent.

In [None]:
# For help, see
# https://medium.com/python-pandemonium/data-visualization-in-python-histogram-in-matplotlib-dce38f49f89c
legend = ['Raised', 'Spent']
plt.hist([campaign_spending["raised"], campaign_spending["spent"]], color=['orange', 'green'])
plt.xlabel("Amount Raised/Spent")
plt.ylabel("Frequency")
plt.legend(legend)
plt.title('Amount Raised and Spent by 2016 Congressional Candidates')
plt.ticklabel_format(style='plain')
plt.xticks(rotation=45)
plt.show()

**Question 6.** Add comments to the above code block noting what each line of code is doing.

**Question 7.** Looking at the histogram, how many campaigns raised more than \$6,000,000? Using code, how many campaigns raised more than \$6,000,000?

In [None]:
...

**Question 8.** Come up with a better way to display this data, while still using a histogram or multiple histograms. 

In [None]:
## Insert your better graph(s) here

**Question 9.** See examples of different types of visualizations [here](https://matplotlib.org/gallery/index.html). Make a visualization. Tell a story with the data and the visualization. You can do anything but a histogram. Make it nice. Make it interesting.

In [None]:
## Insert your graph(s) here

# Congratulations!

You are done with the lab. Before you finish and submit, please fill out this brief evaluation:

- I spent around XXXX hours on this lab,.
- This lab was (too easy, too hard, just about the right difficulty).

**To turn in your lab, you will need to submit a PDF through Canvas. You can download a notebook by opening it, turning Edit mode on, then navigating to File -> Download as -> PDF.**