# Lab 4: Functions and Visualizations

---

Welcome to Lab 4! This week, we'll learn about functions, table methods such as `apply`, and how to generate visualizations!

Recommended Reading:

* [Applying a Function to a Column](https://inferentialthinking.com/chapters/08/1/Applying_a_Function_to_a_Column.html)
* [Visualizations](https://inferentialthinking.com/chapters/07/Visualization.html)



# Before You Begin: Disable AI Assistance

To ensure you learn the concepts and complete this assignment based on your own understanding, you are *strongly* encouraged to turn off Google Colab's built-in generative AI features before you begin. This will also help you prepare for the midterms and final project.

**Follow these steps:**
1.  Go to the **Edit** menu at the top of the page.
2.  Click on **Notebook settings**.
3.  Check the box next to **"Hide generative AI features"**.
5.  Click **Save**.

This will prevent Google Gemini from suggesting or writing code for you, allowing you to focus on solving the problems yourself.


# Set-up
First, set up the notebook by running the cell below.

In [None]:
import numpy as np
import pandas as pd

# These lines set up graphing capabilities.
import seaborn as sns

## 1. Functions

In this question, we'll look at campaign spending from the 2016 U.S. House of Representatives elections. All of this data comes from the [FEC](https://classic.fec.gov/disclosurehs/hsnational.do).

We've copied the raw data from the FEC into a file called `campaign_spending_2016.csv`. This data contains the candidate's name, their party, their state, whether they are an incumbent or not, how much they raised, and how much they spent.

In [None]:
campaign_spending = pd.read_csv('https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/master/Colab/Lab4/campaign_spending_2016.csv')
campaign_spending.head()

Unnamed: 0,name,party,state,incumbent_challenge_full,raised,spent
0,"COX, JOHN R.",REP,AK,Challenger,$0,$0
1,"DUNBAR, FORREST",DEM,AK,Challenger,$550.00,"$7,563.97"
2,"LEDOUX, GABRIELLE R",REP,AK,Challenger,$0,$0
3,"CHESNUT, DEBRA SUE",DEM,AK,Challenger,$0,$0
4,"VONDERSAAR, FRANK J",DEM,AK,Challenger,$0,$0


We want to compute the average raised in 2016. Try running the cell below.

In [None]:
np.average(campaign_spending["raised"])

TypeError: ufunc 'divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

You should see an error. Let's examine why this error occurred by looking at the values in the "disbursements" column. Use the `dtypes` function and you will see that `raised` is of type `object`. This is the same as a string (more technically, columns with mixed types are stored with the `object` dtype). But we can't take the average of a string! We will need to convert this string to a number that we can actually use.

In [None]:
campaign_spending.dtypes

name                        object
party                       object
state                       object
incumbent_challenge_full    object
raised                      object
spent                       object
dtype: object

**Question 1.** It doesn't make sense to take the average of string values, so we need to convert them to numbers if we want to do this. See if you can figure out a way to do it using [this](https://stackoverflow.com/a/32465968) for help.

In [None]:
campaign_spending['raised'] = ...
campaign_spending['spent'] = ...
# Leave these lines of code to check your work.
# You should see no $ in raised or spent.
# raised and spent should both be floats now.
print(campaign_spending.head())
print(campaign_spending.dtypes)

Now re-run this code to get the average raised and average spent in 2016 House races.

To make it look nicer, we are going to `round` the numbers to two decimal points. See [this](https://www.tutorialspoint.com/python/number_round.htm) if you want more details on how to round numbers.

In [None]:
print("Amount raised was $", round(np.average(campaign_spending["raised"]), 2))
print("Amount spent was $", round(np.average(campaign_spending["spent"]), 2))

Notice how we had to do this for both "raised" and "spent". Imagine we had to do this for 100s of columns, such as a monthly campaign finance report. That would take a lot of time.

This is where functions come in.  First, we'll define a new function, giving a name to the expression that converts strings to numeric values.  Later in this lab we'll see the payoff: we can call that function on every string in the dataset at once.

**Question 2.** Copy the expression you used in Question 1 as the `return` expression of the function below, but replace the specific "raised" column with the generic `dollar_string` name specified in the first line of the `def` statement.

*Hint*: When dealing with functions, you should generally not be referencing any variable outside of the function. Usually, you want to be working with the arguments that are passed into it, such as `dollar_string` for this function.

In [None]:
def convert_string_to_number(dollar_string):
    """Converts a string like '$550' to a number of dollars."""
    return ...

Running that cell doesn't convert any particular string. Instead, it creates a function called `convert_string_to_number` that can convert any string with the right format to a number representing dollars.

We can call our function just like we call the built-in functions we've seen. It takes one argument (a string column from a data frame) and it returns a numeric column.

In [None]:
# Re-load the data so "raised" and "spent" are in the string format.
campaign_spending = pd.read_csv('https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/master/Colab/Lab4/campaign_spending_2016.csv')
# Now let's run the function.
campaign_spending['raised'] = convert_string_to_number(campaign_spending['raised'])
campaign_spending['spent'] = convert_string_to_number(campaign_spending['spent'])
campaign_spending.head()

## 2. Defining functions

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50.  (No percent sign)

A function definition has a few parts.

##### `def`
It always starts with `def` (short for **def**ine):

    def

##### Name
Next comes the name of the function.  Let's call our function `to_percentage`.
    
    def to_percentage

##### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  `to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.

    def to_percentage(proportion)

We put a colon after the signature to tell Python it's over.

    def to_percentage(proportion):

##### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing a triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
    
    
##### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function.  We can write anything we could write anywhere else.  First let's give a name to the number we multiply a proportion by to get a percentage.

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100

##### `return`
The special instruction `return` in a function's body tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor

Note that `return` inside a function gives the function a value, while `print`, which we have used before, is a function which has no `return` value and just prints a certain value out to the console. The two are **very** different.

##### Triple quotation marks

In the `to_percentage()` function, we use triple quotation marks. These are known as docstrings: a special type of comment used for documentation. Docstrings don't affect how the function runs but are there to describe what the function does. In this case, the docstring explains that the function converts a string like "$550" into a numerical value.

In addition to docstrings, you can also use regular comments (with #) to provide further details about specific parts of the code, making it easier for others to understand its logic.

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        # Define the conversion factor from proportion to percentage
        factor = 100
        #  Multiply the proportion by 100 to convert it into a percentage
        return proportion * factor

**Question 3.** Define `to_percentage` in the cell below.  Call your function to convert the proportion .2 to a percentage.  Name that percentage `twenty_percent`.

In [None]:
def ...
    """ ... """
    ... = ...
    return ...

twenty_percent = ...
twenty_percent

Like the built-in functions, you can use named values as arguments to your function.

**Question 4.** Use `to_percentage` again to convert the proportion named `a_proportion` (defined below) to a percentage called `a_percentage`.

*Note:* You don't need to define `to_percentage` again!  Just like other named things, functions stick around after you define them.

In [None]:
a_proportion = 2**(.5) / 2
a_percentage = ...
a_percentage

Here's something important about functions: the names assigned within a function body are only accessible within the function body. Once the function has returned, those names are gone.  So even though you defined `factor = 100` inside  the body of the `to_percentage` function up above and then called `to_percentage`, you cannot refer to `factor` anywhere except inside the body of `to_percentage`:

In [None]:
# You should see an error when you run this.  (If you don't, you might
# have defined factor somewhere above.)
factor

As we've seen with the built-in functions, functions can also take strings (or arrays, or tables) as arguments, and they can return those things, too.

**Question 5.** Define a function called `disemvowel`.  It should take a single string as its argument.  (You can call that argument whatever you want.)  It should return a copy of that string, but with all the characters that are vowels removed.  (In English, the vowels are the characters "a", "e", "i", "o", and "u".)

*Hint:* To remove all the "a"s from a string, you can use `that_string.replace("a", "")`.  The `.replace` method for strings returns another string, so you can call `replace` multiple times, one after the other.

In [None]:
def disemvowel(a_string):
    ...
    ...

# An example call to your function.  (It's often helpful to run
# an example call from time to time while you're writing a function,
# to see how it currently works.)
disemvowel("Can you read this without vowels?")

##### Calls on calls on calls
Just as you write a series of lines to build up a complex computation, it's useful to define a series of small functions that build on each other.  Since you can write any code inside a function's body, you can call other functions you've written.

If a function is a like a recipe, defining a function in terms of other functions is like having a recipe for cake telling you to follow another recipe to make the frosting, and another to make the sprinkles.  This makes the cake recipe shorter and clearer, and it avoids having a bunch of duplicated frosting recipes.  It's a foundation of productive programming.

For example, suppose you want to count the number of characters *that aren't vowels* in a piece of text.  One way to do that is this to remove all the vowels and count the size of the remaining string.

**Question 5.** Write a function called `num_non_vowels`.  It should take a string as its argument and return a number.  The number should be the number of characters in the argument string that aren't vowels.

*Hint:* The function `len` takes a string as its argument and returns the number of characters in it.

In [None]:
def num_non_vowels(a_string):
    """The number of characters in a string, minus the vowels."""
    ...

# Try calling your function yourself to make sure the output is what
# you expect.

Functions can also encapsulate code that *do things* rather than just compute values.  For example, if you call `print` inside a function, and then call that function, something will get printed.

##### Print is not the same as Return
Let's look at an example of a function that prints a value but does not return it.

In [None]:
def print_number_five():
    print(5)

In [None]:
print_number_five()

However, if we try to use the output of `print_number_five()`, we see that we get an error when we try to add the number 5 to it!

In [None]:
print_number_five_output = print_number_five()
print_number_five_output + 5

It may seem that `print_number_five()` is returning a value, 5. In reality, it just displays the number 5 to you without giving you the actual value! If your function prints out a value without returning it and you try to use it, you will run into errors so be careful!

Defining a function is a lot like giving a name to a value with `=`.  In fact, a function is a value just like the number 1 or the text "the"!

For example, we can make a new name for the built-in function `max` if we want:

In [None]:
our_name_for_max = max
our_name_for_max(2, 6)

The old name for `max` is still around:

In [None]:
max(2, 6)

Try just writing `max` or `our_name_for_max` (or the name of any other function) in a cell, and run that cell.  Python will print out a (very brief) description of the function.

In [None]:
max

## 4. Histograms
Earlier, we computed the average amounts raised and spent by candidates in our dataset.  The average doesn't tell us everything about the amounts raised and spent, though.  Maybe just a few campaigns spend and raise the bulk of money.

We can use a *histogram* method to display more information about a set of numbers.  The table method `hist` takes a single argument, the name of a column of numbers.  It produces a histogram of the numbers in that column.

The below code produces a histogram of the amount raised and spent.

In [None]:
#Reshape the DataFrame from wide to long format:
# - In wide format, 'raised' and 'spent' are separate columns.
# - In long format, these values are combined into a single 'Amount' column,
# with a new 'Type' column indicating whether the amount is 'raised' or 'spent'.
df_long = campaign_spending.melt(value_vars=["raised", "spent"],
                                 var_name="Type",
                                 value_name="Amount")
df_long.head()

In [None]:
ax = sns.histplot(
    data=df_long,
    x="Amount",
    hue="Type",
    palette={"raised": "orange", "spent": "green"},
    multiple="dodge"
)

ax.set(
    xlabel="Amount Raised/Spent",
    ylabel="Frequency",
    title="Amount Raised and Spent by 2016 Congressional Candidates"
)

ax.ticklabel_format(style='plain', axis='x')

for label in ax.get_xticklabels():
    label.set_rotation(45)
    label.set_ha('right')

**Question 6.** Add comments to the above code block noting what each line of code is doing.

**Question 7.** Looking at the histogram, how many campaigns raised more than \$6,000,000?

Using code, how many campaigns raised more than \$6,000,000?

In [None]:
...

**Question 8.** Come up with a better way to display this data, while still using a histogram or multiple histograms.
*Hint:* Check the `binwidth` argument in seaborn

In [None]:
## Insert your better graph(s) here

**Question 9.** See examples of different types of visualizations [here](https://seaborn.pydata.org/examples/index.html). Make a visualization using the data you already have loaded. Tell a story with the data and the visualization. You can do anything but a histogram.

In [None]:
## Insert your graph(s) here

# Congratulations!

You are done with the lab. Before you finish and submit, please fill out this brief evaluation:

- I spent around XXXX hours on this lab,.
- This lab was (too easy, too hard, just about the right difficulty).

All assignments in the course will be distributed as notebooks like this one, and you will submit your work as a PDF.  

# How to Convert your Colab notebook to a PDF and download it

Follow these instructions exactly to make sure your notebook is correctly converted to a PDF and saved to your computer.

---

## 1. Check everything is in order and all the code runs

Before starting the conversion process, make sure your notebook is complete and error-free.

1. **Open your notebook** in Google Colab.
2. **Save your work**:
   - Go to **File → Save** or press `Ctrl + S` (Windows) / `Cmd + S` (Mac).
   - This ensures that all your recent changes are stored.
3. **Run all the cells** to confirm there are no errors:
   - In the top menu, select **Cell → Run All**.
   - Colab will execute every cell in order.
4. **Watch for errors**:
   - If you see any red error messages, fix them before proceeding.
   - A notebook with errors will **not** convert to PDF correctly.
5. Once all cells run without errors, you can proceed to the next step.

---

## 2. Make sure your notebook is saved in Google Drive and named correctly
Before converting, confirm that your notebook is stored in your Google Drive inside the `Colab Notebooks` folder.

1. Look at the top-left corner of the Colab page — you’ll see the notebook’s current name next to two yellow circle icons.
2. Rename the file by directly clicking on the name, type the new name, and press **Enter**.  
  - Please rename the notebook to LASTNAME_FIRSTNAME_LAB#.pdf. So for this lab, I would call it Alberto_Stefanelli_Lab4.ipynb.
  - The name must end with `.ipynb`
3. Ensure the notebook is in your Google Drive (not in Colab’s temporary session storage):  
   - In the menu, click **File → Locate in Drive**.  
   - This will open the folder in Drive where the notebook is stored.  
   - If it’s not in `My Drive/Colab Notebooks`, move it there for easier access.

---

## 3. Install the required tools

We wrote some code (see below) to automatically convert your Notebook to PDF. When you run the provided code cell, the first step will install some essential pieces of software (i.e., Pandoc and Latex) inside your Colab environment. There is no need to exactly understand what is happening

**Important:**
- This installation will take between 2 to 5 minutes.
- Do **not** close or refresh the Colab page while it runs.

---

## 4. Mount your Google Drive in Colab

After installing the requirements, the code will ask Colab to **mount** your Google Drive. You can use both your personal or Yale account. This is needed because the notebook you are converting must be saved in Drive before it can be converted to PDF.

You will see a pop-up with a **link**:
1. Click the link.
2. Sign in to your Google account (use the same account where your Colab notebook is saved).
3. If prompted, copy the long **authorization code** provided.
  - Paste that code into the input box in Colab and press **Enter**.
5. This will connect your Google Drive to Colab
6. If the link does not appear, make sure your browser is not blocking pop-ups.

---

## 5. Enter your notebook’s file name

The code will now ask to enter your notebook’s exact file name

1. Type the **full name** of your notebook, including the `.ipynb` ending.  
  - Example: `Alberto_Stefanelli_Lab4.ipynb.`
2. Make sure the name matches exactly, including capitalization and underscores.
3. Press **Enter**.

---

## 7. Convert the notebook to PDF and download it

The code will convert the notebook file to a pdf. After the PDF is created, your browser will show a download pop-up or automatically save the file to your Downloads folder. You can now open the PDF with any PDF reader. Once you have your PDF and made sure eveything is in order, you can then upload it to Canvas.

---

## 8. If you see an error:

Double-check that:

- All cells run without errors
- Mounted Google Drive without errors
- Saved notebook to Google Drive and not locally or on Github
- Entered correct notebook file name with `.ipynb`.
- Conversion completed without any errors.
- PDF downloaded to your computer (download and pop-ups are not blocked by your browser)


**Fallback:** If the PDF export method below fails (for example, due to LaTeX or pandoc errors), you can use https://convert.ploomber.io/ as a fallback option. However, I strongly suggest trying the methods below first and using this fallback only as a last resort.

**If you run into any issues, please reach out for help**




In [None]:
# Install requirements
!apt-get -qq update
!apt-get install -y pandoc texlive-xetex texlive-fonts-recommended texlive-plain-generic

from google.colab import drive, files

# Mount Google Drive
drive.mount('/content/drive')

# Ask for the notebook name
notebook_name = input(
    "Enter your notebook’s exact file name,\n"
    "exactly as shown in the top-left corner of the Colab page (next to the two yellow circle icons): "
)

# Build paths
input_path = f"/content/drive/MyDrive/Colab Notebooks/{notebook_name}"
output_path = input_path.replace(".ipynb", ".pdf")

# Convert to PDF
!jupyter nbconvert --to pdf "{input_path}"

# Download the PDF
files.download(output_path)




