## Week 12 Assignment - W200 Python Fundamentals for Data Science, UC Berkeley MIDS

Write code in this Jupyter Notebook to solve the following problems. Please upload this **Notebook** with your solutions to your GitHub repository in your SUBMISSIONS/week_12 folder by 11:59PM PST the night before class.

This homework assignment is Week 12 which corresponds to the Unit #11 async. If you turn-in anything on ISVC please do so under the Week 12 Assignment category. (Apologies for the confusion)

## Objectives

- Explore and get insights from a real dataset using pandas
- Practice the use of pandas for: exploratory analysis, information gathering and discovery
- Use matplotlib for plotting charts from the data

## Data files

In this assignment you will apply what you are learning to answer questions about campaign contributions in the Democratic presidential primary race. We will use the csv file located here: https://drive.google.com/file/d/1Lgg-PwXQ6TQLDowd6XyBxZw5g1NGWPjB/view?usp=sharing. You will need to download and save the csv in the same folder as this notebook. This file originally came from the U.S. Federal Election Commission (https://www.fec.gov/).

** REMEMBER -- DO NOT PUT THIS DATA IN YOUR GITHUB REPO ! **

Documentation for this data file can be found here: https://drive.google.com/file/d/11o_SByceenv0NgNMstM-dxC1jL7I9fHL/view?usp=sharing

## General Guidelines:

- This is a **real** dataset and so it contains errors and other pecularities to work through
- The data is ~218mb which will take some time to load (and probably won't load in google sheets or Excel)
- If you make assumptions please annotate them in your answer
- While we've left one code/markdown cell for you after each question as a placeholder, some of your answers will require multiple cells to fully respond
- Double click the markdown cells where it says YOUR ANSWER HERE to enter your written answers; if you need more cells to write answers in please make them markdown cells (rather than code cells)

## Setup

First, run the two cells below. 

The first cell will load in the data to a pandas dataframe named 'contrib'. Note we define a custom date parser to speed up import (we could have Python guess the date format, but this can make the load a lot slower).  

The second cell subsets the data to look at the primary period.  Otherwise, we would see general election donations which would make it harder to draw conclusions about the primaries.  We will analyze through May 2016.

In [None]:
# import the data
import pandas as pd
import matplotlib.pyplot as plt

pd.options.display.float_format = '{:,.2f}'.format
%matplotlib inline

# Create date parser to pass to read_csv
import datetime
d = lambda x: pd.datetime.strptime(x, '%d-%b-%y')

contrib = pd.read_csv('./P00000001-CA.csv', index_col=False, parse_dates = ['contb_receipt_dt'], date_parser=d)

# Note - for now, it is okay to ignore the warning about mixed types.  

In [None]:
# Subset data to primary period 
print(contrib.shape)

contrib = contrib.copy()[contrib['contb_receipt_dt'] <= datetime.datetime(2016, 5, 31)]
print(contrib.shape)

## 1. Data Exploration

**1a.** First, let's take a preliminary look at our data:
- Print the *shape* of the data. What does this tell you about the number of variables and rows you have?
- Print a list of column names. 
- Review the documentation for this dataset (linked above). Do you have all of the columns you expect to have?
- Sometimes variable names are not clear unless we read the documentation. In your own words, based on the documentation, what information does the "election_tp" variable contain?

In [None]:
# YOUR CODE HERE

- 1a YOUR ANSWER HERE

**1b.** Print out the first 5 observations from the dataset to manually look through some of your data.

In [1]:
# YOUR CODE HERE

**1c.** When working with a new dataset, it is important to explore and sanity check your variables. Pick **three** variables from the dataset above and run some quick sanity checks. For example, you may want to examine the maximum and minimum values, a frequency count, or something else. Use markdown cells to explain if your sanity checks "pass" your scrutiny or if you have concerns about the integrity of your data. 

In [2]:
# YOUR CODE HERE

- 1c YOUR ANSWER HERE

## 2. Exploring Campaign Contributions

Let's investigate the donations to the candidates.

**2a.** 
Create a table that shows the total number of donations to each candidate. Hint: use "groupby" as shown in async unit 11.07.

When presenting data in a table, it is best to sort the data in a meaningful way. This makes it easier for your reader to examine what you've done and to get insights out of your tables. Use "sort_values" to sort the data so that candidates with the largest number of donations appear on top. From now on, all tables you present in this assignment (and course) should be sorted.

What candidate recieved the largest number of contributions (variable 'contb_receipt_amt')?

In [3]:
# YOUR CODE HERE

- 2a YOUR ANSWER HERE

**2b.** Now, create a table that shows the total **value** of donations to each candidate. What candidate raised the most money in California?

In [None]:
# YOUR CODE HERE

- 2b YOUR ANSWER HERE

**2c.** Combining your tables
- What is the "type" of the two tables you printed above? Show if they are Series, or DataFrames.
- Convert any Series to pandas DataFrames.
- Update the variable (column) names to accurately describe what is shown
- Merge together your tables to show the *count* and the *value* of donations to each candidate in one table. Use the "join" function.

In [None]:
# YOUR CODE HERE

**2d.** Calculate and add a new variable to your table that shows the average $ per donation.

In [None]:
# YOUR CODE HERE

**2e.** There are several interesting conclusions you can draw from the table you have created. Please comment on the results of your data analysis in a short paragraph. What have you learned about campaign contributions in California?

- 2e YOUR ANSWER HERE

## 3. Exploring Donor Occupation

Above in part 2, we saw that some simple data analysis can give us insights into the campaign of our candidates. Now let's quickly look to see what *kind* of person is donating to each campaign using the "contbr_occupation" variable.

**3a.** Subset your data to create a data frame with only donations for Hillary Clinton. Then use value_counts() and head() to display the top 5 occupations (contbr_occupation) for her donors. Note: we are just interested in the count of donations, not the value of those donations.

In [None]:
# YOUR CODE HERE

- 3a YOUR ANSWER HERE

**3b.** Imagine that you want to do the previous operation on several candidates.  To keep your work neat, you want to take the work you did on the Clinton-subset and wrap it in a function that you can apply to other subsets of the data.  Specifically, write a function called get_donors() that takes a DataFrame as input, and outputs a Series containing the counts for the top 5 occupations contained in that DataFrame.

In [None]:
def get_donors(df):
    """This function takes a dataframe that contains a variable named contbr_occupation.
    It outputs a Series containing the counts for the 5 most common values of that
    variable."""
    
    # YOUR CODE HERE

**3c.** Now try running your function on subsets of the dataframe corresponding to three candidates:
    1. Hillary Clinton
    2. Bernie Sanders
    3. Donald Trump

In [None]:
# YOUR CODE HERE

**3d.** Finally, use a groupby to divide the entire dataset by candidate.  Call .apply(get_donors) on your groupby object, which will apply the function you wrote to each subset of your data.  Look at your output and marvel at what Pandas can do in just one line!

In [None]:
# YOUR CODE HERE

**3e.** Comment on your findings in a short paragraph.

- 3e YOUR ANSWER HERE

**3f.** Think about your findings in section 3 vs. your findings in section 2 of this assignment. Do you have any new insights into the results you got in section 2, now that you see the top occupations for each candidate?

- 3f YOUR ANSWER HERE


## 4. Plotting Data

There is an important element that we have not yet explored in this dataset - time.

**4a.** Please create a single line chart with the following elements:
- Show the date on the x-axis
- Show the contribution amount on the y-axis
- Include a title
- Include axis labels

In [None]:
# YOUR CODE HERE

**4b.** This chart is messy (and you should make better plots for your project). While there are better ways we can show this data, what conclusions can you draw from just your basic plot?

- 4b YOUR ANSWER HERE

**4c.** Brainstorm: If you were going to improve on this plot looking at donations over time, what could you display that would be more useful? You do not need to do any plotting for this question.

- 4c YOUR ANSWER HERE

## If you have feedback for this homework, please submit it using the link below:

http://goo.gl/forms/74yCiQTf6k