# Assignment 4: Visualization

In this assignment we will introduce two popular visualization tools commonly used in python. We will use data from the US Federal Election Commission. 

One of the most popular plotting libraries today is [Matplotlib](https://matplotlib.org/). According to the documentation, "Matplotlib tries to make easy things easy and hard things possible." For the most basic plots, line graphs, single bar plots etc., matplotlib has functions for just these things.

As plots get more complex, many data scientists use [Seaborn](https://seaborn.pydata.org/), a wrapper around matplotlib that improves on its theme, and has convenience functions for more complex plots that tend to be common. You can see a ton of examples in Seaborn's [example gallery](https://seaborn.pydata.org/examples/index.html).

With Matplotlib and Seaborn you often have to manually move labels, adjust axes, add data labels, etc. An alternative is declarative visualization. Akin to the difference between imperative approaches like Pandas vs declarative approaches like SQL for manipulating tabular data, declarative visualizations allow users to specify what visualization they want over their data by describing the connection between data and visual encodings.

In this assignment we will use a python declarative visualization library, [Altair](https://altair-viz.github.io/), itself built atop the popular declarative visualization tool [Vega](https://vega.github.io/vega/), to demonstrate how to get some effective visualizations in relatively few lines of code. 

Before we begin, make sure you have chosen the right virtual environment (iap-data-venv) to run this notebook in. How to do this depends on your IDE. 

## Data Overview
Let's briefly describe the data before starting our visualizations.

In [None]:
import pandas as pd
cands = pd.read_csv("../datasets/elections/cand_summary.txt", delimiter="|")
cands["CAND_OFFICE"] = cands.CAND_ID.str[:1]
pacs = pd.read_csv("../datasets/elections/pac_summary.txt", delimiter="|")
dist_pop = pd.read_csv("../datasets/elections/dist_pop.txt", delimiter="|")

The ``cands`` table contains information about candidates in each election year. We are primarily interested in visualizing election funding information, so let us remove unneeded information.

In [None]:
# Removing US territories
cands = cands[~cands.CAND_OFFICE_ST.isin(["AS", "GU", "MP", "US", "DC", "MH", "PR", "VI"])]

# Add a column for CAND_OFFICE ('P', 'H' or 'S', for President, House, and Senate respectively)
cands["CAND_OFFICE"] = cands.CAND_ID.str[:1]

# We keep candidate id, name, state, office, year, party, total funding, funding from individual contributions.
cands = cands[['CAND_ID', 'CAND_NAME', 'CAND_OFFICE_ST', 'CAND_OFFICE', 'ELECTION_CYCLE_YR', 'PTY_AFFILIATION', 'TTL_RECEIPTS', 'TTL_INDIV_CONTRIB']]
cands

The `pacs` table contains information about Political Action Committee which are defined as:
> (in the US) an organization that raises money privately to influence elections or legislation, especially at the federal level.

Let's again remove unneeded information.

In [None]:
# We keep PAC id, name, type, election year, and total funding.
pacs = pacs[["CMTE_ID", "CMTE_NM", "CMTE_TP", "ELECTION_CYCLE_YR", "TTL_RECEIPTS"]]

In [None]:
pacs.info()

Finally the `dist_pop` table contains the population US districts.

In [None]:
dist_pop

In [None]:
dist_pop.info()

## Part 1: Matplotlib + Seaborn

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

### Viz: Funding of House Canditates in 2018.
In this first visualization task, we want to compare individual contributions and total funding for house canditates in 2018. We will walk you through this task, but the later ones will be more open-ended.
We can get the 2018 house candidates as follows, sorted by total fundraising receipts.

In [None]:
house_2018 = cands[(cands.ELECTION_CYCLE_YR == 2018) &
                   (cands.CAND_OFFICE == "H") &
                   (cands.PTY_AFFILIATION.isin(["REP", "DEM"]))] \
            .sort_values("TTL_RECEIPTS", ascending=False)
house_2018

Now let's start making some plots. We'll start by looking at the ratio of individual contributions to total receipts for all 2018 House candidates using a scatterplot. Here is first attempt:

In [None]:
plt.scatter(house_2018.TTL_RECEIPTS, house_2018.TTL_INDIV_CONTRIB)

# Increase the size of the figure for visibility
plt.gcf().set_size_inches(10,6) # gcf() means get current figure.

It's worth thinking about what is going on here. Matplotlib has a default **figure** that starts with one **axes** object. When using the ``plt.`` functions we edit the default axis or the default figure (depending on what property we are manipulating). Alternatively we can fetch the axis or figure object and operate on that directly. This is useful if you want to create multiple plots in the same figure. An example of multiple plots on the same figure is [here](https://matplotlib.org/devdocs/gallery/subplots_axes_and_figures/subplots_demo.html).

There are a couple of issues with the figure above: It has no title, no axis labels, and our area of interest is pretty hard to see. Let's fix that.

In [None]:
# Get the current axes object
ax = plt.gca()

# Create Scatterplot
ax.scatter(house_2018.TTL_RECEIPTS, house_2018.TTL_INDIV_CONTRIB)

# Set the title
ax.set_title("2018 House Individual Contributions vs Total Fundraising")

# Set the axis labels
ax.set_xlabel("TTL_RECEIPTS ($)")
ax.set_ylabel("TTL_INDIV_CONTRIB ($)")

# Adjust the axis limits 
ax.set_xlim((0, 20000000))
ax.set_ylim((0, 20000000))

# Increase figure size on the current figure
plt.gcf().set_size_inches(10,8)

This is looking better, but who are these outliers who receive tons of money but almost none by individuals? 

The first thing we can do is add multiple series on the same axis to see which party these house candidates belong to. We'll split our data into two sets, one for Republicans and one for Democrats. We will then add both scatterplots to the same axes object and plot them over each other.

We will also add a line y=x to better see how much the individual contributions differ from the total funding.

In [None]:
# Get the current figure and axes
ax = plt.gca()

# Split rep and dem.
rep_2018 = house_2018[house_2018.PTY_AFFILIATION =="REP"]
dem_2018 = house_2018[house_2018.PTY_AFFILIATION =="DEM"]

# TODO: Make scatter plot colored red for republicans.
# TODO: Make scatter plot colored blue for democrats.
# TODO: Add Add line y=x.


# Set the title
ax.set_title("2018 House Individual Contributions vs Total Fundraising")

# Set the axis labels
ax.set_xlabel("TTL_RECEIPTS ($)")
ax.set_ylabel("TTL_INDIV_CONTRIB ($)")

#Add a legend
plt.gcf().legend()

#Adjust the axis limits 
ax.set_xlim((0, 20000000))
ax.set_ylim((0, 20000000))

#Increase figure size
plt.gcf().set_size_inches(10,8)

Now we can see that there are a number of politicians raising a ton of money with almost none of it coming from indivudal contributers. But who exactly are those outliers? One thing we could do is to add data labels for some of these points (See the example [here](https://matplotlib.org/3.1.1/tutorials/text/annotations.html)). We'll come back to this later.

Even with these simple plots this is starting to get pretty verbose. Matplotlib tends to be a manual process in this way. However, one advantage is the ability to do nearly anything you want. 

Seaborn is a wrapper around matplotlib that tends to be a bit easier to use and more easily produces cleaner plots without much manipulation. In the next few steps we'll look at how to show how the total amount of money raised changes each election cycle. Fortunately this data goes back to 1980!

But first we'll revisit the same plot, to see how we would do the same thing in Seaborn. Using [sns.relplot](https://seaborn.pydata.org/generated/seaborn.relplot.html) documentation, make a scatter plot in which republicans and democrats are colored differently.

In [None]:
# Set the Seaborn default theme on Matmplotlib
sns.set()

# TODO: Use sns.relplot to automatically split the data into separate series and colors them accordingly.

# Adjust the axis limits using Matplotlib
plt.xlim((0, 20000000))
plt.ylim((0, 20000000))

# Increase figure size
plt.gcf().set_size_inches(10,8)

Instead of plotting Democrats and Republicans atop one another, we can also use a "Facet Grid" to split out different categories into different plots. Take a look at [sns.FacetGrid](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html) and split the two plots.

In [None]:
# TODO: Create facet grid object. You'll only need the 'col' and 'hue' option.

# TODO: Use map to draw each figure.

#Adjust the axis limits 
plt.xlim((0, 20000000))
plt.ylim((0, 20000000))

#Increase figure size
plt.gcf().set_size_inches(10,8)

### Viz: Spending Over Time.

In this task, we will look at how campaign contributions change over time using Seaborn.

In [None]:
# Get house candidates before 2020 (years with complete data).
house_cands = cands[(cands.CAND_OFFICE == 'H') & (cands.ELECTION_CYCLE_YR < 2020) & (cands.PTY_AFFILIATION.isin(["REP", "DEM"]))]

Take a look at the documentation for [sns.barplot](https://seaborn.pydata.org/generated/seaborn.barplot.html). Use it to show the evolution of campaign spending over time for both parties combined. 

Readability guidelines:
* Set the figure size to (10, 8).
* Rotate the x labels by 45 degree using [plt.xticks](https://matplotlib.org/3.5.0/api/_as_gen/matplotlib.pyplot.xticks.html) 

In [None]:
contrib_sum = house_cands.groupby("ELECTION_CYCLE_YR").sum().reset_index()
contrib_sum
# TODO: evolution of election spending.

We can see the rapid increase in spending on elections over the past 40 years. Although the US population has increased by 50% and the inflation rate is about 3x since 1980, the spending on elections far outpaces this.

Next, create a grouped bar plot using Seaborn that compares Democrat and Republican spending on **House** Campaigns since 1980. Use [seaborn's grouped barplot example](https://seaborn.pydata.org/examples/grouped_barplot.html) for inspiration. Use the same readability guidelines as above.

In [None]:
contrib_sum = house_cands.groupby(["ELECTION_CYCLE_YR", "PTY_AFFILIATION"]).sum().reset_index()
# TODO: Grouped bar plot

**Q: Since we are tracking data over time, what might be a different plot that could be a useful way to view this data?**

Now change the above plot to instead be of the type we just decided. Feel free to use Google or Duckduckgo to find documentation.

In [None]:
# TODO: Create your plot here

### Viz: Sources of Funding

Election funding can come from the candidate's own campaign or from PACs. For each year since 2000, we would like to visualize the proportion of funding from campaigns (senate, house, presidential) and from PACs according to their types.

Let's take a look at how we would accomplish this. The first thing we have to do is manipulate our data so that it contains what we want. Remember that we want to separate out spending on house, presidential, and Senate elections, as well as different spending on political action committees (PACs) per year. 

You've already done enough pandas manipulation so we'll just give you the code to gather the data in one dataframe. You don't need to read the details of the code, just read the output. In the next section of this tutorial, we'll see how to avoid most of this manipulation.

In [None]:
#Sum up and get only the fields we are interested in
election_spending = cands[(cands.ELECTION_CYCLE_YR >= 2000) & (cands.ELECTION_CYCLE_YR < 2020)]. \
                    groupby(["ELECTION_CYCLE_YR", "CAND_OFFICE"])\
                    .sum().reset_index()

# We will change the column name to match with the pacs table.
election_spending = election_spending[["ELECTION_CYCLE_YR", "CAND_OFFICE", "TTL_RECEIPTS"]].rename({"CAND_OFFICE":"SPENDING_TYPE"}, axis="columns")


pac_spending = pacs[pacs.ELECTION_CYCLE_YR < 2020].groupby(["ELECTION_CYCLE_YR", "CMTE_TP"]).sum().reset_index()

#Get the top 3 Pacs
top_pacs = pac_spending.groupby("CMTE_TP").sum().sort_values("TTL_RECEIPTS", ascending=False).index[:3]
pac_spending = pac_spending[pac_spending.CMTE_TP.isin(top_pacs)]
pac_spending = pac_spending[["ELECTION_CYCLE_YR", "CMTE_TP", "TTL_RECEIPTS"]].rename({"CMTE_TP":"SPENDING_TYPE"}, axis="columns")

total_spending = pd.concat([election_spending, pac_spending])


#Rename to meaningful spending types
spending_type_map = {"H":"House", "S":"Senate", "P":"President", "Q":"Standard PAC", "O":"Super PAC", "Y":"Party PAC"}
total_spending.SPENDING_TYPE = total_spending.SPENDING_TYPE.apply(lambda x: spending_type_map[x])

# This will create rows for election cycle years, and columns for each distinct spending type.
total_spending = total_spending.pivot(index="ELECTION_CYCLE_YR", columns="SPENDING_TYPE", values="TTL_RECEIPTS")

total_spending

Once your data is in the above format (x-axis represented by index and each stack element as a column), the builtin pandas plot function allows to make stacked bar plots as follows:

In [None]:
total_spending.plot.bar(stacked=True)
plt.gcf().set_size_inches(15,10)
plt.legend()

One interesting thing to see here is that Super PAC did not contribute to elections before 2010. [According to Wikipedia](https://en.wikipedia.org/wiki/Political_action_committee#Super_PACs), they came into existence in 2010 after a court decision.

Seaborn allows you to do many more things as shown in the [gallery](https://seaborn.pydata.org/examples/index.html) but, for complex visualizations, using Altair and Vega-lite are likely more convenient.

## Part 2: Altair & Vega-Lite

With matplotlib alone, even simple plots require a lot of hand-written code to get a presentable plot. Seaborn significantly simplifies matplotlib, but (1) does not easily support advanced techniques like interactivity, (2) gets increasingly tedious for complex visualizations like , (3) often requires some matplotlib style manual adjustments for best readability.

Altair further simplifies visualizations. It is to seaborn what SQL is to Pandas. Despite having a higher learning curve, it aims to allow for more sophisticated and better looking visualizations with less code. From the documentation:

> The key idea is that you are declaring links between data columns and visual encoding channels, such as the x-axis, y-axis, color, etc. The rest of the plot details are handled automatically. Building on this declarative plotting idea, a surprising range of simple to sophisticated plots and visualizations can be created using a relatively concise grammar.

Altair is a python wrapper around Vega-lite, a declarative visualization grammar. 

### Basics

In [None]:
import altair as alt
from vega_datasets import data
from altair import datum
alt.renderers.enable('mimetype')
alt.data_transformers.disable_max_rows()

house_2018 = cands[(cands.ELECTION_CYCLE_YR == 2018) &
                   (cands.CAND_OFFICE == "H") & (cands.TTL_RECEIPTS < 20000000)]

We'll start by revisiting the initial scatterplot. We'll start with the full code then walk through it.

In [None]:
alt.Chart(house_2018).mark_point().encode(
    x="TTL_RECEIPTS:Q",
    y="TTL_INDIV_CONTRIB:Q",
    color="PTY_AFFILIATION:N",
).transform_filter(
    (datum.PTY_AFFILIATION == "REP") | (datum.PTY_AFFILIATION == "DEM")
).properties(
    title={
        "text": "Individual Contrib vs Total Receipts"
    }
)

There is a bunch going on here so let's break it down a step at a time.

The code starts with `alt.Chart(data)` which specifies that you want to draw a figure using the provided data. Underneath, the `Chart` object compiles to `JSON` then get used by the underlying library.

Then you specify a `mark` to indicate what the data should look like. Because we are doing a scatter plot, we want individual points, so we use `mark_point()`. We'll see other marks later in the tutorial.

In [None]:
chart = alt.Chart(house_2018).mark_point()
chart

This is not very useful as it stacks all points in one. We need to specify more visual elements using the `encode` function. Possible encodings include x-axis, y-axis, color, interactive tooltip, labels on points, etc.

Let's start by specifying an x-axis.

In [None]:
# Q indicates that the variable is quantitative.
chart = chart.encode(x="TTL_RECEIPTS:Q")
chart

The data is now spread through the x-axis. Let's add a y-axis.

In [None]:
chart = chart.encode(y="TTL_INDIV_CONTRIB:Q")
chart

We now want to color points by party affiliation. We once again use the `encode()` function.

In [None]:
# N means nominal: unordered categorical variable.
chart = chart.encode(color="PTY_AFFILIATION:N")
chart

That's a lot of parties; we only want republicans and democrats. Instead of using pandas to filter the dataframe, we can filter individual points using `transform_filter`. When your data is in pandas format already, you should likely just use pandas since you are more familiar with it. But transforms are more general as they operate across data formats. So we show an example here.

In [None]:
chart = chart.transform_filter((datum.PTY_AFFILIATION == "REP") | (datum.PTY_AFFILIATION == "DEM"))
chart

For best practice, let's add a title. We do so by using the `properties` function.

In [None]:
chart = chart.properties(
    title={
        "text": "Individual Contrib vs Total Receipts"
    }
)
chart

Note that the readability of the plot is taken care of by the library's default configurations, which are fairly decent in most cases.

Let's add a few more features to our plot.

Support we wanted to add an y=x line to the plot above. Let's first draw it in isolation. Use [this example from the documentation](https://altair-viz.github.io/gallery/simple_line_chart.html) to draw a single y=x line.

In [None]:
df = pd.DataFrame({
    'X': [0, 20000000],
    'Y': [0, 20000000],
})

line = None
# TODO: Draw line. Store the resulting chart in the 'line' variable. You'll need a Chart, a mark, and a few encodings.

Altair allows placing multiple plots in the same figure using the `+` operation.

In [None]:
full_chart = chart + line
full_chart

Likewise, you may want to split party the figure by party affiliation. You do that by using facets.

In [None]:
multi_chart = chart.facet("PTY_AFFILIATION")
multi_chart

Finally, suppose we wanted to interactively identify outliers by hovering over the points with our mouse. Adding interactivity is trivial by adding a `tooltip` to the encoding.

In [None]:
chart = chart.encode(
    tooltip=["CAND_NAME", "CAND_OFFICE_ST"] # Specify the columns you want to see.
)

chart

We've now explored a small subset of the things you can do with Altair. So here's a chance for you to play around.

### Practice 1: Spending over Time by Party
Try recreating the histogram showing spending over time by each party. Make it interactive by showing the exact value for a bar when you hover over it. It doesn't have to be exactly the same as the histogram above, but it should be close. For inspiration, start from [this example in the documentation](https://altair-viz.github.io/gallery/grouped_bar_chart.html). Also feel free to use Google and SO.

In [None]:
# Use pandas for simple filtering.
house_cands = cands[(cands.CAND_OFFICE == 'H') & (cands.ELECTION_CYCLE_YR < 2020) & (cands.PTY_AFFILIATION.isin(["REP", "DEM"]))]

# TODO: Recreate the plot showing spending by party over time without using pandas aggregation.

### Practice 2: Share of Receipts by Campaign Type
Create a stacked bar chart showing the proportion of the funding that comes from presidential, senate or house campaigns. You do not need to show PAC data here. Start from [this example in the documentation](https://altair-viz.github.io/gallery/stacked_bar_chart.html).

In [None]:
all_cands = cands

# TODO: Create stacked bar plot showing the funding of all three campaign types.

### Map Visualization
There are many more things we can do with Altair. Take a look at [Altair's gallery](https://altair-viz.github.io/gallery/index.html) for more examples.

Let's say we want to see which states in the last election cycles had Senate races that raised a disproportionate amount of money per capita. Senate elections are every 6 years so looking at data from 2012 to 2018 data ensures that all Senate seats have had an election.

We'll create Choropleth map. That is, a map that is shaded relative to some statistic.

Let's first compute the per capita receipts in each state. You only need to look at the output; don't worry about the manipulation.

In [None]:
# Use dist_pop to get state population.
state_pop = dist_pop[["state", "population"]].groupby("state").sum().reset_index()
# Get total receipts per candidate.
senate_cands = cands[(cands.ELECTION_CYCLE_YR >= 2012) &
                                 (cands.ELECTION_CYCLE_YR < 2018) &
                                 (cands.CAND_OFFICE=="S")]
senate_receipts = senate_cands.groupby("CAND_OFFICE_ST").agg({"TTL_RECEIPTS": "sum"}).reset_index()

# Merge
senate_pop = pd.merge(left=senate_receipts, right=state_pop, left_on="CAND_OFFICE_ST", right_on="state")

# Compute per capita receipts
senate_pop["PER_CAPITA_RECEIPTS"] = senate_pop.TTL_RECEIPTS / senate_pop.population

# Add state id: needed for map
# Mapping of postal code to the state id used in the geographic data
states = {'AK': 'Alaska','AL': 'Alabama','AR': 'Arkansas','AS': 'American Samoa','AZ': 'Arizona','CA': 'California','CO': 'Colorado','CT': 'Connecticut','DC': 'District of Columbia','DE': 'Delaware','FL': 'Florida','GA': 'Georgia','GU': 'Guam','HI': 'Hawaii','IA': 'Iowa','ID': 'Idaho','IL': 'Illinois','IN': 'Indiana','KS': 'Kansas','KY': 'Kentucky','LA': 'Louisiana','MA': 'Massachusetts','MD': 'Maryland','ME': 'Maine','MI': 'Michigan','MN': 'Minnesota','MO': 'Missouri','MP': 'Northern Mariana Islands','MS': 'Mississippi','MT': 'Montana','NA': 'National','NC': 'North Carolina','ND': 'North Dakota','NE': 'Nebraska','NH': 'New Hampshire','NJ': 'New Jersey','NM': 'New Mexico','NV': 'Nevada','NY': 'New York','OH': 'Ohio','OK': 'Oklahoma','OR': 'Oregon','PA': 'Pennsylvania','PR': 'Puerto Rico','RI': 'Rhode Island','SC': 'South Carolina','SD': 'South Dakota','TN': 'Tennessee','TX': 'Texas','UT': 'Utah','VA': 'Virginia','VI': 'Virgin Islands','VT': 'Vermont','WA': 'Washington','WI': 'Wisconsin','WV': 'West Virginia','WY': 'Wyoming'}
state_to_id = {"Alabama":"1","Alaska":"2","Arizona":"4","Arkansas":"5","California":"6","Colorado":"8","Connecticut":"9","Delaware":"10","District of Columbia":"11","Florida":"12","Georgia":"13","Hawaii":"15","Idaho":"16","Illinois":"17","Indiana":"18","Iowa":"19","Kansas":"20","Kentucky":"21","Louisiana":"22","Maine":"23","Maryland":"24","Massachusetts":"25","Michigan":"26","Minnesota":"27","Mississippi":"28","Missouri":"29","Montana":"30","Nebraska":"31","Nevada":"32","New Hampshire":"33","New Jersey":"34","New Mexico":"35","New York":"36","North Carolina":"37","North Dakota":"38","Ohio":"39","Oklahoma":"40","Oregon":"41","Pennsylvania":"42","Rhode Island":"44","South Carolina":"45","South Dakota":"46","Tennessee":"47","Texas":"48","Utah":"49","Vermont":"50","Virginia":"51","Washington":"53","West Virginia":"54","Wisconsin":"55","Wyoming":"56","Puerto Rico":"72"}

# Set the state name in our data instead of the postal code
senate_pop["STATE_ID"] = senate_pop.CAND_OFFICE_ST.apply(lambda x: state_to_id[states[x]])

senate_pop

Starting from [the example in Altair's documentation](https://altair-viz.github.io/gallery/choropleth.html), draw a map shaded by per capita receipts. Make the map interactive by showing a state's name, population and total receipts when you hover over it.

In [None]:
# Fetch the state geograpy data
state_geo = alt.topo_feature(data.us_10m.url, 'states')

# TODO: Make choropleth map.
# Tip: Use senate_pop and STATE_ID in the transform lookup.
# Tip: Use state_geo above instead of 'counties' in the example.

### Visualizations from your own research!
Think of something interesting you'd like to visualize from your own research! It doesn't have to be complex. If you don't have an applicable dataset, visualize something interesting from the election dataset instead.