# Lab 5: Visualization

Visualizations help you to explore a dataset, discover anomalies or errors, find areas of interest, explain ideas convincingly to others, and check your assumptions. In this lab we will introduce two popular visualization tools commonly used in python. 

In the in-class portion of this lab we will use some of the FEC data from lab 1. Since the point of this lab is visualization and not running complex queries over data, we will provide code for the transformations in this lab.

One of the most popular plotting libraries today is [Matplotlib](https://matplotlib.org/). According to the documentation, "Matplotlib tries to make easy things easy and hard things possible." For the most basic plots, line graphs, single bar plot's etc., matplotlib has functions for just these things.

As plots get more complex, many data scientists use [Seaborn](https://seaborn.pydata.org/), a wrapper around matplotlib that improves on its theme, and has convenience functions for more complex plots that tend to be common. You can see a ton of examples in Seaborn's [example gallery](https://seaborn.pydata.org/examples/index.html).

With Matplotlib and Seaborn you often have to manually move labels, adjust axes, add data labels, etc. An alternative is declarative visualization. Akin to the difference between imperative approaches like Pandas vs declarative approaches like SQL for manipulating tabular data, declarative visualizations allow users to specify what visualization they want over their data by describing the connection between data and visual encodings.

In this lab we will use a python declarative visualization library, [Altair](https://altair-viz.github.io/), itself built atop the popular declarative visualization tool [Vega](https://vega.github.io/vega/), to demonstrate how to get some effective visualizations in relatively few lines of code. 

While it is difficult to be comprehensive in a short lecture, the idea behind this excercize is to give an introduction to the flavor of some of these tools so that you can try out things on your own later. As usual, useful resource include the documentation and StackOverflow.

## Part 1: Python Plotting Libraries (Matplotlib, Seaborn, & Pandas)

In the first part of this lab, we will load our data and start creating some simple visualizations, much like the questions we were answering in Lab 1. While we can't explore all of the things you can do with these libraries, we will try to give you a flavor of what is possible. 

We start by importing packages and loading our datasets.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
cands = pd.read_csv("data/cand_summary.txt", delimiter="|")
#Removing US territories
cands = cands[~cands.CAND_OFFICE_ST.isin(["AS", "GU", "MP", "US", "DC", "MH", "PR", "VI"])]
#Add a column for CAND_OFFICE ('P', 'H' or 'S', for President, House, and Senate respectively)
cands["CAND_OFFICE"] = cands.CAND_ID.str[:1]
pacs = pd.read_csv("data/pac_summary.txt", delimiter="|")

The ``cands`` table is the same as the cand_summary table from lab 1 but includes an additional column for the election cycle year. E.g 2015-2016 are represented in the dataset by ELECTION_CYCLE_YEAR of 2016. The data goes back to 1980. Similarly, the ``pacs`` table contains summary information of politiacal action committees dating to 2000.

In [None]:
cands.dtypes

In [None]:
pacs.dtypes

We can get the 2018 house candidates as follows, sorted by total fundraising receipts.

In [None]:
house_2018 = cands[(cands.ELECTION_CYCLE_YR == 2018) &
                   (cands.CAND_OFFICE == "H")] \
            .sort_values("TTL_RECEIPTS", ascending=False)
house_2018.describe()

Now let's start making some plots! We'll start by looking at the ratio of individual contributions to total receipts for all 2018 House candidates using a scatterplot.

In [None]:
plt.scatter(house_2018.TTL_RECEIPTS, house_2018.TTL_INDIV_CONTRIB)

#Increase the size of the figure for visibility
plt.gcf().set_size_inches(10,6)

It's worth thinking about what is going on here. Matplotlib has a default **figure** that starts with one **axes** object. When using the ``plt.`` functions we edit the default axis or the default figure (depending on what property we are manipulating). Alternatively we can fetch the axis or figure object and operate on that directly. This is useful if you want to create multiple plots in the same figure. An example of multiple plots on the same figure is [here](https://matplotlib.org/devdocs/gallery/subplots_axes_and_figures/subplots_demo.html).

There are a couple of issues with the figure above: It has no title, no axis labels, and our area of interest is pretty hard to see. Let's fix that.

In [None]:
#Get the current axes object
ax = plt.gca()

#Create Scatterplot
ax.scatter(house_2018.TTL_RECEIPTS, house_2018.TTL_INDIV_CONTRIB)

#Set the title
ax.set_title("2016 House Individual Contributions vs Total Fundraising")

#Set the axis labels
ax.set_xlabel("TTL_RECEIPTS ($)")
ax.set_ylabel("TTL_INDIV_CONTRIB ($)")

#Adjust the axis limits 
ax.set_xlim((0, 20000000))
ax.set_ylim((0, 20000000))

#Increase figure size on the current figure
plt.gcf().set_size_inches(10,8)

This is looking better, but who are these outliers who receive tons of money but almost none by individuals? 

The first thing we can do is add multiple series on the same axis to see which party these house candidates belong to. We'll split our data into two sets, one for Republicans and one for Democrats. We will then add both scatterplots to the same axes object and plot them over each other. 

In [None]:
#Get the current figure and axes
ax = plt.gca()

#split rep, dem and third party
rep_2018 = house_2018[house_2018.PTY_AFFILIATION =="REP"]
dem_2018 = house_2018[house_2018.PTY_AFFILIATION =="DEM"]

# Note that we are adding a label here, we are not getting this directly from the data
# We also manually set the color of each.
ax.scatter(rep_2018.TTL_RECEIPTS, rep_2018.TTL_INDIV_CONTRIB, label="REP", color="RED")
ax.scatter(dem_2018.TTL_RECEIPTS, dem_2018.TTL_INDIV_CONTRIB, label="DEM", color="BLUE")

# We can also add a line y=x to show when all contributions come from individuals. 
# So we can see how far some candidates deviate from this ideal, 
# the "--" designates a dotted line
ax.plot([0, 20000000], [0, 20000000], "--")


#Set the title
ax.set_title("2016 House Individual Contributions vs Total Fundraising")

#Set the axis labels
ax.set_xlabel("TTL_RECEIPTS ($)")
ax.set_ylabel("TTL_INDIV_CONTRIB ($)")

#Add a legend
plt.gcf().legend()

#Adjust the axis limits 
ax.set_xlim((0, 20000000))
ax.set_ylim((0, 20000000))

#Increase figure size
plt.gcf().set_size_inches(10,8)

Now we can see that there are a number of politicians raising a ton of money with almost none of it coming from indivudal contributers. But who exactly are those outliers? One thing we could do is to add data labels for some of these points (See the example [here](https://matplotlib.org/3.1.1/tutorials/text/annotations.html)). We'll come back to this later.

Even with these simple plots this is starting to get pretty verbose. Matplotlib tends to be a manual process in this way. However, One advantage is the ability to do nearly anything you want. 

Seaborn is a wrapper around matplotlib that tends to be a bit easier to use and more easily produces cleaner plots without much manipulation. In the next few steps we'll look at how to show how the total amount of money raised changes each election cycle. Fortunately this data goes back to 1980!

But first we'll revisit the same plot, to see how we would do the same thing in Seaborn.

Note that we can just pass the data to Seaborn and select which columns we want, rather than selecting out columns or datasets ahead of time. The axis labels are taken care of automatically.

One nice thing about Seaborn is that it's a wrapper around Matplotlib so once you produce a plot, you can still manipulate it with the same standard set of Matplotlib functions. Below we adjust the axis limits.

In [None]:
#Set the Seaborn default theme on Matmplotlib
sns.set()

#Note that we use by using hue, Seaborn automatically splits the data into separate series and colors them accordingly.
sns.relplot(x="TTL_RECEIPTS", y="TTL_INDIV_CONTRIB", hue="PTY_AFFILIATION", \
            data=house_2018[house_2018.PTY_AFFILIATION.isin(["REP","DEM"])])

#Adjust the axis limits using Matplotlib
plt.xlim((0, 20000000))
plt.ylim((0, 20000000))

#Increase figure size
plt.gcf().set_size_inches(10,8)

Instead of plotting Democrats and Republicans atop one another, we can also use a "Factor Grid" to split out different categories into different plots. Let's look at senators this time.


In [None]:
major_party_senate_2018 = cands[(cands.ELECTION_CYCLE_YR == 2018) & 
                                             (cands.CAND_OFFICE == "S") & 
                                             (cands.PTY_AFFILIATION.isin(["REP", "DEM"]))]
fgrid = sns.FacetGrid(major_party_senate_2018, col="PTY_AFFILIATION")

# Since we have already declared our facet to be "PTY_AFFILIATION"
# Seaborn knows to create two plots side by side
fgrid = fgrid.map(plt.scatter, "TTL_RECEIPTS", "TTL_INDIV_CONTRIB")

#Adjust the axis limits 
plt.xlim((0, 20000000))
plt.ylim((0, 20000000))

#Increase figure size
plt.gcf().set_size_inches(10,8)

Ok. Now let's see how campaign contributions change over time using Seaborn. The first thing we have to do is manipulate our data to get total campaign contributions for the last 40 years.

In [None]:
house_cands = cands
contrib_sum = house_cands.groupby("ELECTION_CYCLE_YR").sum()
contrib_sum

In [None]:
# This reset index removes the index column from the grouped data for ease of use in Seaborn
contrib_sum = cands.groupby("ELECTION_CYCLE_YR").sum().reset_index()

sns.barplot(x="ELECTION_CYCLE_YR", y="TTL_RECEIPTS", data=contrib_sum)
plt.gcf().set_size_inches(10,8)

#Rotate the labels on the x axis for readability
plt.xticks(rotation=45)

We can see the rapid increase in spending on elections over the past 40 years. Although the US population has increased by 50% and the inflation rate is about 3x since 1980, the spending on elections far outpaces this.

Next, create a grouped bar plot using Seaborn that compares Democrat and Republican spending on **House** Campaigns since 1980. How many years did Republicans spend more than Democrats? (You may start by copying the code for the above plot)

In [None]:
# Create your plot here

Since we are tracking data over time, what might be a different plot that could be a useful way to view this data?

**Clicker Question**

Now change the above plot to instead be of the type we just decided.

In [None]:
# Create your plot here

Political action committees (PACs) are organizations that raise money to support political campaigns or political ideas. These include party committees (like the Democratic National Committee), committees organized around ideas or particular issues, or special interest groups (like the American Medical Association).

Say we are now interested not just in the total amount of money raised during an election cycle, but on where that money comes from over time? What about distributed across both candidates campaigns as well as different types of political action committees (PACs).


What might be a good way of visualizing this?

**Clicker Question**

Let's take a look at how we would accomplish this in Matplotlib/Seaborn. The first thing we have to do is manipulate our data so that it contains what we want. Remember that we want to separate out spending on house, presidential, and Senate elections, as well as different spending on political action committees (PACs) per year. Unfortunately the PAC data only goes back to the 2000 election cycle, so we'll settle for looking at the last 20 years instead of the last 40.

In this case we will show how to use the plotting convenience functions of pandas, that internally use Matplotlib.

Below we combine the ``cands`` and ``pacs`` dataset. For visual clarity, we'll look at only the top 3 types of PACS. 

In [None]:
#Sum up and get only the fields we are interested in
election_spending = cands[cands.ELECTION_CYCLE_YR >= 2000]. \
                    groupby(["ELECTION_CYCLE_YR", "CAND_OFFICE"])\
                    .sum().reset_index()

#We will change the
election_spending = election_spending[["ELECTION_CYCLE_YR", "CAND_OFFICE", "TTL_RECEIPTS"]].rename({"CAND_OFFICE":"SPENDING_TYPE"}, axis="columns")


pac_spending = pacs.groupby(["ELECTION_CYCLE_YR", "CMTE_TP"]).sum().reset_index()

#Get the top 3 Pacs
top_pacs = pac_spending.groupby("CMTE_TP").sum().sort_values("TTL_RECEIPTS", ascending=False).index[:3]
pac_spending = pac_spending[pac_spending.CMTE_TP.isin(top_pacs)]
pac_spending = pac_spending[["ELECTION_CYCLE_YR", "CMTE_TP", "TTL_RECEIPTS"]].rename({"CMTE_TP":"SPENDING_TYPE"}, axis="columns")

total_spending = pd.concat([election_spending, pac_spending])


#Rename to meaningful spending types
spending_type_map = {"H":"House", "S":"Senate", "P":"President", "Q":"Standard PAC", "O":"Super PAC", "Y":"Party PAC"}
total_spending.SPENDING_TYPE = total_spending.SPENDING_TYPE.apply(lambda x: spending_type_map[x])
total_spending

Now that we have our data in the proper format, we will create our plot by first creating a pivot table.

A pivot table contains aggregate data for a dataset. Each row and column represent a category. Each cell contains an aggregate for data points in each row. For instance, beelow we create rows for each ``ELECTION_CYCLE_YEAR`` and columns for each ``SPENDING_TYPE``. The cell corresponding to ``2016`` and ``House`` sums up all values where ``ELECTION_CYCLE_YEAR`` is 2016 and ``SPENDING_TYPE`` is ``House``


In [None]:
#This will create rows for election cycle years, and columns for each distinct spending type.
total_spending_pivot = total_spending.pivot(index="ELECTION_CYCLE_YR", columns="SPENDING_TYPE")

#Reorder columns for plotting prettyness
column_order = ['Party PAC', 'Standard PAC', "Super PAC", "House", "Senate", "President"]
total_spending_pivot = total_spending_pivot.reindex([("TTL_RECEIPTS", x) for x in column_order], axis = 1)
total_spending_pivot

In [None]:
total_spending_pivot.plot.bar(y = "TTL_RECEIPTS", stacked=True)
plt.gcf().set_size_inches(15,10)
plt.legend()

## Part 2: Declarative Visualizations (Altair & Vega-Lite)

In the first part of the reversed lecture we looked at some of the most popular plotting libraries in python. You will often find plots made with these libraries in academic papers and online.

In the preceding section, you may have felt that a lot of this is hardcoded. In each case we had to massage our data into a form that was close to what we wanted to plot in pandas, before passing it to the plotting library. We had to manually adjust axes, and aggregate ahead of time.

An alternative to this approach is the idea of **Declarative visualization**. The difference in style is expressed in the Altair [documentation](https://altair-viz.github.io/getting_started/overview.html). 

    The key idea is that you are declaring links between data columns and visual encoding channels, such as the x-axis, y-axis, color, etc. The rest of the plot details are handled automatically. Building on this declarative plotting idea, a surprising range of simple to sophisticated plots and visualizations can be created using a relatively concise grammar."
    
We do this by giving input data, then declaring a set of transformations to get to visual encoding channels.

Altair is a python wrapper around Vega-lite. A declarative visualization grammar. 

In [None]:
import altair as alt
from vega_datasets import data
from altair import datum
alt.renderers.enable('notebook')
alt.data_transformers.disable_max_rows()

We'll start by revisiting the scatterplot example from above, instead looking at Senate candidates. Because of the way that Altair works, it is faster to render the fewer points we have. So we'll give it Senate candidates only to reduce the size of our dataset.

In [None]:
senate_cands = cands[cands.CAND_OFFICE=="S"]
alt.Chart(senate_cands).mark_point(). \
    encode(x="TTL_RECEIPTS:Q",
           y="TTL_INDIV_CONTRIB:Q",
           color="PTY_AFFILIATION:N"). \
transform_filter(
        ((datum.PTY_AFFILIATION == "REP") | (datum.PTY_AFFILIATION == "DEM")))


There is a bunch going on here so let's break it down a step at a time.

```
alt.Chart(cands).mark_point()
```
This line declares that we are creating a chart over some set of data, in our case the candidate dataset. We also specify the kind of mark we are going to make on our plot. In this case a "point", we might also want a bar, a circle, some area, etc. We will look at some other marks later.

We then encode our data using visual channels. Altair can infer what data type to use (Recall Monday's lecture on visual data types: Quantitative, Ordinal, Nominal, and Temporal). As a reminder of the diference these see [this Altair Documentation](https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types) part of the Altair documentation. Below we'll add one mapping from column to visual encoding at a time, and observe the impact.

In [None]:
alt.Chart(senate_cands).mark_point(). \
    encode()

Since we have not specified any encoding for our data, all our points are plotted atop each other, this is obviously not very useful.

In [None]:
alt.Chart(senate_cands).mark_point(). \
    encode(x="TTL_RECEIPTS:Q")

The Q above tells Altair that this is a Quantitative value.

Here we plot the total receipts along the x axis, note that labeling is taken care of automatically. We start to see that most Senate candidates are mostly clustered near the low end of total receipts with only a few reaching beyond the $20 million mark. One thing we might be interested in at this point is creating a histogram. Let's do that. We will create bins on total receipts. Then we will add a count (with log axis).

In [None]:
alt.Chart(senate_cands).mark_point(). \
    encode(alt.X("TTL_RECEIPTS:Q", bin=True))

This by itself is not very useful without a y-axis. Let's add one for the count of these values on the y axis, with log scale, to create a histogram)

In [None]:
alt.Chart(senate_cands).mark_point(). \
    encode(alt.X("TTL_RECEIPTS:Q", bin=True),
          alt.Y("count()", scale=alt.Scale(type="log")))

Now we can see that the mark type is probably not appropriate for what we are trying to show, we'll switch from "points" to "bars" to get a histogram of our data. We can see that Most candidates raise between 0 and $10 million.

In [None]:
alt.Chart(senate_cands).mark_bar(). \
    encode(alt.X("TTL_RECEIPTS:Q", bin=True),
          alt.Y("count()", scale=alt.Scale(type="log")))

We'll get back to the remaining visual encodings for our dataset now. We'll encode the total individual contributions along the y-axis.

In [None]:
alt.Chart(senate_cands).mark_point(). \
    encode(x="TTL_RECEIPTS:Q",
           y="TTL_INDIV_CONTRIB:Q")

We now have all our Senate candidates plotted, comparing total receipts and total individual contributions. We will add back color to indicate party affiliation.

In [None]:
alt.Chart(senate_cands).mark_point(). \
    encode(x="TTL_RECEIPTS:Q",
           y="TTL_INDIV_CONTRIB:Q",
           color="PTY_AFFILIATION:N")

Woah, while that gave split the parties up by color, there are a bunch of small parties that are cluttering up our dataset.

We can filter it down to get just Republicans and Democrats. While we could perform this manipulation in Pandas, here we will show how to add transformations to data in Altair.

In [None]:
senate_cands = cands[cands.CAND_OFFICE=="S"]
alt.Chart(senate_cands).mark_point(). \
    encode(x="TTL_RECEIPTS:Q",
           y="TTL_INDIV_CONTRIB:Q",
           color="PTY_AFFILIATION:N"). \
transform_filter(
        ((datum.PTY_AFFILIATION == "REP") | (datum.PTY_AFFILIATION == "DEM")))


We can also split it into two plots by party, by using a facet chart.

In [None]:
senate_cands = cands[cands.CAND_OFFICE=="S"]
alt.Chart(senate_cands).mark_point(). \
    encode(x="TTL_RECEIPTS:Q",
           y="TTL_INDIV_CONTRIB:Q",
          color="PTY_AFFILIATION:N"). \
    facet(column="PTY_AFFILIATION:N").\
transform_filter(
        ((datum.PTY_AFFILIATION == "REP") | (datum.PTY_AFFILIATION == "DEM")))


Remember how we were trying to find just who some of these outliers were for house candidates? Altair makes this trivial with tooltips. Conceptually, we are adding another visual encoding channel to the output with the candidate name. Let's add a tooltip for the candidate and election cycle to see which candidates are outliers.

In [None]:
senate_cands = cands[cands.CAND_OFFICE=="S"]
alt.Chart(senate_cands).mark_point(). \
    encode(x="TTL_RECEIPTS:Q",
           y="TTL_INDIV_CONTRIB:Q",
          color="PTY_AFFILIATION:N",
          tooltip=["CAND_NAME", "ELECTION_CYCLE_YR"]). \
    facet(column="PTY_AFFILIATION:N").\
transform_filter(
        ((datum.PTY_AFFILIATION == "REP") | (datum.PTY_AFFILIATION == "DEM")))


We've now explored a small subset of the things you can do with Altair. So here's a chance for you to play around.

**Clicker Question**

Let's say we want to create a plot that shows how much Republicans and Democrats each spent on house elections going back to 1980. What kind of plot might you create?

Try other marks in the following cell to find one you think is appropriate for the data. See the set of marks in the [Altair documentation](https://altair-viz.github.io/user_guide/marks.html).

In [None]:
#Play around with different marks for this dataset. Which ones make sense to you?

house_cands = cands[(cands.CAND_OFFICE.isin(["H", "S"])) & (cands.PTY_AFFILIATION.isin(["REP", "DEM"]))]
alt.Chart(house_cands).mark_point().encode(
    x="ELECTION_CYCLE_YR:O",
    y="sum(TTL_RECEIPTS)",
    color="PTY_AFFILIATION:N")

There are a couple more cool things you can do with Altair/Vega. In particular, visualizations with maps and interactive visualizations.

Let's say we want to see which states in the last three election cycles had Senate races that raised a disproportionate amount of money per capita. Senate elections are every 6 years so looking at the previous 6 years of data ensures that all Senate seats have had an election.

We'll start by creating a Choropleth map. That is, a map that is shaded relative to some statistic. Here let's look at the senate candidates that raised the most money per capita.

In [None]:
# First we'll add our state population data to our table.
# we can do this in Altair too, but for convenience we'll do it outside in pandas.
dist_pop = pd.read_csv("data/dist_pop.txt", delimiter="|")

# Add the state population to the senate table with a merge
state_pop = dist_pop[["state", "population"]].groupby("state").sum().reset_index()
senate_pop = pd.merge(left=cands[(cands.ELECTION_CYCLE_YR >= 2014) &
                                 (cands.ELECTION_CYCLE_YR <= 2018) &
                                 (cands.CAND_OFFICE=="S")],
                           right=state_pop, left_on="CAND_OFFICE_ST", right_on="state")
senate_pop

We'll create our map using a built in dataset containing mappings of states to their geographies. Much of this code is adopted from Altair's [example gallery](https://altair-viz.github.io/gallery/index.html)


In [None]:
#Fetch the state geograpy data
state_geo = alt.topo_feature(data.us_10m.url, 'states')

#state_geo only contains an opaque numeric state id. 
#We need to map our postal code to this state id, the following 3 lines do this

#Mapping of postal code to the state id used in the geographic data
states = {'AK': 'Alaska','AL': 'Alabama','AR': 'Arkansas','AS': 'American Samoa','AZ': 'Arizona','CA': 'California','CO': 'Colorado','CT': 'Connecticut','DC': 'District of Columbia','DE': 'Delaware','FL': 'Florida','GA': 'Georgia','GU': 'Guam','HI': 'Hawaii','IA': 'Iowa','ID': 'Idaho','IL': 'Illinois','IN': 'Indiana','KS': 'Kansas','KY': 'Kentucky','LA': 'Louisiana','MA': 'Massachusetts','MD': 'Maryland','ME': 'Maine','MI': 'Michigan','MN': 'Minnesota','MO': 'Missouri','MP': 'Northern Mariana Islands','MS': 'Mississippi','MT': 'Montana','NA': 'National','NC': 'North Carolina','ND': 'North Dakota','NE': 'Nebraska','NH': 'New Hampshire','NJ': 'New Jersey','NM': 'New Mexico','NV': 'Nevada','NY': 'New York','OH': 'Ohio','OK': 'Oklahoma','OR': 'Oregon','PA': 'Pennsylvania','PR': 'Puerto Rico','RI': 'Rhode Island','SC': 'South Carolina','SD': 'South Dakota','TN': 'Tennessee','TX': 'Texas','UT': 'Utah','VA': 'Virginia','VI': 'Virgin Islands','VT': 'Vermont','WA': 'Washington','WI': 'Wisconsin','WV': 'West Virginia','WY': 'Wyoming'}
state_to_id = {"Alabama":"1","Alaska":"2","Arizona":"4","Arkansas":"5","California":"6","Colorado":"8","Connecticut":"9","Delaware":"10","District of Columbia":"11","Florida":"12","Georgia":"13","Hawaii":"15","Idaho":"16","Illinois":"17","Indiana":"18","Iowa":"19","Kansas":"20","Kentucky":"21","Louisiana":"22","Maine":"23","Maryland":"24","Massachusetts":"25","Michigan":"26","Minnesota":"27","Mississippi":"28","Missouri":"29","Montana":"30","Nebraska":"31","Nevada":"32","New Hampshire":"33","New Jersey":"34","New Mexico":"35","New York":"36","North Carolina":"37","North Dakota":"38","Ohio":"39","Oklahoma":"40","Oregon":"41","Pennsylvania":"42","Rhode Island":"44","South Carolina":"45","South Dakota":"46","Tennessee":"47","Texas":"48","Utah":"49","Vermont":"50","Virginia":"51","Washington":"53","West Virginia":"54","Wisconsin":"55","Wyoming":"56","Puerto Rico":"72"}

#Set the state name in our data instead of the postal code
senate_pop["STATE_ID"]=senate_pop.CAND_OFFICE_ST.apply(lambda x: state_to_id[states[x]])

#Get the per capita receipts for each state
senate_pop["PER_CAPITA_RECEIPTS"]=senate_pop.TTL_RECEIPTS/senate_pop.population

# Get the sum of per capita and total receipts in our senate data
agg_senate_pop = senate_pop[["STATE_ID", "CAND_OFFICE_ST", "PER_CAPITA_RECEIPTS", "TTL_RECEIPTS"]].\
                groupby("STATE_ID").agg({"CAND_OFFICE_ST":"first", "PER_CAPITA_RECEIPTS":"sum", "TTL_RECEIPTS":"sum"}) .reset_index()

#Plots the map from the given dataset, and uses a map projection
per_capita_spending = alt.Chart(state_geo).mark_geoshape().project(
    type='albersUsa'
#looks up the state in our dataset
).transform_lookup(
    lookup='id',
    # Fetches only the columns of interest
    from_=alt.LookupData(agg_senate_pop, 'STATE_ID', ["PER_CAPITA_RECEIPTS", "CAND_OFFICE_ST", "TTL_RECEIPTS"])
).encode(color="PER_CAPITA_RECEIPTS:Q", tooltip=["CAND_OFFICE_ST:N","TTL_RECEIPTS:Q"])
per_capita_spending

Finally, Altair makes using interactions straightforward. We have already seen a bit of this with tooltips when hovering over states above. But you can also perform interactive filtering on datasets.

Let's create a scatterplot of our Senate candidates, and then we will allow users to filter over different regions to see which parties have low ratios of individual contributions to total receipts. We create a selector for our data that will allow us to filter data points. We'll also zoom into our area of interest by setting the scale on the x and y axis.

In [None]:
#Create our interaction
brush = alt.selection(type="interval", resolve="global")

# Get only Republicans and Democrats
base = alt.Chart(senate_pop[senate_pop.PTY_AFFILIATION.isin(["REP", "DEM"])])

#Create the scatterplot with our selection
points = base.mark_point(clip=True).encode(
    alt.X("TTL_RECEIPTS:Q", scale=alt.Scale(domain=(1,20_000_000))),
    alt.Y("TTL_INDIV_CONTRIB:Q", scale=alt.Scale(domain=(1,20_000_000))),
    color=alt.condition(brush, "PTY_AFFILIATION:N", alt.value("lightgray"))).add_selection(brush)

#Filter our bar chart by the selection
count = base.mark_bar().encode(
    x="PTY_AFFILIATION:N",
    color="PTY_AFFILIATION:N",
    y="sum(TTL_RECEIPTS)").transform_filter(brush)

#This concatonates plots side by side, You could place them atop one another using &
points | count

Now we can filter the data in the right bar plot by dragging our mouse over points on the scatter plot. You can try it above.

Of course there is a ton of functionality of these visualization libraries that we did not have the time to get into. For the take home portion of the lab, you will create your own visualization over a dataset of your choice. This will give you the chance to explore these libraries on your own and try new things.

We're looking forward to seeing the visualizations you create!