# Charting with Altair

This notebook covers some basic data visualization principles and how to implement them using [Altair](https://altair-viz.github.io/index.html), a popular visualization library for Python. Start by importing both Pandas and Altair. We use an alias for Altair, just like Pandas. In this case the alias is `alt`.

In [1]:
import pandas as pd
import altair as alt

## Loading and prepping the data

Next let's load our data. In this notebook we will be using contributions given in support or opposition to California Proposition 30, which calls for an income tax increase for those making more than $2 million a year. Revenue from the tax increase will be used for electronic vehicle incentive programs and wildfire prevention. Once the data is loaded, take a look at the columns in the dataframe, using the `.info()` method. Pay close attention to the data types, these will guide what visualizations you will be able to do with the data.

In [2]:
data_url = "https://gist.githubusercontent.com/esagara/8e1b7b0421bbefc394c60b0aeaa20b9b/raw/253dffacaed0e88d95836ce77504ec92ebf181e9/prop30_contributions.csv"
contribs = pd.read_csv(data_url)
contribs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   filer_id                  152 non-null    int64  
 1   date_filed                152 non-null    object 
 2   filing_id                 152 non-null    int64  
 3   id                        120 non-null    float64
 4   date_received             152 non-null    object 
 5   contributor_committee_id  14 non-null     float64
 6   contributor_firstname     110 non-null    object 
 7   contributor_lastname      152 non-null    object 
 8   contributor_employer      104 non-null    object 
 9   contributor_occupation    110 non-null    object 
 10  contributor_city          120 non-null    object 
 11  contributor_state         120 non-null    object 
 12  contributor_zip           120 non-null    float64
 13  amount                    152 non-null    float64
 14  is_loan   

Some of these columns have the wrong data type. The most notable columns are those with date information in them - **date_filed** and **date_received**. We should convert those columns into a data type that Pandas recognizes as dates. We can do that using a Pandas function - `to_datetime` - on each column.

In [4]:
date_cols = [
    'date_filed',
    'date_received'
]

for date_col in date_cols:
    contribs[date_col] = pd.to_datetime(contribs[date_col])

Confirm the data types have been changed correctly.

In [5]:
contribs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   filer_id                  152 non-null    int64         
 1   date_filed                152 non-null    datetime64[ns]
 2   filing_id                 152 non-null    int64         
 3   id                        120 non-null    float64       
 4   date_received             152 non-null    datetime64[ns]
 5   contributor_committee_id  14 non-null     float64       
 6   contributor_firstname     110 non-null    object        
 7   contributor_lastname      152 non-null    object        
 8   contributor_employer      104 non-null    object        
 9   contributor_occupation    110 non-null    object        
 10  contributor_city          120 non-null    object        
 11  contributor_state         120 non-null    object        
 12  contributor_zip       

Now that is fixed we can see a couple of things in our data. The **amount** column is the only real numeric data we have to chart, so we know our visualizations will focus on that field. The date columns can be used as numeric or categorical data. The rest of the columns are all strings or objects in Pandas-speak, which means they are categorical data. We are now ready to begin exploring our data with visualizations.

## Bar charts

Let's start with a simple bar chart. We already know Lyft is a top contributor. Let's take a look at who is filing the forms using the **name** and **amount** fields.

We begin by creating a `Chart` and telling Altair what data we want to visualize.

In [6]:
alt.Chart(contribs)

SchemaValidationError: Invalid specification

        altair.vegalite.v4.api.Chart, validating 'required'

        'mark' is a required property
        

alt.Chart(...)

Altair is throwing an error because we did not give it enough information to draw a chart. We need to specify the type of chart. Altair calls these marks, for a bar chart we want to use the `mark_bar` function. We also need to specify which fields will go on which axes using the `encode` function. Let's put the **amount** column on the Y-axis and the **name** column on the X-axis.

In [7]:
alt.Chart(contribs).mark_bar().encode(x='name',y='amount')

This chart doesn't look right. We shouldn't have a negative value for a single bar like that. This is probably because Altair is trying to chart each contribution individually rather than sum them up. We can correct that pretty easily, Altair does recognize the `sum` function when charting.

In [8]:
alt.Chart(contribs).mark_bar().encode(x='name',y='sum(amount)')

That's much better. You can see our scales have changed because Altair is now adding up contributions by filer name. Let's make some improvements to the chart by adding the following features:

- Change it from a horizontal bar chart to a vertical bar chart
- Sort by amount
- Color by the filer's position on the proposition

Changing a bar chart from horizontal to vertical is fairly easy. Altair will need some additional information from us to sort and color the chart.

First let's change the orientation of the chart.

In [9]:
alt.Chart(contribs).mark_bar().encode(x='sum(amount)',y='name')

Now let's look at sorting. We want to sort by **amount** along the vertical axis. To do this we will need to use some special syntax - the `alt.Y` function. In it we will specify the name of the column to use for the vertical axis, just like before. We need to add an additional bit of information using the `sort` keyword. By setting sort to equal `-x` we are telling Altair to sort the bars by their value in descending order.

In [10]:
alt.Chart(contribs).mark_bar().encode(
    x='sum(amount)',
    y=alt.Y('name', sort="-x")
)

With that sorted out we can now work on coloring the bars by the filer's position on Prop. 30. This can be done by adding a third keyword argument to the encode function.

In [11]:
alt.Chart(contribs).mark_bar().encode(
    x='sum(amount)',
    y=alt.Y('name', sort="-x"),
    color='position'
)

We've covered some of the basics for creating a bar chart. Keep in mind the main thing that dictates this is a bar chart is the `mark_bar` method. This means many of the features we implemented - sorting, coloring and aggregating data - can be used on any type of chart, not just bar charts. With that in mind, let's move on to line charts and explore some other options.

## Line charts

Line charts are ideal for showing change over time. Time data - years, months, days, etc - is best displayed along the horizontal axis while numeric values are displayed along the vertical axis. This means our date fields - particularly the **date_received** column - can come into play here. Let's do a simple visualization charting out contributions over time. As before we have to create the chart and tell Altair what data to use. We also have to specify the axes using the `encode` function. However, we will be using the `mark_line` function instead of the `mark_bar` function.

In [12]:
alt.Chart(contribs).mark_line().encode(x='date_received', y='sum(amount)')

Notice that we are still summing up the contribution amounts. This makes for a really nice graphic, but it also has the unintended consequence of hiding negative amounts - typically cases where the filer has paid back a loan. This is something that should be thought about and probably discussed with others before publishing a graphic based on it, but we will continue for now.

There are a few issues we should address however. The first is the labels on the horizontal axes. They are long and kind of cumbersome. We can fix that to some extent by making sure Altair knows that the **date_received** column is temporal data. This is done by adding `:T` after the column name.

In [13]:
alt.Chart(contribs).mark_line().encode(x='date_received:T', y='sum(amount)')

That's better, the graphic is a bit easier to read now.

The other issue we need to fix is we cannot distinguish contributions raised in support of Prop. 30 from those opposed to it. One approach is to split those up so each position has its own line in the chart. This can be done using the `color` keyword argument in the `encode` function as we did above.

In [14]:
alt.Chart(contribs).mark_line().encode(x='date_received:T', y='sum(amount)', color='position')

Better, but still difficult to read. We can try another approach called faceting. Faceting is used to create multiple versions of the visualization based on different subsets of the data. To do this we are going to drop the `color` setting and add another function - `facet`. The only argument we will pass into the `facet` function is the name of the field we want to split the data up by - **position**.

In [15]:
alt.Chart(contribs).mark_line().encode(x='date_received:T', y='sum(amount)').facet('position')

In [16]:
alt.Chart(contribs).mark_line().encode(x='date_received:T', y='sum(amount)', color='position').facet('name')

And that's a much more readable presentation of the data. Notice that our negative values returned. In this case it looks like someone from the opposition camp paid back a loan. This means our initial line chart displaying all contributions would have had some serious accuracy issues.

## Bonus section: Line charts displaying cummulative sums

The charts above may look ok, but are they telling us the story we want to know? Does it make sense to show how much money was raised on a give day or a running total? To graph the running totals we need to do some work to in Pandas, namely calculating the cummulative sum of the contributions by **date_received** and **position**. To begin we should sort the data by **date_received** to make sure Pandas calculates a running total correctly. We using the `inplace=True` argument so Pandas knows to sort the existing dataframe rather than return a new sorted dataframe. Notice there will not be any output in this case.

In [17]:
contribs.sort_values('date_received', inplace=True)

Next we need to calculate the cummulative sum using a `groupby` with the `cumsum` function. We are going to assign the output to a new field called `running_total`.

In [18]:
contribs['running_total'] = contribs[['position', 'amount']].groupby('position').cumsum()

Let's check our work.

In [19]:
cummulative_contribs = contribs[['date_received', 'position', 'amount', 'running_total']]
cummulative_contribs

Unnamed: 0,date_received,position,amount,running_total
0,2022-01-10,support,3000000.0,3000000.0
1,2022-01-10,support,0.0,3000000.0
2,2022-01-10,support,0.0,3000000.0
3,2022-01-12,support,0.0,3000000.0
4,2022-01-12,support,51000.0,3051000.0
...,...,...,...,...
148,2022-10-04,oppose,25000.0,11651336.0
150,2022-10-04,oppose,10000.0,11661336.0
146,2022-10-04,oppose,20891.5,11682227.5
147,2022-10-04,oppose,5000.0,11687227.5


That's a start in the right direction. The issue now is we have duplicate entries for some days. Altair may not chart what we want if we leave those duplicates in. What we want is the maximum **running_total** for each given day. We can get there doing a `groupby` and excluding the **amount** field from the calculation. 

In [20]:
cummulative_contribs = cummulative_contribs[['date_received', 'position', 'running_total']].groupby(['date_received', 'position'], as_index=False).max()
cummulative_contribs

Unnamed: 0,date_received,position,running_total
0,2022-01-10,support,3000000.0
1,2022-01-12,support,3051000.0
2,2022-01-14,support,3102000.0
3,2022-01-18,support,3152000.0
4,2022-01-21,support,3352000.0
...,...,...,...
60,2022-09-30,support,47091237.0
61,2022-10-01,support,47131237.0
62,2022-10-02,oppose,11019600.0
63,2022-10-03,oppose,11621336.0


Now we are ready for some charting. Let's do the same chart as we did above, but using the **running_total** field instead of the **amount** field.

In [21]:
alt.Chart(cummulative_contribs).mark_line().encode(x='date_received:T', y='running_total').facet('position')