# Missouri Presidential Primary Results

This notebook walks through wrangling, analysis and visualization of *unofficial* results of the statewide Missouri General Election, which will be held on November 3, 2020.

## Dependencies 

First, we import the necessary Python modules (or parts of them). Here's the list of what we need from the Python's standard libary:

In [2]:
import os
import xml.etree.ElementTree as et

Here are the modules from third-party packages (previously installed via [pip](https://pip.pypa.io/en/stable/installing/)).

In [3]:
import pandas as pd
import altair as alt

And here we'll introduce another third-party package called [Requests](https://requests.readthedocs.io/en/master/), which is very useful for managing HTTP requests and responses.

[HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP) stands for **H**yper**t**ext **T**ransfer **P**rotocol, and it is the foundation of any data exchange on Web. It sets the rules for how clients like your preferred web browser (e.g., Firefox, Safari, Chrome) communicate with web servers that host web pages and other resources you might fetch.

The HTTP flow starts when a client sends a request to server. All requests have the following parts:

1. A [method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) indicating the user's desired action.
2. A path to a resource, indicated by Universal Resource Locator (aka, [URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL)).
3. Optional headers that convey additional information to the server.

For more background, check out ["How the Web works"](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works) and other resources on the Mozilla Developer Network.

In [4]:
import requests

## Accessing results data

For every state-wide election, the Missouri Secretary of State publishes county-level results. Initial results are posted after polls close at 7 pm on election night with updated results posted every few minutes until all precincts have reported.

Here is the URL:

In [5]:
results_url = "https://enrarchives.sos.mo.gov/apfeed/apfeed.asmx/GetElectionResults"

In order to access the feed, you need a key provided by SoS Elections Divisions. To keep the key hidden from public view, we can store it in a variable in our shell environment so that it won't show up in our notebook when we publish it.

Go to your launcher and launch a terminal session. Then you can set the environment variable:

```bash
set $ACCESS_KEY={paste the access key here}
```

Note that the environment variable will be deleted when the terminal session is closed. There are a lot of tools and tricks for storing project-specific environment variables and loading them as needed. My preference is [direnv](https://direnv.net/).

We can now access this environment variable via [`os`](https://docs.python.org/3/library/os.html), Python's built-in module for interacting with you computer's underlying operating system.

In [13]:
access_key = os.environ.get('ACCESS_KEY')

I can fetch this data using the popular [Requests](https://requests.readthedocs.io/en/master/) library, which provides a simple interface for sending [HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP) requests and processing the responses.

Here's how I make a `GET` request (the most common HTTP [request method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods)) to the XML feed url. The access key is passed in via a query parameter.

In [14]:
r = requests.get(
    results_url, params={'AccessKey': access_key}
)

This method call returns a [`requests.Response`](https://requests.readthedocs.io/en/master/api/#requests.Response), which I've called `r`.

Here's how we check the general status of the response.

In [15]:
r.ok

False

Uh oh, it's not okay.

We can further investigate by checking the responses status code [status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

In [9]:
r.status_code

403

And the reason for the status code, if provided by the server.

In [10]:
r.reason

'Forbidden'

As is often the case, some web servers are configured to block requests coming from specific kinds of [user agents](https://developer.mozilla.org/en-US/docs/Glossary/user_agent) (i.e., software making a request on behalf of a user).

However, we can bypass this issue by modifying the [`'User-Agent'` header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) of the request. The default is `'python-requests/2.23.0'`, but we can substitute what a web browser would send.

Much like the parameters, the headers can be passed in as a `dict` through a keyword argument in the `.get` method call:

In [18]:
r = requests.get(
    results_url,
    params={'AccessKey': access_key},
    headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:73.0) Gecko/20100101 Firefox/73.0'}
)

Now let's check the status again.

In [19]:
r.ok

True

The actual data we're after is in the `.content` attibute of the response.

In [20]:
xml = r.content

## Preparing results data for analysis

Missouri's election results (and election results published by most other U.S. election authorities) are provided in [XML](https://developer.mozilla.org/en-US/docs/Web/XML/XML_introduction) format. XML stands for E**x**tensible **M**arkup **L**anguage, and it's closely related to HTML (aka, **H**yper**t**ext **M**arkup **L**anguage), another foundational technology of the Web.

HTML and XML both share the concept of [elements](https://developer.mozilla.org/en-US/docs/Glossary/element) which afford structure to documents. 

Whereas HTML has predefined tags like [`<h1>` through `<h6>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/Heading_Elements) for headings and [`<p>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/p) for paragraphs, XML allows data providers to define their own tags, which affords more semantic annotation of the document's content. In the sense, the tags in XML become sort of like column headers of a data table, communicate how we should interpret the inner content.

At the top of this notebook, we imported [`ElementTree`](https://docs.python.org/3.8/library/xml.etree.elementtree.html#module-xml.etree.ElementTree)  under the alias `et` (this is part of Python's built-in [XML](https://docs.python.org/3.8/library/xml.etree.elementtree.html) module). In that module is a function named [`fromstring`](https://docs.python.org/3.8/library/xml.etree.elementtree.html#xml.etree.ElementTree.fromstring), which parses a string of XML (i.e., the content of our response).

In [21]:
root = et.fromstring(xml)

The `fromstring` function returns the top-level or "root" [`Element`](https://docs.python.org/3.8/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element) of the XML. This is the outermost element of the XML tree.

Again, the name of each tag indicates the kind of data the element contains, sort of like a column header in a tabular data format. We can access this label like this:

In [26]:
type(root)

xml.etree.ElementTree.Element

In [24]:
print(root.tag)

ElectionResults


We can also access the elements "attributes", which is a dictionary of the all the stuff that's inside the first part of the tag. In this case, there's just one attribute: `'LastUpdated'`, which tells us the date and time the results were last updated.

In [23]:
print(root.attrib['LastUpdated'])

10/27/2020 04:51:42 PM


The `Element` class has two other methods we'll make use of:

- [`.find`](https://docs.python.org/3.8/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.find), which returns the *first* subelement that matches the string you pass in.
- [`.findall`](https://docs.python.org/3.8/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findall), which returns *all* of the matching subelements in a list.

Unlike CSV and other tabular data formats, XML has a nested structure with elements inside other elements. We can't read XML directly into a `pandas.DataFrame` (though, I did just run across [this](https://pypi.org/project/pandas-read-xml/) third-party package, which looks quite promising).

Instead, we need to reformat the data from a nested structure into a two-dimensional structure that resembles a data table. We need to "flatten" the XML documents hierarchy such that each row in our data includes single observation with an observed value for each variable (i.e., columns). 

Essentially, we want each row to contain the number of votes counted for each ballot option in a given county and given contest.

Let's start by defining an empty list hold all of our rows

In [15]:
rows = []

Now let's work from the outside in.

In [45]:
for type_race in root.findall('.//TypeRace'):
    row = {'type': type_race.find('Type').text}
    for race in type_race.findall('Race'):
        row['race'] = race.find('RaceTitle').text.strip()
    print(row)

{'type': 'Federal', 'race': 'U.S. President and Vice President'}
{'type': 'State of Missouri', 'race': 'Attorney General'}
{'type': 'US Representative ', 'race': 'U.S. Representative - District 8'}
{'type': 'State Senate ', 'race': 'State Senator - District 33'}
{'type': 'State House ', 'race': 'State Representative - District 163'}
{'type': 'Circuit Court', 'race': 'Circuit Judge - Circuit 42 Division 2'}
{'type': 'Missouri Supreme Court', 'race': 'Missouri Supreme Court'}
{'type': 'Eastern Appellate Court ', 'race': 'Court of Appeals - Eastern District 1'}
{'type': 'Southern Appellate Court ', 'race': 'Court of Appeals - Southern District'}
{'type': 'Western Appellate Court ', 'race': 'Court of Appeals - Western District'}
{'type': 'Circuit Court', 'race': 'Associate Circuit Judge - Circuit 31 - Div 26'}
{'type': 'Ballot Issues', 'race': 'Constitutional Amendment 3'}


And then inspect a single candidate element.

We can list all of the elements inside the candidate elements.

Now I can load this data into a `pandas.DataFrame`. In so doing, I will specify the column headers. For the sake of consistency, I will reuse the element tags from the XML:

I'll take a look at the columns, their positions, non-null counts and data-types:

And take a peek of the actual data: