# Choropleths

A **choropleth** is a map in which areas are colored according to some statistic or variable of interest. Perhaps the most familiar example of a choropleth is the presidential election map, which shows the percentage in each county who voted for the Democratic or Republican candidate. In this graphic, the areal units are counties, and the statistic of interest is the percentage who voted for the Democratic (or Republican) candidate.

![](https://github.com/dlsun/pods/blob/master/12-Geospatial-Data/img/2016election.png?raw=1)

In this notebook, you will learn how to make choropleths like the one above.

## Shapefiles

The shapefile format is a data format for geometric objects, such as points, lines, and polygons. A shapefile can be used to describe the boundaries of a lake, the course of a river, or the boundaries of a county.

You can find shapefiles for most geographic entities online. For example, the [U.S. Census Bureau](https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html) maintains shapefiles for boundaries of states, counties, and congressional districts in the United States. Shapefiles for international data can be found [at the Natural Earth website](https://www.naturalearthdata.com/downloads/110m-cultural-vectors/).

The U.S. county shapefiles - at resolution 1:500,000 - are located at https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_500k.zip. If you look in the zip folder you'll notice that "shapefile" is somewhat of a misnomer, as the format refers not to a single file but a collection of files, all of which have the same filename but different extensions. The main extensions are:

- `.shp` - shape format, which stores the geometric objects
- `.shx` - shape index format, which indexes the objects to make them quickly searchable
- `.dbf` - attribute format, which stores additional metadata about each object
- `.prj` - projection format

`GeoPandas` makes it easy to read and create a `GeoDataFrame` from shape files.

In [None]:
!pip install --upgrade geopandas

In [None]:
import pandas as pd

import geopandas as gpd

from matplotlib import pyplot as plt

In [None]:
zipfile = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_500k.zip"

df_counties = gpd.read_file(zipfile)

df_counties

The geometry column contains information about "patches" like `Polygon`s for constructing maps. We can plot the county boundaries as in the previous notebook.

In [None]:
df_counties.boundary.plot(figsize=(12, 12));

Let's adjust the axes to zoom in on the continental U.S.

In [None]:
df_counties.boundary.plot(figsize=(12, 12));
plt.xlim(-130, -65);
plt.ylim(22, 50);

## Making Choropleths

A map graphic is a collection of patches. A "patch" is simply a 2-dimensional object with an edge color and face color. Examples of patches include circles, rectangles, and polygons. Since areal units---like countries, states, and counties---are polygons in general, the most important type of patch for making a choropleth is the `Polygon`. A `Polygon` is specified by a list of its vertices; see the **geometry** column. Thus, one way to create a choropleth is to draw the `Polygon` for each county, one at a time, setting the face color of each patch to an appropriate color based on the data value for that county. This requires that we maintain a color map that maps data values to colors.

For example, suppose we want to color each county according to the fraction of the county's area that is water. We first compute this fraction using the **ALAND** and **AWATER** columns.

In [None]:
df_counties["frac_water"] = df_counties["AWATER"] / (df_counties["AWATER"] + df_counties["ALAND"])
df_counties

`GeoPandas` makes it easy to create choropleth maps. Simply use the `plot` command with the `column` argument set to the column whose values you want used to assign colors.

In [None]:
df_counties.plot(column='frac_water', figsize=(12, 12))
plt.xlim(-130, -65);
plt.ylim(22, 50);

We can control the colormap with the `cmap` argument. A list of the available colormaps can be found at the [Matplotlib website](https://matplotlib.org/users/colormaps.html). Since we are displaying the fraction of each county that is water, we'll use a blue color map (`Blues`). Notice that the counties that border major bodies of water tend to have a higher fraction of their area as water, which should make sense!

In [None]:
df_counties.plot(column='frac_water', cmap = "Blues", figsize=(12, 12));
plt.xlim(-130, -65);
plt.ylim(22, 50);

## Making Choropleths with Outside Data

In the above example, we made a choropleth from data that was already in the shapefile. But in general, the shapefile only contains minimal metadata about each areal unit. Suppose the data that we want to visualize resides in a separate file. For example, suppose we want to plot the 2016 presidential election results by county (specifically "per_dem", the percentage in the county who voted for the Democratic candidate, Hillary Clinton).

In [None]:
df_election = pd.read_csv("https://dlsun.github.io/pods/data/election2016.csv")
df_election

We will need to merge `df_election` with the `df_counties` `DataFrame` that we defined above. But what do we merge the `DataFrame`s on? It turns out that every county in the United States is assigned a unique ID called a [FIPS code](https://www.census.gov/library/reference/code-lists/ansi.html). The FIPS code appears in `df_election` as **combined_fips** and in `df_counties` as **GEOID**. Let's take a look at these columns.

In [None]:
df_election["combined_fips"]

In [None]:
df_counties["GEOID"]

Notice that `df_counties` treats the FIPS code as a string (so every FIPS code is exactly 5 digits, with a leading zero if necessary). On the other hand, `df_elections` treats the FIPS code as an integer. If we want to join the two, we will have to cast them to the same type. It is probably easier to convert the string to an integer than vice versa.

In [None]:
df_counties["GEOID"] = df_counties["GEOID"].astype(int)

Now we can merge `df_election` and `df_counties` to add the 2016 election results to our GeoDataFrame.

In [None]:
df_all = df_counties.merge(df_election,
                           how="left",
                           left_on="GEOID", right_on="combined_fips")
df_all

One quick check is to make sure that `df_all` has the same number of rows as `df_counties`. This seems to be the case.

Now we can make a choropleth as before. There is just one catch. When we left-joined `df_counties` to `df_election`, some of the FIPS codes could not be matched. Therefore, these counties will be missing election data. Unfortunately, `matplotlib`'s color maps do not handle missing values gracefully, so we will have to handle these manually.

Let's first take a look at which states the counties with missing data were in.

In [None]:
missing_data = df_all[df_all.per_dem.isnull()].STATEFP.value_counts()

missing_data

[A list of FIPS State Codes can be found here](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code). The "states" that had more than one county missing election data are all outlying territories: Puerto Rico (72), Northern Mariana Islands (69), Virgin Islands (78), American Samoa (60), and Guam (66). It turns out that the two remaining states, each with exactly one county that could not be joined, are Alaska (02) and Hawaii (15), which do not show up on our map of the continental United States. So we could safely fill the missing values in `df_all` with an arbitrary value in the middle of the data range without affecting the appearance of the map, since none of these states/territories actually show up on the map. Since we're not displaying these counties anyway, we'll just remove them from the data frame.

In [None]:
df_all = df_all[~df_all["STATEFP"].isin(missing_data.index)]

Now we'll make a choropleth map of the percentage of votes cast for the Democratic candidate (Hillary Clinton) by county, using the `RdBu` colormap which maps 0 to red and 1 to blue. (Since we removed counties that aren't in the continental U.S., we don't have to adjust the axis limits anymore.)

In [None]:
df_all.plot(column = "per_dem", cmap = "RdBu", figsize = (12, 12));

We can add a color bar legend with `legend=True`.

In [None]:
df_all.plot(column = "per_dem", cmap = "RdBu", legend=True, figsize = (12, 12));

Here is some additional formatting of the legend.

In [None]:
from mpl_toolkits.axes_grid1 import make_axes_locatable

fig, ax = plt.subplots(1, 1)

divider = make_axes_locatable(ax)

cax = divider.append_axes("right", size="5%", pad=0.1)

df_all.plot(column = "per_dem", cmap = "RdBu", ax=ax, legend=True, cax=cax);

Notice that the color scale is automatically calculated from the data so that some counties that went for Clinton may actually be colored red!
One way to fix this problem is to plot a different statistic: the difference between the percentage of votes for Clinton and the percentage of votes for Trump. We will also manually set the min and max of the color so that a difference of 0 is in the middle.

Finally, we'll change the projection by changing the CRS; note the change in scale on the axes.

In [None]:
df_all["per_diff"] = df_all["per_dem"] - df_all["per_gop"]

ax = df_all.to_crs("EPSG:3082").plot(
    column="per_diff",
    figsize=(12, 12),
    cmap="RdBu", vmin=-0.9, vmax=0.9)
ax.set_xlim(-0.7e6, 4.2e6)
ax.set_ylim(0.55e7, 0.9e7)

## Problems with Choropleths

> "Oh, I love those beautiful red areas, that middle of the map.  There’s just a little blue here and a little blue there.  Everything else — everything else is bright red."
>
> -- Donald Trump

Choropleths can be misleading because they violate the *area principle*---the principle that the area on a graph should represent the magnitude of the data being presented. Since the geographic size of a county is often irrelevant to the data being presented, choropleths can easily be misinterpreted.

For example, President Donald Trump cites the large amount of red area on the 2016 electoral map as an indication of overwhelming support for him. However, the total amount of red area is a statistic that conflates two unrelated quantities: the geographic size of a county and the depth of support for Trump. In fact, most of the red area is in rural parts of the country where few people live. From the choropleth, one would never guess that more people actually voted for his opponent Hillary Clinton! She received strong support from urban areas, but unfortunately for her, cities can barely be seen on a map---despite being home to a majority of Americans.

It is important to be cautious when designing and interpreting choropleths. (Here is an interesting [*NY Times* article about mapping election results](https://www.nytimes.com/interactive/2016/11/01/upshot/many-ways-to-map-election-results.html).)