# The Milwaukee Journal Sentinel and Life Expectancy

**Story:** [Race gap found in pothole patching](https://web.archive.org/web/20081223094123/http://www.jsonline.com/news/milwaukee/32580034.html)

**Author:** Keegan Kyle, Grant Smith and Ben Poston, Milwaukee Journal Sentinel

**Topics:** Census Data, Geocoding, QGIS Spatial Joins, Linear Regression

**Datasets**

* Pothole fill data from the Milwaukee [Department of Public Works](https://city.milwaukee.gov/dpw)
    - **2007-2010 POTHOLES.xls**: Pothole data, July 2007-July 2010
    - **2010-2013 POTHOLES.xls**: Pothole data, July 2010-July 2013
    - **2013-2017 POTHOLES.xls**: Pothole data, July 2013-July 2017
* **tl_2013_55_tract.zip:** 2013 census tract boundaries from the [US Census Bureau](https://www.census.gov/cgi-bin/geo/shapefiles/index.php)
* **addresses_geocoded.csv:** a large selection of addresses in Milwaukee, geocoded by [Geocod.io](https://geocod.io)
* **R12216099_SL140.csv:** ACS 2013 5-year, tract level, from [Social Explorer](https://www.socialexplorer.com)
    - Table A04001, Hispanic or Latino by Race
    - `R12216099.txt` is the data dictionary
* **R12216226_SL140.csv** ACS 2013 5-year, tract level, from [Social Explorer](https://www.socialexplorer.com)
    - Table A14006, 2013 Median Household income
    - Data dictionary [is here](https://www.socialexplorer.com/data/ACS2013_5yr/metadata/?ds=SE&table=A14006)

# What's the story?

We're trying to figure out if the **time it took Milwaukee to fill pot holes** is related to the racial makeup of a census tract.

# Do your imports

You'll also want to set pandas to display **up to 200 columns at a time**.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
pd.set_option('display.max_columns', 200)

# Create a dataframe of potholes from only 2013

The delivered multiple files to us, so we'll need to filter them down to just 2013.

**Editorial decision:** When we say "a pothole from 2013," what exactly do we mean?

In [3]:
df=pd.read_csv('2010-2013 POTHOLES.xls')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

## What is the maximum and minimum `EnterDt` and `ResolvDt`?

Use this to confirm that your date range is what you expected. If it isn't, take a look at what might have happened with your dataset.

* **Tip:** Missing data might be a headache

## Calculate how long it took to fill potholes in 2013

Save it into a new column.

* **Tip:** It's possible to subtract two dates

### Hrm, well, I think we need that difference to be an integer

If your new column isn't an integer, create _another_ column that is.

* **Tip:** Just like you might use `.str.strip()` on a string column, if your column is a datetime you can use `.dt.components` to get the days, hours, minutes, seconds, etc of the column.

# Read in your race-related census data

* **Tip:** This will probably give you an error the first time you try, because the Census Bureau doesn't think about the rest of the world.

## Create new dataframe describing the racial makeup of each census tract

Each row should contain:

* The census tract number
* The percent of the population that is White
* The percent of the population that is Black
* The percent of the population that is Hispanic
* The percent of the population that is a minority (non-White)

You can either calculate and filter on the original shapefile, or create a whole new dataframe.

* **Tip:** You'll want to read the data dictionary for the Census data.
* **Tip:** The column names don't exactly match the data dictionary, but you can figure it out
* **Tip:** If you aren't sure which column is the actual census tract number, you can keep multiples for now

# Combine this census data with your street address data

Create a new dataframe my merging your street addresses with your census data. It would be nice to marge on census tract code, but your addresses data does **not** have a census tract on them! **This is where things get a little crazy.**

### Adding census tract codes to addresses

We went ahead and geocoded many addresses in Milwaukee to create `addresses_geocoded.csv`. You'll need to add a new column using QGIS that adds in the address's census tract.

The QGIS command **Join attributes by location** will merge datasets that geographically overlap each other. If you have one layer that's a list of lat/lon points and one layer that's a shapefile with census tract information, **join attributes by location** can create a new layer of lat/lon points that also has census tract information.

You can export the result by right clicking the layer, selecting **Export As** and making sure the output is another CSV.

### Merging

Now you'll have three datasets:

* Addresses and pothole fill ties
* Census data according with census tract codes
* Addresses with census tract codes

You'll need to merge them each together to create one whole dataset.

# Linear regression

Using the `statsmodels` package, run a linear regression to find the coefficient relating percent minority and pothole fill times.

Translate that into the form **"every X percentage point change in the minority population translates to a Y change in pot hole fill times"**

Do you feel comfortable that someone can understand that? Can you reword it to make it more easily understandable?

# Other methods of explanation

While the regression is technically correct, it just does't sound very nice. What other options do we have?

## What's the average wait to fill a pothole between majority-white and majority-minority census tracts?

You'll need to create a new column to specify whether the census tract is majority White or not.

## How does the average wait time to fill a pothole change as more minorities live in an area?

* **Tip:** Use `.cut` or `.qcut` to split the percent minority (or white) into a few different bins.

# Bonus: Income

`R12216226_SL140.csv` contains income data for each census tract in Wisconsin. Add it into your analysis.

If you run a multivariate regression also including income, how does this change things?