# American Streets named after MLK

I used data from [OpenAddresses](https://openaddresses.io). To run the python files included in this repo, you'll need to download the 4 US datasets from the OpenAddresses website (Midwest, Northeast, South, and West). 

Unzipped, they add up to about 10 GB, so I downloaded one at a time, ran the respective python file, then deleted the original data file before moving on to the next region. 

Note: OpenAddresses does not have a complete collection of US addresses. The image below (from their website) shows the areas included. Unfortunately, there is not a legend indicating what the different shades of green mean. My best guess is that light green indicates areas with partial coverage. 


![](images/data_map.png)

## Replicating data processing

The python files (located in py_files) process the original datasets and create new CSV files. To minimize the changes required, move the downloaded data into the main folder of the repo & name the 4 data folders "west", "midwest", "northeast", and "south". 

## Understanding data processing
### Compiling data files
This dataset is broken up into 1050 CSV files. I use glob to make a list of file names. Later, we will loop through each file and concatenate results into a single dataframe for each region.

![](images/demo1.png)

### Filling NaNs ( part 1 ) 
Initial exploration showed that many 'REGION' and 'CITY' names are missing from data. However, this information can often be extracted from folder & file names. Some of the CSV files are actually named by county, rather than city. After investigating, I found that many of the files that have cities as names have no entries in the 'CITY' column. Because of this, we will only use the file name to replace city name if every entry in that file is missing values for the 'CITY' column. 

![](images/demo2.png)

### Looping through data files

In the remaining code, you'll see three main steps (we'll get to the 'little' steps next). First, we create a dataframe with the same columns as the original data. Second, we loop through each CSV, extract some info, and add it to the aforementioned dataframe. Third, we save our new dataframe as a CSV. 

![](images/demo3.png)

### Now for the little steps...
#### Filter streets
After converting the 'STREET' column to lowercase, we filter the column to include only streets that either contain "luther" or "mlk". 

String methods are picky. I chose "luther" instead of "martin luther king" to minimize the cases in which an extra space, a typo, or stray punctuation might cause us to miss a street. This means we will take steps later to eliminate 'Luther' streets that may not be connected to MLK. 

#### Drop duplicates
This pandas method is aptly named. Our original data includes multiple addresses for the same streets. Because we are just interested in street names, we can drop all but the first instance of that street name. 

*As I am typing this, I discovered a mistake. In line 37, I should have subset equal `['STREET', 'CITY']`. As it stands, I have potentially dropped "duplicates" where the street name is the same, but the city is different.  Unfortunately, I have already deleted the data from my computer, so my dataset will remain flawed until I have time to re-download & re-run the code. Fortunately, I have fixed it in time for YOUR dataset to be better than mine*

#### Filling NaNs ( part 2 )
As previously explained, the 'CITY' column is often left blank if the CSV file only contains data from one city. I use an if statement to fill in the 'CITY' column with the file name. 

## Let's check out the results

In [13]:
import pandas as pd

In [21]:
# Load each of 4 regions
west = NE_df = pd.read_csv('luther_df_west.csv', index_col=0)
northeast = NE_df = pd.read_csv('luther_df_NE.csv', index_col=0)
midwest = NE_df = pd.read_csv('luther_df_MW.csv', index_col=0)
south = NE_df = pd.read_csv('luther_df_south.csv', index_col=0)

In [22]:
# Concatenate, and drop columns we don't need
df = pd.concat([west, northeast, midwest, south], axis=0, ignore_index=True)
df.drop(['NUMBER', 'UNIT', 'ID', 'HASH'], axis=1, inplace=True)
df.shape

(1935, 7)

### Excluding other Luthers
We already discussed the possibility of Luther streets that aren't connected to MLK, so now we have to do something about it. If we require the name to include 'king', we will eliminate the streets that are abbreviated to MLK. Below is my approach to resolve that. 

In [23]:
df_king = df[df.STREET.str.contains('king')]
print(df_king.shape)

df_mlk = df[df.STREET.str.contains('mlk')]
print(df_mlk.shape)

df = pd.concat([df_king, df_mlk], axis=0, ignore_index=True)

(1112, 7)
(163, 7)


## Visualizating Results
There are a variety of ways we can examine the results, but I chose to create a scatter map. This was a simple process because the Latitude and Longitude columns have no missing values. 

If you look at the text columns for City, you'll notice this column needs a few cleaning steps before we could use this to group streets. It would be interesting to see which city has the most streets named after MLK. 

As this was a quick project, I opted to work with the Latitude/Longitude features, which required no extra cleaning. 

In [27]:
import plotly.graph_objects as go

fig = go.Figure(data=go.Scattergeo(
        lon = df['LON'],
        lat = df['LAT'],
        text = df['CITY'],
        mode = 'markers',
        ))

fig.update_layout(
        title = "A (Limited) Glimpse of American Streets named after MLK",
        geo_scope='usa'
    )
fig.show()