# Population density in King County, WA per zipcode

In the main notebook **EDA.ipynb** we come up with a very simple solution to give a definition of "central neighbourhood". We just used a list of zip codes related to Seattle found on the internet.  Here, we want to explore a more advanced way by considering provided data giving information about the population density in each zip code area. Furthermore we will use the _plotly_ library (especially the choropleth_mapbox function) to create a visualisation. 

## Data about population density in Washington State

We can find data about the population density for each zipcode in the USA at: <br>
[http://zipatlas.com/us/zip-code-comparison/population-density.htm](http://zipatlas.com/us/zip-code-comparison/population-density.htm) <br>
<br>
As we are only interested in data from King County, we can select Washington state: <br>
[http://zipatlas.com/us/wa/zip-code-comparison/population-density.htm](http://zipatlas.com/us/wa/zip-code-comparison/population-density.htm)

## Data about zip codes in Washington state

There is a github repository where we can find geographical data, i.e. information about zip code boundaries for each of the 50 states: <br>
[https://github.com/OpenDataDE/State-zip-code-GeoJSON](https://github.com/OpenDataDE/State-zip-code-GeoJSON) <br>
<br>
For Washington state, the url is: <br>
[https://github.com/OpenDataDE/State-zip-code-GeoJSON/blob/master/wa_washington_zip_codes_geo.min.json](https://github.com/OpenDataDE/State-zip-code-GeoJSON/blob/master/wa_washington_zip_codes_geo.min.json)

--------------------

## Data analysis

To begin with, we import the required modules and take a look at the population density data:

In [None]:
import pandas as pd
import plotly.express as px
import json

df = pd.read_csv('data/washington_pop-density_by_zipcode.csv')
df

As you can see, I removed some columns from the original dataset on the website that are not neccessary for our aims and separated both lat/lon values as well as city/state values to single columns.  

In [None]:
df.info()

In [None]:
df.describe()

Now, let's get have some insights on the zip code boundary data:

In [None]:
# Opening JSON file
f = open('data/wa_washington_zip_codes_geo.min.json')

# returns JSON object
data = json.load(f)

In [None]:
type(data)

In [None]:
data.keys()

The data about zip code boundaries is stored in a nested dictionary. So, 'type' is the first key and has the value 'FeatureCollection' and the second key, 'features', is a list containing itself in total 598 entries. These entries contain all information and are (nested) lists, dictionaries etc.

In [None]:
data['type']

In [None]:
data['features']

In [None]:
len(data['features'])

Each zip code area contains individual _properties_ by which we can address it and has therefore its own _properties_ dictionary. For the First area it is:

In [None]:
print(data["features"][0]["properties"])

As our population data provides zip codes, we will use them to link the two datasets. Within the "properties" dictionary it is the key **'ZCTA5CE10'**. For the first zip code area it has the key-value '98822'. <br>
So, the dictionary keys stay fixed, but the values may change for each zip code. 

In [None]:
print(data["features"][1]["properties"])

## Visualisation

Now we can map the datasets combined using _plotly choropleth mapbox_. A documentation about the parameters can be found here: [https://plotly.github.io/plotly.py-docs/generated/plotly.express.choropleth_mapbox.html](https://plotly.github.io/plotly.py-docs/generated/plotly.express.choropleth_mapbox.html). Note, that we assign our population density dataset to *data_frame* and our dataset containing the geographical information about the zip code areas (i.e. the polygons) to the parameter *geojson*. <br>
To connect these data via zip code, the parameters *featureidkey* and *locations* become important as these are assigned to the corresponding zip code information.

In [None]:
fig = px.choropleth_mapbox(data_frame=df.iloc[1:], 
                            geojson=data,
                            featureidkey="properties.ZCTA5CE10",
                            locations="zip_code",
                            color="people_per_square-mile",
                            hover_name="zip_code", 
                            hover_data=["city", "lat", "lon", "population", "people_per_square-mile"],                            
                            center={"lat": 47.604569, "lon": -122.335359},
                            mapbox_style="open-street-map", 
                            zoom=9)

fig.update_layout(title="Population density per square mile",
                    legend_title="People per square mile", 
                    margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

To conclude our aim to proper define a metric for "central neighbourhood" let's have a look at the distribution of the density and plot a histogram:

In [None]:
fig2 = px.histogram(df, x="people_per_square-mile")
fig2.show()


As we can see, we have an outlier at 50k. So lets remove it from our figure (in fact that's what have been done in the first plot, as well).

In [None]:
fig3 = px.histogram(df.iloc[1:], x="people_per_square-mile")
fig3.show()


Now we may define a threshold for being central in terms of population density regarding to percentiles. Please consider, that for the task in the main python notebook, we only consider King County. We still need to apply a filter as the data within this notebook is for Washington state in total.

In [None]:
p75, p80, p85, p90, p95 = df['people_per_square-mile'].quantile([0.75, 0.80, 0.85, 0.90, 0.95])

In [None]:
print(p75, p80, p85, p90, p95)