# Python Mapping
This notebook will gather data for Python mapping exercise.<br>
The idea will be to explore different mapping technologies in the context of Python using NYC Open Data and US Census American Community Survey data.

The idea is the following:

- Create a basic overview of how to create GIS maps in Python
  - Obtain geographic data
  - Obtain census data (census api)
  - Merge DataFrame with GeoDataFrame
  - Static map
  - Static map w/ basemap
  - Choropleth map
  - Choropleth map with Graduated Symbology
  - Interactive map
  - Interactive map with toggle between two different maps
  - Web map deployment?

Will use <a href="https://pypi.org/project/census/>">census package</a> in Python to get:
- Population
- Median Age
- Median Income
- Poverty Level - will develop percent poverty level
- Households with car - will develop percent with vehicle
- Education - will develop percent college degree

We will work with the `state_zipcode()` function within the `census` package.<br>
We can grab <a href="https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk/about_data">MODZCTA shapefiles</a> from NYC Open data, and list the ZCTA to pull the census data with the API.<br>
We will rename the Census variables accordingly, develop the percent features and merge with the GeoDataFrame.

Then we can get to mapping!!!

In [18]:
# import packages
import pandas as pd
import numpy as np
import geopandas as gpd
import contextily as ctx
import census
from us import states

# import census api key
from src.config import CENSUS_API

## Load Data
### MODZCTA Data
We need to download the shapefile from NYC Open data for mapping, as well as construct a ZCTA list to pass into the census api.

In [19]:
# load downloaded shapefile
gdf=gpd.read_file("./data/shp/Modified Zip Code Tabulation Areas (MODZCTA)_20240418/geo_export_bdb2fc16-3964-47c7-a04d-4d106b707aaf.shp")
# format column names
gdf.columns = [col.lower() for col in gdf.columns]
# preview
gdf.head()

Unnamed: 0,modzcta,label,zcta,pop_est,geometry
0,10001,"10001, 10118","10001, 10119, 10199",23072.0,"POLYGON ((-73.98774 40.74407, -73.98819 40.743..."
1,10002,10002,10002,74993.0,"POLYGON ((-73.99750 40.71407, -73.99709 40.714..."
2,10003,10003,10003,54682.0,"POLYGON ((-73.98864 40.72293, -73.98876 40.722..."
3,10026,10026,10026,39363.0,"MULTIPOLYGON (((-73.96201 40.80551, -73.96007 ..."
4,10004,10004,10004,3028.0,"MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ..."


Let's grab the list of zip codes for NYC from the shapefile.<br>
We will need these to pass into the census api.

In [20]:
zip_list=list(gdf['zcta'].str.split(',').explode())

### Census Data

In [21]:
# set api key
c = census.Census(CENSUS_API)

We will construct a dictionary of variables with the desired column name as the key, and the actual census variable name as the value.<br>
This will allow us to easily rename the columns.

In [41]:
# dictionary of variables
var_dict = {
  'name': 'NAME',
  'population': 'B01003_001E',
  'median_age': 'B01002_001E',
  'median_household_income': 'B19013_001E',
  'poverty_level': 'B17001_002E',
  'total_households': 'B08201_001E',
  'total_households_no_vehicle': 'B08201_002E',
  'pop_25_older': 'B15003_001E',
  'pop_25_older_hs_grad': 'B15003_017E',
  'pop_25_older_associates': 'B15003_019E',
  'pop_25_older_bachelors': 'B15003_020E',
  'pop_25_older_graduate': 'B15003_021E',
}

# get list of values for api call
variables=list(var_dict.values())

The census api can only call 50 at a time, so we will loop through the zip codes to get data for all.

In [25]:
# initialize list
all_data = []

# api call
for zip in zip_list:
  data=c.acs5.state_zipcode(fields=variables,
                     state_fips="states.NY.fips",
                     year=2020,
                     zcta=zip)
  all_data.extend(data)

Let's create dataframe.

In [26]:
# create dataframe
df_acs_2020=pd.DataFrame(all_data)
df_acs_2020.head()

Unnamed: 0,NAME,B01003_001E,B01002_001E,B19013_001E,B17001_002E,B08201_001E,B08201_002E,B15003_001E,B15003_017E,B15003_019E,B15003_020E,B15003_021E,zip code tabulation area
0,ZCTA5 10001,25026.0,36.1,96787.0,2798.0,13311.0,11290.0,19550.0,1307.0,427.0,1004.0,579.0,10001
1,ZCTA5 10119,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10119
2,ZCTA5 10199,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10199
3,ZCTA5 10002,74363.0,44.8,35607.0,20257.0,33790.0,28446.0,58942.0,8972.0,1293.0,3897.0,2459.0,10002
4,ZCTA5 10003,54671.0,31.9,129981.0,4040.0,25158.0,19715.0,38411.0,1710.0,476.0,1661.0,1408.0,10003


Export df to avoid calling api in future

In [27]:
df_acs_2020.to_pickle("./data/df_acs_2020.pkl")

In [42]:
# import pickled df
df_acs_2020=pd.read_pickle("./data/df_acs_2020.pkl")

In [43]:
df_acs_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   NAME                      214 non-null    object 
 1   B01003_001E               214 non-null    float64
 2   B01002_001E               214 non-null    float64
 3   B19013_001E               214 non-null    float64
 4   B17001_002E               214 non-null    float64
 5   B08201_001E               214 non-null    float64
 6   B08201_002E               214 non-null    float64
 7   B15003_001E               214 non-null    float64
 8   B15003_017E               214 non-null    float64
 9   B15003_019E               214 non-null    float64
 10  B15003_020E               214 non-null    float64
 11  B15003_021E               214 non-null    float64
 12  zip code tabulation area  214 non-null    object 
dtypes: float64(11), object(2)
memory usage: 21.9+ KB


In [44]:
# remove last col
df_acs_2020=df_acs_2020.iloc[:,:-1]

In [45]:
# rename columns
df_acs_2020.columns = var_dict.keys()

In [46]:
df_acs_2020.head()

Unnamed: 0,name,population,median_age,median_household_income,poverty_level,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate
0,ZCTA5 10001,25026.0,36.1,96787.0,2798.0,13311.0,11290.0,19550.0,1307.0,427.0,1004.0,579.0
1,ZCTA5 10119,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ZCTA5 10199,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ZCTA5 10002,74363.0,44.8,35607.0,20257.0,33790.0,28446.0,58942.0,8972.0,1293.0,3897.0,2459.0
4,ZCTA5 10003,54671.0,31.9,129981.0,4040.0,25158.0,19715.0,38411.0,1710.0,476.0,1661.0,1408.0


## Feature Creation

Need to create a few features that will make it easier to visualize data.

1) `% poverty level` - take poverty level number and divide by population
2) `% households with car` - take households without a car, subtract from households, then divide result by households
3) `% college degree` - sum up associates degree and higher, divide this number by total population 25 and older 
4) Remove `ZCTA` from `name` column, and rename as `zcta`

In [47]:
# create % poverty level
df_acs_2020['perc_poverty_level'] = df_acs_2020['poverty_level'] / df_acs_2020['population']

In [48]:
# create % with car
df_acs_2020['perc_hh_w_vehicle'] = (df_acs_2020['total_households'] - df_acs_2020['total_households_no_vehicle']) / df_acs_2020['total_households']

In [49]:
# create % college degree
df_acs_2020['perc_college_degree'] = \
    (df_acs_2020['pop_25_older_associates'] + df_acs_2020['pop_25_older_bachelors'] + df_acs_2020['pop_25_older_graduate']) / df_acs_2020['pop_25_older']

In [50]:
# remove ZCTA from name col
df_acs_2020['name']=[name[6:] for name in df_acs_2020['name']]

In [51]:
# rename name to zcta
df_acs_2020.rename(columns={'name':'zcta'}, inplace=True)

In [52]:
pd.set_option('display.max_columns',None)
df_acs_2020.head()

Unnamed: 0,zcta,population,median_age,median_household_income,poverty_level,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate,perc_poverty_level,perc_hh_w_vehicle,perc_college_degree
0,10001,25026.0,36.1,96787.0,2798.0,13311.0,11290.0,19550.0,1307.0,427.0,1004.0,579.0,0.111804,0.151829,0.102813
1,10119,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,
2,10199,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,
3,10002,74363.0,44.8,35607.0,20257.0,33790.0,28446.0,58942.0,8972.0,1293.0,3897.0,2459.0,0.272407,0.158153,0.129772
4,10003,54671.0,31.9,129981.0,4040.0,25158.0,19715.0,38411.0,1710.0,476.0,1661.0,1408.0,0.073897,0.216353,0.092291


Export df for mapping notebook.

In [53]:
df_acs_2020.to_pickle("./data/df_acs_2020_cleaned.pkl")