# Data Gathering
This notebook will gather data for Python mapping exercise.<br>
The idea will be to explore different mapping technologies in the context of Python using <b>NYC Open Data</b> and <b>US Census American Community Survey</b> data.

Will use <a href="https://pypi.org/project/census/>">census package</a> in Python to get:
- Population
- Median Age
- Median Income
- Poverty Level - will develop percent poverty level
- Households with car - will develop percent with vehicle
- Education - will develop percent college degree
- Racial - will develop percent race

We will work with the `state_zipcode()` function within the `census` package.<br>
We can grab <a href="https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk/about_data">MODZCTA shapefiles</a> from NYC Open data, and list the ZCTA to pull the census data with the API.<br>
We will rename the Census variables accordingly, develop the percent features and merge with the GeoDataFrame.

Then we can get to mapping!!!

In [2]:
# import packages
import pandas as pd
import geopandas as gpd
import logging
import census
from us import states

# import census api key
from src.config import CENSUS_API

## Load Data
### MODZCTA Data
We need to download the shapefile from NYC Open data for mapping, as well as construct a ZCTA list to pass into the census api.

In [3]:
# load downloaded shapefile
# can write code to programmatically download this after, not important right now
gdf=gpd.read_file("./data/shp/Modified Zip Code Tabulation Areas (MODZCTA)_20240418/geo_export_bdb2fc16-3964-47c7-a04d-4d106b707aaf.shp")
# format column names
gdf.columns = [col.lower() for col in gdf.columns]
# preview
gdf.head()

Unnamed: 0,modzcta,label,zcta,pop_est,geometry
0,10001,"10001, 10118","10001, 10119, 10199",23072.0,"POLYGON ((-73.98774 40.74407, -73.98819 40.743..."
1,10002,10002,10002,74993.0,"POLYGON ((-73.99750 40.71407, -73.99709 40.714..."
2,10003,10003,10003,54682.0,"POLYGON ((-73.98864 40.72293, -73.98876 40.722..."
3,10026,10026,10026,39363.0,"MULTIPOLYGON (((-73.96201 40.80551, -73.96007 ..."
4,10004,10004,10004,3028.0,"MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ..."


In [4]:
gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   modzcta   178 non-null    object  
 1   label     177 non-null    object  
 2   zcta      178 non-null    object  
 3   pop_est   178 non-null    float64 
 4   geometry  178 non-null    geometry
dtypes: float64(1), geometry(1), object(3)
memory usage: 7.1+ KB


Let's grab the list of zip codes for NYC from the shapefile.<br>
We will need these to pass into the census api.

In [5]:
# create zip_list
zip_list=list(gdf['zcta'].str.split(',').explode())
# remove leading spaces
zip_list=[z.replace(' ','') for z in zip_list]

### Census Data

In [6]:
# set api key
c = census.Census(CENSUS_API)

We will construct a dictionary of variables with the desired column name as the key, and the actual census variable name as the value.<br>
This will allow us to easily rename the columns.

In [7]:
# dictionary of variables
var_dict = {
    'name': 'NAME',
    'population': 'B03002_001E',
    'median_age': 'B01002_001E',
    'median_household_income': 'B19013_001E',
    'poverty_level': 'B17001_002E',
    'white': 'B03002_003E',
    'black': 'B03002_004E',
    'american_indian_alaskan': 'B03002_005E',
    'asian': 'B03002_006E',
    'nhpi': 'B03002_007E',
    'other': 'B03002_008E',
    'two_or_more': 'B03002_009E',
    'hispanic': 'B03002_012E',
    'total_households': 'B08201_001E',
    'total_households_no_vehicle': 'B08201_002E',
    'pop_25_older': 'B15003_001E',
    'pop_25_older_hs_grad': 'B15003_017E',
    'pop_25_older_associates': 'B15003_019E',
    'pop_25_older_bachelors': 'B15003_020E',
    'pop_25_older_graduate': 'B15003_021E',
}

# get list of values for api call
variables=list(var_dict.values())

# define other search variables
# state fips
ny_fips=states.NY.fips
# year
year = 2020

The census api can only call 50 at a time, so we will loop through the zip codes to get data for all.

In [8]:
# # initialize list
# all_data = []

# # api call
# try:
#   for zip_code in zip_list:
#     data=c.acs5.state_zipcode(fields=variables,
#                       state_fips=ny_fips,
#                       year=year,
#                       zcta=zip_code)
#     all_data.extend(data)
# except KeyError as e:
#   logging.error(f"KeyError for ZIP code {zip_code}: {e}")
# except Exception as e:
#   logging.error(f"Unexpected error for ZIP code {zip_code}: {e}")

Let's create dataframe.

In [9]:
# # create dataframe
# df_acs_2020=pd.DataFrame(all_data)
# df_acs_2020.head()

Export df to avoid calling api in future

In [10]:
# df_acs_2020.to_pickle("./data/df_acs_2020.pkl")

## START ITERATING OVER CODE HERE, DON'T RUN API OVER AND OVER*

In [11]:
# import pickled df
df_acs_2020=pd.read_pickle("./data/df_acs_2020.pkl")

In [12]:
df_acs_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   NAME                      214 non-null    object 
 1   B03002_001E               214 non-null    float64
 2   B01002_001E               214 non-null    float64
 3   B19013_001E               214 non-null    float64
 4   B17001_002E               214 non-null    float64
 5   B03002_003E               214 non-null    float64
 6   B03002_004E               214 non-null    float64
 7   B03002_005E               214 non-null    float64
 8   B03002_006E               214 non-null    float64
 9   B03002_007E               214 non-null    float64
 10  B03002_008E               214 non-null    float64
 11  B03002_009E               214 non-null    float64
 12  B03002_012E               214 non-null    float64
 13  B08201_001E               214 non-null    float64
 14  B08201_002

In [13]:
# remove last col
df_acs_2020=df_acs_2020.iloc[:,:-1]

In [14]:
# rename columns
df_acs_2020.columns = var_dict.keys()

In [15]:
df_acs_2020.head()

Unnamed: 0,name,population,median_age,median_household_income,poverty_level,white,black,american_indian_alaskan,asian,nhpi,other,two_or_more,hispanic,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate
0,ZCTA5 10001,25026.0,36.1,96787.0,2798.0,13641.0,1536.0,11.0,5201.0,63.0,107.0,542.0,3925.0,13311.0,11290.0,19550.0,1307.0,427.0,1004.0,579.0
1,ZCTA5 10119,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ZCTA5 10199,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ZCTA5 10002,74363.0,44.8,35607.0,20257.0,16476.0,5776.0,375.0,31011.0,0.0,388.0,1182.0,19155.0,33790.0,28446.0,58942.0,8972.0,1293.0,3897.0,2459.0
4,ZCTA5 10003,54671.0,31.9,129981.0,4040.0,37168.0,2738.0,38.0,8238.0,69.0,146.0,1542.0,4732.0,25158.0,19715.0,38411.0,1710.0,476.0,1661.0,1408.0


## Data Cleaning
1) Remove `ZCTA` from `name` column, and rename as `zcta`
2) Rename name to `zcta`


In [16]:
# remove ZCTA from name col
df_acs_2020['name']=[name[6:] for name in df_acs_2020['name']]

In [17]:
# rename name to zcta
df_acs_2020.rename(columns={'name':'zcta'}, inplace=True)

In [18]:
df_acs_2020.head()

Unnamed: 0,zcta,population,median_age,median_household_income,poverty_level,white,black,american_indian_alaskan,asian,nhpi,other,two_or_more,hispanic,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate
0,10001,25026.0,36.1,96787.0,2798.0,13641.0,1536.0,11.0,5201.0,63.0,107.0,542.0,3925.0,13311.0,11290.0,19550.0,1307.0,427.0,1004.0,579.0
1,10119,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,10199,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,10002,74363.0,44.8,35607.0,20257.0,16476.0,5776.0,375.0,31011.0,0.0,388.0,1182.0,19155.0,33790.0,28446.0,58942.0,8972.0,1293.0,3897.0,2459.0
4,10003,54671.0,31.9,129981.0,4040.0,37168.0,2738.0,38.0,8238.0,69.0,146.0,1542.0,4732.0,25158.0,19715.0,38411.0,1710.0,476.0,1661.0,1408.0


Why do we have some -66666666.0 values?<br>
Does this have something to do with the secondary zip code set?<br>
Let's extract them and find out.

In [19]:
# function to extract secondary zip
def extract_secondary_zip(zip_codes):
  zip_list = zip_codes.strip().split(',')
  return [zip_code.strip() for zip_code in zip_list[1:]]

# apply function
gdf['secondary_zip']=gdf['zcta'].apply(extract_secondary_zip)

# set of zips that are secondary
secondary_zip_set = set(zip_code for zip_codes in gdf['secondary_zip'] for zip_code in zip_codes)

Let's check to see how the `secondary_zip` column handles those that do not have secondary zips.

In [20]:
gdf[gdf['modzcta'] == '10002']

Unnamed: 0,modzcta,label,zcta,pop_est,geometry,secondary_zip
1,10002,10002,10002,74993.0,"POLYGON ((-73.99750 40.71407, -73.99709 40.714...",[]


Well, it just creates a blank list. I guess that is fine?

Let's try joining `df_acs_2020` to `gdf`. We will have to do a few things first:

- If there is a record in `secondary_zip` column, then copy the row and create a new one
- Iterate over each `secondary_zip` if there is more then one so that there is a row with all the same info

In [21]:
new_records = []  # List to store new records

for idx, row in gdf.iterrows():
    primary_zip = row['zcta'].split(',')[0].strip()  # Extract primary zip code
    secondary_zips = row['secondary_zip']
    
    # If there are secondary zip codes, create new records
    if secondary_zips:
        for sec_zip in secondary_zips:
            # Create a copy of the row and update the 'zcta' column with the secondary zip
            new_row = row.copy()
            new_row['zcta'] = sec_zip.strip()
            new_records.append(new_row)

# Create a new DataFrame from the list of new records
new_df = pd.DataFrame(new_records)

In [22]:
# check row numbers
gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   modzcta        178 non-null    object  
 1   label          177 non-null    object  
 2   zcta           178 non-null    object  
 3   pop_est        178 non-null    float64 
 4   geometry       178 non-null    geometry
 5   secondary_zip  178 non-null    object  
dtypes: float64(1), geometry(1), object(4)
memory usage: 8.5+ KB


In [23]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37 entries, 0 to 173
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   modzcta        37 non-null     object 
 1   label          37 non-null     object 
 2   zcta           37 non-null     object 
 3   pop_est        37 non-null     float64
 4   geometry       37 non-null     object 
 5   secondary_zip  37 non-null     object 
dtypes: float64(1), object(5)
memory usage: 2.0+ KB


Ok, now let's concatenate it to the `gdf` and remove the secondary zips from the original `zcta` records

In [24]:
# concatenate gdf with secondary zip code df
gdf_full=pd.concat([gdf,new_df],ignore_index=True)
# drop anything after primary in zcta column
for idx, row in gdf_full.iterrows():
  gdf_full.loc[idx,'zcta'] = row['zcta'].split(',')[0].strip()
# drop secondary zip col
gdf_full.drop(columns='secondary_zip',inplace=True)


In [25]:
gdf_full.head()

Unnamed: 0,modzcta,label,zcta,pop_est,geometry
0,10001,"10001, 10118",10001,23072.0,"POLYGON ((-73.98774 40.74407, -73.98819 40.743..."
1,10002,10002,10002,74993.0,"POLYGON ((-73.99750 40.71407, -73.99709 40.714..."
2,10003,10003,10003,54682.0,"POLYGON ((-73.98864 40.72293, -73.98876 40.722..."
3,10026,10026,10026,39363.0,"MULTIPOLYGON (((-73.96201 40.80551, -73.96007 ..."
4,10004,10004,10004,3028.0,"MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ..."


In [26]:
gdf_full.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   modzcta   215 non-null    object  
 1   label     214 non-null    object  
 2   zcta      215 non-null    object  
 3   pop_est   215 non-null    float64 
 4   geometry  215 non-null    geometry
dtypes: float64(1), geometry(1), object(3)
memory usage: 8.5+ KB


`gdf_full` has 215 records (or zip codes) and `df_acs_2020` has 214. Thats better then prior, but I wonder what the one extra is.<br>
There is one without a label, let's check that one out.

In [27]:
gdf_full[gdf_full['label'].isna()]

Unnamed: 0,modzcta,label,zcta,pop_est,geometry
177,99999,,99999,0.0,"MULTIPOLYGON (((-74.21417 40.55659, -74.21409 ..."


Ah yes, the old 99999 zip code, how does this even exist? Does it exist in the original `gdf`?

In [28]:
gdf[gdf['modzcta'] == '99999']

Unnamed: 0,modzcta,label,zcta,pop_est,geometry,secondary_zip
177,99999,,99999,0.0,"MULTIPOLYGON (((-74.21417 40.55659, -74.21409 ...",[]


What about in the `df_acs_2020`?

In [29]:
df_acs_2020[df_acs_2020['zcta'] == '99999']

Unnamed: 0,zcta,population,median_age,median_household_income,poverty_level,white,black,american_indian_alaskan,asian,nhpi,other,two_or_more,hispanic,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate


Ok, its only in the `gdf`, I am going to drop it.

In [30]:
# drop 99999 from gdf_full
gdf_full=gdf_full[gdf_full['modzcta'] != '99999']

In [31]:
# check info
gdf_full.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 214 entries, 0 to 214
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   modzcta   214 non-null    object  
 1   label     214 non-null    object  
 2   zcta      214 non-null    object  
 3   pop_est   214 non-null    float64 
 4   geometry  214 non-null    geometry
dtypes: float64(1), geometry(1), object(3)
memory usage: 10.0+ KB



Ok good, now we have a redundant `gdf_full` with exactly 214 zip codes that we can join the `df_acs_2020` to, and then groupby by `modzcta`.<br>
This should allow us to account for all zip codes and aggregate values accordingly.

In [32]:
gdf_full.merge(df_acs_2020, right_on='zcta', left_on='zcta', how='left')

Unnamed: 0,modzcta,label,zcta,pop_est,geometry,population,median_age,median_household_income,poverty_level,white,...,other,two_or_more,hispanic,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate
0,10001,"10001, 10118",10001,23072.0,"POLYGON ((-73.98774 40.74407, -73.98819 40.743...",25026.0,36.1,96787.0,2798.0,13641.0,...,107.0,542.0,3925.0,13311.0,11290.0,19550.0,1307.0,427.0,1004.0,579.0
1,10002,10002,10002,74993.0,"POLYGON ((-73.99750 40.71407, -73.99709 40.714...",74363.0,44.8,35607.0,20257.0,16476.0,...,388.0,1182.0,19155.0,33790.0,28446.0,58942.0,8972.0,1293.0,3897.0,2459.0
2,10003,10003,10003,54682.0,"POLYGON ((-73.98864 40.72293, -73.98876 40.722...",54671.0,31.9,129981.0,4040.0,37168.0,...,146.0,1542.0,4732.0,25158.0,19715.0,38411.0,1710.0,476.0,1661.0,1408.0
3,10026,10026,10026,39363.0,"MULTIPOLYGON (((-73.96201 40.80551, -73.96007 ...",38937.0,35.4,64716.0,7921.0,8126.0,...,95.0,1439.0,7607.0,15362.0,11413.0,27757.0,4583.0,681.0,2989.0,1137.0
4,10004,10004,10004,3028.0,"MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ...",3310.0,38.4,204949.0,93.0,2214.0,...,13.0,104.0,193.0,1822.0,1347.0,2899.0,55.0,2.0,77.0,22.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209,11369,11369,11371,34118.0,"POLYGON ((-73.88258 40.75585, -73.88296 40.757...",0.0,-666666666.0,-666666666.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
210,11411,11411,11411,20930.0,"POLYGON ((-73.73222 40.68512, -73.73309 40.685...",20473.0,45.1,104269.0,768.0,370.0,...,146.0,202.0,1328.0,6183.0,732.0,15295.0,3272.0,958.0,2660.0,1543.0
211,11429,11429,11429,31780.0,"MULTIPOLYGON (((-73.71050 40.72723, -73.71051 ...",27808.0,40.9,82532.0,2448.0,623.0,...,1398.0,1220.0,3990.0,8006.0,1737.0,19941.0,3830.0,793.0,3295.0,1824.0
212,11433,11433,11451,36489.0,"POLYGON ((-73.79437 40.68691, -73.79478 40.687...",0.0,-666666666.0,-666666666.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
df_acs_2020[df_acs_2020['zcta'] == '11451']

Unnamed: 0,zcta,population,median_age,median_household_income,poverty_level,white,black,american_indian_alaskan,asian,nhpi,other,two_or_more,hispanic,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate
208,11451,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
# view zip codes in secondary zip set
df_acs_2020[df_acs_2020['zcta'].isin(secondary_zip_set)]

Unnamed: 0,zcta,population,median_age,median_household_income,poverty_level,white,black,american_indian_alaskan,asian,nhpi,other,two_or_more,hispanic,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate
1,10119,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,10199,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,10271,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,10278,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,10279,96.0,-666666666.0,-666666666.0,0.0,96.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,40.0,96.0,0.0,0.0,56.0,0.0
21,10165,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,10167,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,10168,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24,10169,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25,10170,0.0,-666666666.0,-666666666.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


If we isolate the zip codes that were combined for each MODZCTA, we can see many of them have 0.0 population and bizarre negative values for other Census metrics. Let's drop any records that have -666666666.0.

In [35]:
# drop zip codes with zero population
df_acs_2020=df_acs_2020[df_acs_2020['population'] != 0]

Check secondary zip code list after dropping 0 population

In [36]:
df_acs_2020[df_acs_2020['zcta'].isin(secondary_zip_set)]

Unnamed: 0,zcta,population,median_age,median_household_income,poverty_level,white,black,american_indian_alaskan,asian,nhpi,other,two_or_more,hispanic,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate
12,10279,96.0,-666666666.0,-666666666.0,0.0,96.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,40.0,96.0,0.0,0.0,56.0,0.0
68,10162,1240.0,40.1,96555.0,0.0,867.0,0.0,0.0,45.0,0.0,0.0,33.0,295.0,622.0,405.0,904.0,230.0,0.0,0.0,56.0
85,10314,89938.0,41.7,90306.0,8109.0,55052.0,2925.0,141.0,16228.0,28.0,58.0,1832.0,13674.0,31363.0,3502.0,63621.0,17385.0,3250.0,7453.0,3988.0
97,11005,2249.0,85.1,75742.0,202.0,2132.0,0.0,0.0,112.0,0.0,0.0,0.0,5.0,1609.0,447.0,2249.0,300.0,59.0,425.0,0.0
98,11040,41523.0,43.9,132767.0,1359.0,17481.0,340.0,74.0,16369.0,70.0,554.0,1207.0,5428.0,13078.0,634.0,29547.0,5802.0,869.0,2846.0,1885.0
146,11424,40.0,42.5,-666666666.0,0.0,13.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.0,13.0,0.0,0.0,0.0
162,11357,40118.0,46.8,82858.0,2343.0,22430.0,179.0,0.0,9894.0,7.0,168.0,485.0,6955.0,14437.0,1542.0,29968.0,6680.0,1462.0,3232.0,2039.0
164,11360,18892.0,50.9,84356.0,1015.0,10335.0,164.0,15.0,6125.0,0.0,18.0,167.0,2068.0,8293.0,1368.0,14856.0,2666.0,689.0,1469.0,1016.0
189,11411,20473.0,45.1,104269.0,768.0,370.0,18332.0,14.0,81.0,0.0,146.0,202.0,1328.0,6183.0,732.0,15295.0,3272.0,958.0,2660.0,1543.0
205,11429,27808.0,40.9,82532.0,2448.0,623.0,19337.0,156.0,1057.0,27.0,1398.0,1220.0,3990.0,8006.0,1737.0,19941.0,3830.0,793.0,3295.0,1824.0


Ok, still two with -666666666.0 in `median_household_income`, let's drop those records as well.

In [37]:
# drop -666666666.0 in median_househould_income
df_acs_2020=df_acs_2020[df_acs_2020['median_household_income'] != -666666666.0]

Ok, after dropping, let's join.

In [38]:
gdf_full_merge=gdf_full.merge(df_acs_2020, right_on='zcta', left_on='zcta', how='left')

Let's check out descriptive statistics to verify.

In [39]:
gdf_full_merge.describe()

Unnamed: 0,pop_est,population,median_age,median_household_income,poverty_level,white,black,american_indian_alaskan,asian,nhpi,other,two_or_more,hispanic,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate
count,214.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0,182.0
mean,44482.116822,46626.027473,38.501648,81860.368132,7844.945055,14914.214286,9981.142857,77.362637,6652.725275,15.285714,431.824176,1142.901099,13410.571429,17713.56044,9616.763736,33013.642857,6543.615385,1174.884615,3329.043956,2134.631868
std,26385.529318,26418.183687,5.799471,38458.586349,7451.362623,14351.505861,14664.566406,108.683383,8540.053524,32.617077,850.350101,837.436414,14544.502941,9858.538912,7590.200413,17941.830825,5076.961277,812.495816,2170.575018,1385.600642
min,3028.0,1240.0,26.6,23337.0,0.0,221.0,0.0,0.0,24.0,0.0,0.0,0.0,5.0,622.0,107.0,904.0,48.0,0.0,0.0,0.0
25%,23072.0,26679.0,35.0,57074.0,2373.25,3503.0,966.5,0.0,1419.5,0.0,84.5,520.0,3998.75,9843.0,2513.5,19139.75,2771.5,586.5,1662.75,1107.0
50%,39048.0,42193.0,38.0,75371.0,4800.0,10562.5,3234.5,38.0,4142.5,0.0,222.0,970.0,7884.0,15985.5,9212.0,30273.0,5609.5,1104.0,2944.5,1873.0
75%,63903.75,66736.0,41.15,97486.75,12202.0,21391.75,14050.25,102.5,8288.25,14.75,484.75,1618.0,16219.5,26225.5,14840.0,46477.75,8956.75,1608.5,4894.25,3015.25
max,112425.0,108661.0,85.1,250001.0,31939.0,65032.0,82872.0,696.0,58722.0,263.0,9701.0,3856.0,81431.0,40821.0,30099.0,73573.0,28522.0,5547.0,10249.0,6362.0


Ok, all the descriptive statistics look reasonable.<br>
Let's look at the info.

In [40]:
gdf_full_merge.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 24 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   modzcta                      214 non-null    object  
 1   label                        214 non-null    object  
 2   zcta                         214 non-null    object  
 3   pop_est                      214 non-null    float64 
 4   geometry                     214 non-null    geometry
 5   population                   182 non-null    float64 
 6   median_age                   182 non-null    float64 
 7   median_household_income      182 non-null    float64 
 8   poverty_level                182 non-null    float64 
 9   white                        182 non-null    float64 
 10  black                        182 non-null    float64 
 11  american_indian_alaskan      182 non-null    float64 
 12  asian                        182 non-null    float64 
 1

So, the `zcta` with the bizarre values, after being dropped, naturally do not have any ACS data.<br>
I think in this case, I will leave in the geometry and just color them grey for no data?<br>
However, the geometry is on `modzcta`, so I will need to aggregate the ACS data from the secondary zip codes into the `modzcta`.<br>
Let's try a groupby in the `gdf_full_merge` on `modzcta` and see how that works.

In [41]:
gdf_full_merge_groupby=gdf_full_merge.groupby('modzcta').agg({'label':'first',
                                      'geometry':'first',
                                      'population':'sum',
                                      'median_age':'median', 
                                      'median_household_income':'median',
                                      'poverty_level':'sum',
                                      'white':'sum',
                                      'black':'sum',
                                      'american_indian_alaskan':'sum', 
                                      'asian':'sum', 
                                      'nhpi':'sum', 
                                      'other':'sum',
                                      'two_or_more':'sum', 
                                      'hispanic':'sum',
                                      'total_households':'sum',
                                      'total_households_no_vehicle':'sum',
                                      'pop_25_older':'sum', 
                                      'pop_25_older_hs_grad':'sum', 
                                      'pop_25_older_associates':'sum',
                                      'pop_25_older_bachelors':'sum',
                                      'pop_25_older_graduate':'sum'}).reset_index()

In [42]:
gdf_full_merge_groupby.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 22 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   modzcta                      177 non-null    object  
 1   label                        177 non-null    object  
 2   geometry                     177 non-null    geometry
 3   population                   177 non-null    float64 
 4   median_age                   177 non-null    float64 
 5   median_household_income      177 non-null    float64 
 6   poverty_level                177 non-null    float64 
 7   white                        177 non-null    float64 
 8   black                        177 non-null    float64 
 9   american_indian_alaskan      177 non-null    float64 
 10  asian                        177 non-null    float64 
 11  nhpi                         177 non-null    float64 
 12  other                        177 non-null    float64 
 13  two_o

In [43]:
gdf_full_merge_groupby.head()

Unnamed: 0,modzcta,label,geometry,population,median_age,median_household_income,poverty_level,white,black,american_indian_alaskan,...,other,two_or_more,hispanic,total_households,total_households_no_vehicle,pop_25_older,pop_25_older_hs_grad,pop_25_older_associates,pop_25_older_bachelors,pop_25_older_graduate
0,10001,"10001, 10118","POLYGON ((-73.98774 40.74407, -73.98819 40.743...",25026.0,36.1,96787.0,2798.0,13641.0,1536.0,11.0,...,107.0,542.0,3925.0,13311.0,11290.0,19550.0,1307.0,427.0,1004.0,579.0
1,10002,10002,"POLYGON ((-73.99750 40.71407, -73.99709 40.714...",74363.0,44.8,35607.0,20257.0,16476.0,5776.0,375.0,...,388.0,1182.0,19155.0,33790.0,28446.0,58942.0,8972.0,1293.0,3897.0,2459.0
2,10003,10003,"POLYGON ((-73.98864 40.72293, -73.98876 40.722...",54671.0,31.9,129981.0,4040.0,37168.0,2738.0,38.0,...,146.0,1542.0,4732.0,25158.0,19715.0,38411.0,1710.0,476.0,1661.0,1408.0
3,10004,10004,"MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ...",3310.0,38.4,204949.0,93.0,2214.0,149.0,0.0,...,13.0,104.0,193.0,1822.0,1347.0,2899.0,55.0,2.0,77.0,22.0
4,10005,10005,"POLYGON ((-74.00783 40.70309, -74.00786 40.703...",8664.0,30.4,184681.0,653.0,6079.0,174.0,0.0,...,0.0,368.0,529.0,4649.0,4256.0,6698.0,278.0,3.0,76.0,28.0


Ok, great now I should be able to do feature creation and get to mapping on those features!!!

## Feature Creation

Need to create a few features that will make it easier to visualize data.

1) `% poverty level` - take poverty level number and divide by population
2) `% households with car` - take households without a car, subtract from households, then divide result by households
3) `% college degree` - sum up associates degree and higher, divide this number by total population 25 and older and remove columns. 
4) `% race columns` - create perc rate columns for each
5) Drop uneeded columns
6) Reorder columns


In [44]:
# poverty level
gdf_full_merge_groupby['poverty_level_perc'] = gdf_full_merge_groupby['poverty_level'] / gdf_full_merge_groupby['population']

In [45]:
# households with car
gdf_full_merge_groupby['hh_w_vehicle_perc'] = (gdf_full_merge_groupby['total_households'] - gdf_full_merge_groupby['total_households_no_vehicle']) / gdf_full_merge_groupby['total_households']

In [48]:
# college degree cols
col_degree=['pop_25_older_associates','pop_25_older_bachelors', 'pop_25_older_graduate']
# create college degree col
gdf_full_merge_groupby['college_degree'] = gdf_full_merge_groupby[col_degree].sum(axis=1)
# create percent college degree
gdf_full_merge_groupby['college_degree_perc'] = gdf_full_merge_groupby['college_degree'] / gdf_full_merge_groupby['pop_25_older']
# remove cols
gdf_full_merge_groupby.drop(columns=col_degree,axis=1,inplace=True)

In [49]:
# other race cols
other_race_cols = ['american_indian_alaskan','nhpi','two_or_more','other']
# create other race col
gdf_full_merge_groupby['other_races_sum'] = gdf_full_merge_groupby[other_race_cols].sum(axis=1)
# drop cols
gdf_full_merge_groupby.drop(columns=other_race_cols,axis=1,inplace=True)

Create percentage race columns

In [50]:
race_cols=['white', 'black', 'asian', 'hispanic','other_races_sum']
for col in race_cols:
  gdf_full_merge_groupby[col + '_perc']=gdf_full_merge_groupby[col]/gdf_full_merge_groupby['population']

Export df for mapping notebook.

In [52]:
gdf_full_merge_groupby.to_pickle("./data/gdf_full_merge_groupby.pkl")