# Census and GeoJSON Data EDA

The goal of this notebook is to obtain and organize the following county-level data:

- nominal data: state, county, fips
- census data: 
    - total population
    - ethnic population(s)
    - voting statistics
    - median income
    - educational attainment
- geographic data (from GeoJSON): 
    - census area
    - latitude/longitude

The statistics gathered in this notebook will only need to be updated once the 2020 Census information is released to the public.

In [33]:
# standard EDA
import numpy as np
import pandas as pd

# processing geodata
import geopandas as gp
import pickle
from scipy import sparse
from shapely.geometry import asShape, Polygon

# opening external coordinates
import json

# opening urls
from urllib.request import urlopen

# 1. import census data from `census.gov`

2019 population estimates can be collected from [census.gov](https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/asrh/). For the most current estimates, we will only save data from `YEAR == 12` ([data dictionary](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/cc-est2019-alldata.pdf)).

In [4]:
pop_df = pd.read_csv(
    '../data/external/cc-est2019-alldata.csv',
    encoding='latin-1',        # to avoid unicode error
    dtype={'STATE':'str',      # these are FIPS codes
           'COUNTY':'str'}
)

# reduce memory usage
# mask for 2019 estimates (12) and all ages (0)
# SUMLEV == 50 for the 50 US states
pop_df = pop_df.drop(columns='SUMLEV')    
pop_df = pop_df.loc[(pop_df['YEAR'] == 12) & (pop_df['AGEGRP'] == 0)]    
pop_df = pop_df.drop(columns=['YEAR', 'AGEGRP'])

pop_df.head()

Unnamed: 0,STATE,COUNTY,STNAME,CTYNAME,TOT_POP,TOT_MALE,TOT_FEMALE,WA_MALE,WA_FEMALE,BA_MALE,...,HWAC_MALE,HWAC_FEMALE,HBAC_MALE,HBAC_FEMALE,HIAC_MALE,HIAC_FEMALE,HAAC_MALE,HAAC_FEMALE,HNAC_MALE,HNAC_FEMALE
209,1,1,Alabama,Autauga County,55869,27092,28777,20878,21729,5237,...,778,687,89,93,40,27,15,19,16,11
437,1,3,Alabama,Baldwin County,223234,108247,114987,94810,100388,9486,...,5144,4646,268,281,264,197,69,65,55,35
665,1,5,Alabama,Barbour County,24686,13064,11622,6389,5745,6311,...,509,408,63,50,61,26,1,0,14,8
893,1,7,Alabama,Bibb County,22394,11929,10465,8766,8425,2941,...,291,253,32,19,6,15,5,1,17,3
1121,1,9,Alabama,Blount County,57826,28472,29354,27258,28154,516,...,2794,2516,76,58,67,66,18,21,34,21


In [5]:
pop_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3142 entries, 209 to 716357
Data columns (total 77 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   STATE         3142 non-null   object
 1   COUNTY        3142 non-null   object
 2   STNAME        3142 non-null   object
 3   CTYNAME       3142 non-null   object
 4   TOT_POP       3142 non-null   int64 
 5   TOT_MALE      3142 non-null   int64 
 6   TOT_FEMALE    3142 non-null   int64 
 7   WA_MALE       3142 non-null   int64 
 8   WA_FEMALE     3142 non-null   int64 
 9   BA_MALE       3142 non-null   int64 
 10  BA_FEMALE     3142 non-null   int64 
 11  IA_MALE       3142 non-null   int64 
 12  IA_FEMALE     3142 non-null   int64 
 13  AA_MALE       3142 non-null   int64 
 14  AA_FEMALE     3142 non-null   int64 
 15  NA_MALE       3142 non-null   int64 
 16  NA_FEMALE     3142 non-null   int64 
 17  TOM_MALE      3142 non-null   int64 
 18  TOM_FEMALE    3142 non-null   int64 
 19  WA

Notice that county names provided by the US census contain descriptive terms, such as 'County', whereas the NYTimes data does not.

In [6]:
# remove descriptive terms from county names, will use on other dataframes
def remove_county_terms(data, county_col):
    county_terms = ['County', 'Parish', 'Municipality']
    for term in county_terms:
        data[county_col] = data[county_col].str.replace(' ' + term, '')
    return data

In [7]:
# rename columns to better-match nytimes data (and personal preference)
pop_df.rename(
    columns={
        'STATE':'state_fips',            # will combine and drop later
        'COUNTY':'county_fips',
        'STNAME':'state',
        'CTYNAME':'county',
        'TOT_POP':'total_pop'
    }, inplace=True
)
pop_df.columns = pop_df.columns.str.lower()

# nytimes fips is 5-digit combo of state and county fips
pop_df['fips'] = pop_df['state_fips'] + pop_df['county_fips']
pop_df = pop_df.drop(columns=['county_fips'])

pop_df = remove_county_terms(pop_df, 'county')

pop_df.pipe(remove_county_terms, 'county')
pop_df.head()

Unnamed: 0,state_fips,state,county,total_pop,tot_male,tot_female,wa_male,wa_female,ba_male,ba_female,...,hwac_female,hbac_male,hbac_female,hiac_male,hiac_female,haac_male,haac_female,hnac_male,hnac_female,fips
209,1,Alabama,Autauga,55869,27092,28777,20878,21729,5237,6000,...,687,89,93,40,27,15,19,16,11,1001
437,1,Alabama,Baldwin,223234,108247,114987,94810,100388,9486,10107,...,4646,268,281,264,197,69,65,55,35,1003
665,1,Alabama,Barbour,24686,13064,11622,6389,5745,6311,5595,...,408,63,50,61,26,1,0,14,8,1005
893,1,Alabama,Bibb,22394,11929,10465,8766,8425,2941,1822,...,253,32,19,6,15,5,1,17,3,1007
1121,1,Alabama,Blount,57826,28472,29354,27258,28154,516,462,...,2516,76,58,67,66,18,21,34,21,1009


## make rows for New York City, Kansas City, and Joplin

Since the NYTimes dataset treats `New York City`, `Kansas City`, and `Joplin` [as their own entities](https://github.com/nytimes/covid-19-data#geographic-exceptions), we need to add them to our population dataframe.

### New York City

`New York City` is the combination of these five counties, [which are coterminous with the five boroughs](https://en.wikipedia.org/wiki/New_York_City#Boroughs):

- Bronx (36005)
- Kings (36047)
- New York (36061)
- Queens (36081)
- Richmond (36085)

We will arbitrarily assign the `fips` as `36NYC`.

In [14]:
boroughs = ['Bronx', 'Kings', 'New York', 'Queens', 'Richmond']
nyc_fips = ['36005', '36047', '36061', '36081', '36085']

In [15]:
def custom_county_maker(source_df, using='fips', method='sum', state_fips='36',
    state='New York', state_abbr='NY', county='New York City', fips=nyc_fips, 
    end_fips='36NYC'):
    
    cols = source_df.select_dtypes(include='number').columns
    if using == 'fips':
        temp_df = source_df.set_index('fips').loc[fips, :]
        ref_df = pop_df.set_index('fips').loc[fips, 'total_pop']
    elif using == 'county':
        temp_df = source_df.loc[source_df['state']=='New York']\
                  .set_index('county').loc[boroughs, :]
        ref_df = pop_df.loc[source_df['state']=='New York']\
                 .set_index('county').loc[boroughs, 'total_pop']
    temp_df = temp_df.select_dtypes(include='number')
    if method == 'sum':
        temp_df = pd.DataFrame(
            [np.sum(temp_df)],
            columns=cols
        )
    elif method == 'mean':
        temp_df = pd.DataFrame(
            [np.average(temp_df.values, axis=0, weights=ref_df)],
            columns=cols
        )
    for c in source_df.select_dtypes(exclude='number').columns:
        if 'state_fips' in c.lower():
            temp_df[c] = state_fips
        elif source_df[c].map(len).mean() == 2:
            temp_df[c] = state_abbr
        elif 'county' in c.lower():
            temp_df[c] = county
        elif 'state' in c.lower():
            temp_df[c] = state
        elif 'fips' in c.lower():
            temp_df[c] = end_fips
    return temp_df

In [16]:
nyc_pop_df = custom_county_maker(pop_df, using='fips', method='sum')
nyc_pop_df

Unnamed: 0,total_pop,tot_male,tot_female,wa_male,wa_female,ba_male,ba_female,ia_male,ia_female,aa_male,...,hiac_male,hiac_female,haac_male,haac_female,hnac_male,hnac_female,state_fips,state,county,fips
0,8336817,3978439,4358378,2145238,2247804,1046937,1247761,57897,58600,597346,...,65574,67225,21187,22500,8967,9468,36,New York,New York City,36NYC


In [17]:
pop_df = pop_df.append(nyc_pop_df, ignore_index=True)
pop_df.tail()

Unnamed: 0,state_fips,state,county,total_pop,tot_male,tot_female,wa_male,wa_female,ba_male,ba_female,...,hwac_female,hbac_male,hbac_female,hiac_male,hiac_female,haac_male,haac_female,hnac_male,hnac_female,fips
3138,56,Wyoming,Teton,23464,12142,11322,11567,10718,101,71,...,1578,25,23,105,81,16,15,12,7,56039
3139,56,Wyoming,Uinta,20226,10224,10002,9753,9524,77,75,...,840,17,23,82,111,3,12,8,2,56041
3140,56,Wyoming,Washakie,7805,3963,3842,3759,3618,25,19,...,489,7,9,54,59,7,8,4,2,56043
3141,56,Wyoming,Weston,6927,3624,3303,3392,3062,32,16,...,120,5,8,23,20,6,4,2,0,56045
3142,36,New York,New York City,8336817,3978439,4358378,2145238,2247804,1046937,1247761,...,922001,257362,294492,65574,67225,21187,22500,8967,9468,36NYC


In [18]:
pop_df['white'] = (pop_df['nhwa_male']+pop_df['nhwa_female'])
pop_df['black'] = (pop_df['nhba_male']+pop_df['nhba_female'])
pop_df['asian'] = (pop_df['nhaa_male']+pop_df['nhaa_female'])
pop_df['hispanic'] = (pop_df['h_male']+pop_df['h_female'])

In [19]:
pop_cols = ['state', 'county', 'state_fips', 'fips', 'total_pop', 'white', 'black', 'asian', 'hispanic']
pop_df = pop_df[pop_cols]
pop_df = pop_df.sort_values(by='fips')
pop_df.tail()

Unnamed: 0,state,county,state_fips,fips,total_pop,white,black,asian,hispanic
3137,Wyoming,Sweetwater,56,56037,42343,33561,502,410,6772
3138,Wyoming,Teton,56,56039,23464,19000,145,378,3554
3139,Wyoming,Uinta,56,56041,20226,17657,126,92,1871
3140,Wyoming,Washakie,56,56043,7805,6417,38,55,1108
3141,Wyoming,Weston,56,56045,6927,6236,45,113,285


### Kansas City and Joplin

Kansas City and Joplin both refer to cities that cross county borders in Missouri. Therefore, we have to get our information from [census.gov quickfacts](https://www.census.gov/quickfacts).

We'll use `29KAN`, and `29JOP` as our `fips` for these two cities.

In [20]:
mo_pop_df = pd.DataFrame(
    [['Missouri',
      'Kansas City',
      '29',
      '29KAN',
      495_327,
      int(0.601*495_327),
      int(0.290*495_327),
      int(0.027*495_327),
      int(0.102*495_327)],
     ['Missouri',
      'Joplin',
      '29',
      '29JOP',
      50_925,
      int(0.876*50_925),
      int(0.032*50_925),
      int(0.019*50_925),
      int(0.048*50_925)
     ]]
    , columns=pop_cols)
mo_pop_df

Unnamed: 0,state,county,state_fips,fips,total_pop,white,black,asian,hispanic
0,Missouri,Kansas City,29,29KAN,495327,297691,143644,13373,50523
1,Missouri,Joplin,29,29JOP,50925,44610,1629,967,2444


In [21]:
pop_df = pop_df.append(mo_pop_df, ignore_index=True)
pop_df.tail()

Unnamed: 0,state,county,state_fips,fips,total_pop,white,black,asian,hispanic
3140,Wyoming,Uinta,56,56041,20226,17657,126,92,1871
3141,Wyoming,Washakie,56,56043,7805,6417,38,55,1108
3142,Wyoming,Weston,56,56045,6927,6236,45,113,285
3143,Missouri,Kansas City,29,29KAN,495327,297691,143644,13373,50523
3144,Missouri,Joplin,29,29JOP,50925,44610,1629,967,2444


In [32]:
pop_df = pop_df.sort_values(by='fips')
pop_df.to_csv('../data/processed/pop_df.csv', index=False)

# 2. import geojson for boundaries and census areas

In [9]:
# https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html

geo_df = gp.read_file('../data/external/cb_2018_us_county_20m/cb_2018_us_county_20m.shp')
geo_df = geo_df.sort_values(by='GEOID').reset_index(drop=True)
geo_df['ALAND'] = geo_df['ALAND'] / 1e6
geo_df = geo_df[['STATEFP', 'GEOID', 'ALAND', 'geometry']]
geo_df.rename(columns={'STATEFP': 'state_fips', 'GEOID': 'fips', 'ALAND': 'area_land'}, inplace=True)
geo_df.head()

Unnamed: 0,state_fips,fips,area_land,geometry
0,1,1001,1539.602123,"POLYGON ((-86.91759 32.66417, -86.71339 32.661..."
1,1,1003,4117.546676,"POLYGON ((-88.02632 30.75336, -87.94455 30.827..."
2,1,1005,2292.144655,"POLYGON ((-85.73573 31.62449, -85.66565 31.786..."
3,1,1007,1612.167481,"POLYGON ((-87.42194 33.00338, -87.31854 33.006..."
4,1,1009,1670.103911,"POLYGON ((-86.96336 33.85822, -86.92439 33.909..."


## add areas to `county_json`

Keep in mind we need to include geo data for three cities included in the NYTimes data (New York City, Kansas City, Joplin). 

GeoJSON data for these five areas compiled from [nomanatim](https://nominatim.openstreetmap.org/) and [polygons](http://polygons.openstreetmap.fr/):
- Search for the area at [nomanatim](https://nominatim.openstreetmap.org/).
- Select `details` from the relevant entry.
- Copy the numeric `code` under `OSM`, ignoring "relation". Eg. for New York City, copy `175905`.
- Search for the `code` at [polygons](http://polygons.openstreetmap.fr/).
- For our purposes, GeoJSONs were selected according to the following criteria: (1) sparsity of vertices (`NPoints`) and (2) accuracy of shape.

In [10]:
# new york city, ny
with open('../data/external/nyc.txt') as f:
    nyc_json = json.load(f)

# kansas city, mo
with open('../data/external/kcm.txt') as f:
    kcm_json = json.load(f)

# joplin, mo
with open('../data/external/jm.txt') as f:
    jm_json = json.load(f)

In [11]:
add_to_gdf = gp.GeoDataFrame(
    [['29', '29JOP', 98.61, asShape(jm_json).buffer(0)],
     ['29', '29KAN', 815.55, asShape(kcm_json).buffer(0)],
     ['36', '36NYC', 777.95, asShape(nyc_json).buffer(0)]], 
    columns=geo_df.columns
)

add_to_gdf

Unnamed: 0,state_fips,fips,area_land,geometry
0,29,29JOP,98.61,"MULTIPOLYGON (((-94.53408 37.04073, -94.53397 ..."
1,29,29KAN,815.55,"POLYGON ((-94.74800 39.24300, -94.74800 39.274..."
2,36,36NYC,777.95,"POLYGON ((-74.26300 40.49400, -74.26200 40.513..."


In [12]:
geo_df = geo_df.append(add_to_gdf, ignore_index=True)
geo_df = geo_df.sort_values(by='fips').reset_index(drop=True)
geo_df.shape

(3223, 4)

In [23]:
# we don't want to double-count the nyc boroughs

for nyc in nyc_fips:
    geo_df = geo_df[geo_df['fips'] != nyc]
geo_df.shape

(3218, 4)

## find neighbors (for clustering later)

In [24]:
# https://gis.stackexchange.com/a/281676

def county_neighbors(g):
    
    indices = g['fips'].tolist()
    neighbor_matrix = []
    
    for i, row in g.iterrows():
        neighbors = g[g['geometry'].intersects(row['geometry'])]['fips'].tolist()
        neighbors.remove(row['fips'])
        neighbor_matrix.append(neighbors)
    
    g['neighbors'] = neighbor_matrix
    return g

In [25]:
geo_df = geo_df.groupby(by='state_fips').apply(county_neighbors)
geo_df.head()

Unnamed: 0,state_fips,fips,area_land,geometry,neighbors
0,1,1001,1539.602123,"POLYGON ((-86.91759 32.66417, -86.71339 32.661...","[01021, 01047, 01051, 01085, 01101]"
1,1,1003,4117.546676,"POLYGON ((-88.02632 30.75336, -87.94455 30.827...","[01025, 01053, 01097, 01099, 01129]"
2,1,1005,2292.144655,"POLYGON ((-85.73573 31.62449, -85.66565 31.786...","[01011, 01045, 01067, 01109, 01113]"
3,1,1007,1612.167481,"POLYGON ((-87.42194 33.00338, -87.31854 33.006...","[01021, 01065, 01073, 01105, 01117, 01125]"
4,1,1009,1670.103911,"POLYGON ((-86.96336 33.85822, -86.92439 33.909...","[01043, 01055, 01073, 01095, 01115, 01127]"


## find centroids

We will use `shapely` to calculate the [centroid](https://en.wikipedia.org/wiki/Centroid) coordinates for the counties (in case we wish to plot bubble maps).

In [26]:
def centroid(df):
    centroids = df['geometry'].centroid
    return [c.coords[0] for c in centroids]

geo_df['lon'], geo_df['lat'] = zip(*geo_df.pipe(centroid))
geo_df.head()

Unnamed: 0,state_fips,fips,area_land,geometry,neighbors,lon,lat
0,1,1001,1539.602123,"POLYGON ((-86.91759 32.66417, -86.71339 32.661...","[01021, 01047, 01051, 01085, 01101]",-86.643648,32.538666
1,1,1003,4117.546676,"POLYGON ((-88.02632 30.75336, -87.94455 30.827...","[01025, 01053, 01097, 01099, 01129]",-87.722603,30.729584
2,1,1005,2292.144655,"POLYGON ((-85.73573 31.62449, -85.66565 31.786...","[01011, 01045, 01067, 01109, 01113]",-85.387579,31.868235
3,1,1007,1612.167481,"POLYGON ((-87.42194 33.00338, -87.31854 33.006...","[01021, 01065, 01073, 01105, 01117, 01125]",-87.125115,32.996421
4,1,1009,1670.103911,"POLYGON ((-86.96336 33.85822, -86.92439 33.909...","[01043, 01055, 01073, 01095, 01115, 01127]",-86.568495,33.98143


In [31]:
with open('../data/processed/geo_df.p', 'wb') as f:
    pickle.dump(geo_df, f, protocol=pickle.HIGHEST_PROTOCOL)

## merge with `pop_df` to begin building `dem_df`

In [28]:
dem_df = pop_df.merge(geo_df[['fips', 'area_land', 'lon', 'lat', 'neighbors']], on='fips')
dem_df.head()

Unnamed: 0,state,county,state_fips,fips,total_pop,white,black,asian,hispanic,area_land,lon,lat,neighbors
0,Alabama,Autauga,1,1001,55869,41215,11098,646,1671,1539.602123,-86.643648,32.538666,"[01021, 01047, 01051, 01085, 01101]"
1,Alabama,Baldwin,1,1003,223234,185747,19215,2346,10534,4117.546676,-87.722603,30.729584,"[01025, 01053, 01097, 01099, 01129]"
2,Alabama,Barbour,1,1005,24686,11235,11807,116,1117,2292.144655,-85.387579,31.868235,"[01011, 01045, 01067, 01109, 01113]"
3,Alabama,Bibb,1,1007,22394,16663,4719,46,623,1612.167481,-87.125115,32.996421,"[01021, 01065, 01073, 01105, 01117, 01125]"
4,Alabama,Blount,1,1009,57826,50176,872,163,5582,1670.103911,-86.568495,33.98143,"[01043, 01055, 01073, 01095, 01115, 01127]"


# 3. add 2016 general election data

Mask compliance has been very political, so it would be interesting to see how political differences vary by county. Data taken from [github.com/tonmcg](https://github.com/tonmcg). Alaska data taken from [RRH Elections](https://rrhelections.com/index.php/2018/02/02/alaska-results-by-county-equivalent-1960-2016/).

In [34]:
with urlopen('https://raw.githubusercontent.com/tonmcg/US_County_Level_Election_Results_08-16/master/2016_US_County_Level_Presidential_Results.csv') as response:
    elect_df = pd.read_csv(
        response,
        encoding='latin-1',        # to avoid unicode error
        dtype={
            'votes_dem':'int',
            'votes_gop':'int',
            'total_votes':'int',
            'combined_fips':'str'},
        index_col=0
    )
elect_df.head()

Unnamed: 0,votes_dem,votes_gop,total_votes,per_dem,per_gop,diff,per_point_diff,state_abbr,county_name,combined_fips
0,93003,130413,246588,0.377159,0.52887,37410,15.17%,AK,Alaska,2013
1,93003,130413,246588,0.377159,0.52887,37410,15.17%,AK,Alaska,2016
2,93003,130413,246588,0.377159,0.52887,37410,15.17%,AK,Alaska,2020
3,93003,130413,246588,0.377159,0.52887,37410,15.17%,AK,Alaska,2050
4,93003,130413,246588,0.377159,0.52887,37410,15.17%,AK,Alaska,2060


In [35]:
elect_df.rename(
    columns={
        'county_name':'county',
        'combined_fips':'fips',
    }, inplace=True
)

elect_df = remove_county_terms(elect_df, 'county')
elect_df['fips'] = elect_df['fips'].apply('{0:0>5}'.format)        # https://stackoverflow.com/a/23836353
elect_df = elect_df[['state_abbr', 'county', 'fips', 'votes_dem', 'votes_gop', 'total_votes']]
elect_df = elect_df.sort_values(by='fips')
elect_df.head()

Unnamed: 0,state_abbr,county,fips,votes_dem,votes_gop,total_votes
29,AL,Autauga,1001,5908,18110,24661
30,AL,Baldwin,1003,18409,72780,94090
31,AL,Barbour,1005,4848,5431,10390
32,AL,Bibb,1007,1874,6733,8748
33,AL,Blount,1009,2150,22808,25384


## add New York City, Kansas City, and Joplin election data

In [36]:
nyc_elect_df = custom_county_maker(elect_df, using='fips', method='sum')
nyc_elect_df['state_abbr'] = 'NY'
nyc_elect_df['county'] = 'New York City'
nyc_elect_df['fips'] = '36NYC'
nyc_elect_df

Unnamed: 0,votes_dem,votes_gop,total_votes,state_abbr,county,fips
0,1969920,461174,2490750,NY,New York City,36NYC


In [37]:
# estimate joplin
jop_fips = ['29097', '29145']
jop_elect_df = custom_county_maker(elect_df, using='fips', method='sum', state_abbr='MO', county='Joplin', fips=jop_fips, end_fips='29JOP')
jop_elect_df

Unnamed: 0,votes_dem,votes_gop,total_votes,state_abbr,county,fips
0,15553,55604,74685,MO,Joplin,29JOP


In [38]:
# https://en.wikipedia.org/wiki/2016_United_States_presidential_election_in_Missouri

kan_elect_df = pd.DataFrame(
    [['MO', 'Kansas City', '29KAN', 97735, 24654, 128601]]
    , columns=elect_df.columns)
kan_elect_df

Unnamed: 0,state_abbr,county,fips,votes_dem,votes_gop,total_votes
0,MO,Kansas City,29KAN,97735,24654,128601


In [39]:
elect_df = elect_df.append([nyc_elect_df, jop_elect_df, kan_elect_df], ignore_index=True)
elect_df.tail()

Unnamed: 0,state_abbr,county,fips,votes_dem,votes_gop,total_votes
3139,WY,Washakie,56043,532,2911,3715
3140,WY,Weston,56045,294,2898,3334
3141,NY,New York City,36NYC,1969920,461174,2490750
3142,MO,Joplin,29JOP,15553,55604,74685
3143,MO,Kansas City,29KAN,97735,24654,128601


## add alaska elections data

Data taken from [RRH Elections](https://rrhelections.com/index.php/2018/02/02/alaska-results-by-county-equivalent-1960-2016/).

In [41]:
ak_elect_df = pd.read_excel('../data/external/2016 AK Gen Official.xlsx', sheet_name='By CE')
ak_elect_df = ak_elect_df.iloc[0:29, 0:12]
ak_elect_df.head()

Unnamed: 0,ED/Muni,Municipality Code,Registered Voters,Times Counted,"Castle, Darrell L.","Clinton, Hillary","De La Fuente, Roque","Johnson, Gary","Stein, Jill","Trump, Donald J.",Write-in 60,ED Total
0,Ketchikan Gateway,Ketchikan,10512,4283,48,1295,13,339,84,2354,104,4237
1,Prince of Wales-Hyder,Prince of Wales-Hyder,4630,1831,67,666,29,93,65,831,59,1810
2,Sitka,Sitka,7218,2787,38,1261,18,145,78,1146,73,2759
3,Petersburg,Petersburg,2741,1078,12,334,7,64,37,577,32,1063
4,Wrangell,Wrangell,1731,764,7,177,3,35,13,512,13,760


In [42]:
ak_elect_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   ED/Muni               29 non-null     object
 1   Municipality Code     29 non-null     object
 2   Registered Voters     29 non-null     object
 3   Times Counted         29 non-null     object
 4   Castle, Darrell L.    29 non-null     object
 5   Clinton, Hillary      29 non-null     object
 6   De La Fuente, Roque   29 non-null     object
 7   Johnson, Gary         29 non-null     object
 8   Stein, Jill           29 non-null     object
 9   Trump, Donald J.      29 non-null     object
 10  Write-in 60           29 non-null     object
 11  ED Total              29 non-null     object
dtypes: object(12)
memory usage: 2.8+ KB


In [43]:
ak_elect_df.rename(
    columns={
        'Trump, Donald J. ':'votes_gop',
        'Clinton, Hillary ':'votes_dem'
    }, inplace=True
)
ak_elect_df = ak_elect_df[['ED/Muni', 'votes_gop', 'votes_dem', 'ED Total']].sort_values(by='ED/Muni')
ak_elect_df[['votes_gop', 'votes_dem', 'ED Total']] = ak_elect_df[['votes_gop', 'votes_dem', 'ED Total']].astype(int)
ak_elect_df = ak_elect_df.sort_values(by='ED/Muni')
ak_elect_df.head()

Unnamed: 0,ED/Muni,votes_gop,votes_dem,ED Total
22,Aleutians East,198,121,369
24,Aleutians West,260,493,846
19,Anchorage,39942,32130,81678
12,Bethel,809,2178,3933
25,Bristol Bay,180,99,316


In [44]:
print(len(ak_elect_df))
print(len(elect_df[elect_df['state_abbr'] == 'AK']))

29
29


In [45]:
elect_df.loc[elect_df['state_abbr'] == 'AK', ['votes_gop', 'votes_dem', 'total_votes']] = ak_elect_df[['votes_gop', 'votes_dem', 'ED Total']].values
elect_df.tail()

Unnamed: 0,state_abbr,county,fips,votes_dem,votes_gop,total_votes
3139,WY,Washakie,56043,532,2911,3715
3140,WY,Weston,56045,294,2898,3334
3141,NY,New York City,36NYC,1969920,461174,2490750
3142,MO,Joplin,29JOP,15553,55604,74685
3143,MO,Kansas City,29KAN,97735,24654,128601


In [46]:
elect_df.to_csv('../data/processed/elect_df.csv', index=False)

In [37]:
# elect_df = pd.read_csv('../data/elect_df.csv')

In [47]:
dem_df = dem_df.merge(elect_df[['fips', 'votes_gop', 'votes_dem', 'total_votes']], on='fips', how='left')
dem_df.tail()

Unnamed: 0,state,county,state_fips,fips,total_pop,white,black,asian,hispanic,area_land,lon,lat,neighbors,votes_gop,votes_dem,total_votes
3135,Wyoming,Sweetwater,56,56037,42343,33561,502,410,6772,27005.754244,-108.882788,41.659439,"[56007, 56013, 56023, 56035, 56041]",12153.0,3233.0,16661.0
3136,Wyoming,Teton,56,56039,23464,19000,145,378,3554,10351.784301,-110.589071,43.935211,"[56013, 56023, 56029, 56035]",3920.0,7313.0,12176.0
3137,Wyoming,Uinta,56,56041,20226,17657,126,92,1871,5391.631764,-110.547578,41.287818,"[56023, 56037]",6154.0,1202.0,8053.0
3138,Wyoming,Washakie,56,56043,7805,6417,38,55,1108,5798.138762,-107.680187,43.904516,"[56003, 56013, 56017, 56019, 56025, 56029]",2911.0,532.0,3715.0
3139,Wyoming,Weston,56,56045,6927,6236,45,113,285,6210.804116,-104.567368,43.840251,"[56005, 56009, 56011, 56027]",2898.0,294.0,3334.0


# 4. add income data

Median income statistics taken from [data.census.gov](https://data.census.gov/cedsci/table?q=s1901&tid=ACSST1Y2018.S1901) (2017 ACS 1-Year Estimates). According to the data table, the median household income is reported in column `S1901_C01_012E`.

In [48]:
income_df = pd.read_csv('../data/external/ACSST5Y2018.S1901_data_with_overlays_2020-07-16T134009.csv',
                        usecols=['GEO_ID', 'NAME', 'S1901_C01_012E'])
income_df = income_df.drop(0, axis=0)
income_df.rename(
    columns={
        'GEO_ID':'fips',
        'S1901_C01_012E':'median_income'
    }, inplace=True
)
income_df['median_income'] = income_df['median_income'].astype(float)
income_df['fips'] = income_df['fips'].str[-5:]
income_df.head()

Unnamed: 0,fips,NAME,median_income
1,1001,"Autauga County, Alabama",58786.0
2,1003,"Baldwin County, Alabama",55962.0
3,1005,"Barbour County, Alabama",34186.0
4,1007,"Bibb County, Alabama",45340.0
5,1009,"Blount County, Alabama",48695.0


In [49]:
income_df[income_df['median_income'].isna()]

Unnamed: 0,fips,NAME,median_income
1817,35039,"Rio Arriba County, New Mexico",


Rio Arriba Income statistics taken from [datausa.io](https://datausa.io/profile/geo/rio-arriba-county-nm#:~:text=Median%20household%20income%20in%20Rio%20Arriba%20County%2C%20NM%20is%20%2433%2C422.)

In [50]:
income_df.at[income_df['fips'] == '35039', 'median_income'] = 33_422

In [51]:
income_df['county'], income_df['state'] = zip(*income_df['NAME'].str.split(', ').tolist())
income_df = income_df.drop('NAME', axis=1)
income_df = remove_county_terms(income_df, 'county')
income_df['median_income'] = income_df['median_income'].astype(int)
income_df.head()

Unnamed: 0,fips,median_income,county,state
1,1001,58786,Autauga,Alabama
2,1003,55962,Baldwin,Alabama
3,1005,34186,Barbour,Alabama
4,1007,45340,Bibb,Alabama
5,1009,48695,Blount,Alabama


In [54]:
income_df2 = pd.DataFrame(
    [['29KAN', 52405, 'Kansas City', 'Missouri'],
     ['29JOP', 42782, 'Joplin', 'Missouri'],
     ['36NYC', 60762, 'New York City', 'New York']]
    , columns=income_df.columns)
income_df2

Unnamed: 0,fips,median_income,county,state
0,29KAN,52405,Kansas City,Missouri
1,29JOP,42782,Joplin,Missouri
2,36NYC,60762,New York City,New York


In [55]:
income_df = income_df.append(income_df2, ignore_index=True)
income_df.tail(5)

Unnamed: 0,fips,median_income,county,state
3218,72151,16013,Yabucoa Municipio,Puerto Rico
3219,72153,14954,Yauco Municipio,Puerto Rico
3220,29KAN,52405,Kansas City,Missouri
3221,29JOP,42782,Joplin,Missouri
3222,36NYC,60762,New York City,New York


In [56]:
income_df.to_csv('../data/processed/income_df.csv', index=False)

In [39]:
# income_df = pd.read_csv('../data/income_df.csv')

In [57]:
dem_df = dem_df.merge(income_df[['fips', 'median_income']], on='fips', how='left')
dem_df.head()

Unnamed: 0,state,county,state_fips,fips,total_pop,white,black,asian,hispanic,area_land,lon,lat,neighbors,votes_gop,votes_dem,total_votes,median_income
0,Alabama,Autauga,1,1001,55869,41215,11098,646,1671,1539.602123,-86.643648,32.538666,"[01021, 01047, 01051, 01085, 01101]",18110.0,5908.0,24661.0,58786
1,Alabama,Baldwin,1,1003,223234,185747,19215,2346,10534,4117.546676,-87.722603,30.729584,"[01025, 01053, 01097, 01099, 01129]",72780.0,18409.0,94090.0,55962
2,Alabama,Barbour,1,1005,24686,11235,11807,116,1117,2292.144655,-85.387579,31.868235,"[01011, 01045, 01067, 01109, 01113]",5431.0,4848.0,10390.0,34186
3,Alabama,Bibb,1,1007,22394,16663,4719,46,623,1612.167481,-87.125115,32.996421,"[01021, 01065, 01073, 01105, 01117, 01125]",6733.0,1874.0,8748.0,45340
4,Alabama,Blount,1,1009,57826,50176,872,163,5582,1670.103911,-86.568495,33.98143,"[01043, 01055, 01073, 01095, 01115, 01127]",22808.0,2150.0,25384.0,48695


# 5. add educational attainment data

Educational attainment statistics taken from [data.census.gov](https://data.census.gov/cedsci/table?tid=ACSST5Y2018.S1501&g=0400000US04) (2017 ACS 1-Year Estimates).

- `S1501_C01_006E` -- population > 25yo
- `S1501_C01_007E` -- less than 9th grade
- `S1501_C01_008E` -- some high school
- `S1501_C01_009E` -- high school or GED
- `S1501_C01_010E` -- some college
- `S1501_C01_011E` -- associate's
- `S1501_C01_012E` -- bachelor's
- `S1501_C01_013E` -- graduate or professional

In [60]:
edu_cols = ['S1501_C01_'+f'{i:03d}'+'E' for i in range(6,14)]
edu_col_names = ['pop25', 'no_hs', 'some_hs', 'hs', 'some_college', 'associates', 'bachelors', 'graduate']
edu_dict = dict(zip(edu_cols, edu_col_names))
edu_dict.update({'GEO_ID':'fips'})

edu_df = pd.read_csv('../data/external/ACSST5Y2018.S1501_data_with_overlays_2020-07-18T170455.csv',
                     usecols=['GEO_ID', 'NAME']+edu_cols)
edu_df = edu_df.drop(0, axis=0)
for col in edu_cols:
    edu_df[col] = edu_df[col].astype(int)
edu_df.rename(
    columns=edu_dict,
    inplace=True
)
edu_df['fips'] = edu_df['fips'].str[-5:]
edu_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,fips,NAME,pop25,no_hs,some_hs,hs,some_college,associates,bachelors,graduate
1,1001,"Autauga County, Alabama",37166,956,3248,12119,7554,2998,5903,4388
2,1003,"Baldwin County, Alabama",146989,3978,10332,40579,32266,13759,30431,15644
3,1005,"Barbour County, Alabama",18173,1490,3411,6486,3287,1279,1417,803
4,1007,"Bibb County, Alabama",15780,903,1747,7471,2938,908,1197,616
5,1009,"Blount County, Alabama",39627,2967,4894,13489,8492,4775,3217,1793


In [61]:
edu_df['county'], edu_df['state'] = zip(*edu_df['NAME'].str.split(', ').tolist())
edu_df = edu_df.drop('NAME', axis=1)
edu_df = remove_county_terms(edu_df, 'county')
edu_df.head()

Unnamed: 0,fips,pop25,no_hs,some_hs,hs,some_college,associates,bachelors,graduate,county,state
1,1001,37166,956,3248,12119,7554,2998,5903,4388,Autauga,Alabama
2,1003,146989,3978,10332,40579,32266,13759,30431,15644,Baldwin,Alabama
3,1005,18173,1490,3411,6486,3287,1279,1417,803,Barbour,Alabama
4,1007,15780,903,1747,7471,2938,908,1197,616,Bibb,Alabama
5,1009,39627,2967,4894,13489,8492,4775,3217,1793,Blount,Alabama


In [62]:
nyc_edu_df = custom_county_maker(edu_df, using='fips', method='sum')
nyc_edu_df

Unnamed: 0,pop25,no_hs,some_hs,hs,some_college,associates,bachelors,graduate,fips,county,state
0,5923498,565345,523873,1421617,815961,379457,1292814,924431,36NYC,New York City,New York


In [63]:
mo_edu_df = pd.DataFrame(
    [['29KAN', 325065, 11373, 22302, 82996, 73203, 23673, 69682, 41836, 'Kansas City', 'Missouri'],
     ['29JOP', 33571, 779, 2580, 10582, 8462, 2576, 5759, 2833, 'Joplin', 'Missouri']]
    , columns=edu_df.columns)
mo_edu_df

Unnamed: 0,fips,pop25,no_hs,some_hs,hs,some_college,associates,bachelors,graduate,county,state
0,29KAN,325065,11373,22302,82996,73203,23673,69682,41836,Kansas City,Missouri
1,29JOP,33571,779,2580,10582,8462,2576,5759,2833,Joplin,Missouri


In [64]:
edu_df = edu_df.append([mo_edu_df, nyc_edu_df], ignore_index=True)
edu_df.tail()

Unnamed: 0,fips,pop25,no_hs,some_hs,hs,some_college,associates,bachelors,graduate,county,state
3218,72151,23916,4975,2245,5972,3636,2645,3706,737,Yabucoa Municipio,Puerto Rico
3219,72153,25976,4977,2259,8182,2381,1791,4902,1484,Yauco Municipio,Puerto Rico
3220,29KAN,325065,11373,22302,82996,73203,23673,69682,41836,Kansas City,Missouri
3221,29JOP,33571,779,2580,10582,8462,2576,5759,2833,Joplin,Missouri
3222,36NYC,5923498,565345,523873,1421617,815961,379457,1292814,924431,New York City,New York


In [65]:
edu_df.to_csv('../data/processed/edu_df.csv', index=False)

In [41]:
# edu_df = pd.read_csv('../data/processed/edu_df.csv')

In [66]:
dem_df = dem_df.merge(edu_df[['fips']+edu_col_names], on='fips', how='left')
dem_df.tail()

Unnamed: 0,state,county,state_fips,fips,total_pop,white,black,asian,hispanic,area_land,...,total_votes,median_income,pop25,no_hs,some_hs,hs,some_college,associates,bachelors,graduate
3135,Wyoming,Sweetwater,56,56037,42343,33561,502,410,6772,27005.754244,...,16661.0,73008,28333,633,1916,9433,6994,3114,4298,1945
3136,Wyoming,Teton,56,56039,23464,19000,145,378,3554,10351.784301,...,12176.0,83831,17164,457,501,2272,3219,868,6488,3359
3137,Wyoming,Uinta,56,56041,20226,17657,126,92,1871,5391.631764,...,8053.0,58235,12915,288,646,5176,3420,1390,1356,639
3138,Wyoming,Washakie,56,56043,7805,6417,38,55,1108,5798.138762,...,3715.0,53426,5662,181,409,1717,1434,701,849,371
3139,Wyoming,Weston,56,56045,6927,6236,45,113,285,6210.804116,...,3334.0,52867,5014,129,260,1796,1334,534,676,285


# 6. add mask usage statistics

In [67]:
with urlopen('https://raw.githubusercontent.com/nytimes/covid-19-data/master/mask-use/mask-use-by-county.csv') as response:
    mask_df = pd.read_csv(response)
mask_df.head()

Unnamed: 0,COUNTYFP,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS
0,1001,0.053,0.074,0.134,0.295,0.444
1,1003,0.083,0.059,0.098,0.323,0.436
2,1005,0.067,0.121,0.12,0.201,0.491
3,1007,0.02,0.034,0.096,0.278,0.572
4,1009,0.053,0.114,0.18,0.194,0.459


In [68]:
mask_df.rename(columns={'COUNTYFP':'fips'}, inplace=True)
mask_df['fips'] = mask_df['fips'].apply('{0:0>5}'.format)
mask_df.columns = mask_df.columns.str.lower()
mask_df.head()

Unnamed: 0,fips,never,rarely,sometimes,frequently,always
0,1001,0.053,0.074,0.134,0.295,0.444
1,1003,0.083,0.059,0.098,0.323,0.436
2,1005,0.067,0.121,0.12,0.201,0.491
3,1007,0.02,0.034,0.096,0.278,0.572
4,1009,0.053,0.114,0.18,0.194,0.459


In [69]:
mask_df.select_dtypes(exclude='number').columns

Index(['fips'], dtype='object')

In [70]:
nyc_mask_df = custom_county_maker(mask_df, using='fips', method='mean')
nyc_mask_df

Unnamed: 0,never,rarely,sometimes,frequently,always,fips
0,0.029802,0.022852,0.057863,0.137004,0.75248,36NYC


In [71]:
# estimates from averaging counties
kan_fips = ['29095', '29047', '29165', '29037']
jop_mask_df = custom_county_maker(mask_df, using='fips', method='mean', fips=jop_fips, end_fips='29JOP')
kan_mask_df = custom_county_maker(mask_df, using='fips', method='mean', fips=kan_fips, end_fips='29KAN')

In [72]:
mask_df = mask_df.append([nyc_mask_df, jop_mask_df, kan_mask_df], ignore_index=True)
mask_df.tail()

Unnamed: 0,fips,never,rarely,sometimes,frequently,always
3140,56043,0.204,0.155,0.069,0.285,0.287
3141,56045,0.142,0.129,0.148,0.207,0.374
3142,36NYC,0.029802,0.022852,0.057863,0.137004,0.75248
3143,29JOP,0.143838,0.141946,0.14473,0.187351,0.383135
3144,29KAN,0.029815,0.058889,0.092879,0.203477,0.614758


In [73]:
mask_df.to_csv('../data/processed/mask_df.csv', index=False)

In [74]:
# mask_df = pd.read_csv('../data/processed/mask_df.csv')

In [75]:
dem_df = dem_df.merge(mask_df, on='fips', how='left')
dem_df.tail()

Unnamed: 0,state,county,state_fips,fips,total_pop,white,black,asian,hispanic,area_land,...,hs,some_college,associates,bachelors,graduate,never,rarely,sometimes,frequently,always
3135,Wyoming,Sweetwater,56,56037,42343,33561,502,410,6772,27005.754244,...,9433,6994,3114,4298,1945,0.061,0.295,0.23,0.146,0.268
3136,Wyoming,Teton,56,56039,23464,19000,145,378,3554,10351.784301,...,2272,3219,868,6488,3359,0.095,0.157,0.16,0.247,0.34
3137,Wyoming,Uinta,56,56041,20226,17657,126,92,1871,5391.631764,...,5176,3420,1390,1356,639,0.098,0.278,0.154,0.207,0.264
3138,Wyoming,Washakie,56,56043,7805,6417,38,55,1108,5798.138762,...,1717,1434,701,849,371,0.204,0.155,0.069,0.285,0.287
3139,Wyoming,Weston,56,56045,6927,6236,45,113,285,6210.804116,...,1796,1334,534,676,285,0.142,0.129,0.148,0.207,0.374


In [76]:
dem_df['pop_density'] = dem_df['total_pop'] / dem_df['area_land']
for col in ['white', 'black', 'asian', 'hispanic', 'total_votes']:
    dem_df['per_' + col] = dem_df[col] / dem_df['total_pop']
dem_df['education'] = (dem_df['some_hs'] + 2*dem_df['hs'] \
                   + 3*dem_df['some_college'] + 4*dem_df['associates'] \
                   + 5*dem_df['bachelors'] + 6*dem_df['graduate']) / dem_df['pop25']
dem_df['per_gop'] = dem_df['votes_gop'] / (dem_df['votes_gop'] + dem_df['votes_dem'])
dem_df['mask'] = dem_df['rarely'] + 2*dem_df['sometimes'] \
                   + 3*dem_df['frequently'] + 4*dem_df['always']
dem_df.head()

Unnamed: 0,state,county,state_fips,fips,total_pop,white,black,asian,hispanic,area_land,...,always,pop_density,per_white,per_black,per_asian,per_hispanic,per_total_votes,education,per_gop,mask
0,Alabama,Autauga,1,1001,55869,41215,11098,646,1671,1539.602123,...,0.444,36.287947,0.737708,0.198643,0.011563,0.029909,0.441408,3.174487,0.754018,3.003
1,Alabama,Baldwin,1,1003,223234,185747,19215,2346,10534,4117.546676,...,0.436,54.215293,0.832073,0.086076,0.010509,0.047188,0.421486,3.329113,0.798123,2.968
2,Alabama,Barbour,1,1005,24686,11235,11807,116,1117,2292.144655,...,0.491,10.769826,0.455116,0.478287,0.004699,0.045248,0.420886,2.38062,0.528359,2.928
3,Alabama,Bibb,1,1007,22394,16663,4719,46,623,1612.167481,...,0.572,13.890616,0.744083,0.210726,0.002054,0.02782,0.39064,2.459823,0.78227,3.348
4,Alabama,Blount,1,1009,57826,50176,872,163,5582,1670.103911,...,0.459,34.624193,0.867707,0.01508,0.002819,0.096531,0.438972,2.606581,0.913855,2.892


In [77]:
dem_df[dem_df.isna().any(axis=1)]

Unnamed: 0,state,county,state_fips,fips,total_pop,white,black,asian,hispanic,area_land,...,always,pop_density,per_white,per_black,per_asian,per_hispanic,per_total_votes,education,per_gop,mask
81,Alaska,Kusilvak Census Area,2,2158,8314,281,32,35,249,44221.226622,...,0.347,0.188009,0.033798,0.003849,0.00421,0.029949,,2.080314,,2.987
548,Hawaii,Kalawao,15,15005,86,23,0,7,1,31.057603,...,0.85,2.769048,0.267442,0.0,0.081395,0.011628,,3.144928,,3.717
2410,South Dakota,Oglala Lakota,46,46102,14177,669,55,18,579,5422.402157,...,0.355,2.614524,0.047189,0.00388,0.00127,0.040841,,2.571686,,2.616


In [78]:
# nans only appear in votes, so we can just impute with 0.5 until we find better data
dem_df = dem_df.fillna(0.5)

In [79]:
for f in nyc_fips:
    dem_df = dem_df[dem_df['fips'] != f]

## save results to csv

In [80]:
dem_df.to_csv('../data/processed/dem_df.csv', index=False)

# Future Work: import Puerto Rico census data

To do:
- find detailed demographic data for Puerto Rico
- find a way to incorporate Puerto Rico into the Altair map

In [None]:
# with urlopen('https://www2.census.gov/programs-surveys/popest/tables/2010-2019/municipios/totals/prm-est2019-annres.xlsx') as response:
#     pr_df = pd.read_excel(response, header=3)
pr_df = pd.read_excel('data/prm-est2019-annres.xlsx', header=3)
pr_df = pr_df[['Unnamed: 0', 2019]]
pr_df.rename(
    columns={
        'Unnamed: 0':'county',
        2019:'total_pop'
    }, inplace=True
)
pr_df = pr_df[~pr_df['total_pop'].isna()]
pr_df['total_pop'] = pr_df['total_pop'].astype('int')
pr_df.head()

In [None]:
pr_df['county'] = [s[0] if len(s) > 0 else s for s in pr_df['county'].str.findall("\.([\w\s]+) Municipio\,.+")]
pr_df = pr_df.iloc[1:]          # removing the territory as a whole from the table
pr_df.head()

We also need to add `fips` codes for all of the municipios.

### import Puerto Rico `fips`

In [None]:
sess = HTMLSession()
res = sess.get('https://en.wikipedia.org/wiki/List_of_United_States_FIPS_codes_by_county')
table = res.html.find('table.wikitable > tbody > tr')
# puerto rico is fips 72
pr_fips = [[tr.find('td')[1].text, tr.find('td')[0].text] for tr in table[1:] if tr.find('td')[0].text[:2] == '72']
pr_fips_df = pd.DataFrame(pr_fips)
pr_fips_df.rename(
    columns={
        0:'county',
        1:'fips'
    }, inplace=True
)
pr_fips_df.head()

In [None]:
pr_fips_df['county'] = [s[0] if len(s) > 0 else s for s in pr_fips_df['county'].str.findall("([\w\s]+) Municipality")]
pr_fips_df.head()

In [None]:
len(list(set(pr_fips_df['county']) - set(pr_df['county'])))

In [None]:
pr_df = pr_df.merge(pr_fips_df, on='county')
pr_df['state'] = 'Puerto Rico'
pr_df.head()

In [None]:
pop_df = optimize(pop_df.append(pr_df, ignore_index=True).append(pr_df, ignore_index=True))
pop_df.tail()

## check county names against NYTimes data

We eventually need to merge with the NYTimes data, so let's see how they match with each other:

In [6]:
with urlopen('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv') as response:
    nyt_df = pd.read_csv(
        response,
        dtype={'fips':'str'}
    )
nyt_df.head()

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061,1,0
1,2020-01-22,Snohomish,Washington,53061,1,0
2,2020-01-23,Snohomish,Washington,53061,1,0
3,2020-01-24,Cook,Illinois,17031,1,0
4,2020-01-24,Snohomish,Washington,53061,1,0


In [7]:
county_diffs = list(set(nyt_df['fips']) - set(pop_df['fips']))
len(county_diffs)

84

In [8]:
sorted([str(f) for f in county_diffs])

['69110',
 '69120',
 '72001',
 '72003',
 '72005',
 '72007',
 '72009',
 '72011',
 '72013',
 '72015',
 '72017',
 '72019',
 '72021',
 '72023',
 '72025',
 '72027',
 '72029',
 '72031',
 '72033',
 '72035',
 '72037',
 '72039',
 '72041',
 '72043',
 '72045',
 '72047',
 '72049',
 '72051',
 '72053',
 '72054',
 '72055',
 '72057',
 '72059',
 '72061',
 '72063',
 '72065',
 '72067',
 '72069',
 '72071',
 '72073',
 '72075',
 '72077',
 '72079',
 '72081',
 '72083',
 '72085',
 '72087',
 '72089',
 '72091',
 '72093',
 '72095',
 '72097',
 '72099',
 '72101',
 '72103',
 '72105',
 '72107',
 '72109',
 '72111',
 '72113',
 '72115',
 '72117',
 '72119',
 '72121',
 '72123',
 '72125',
 '72127',
 '72129',
 '72131',
 '72133',
 '72135',
 '72137',
 '72139',
 '72141',
 '72143',
 '72145',
 '72147',
 '72149',
 '72151',
 '72153',
 '78010',
 '78020',
 '78030',
 'nan']

As expected, the census county data is missing all municipios from [Puerto Rico](https://www.census.gov/data/tables/time-series/demo/popest/2010s-total-puerto-rico-municipios.html) (`fips == 72`) as well as a couple from the Northern Mariana Islands (`fips == 69`) and US Virgin Islands (`fips == 78`) so we need to append that data to `pop_df`.

In [None]:
county_diffs = list(set(nyt_df['county']) - set(dem_df['county']))
len(county_diffs)

In [None]:
county_diffs

The NYTimes dataset is missing diacritical marks in their names. While it would be easier to replace diacritical marks with their "standard" character counterparts, we will preserve them in our final dataframe in the interest of cultural accuracy. This will be handled when we merge `pop_df` with `nyt_df` in the other notebook.