## data
### for grabbing / processing (census) data

In [1]:
from gerrytools.data import *
import geopandas as gpd
import pandas as pd
import us

_Note: Sometimes, when calling any of the below functions, you may get an error code that looks like this:_
```
ValueError: Unexpected response (URL: ...): Sorry, the system is currently undergoing maintenance or is busy.  Please try again later. 
```
_This is due to a Census API issue and can't be fixed on our end. Usually, running the function again works like a charm!_

### census
Uses the US Census Bureau's API to retrieve 2020 Decennnial Census PL 94-171 data at the stated geometry level. The five tables are
 * P1: Race
 * P2: Hispanic or Latino, and Not Hispanic or Latino by Race
 * P3: Race for the Population 18 Years and Over (Race by VAP)
 * P4: Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
 * P5: Group Quarters Population by Major Group Quarters Type

In [2]:
us.states.MA

<State:Massachusetts>

In [3]:
%%time
# this should take < 5s
df = census20(
    us.states.MA, 
    table="P3", # Table from which we retrieve data, defaults to "P1"
    columns={}, # mapping Census column names from the table to human-readable names, if desired
    geometry="tract", # data granularity, one of "block" (default), "block group", or "tract"
)

df[["GEOID20", "VAP20", "WHITEVAP20", "BLACKVAP20", "ASIANVAP20", "OTHVAP20"]].head()

CPU times: user 52.6 ms, sys: 4.01 ms, total: 56.6 ms
Wall time: 2.55 s


Unnamed: 0,GEOID20,VAP20,WHITEVAP20,BLACKVAP20,ASIANVAP20,OTHVAP20
0,25001012601,2657,1868,153,122,172
1,25001012602,4564,2444,517,147,547
2,25001012700,4059,3445,119,49,144
3,25001012800,3464,2971,86,42,84
4,25001012900,3568,3011,101,47,103


In [4]:
# The `variables()` function produces the default mapping that `census()` uses 
# to map Census column-names to human-readable ones
mapping = variables("P3")
mapping

variables("P3")

{'P3_001N': 'VAP20',
 'P3_003N': 'WHITEVAP20',
 'P3_004N': 'BLACKVAP20',
 'P3_005N': 'AMINVAP20',
 'P3_006N': 'ASIANVAP20',
 'P3_007N': 'NHPIVAP20',
 'P3_008N': 'OTHVAP20',
 'P3_011N': 'WHITEBLACKVAP20',
 'P3_012N': 'WHITEAMINVAP20',
 'P3_013N': 'WHITEASIANVAP20',
 'P3_014N': 'WHITENHPIVAP20',
 'P3_015N': 'WHITEOTHVAP20',
 'P3_016N': 'BLACKAMINVAP20',
 'P3_017N': 'BLACKASIANVAP20',
 'P3_018N': 'BLACKNHPIVAP20',
 'P3_019N': 'BLACKOTHVAP20',
 'P3_020N': 'AMINASIANVAP20',
 'P3_021N': 'AMINNHPIVAP20',
 'P3_022N': 'AMINOTHVAP20',
 'P3_023N': 'ASIANNHPIVAP20',
 'P3_024N': 'ASIANOTHVAP20',
 'P3_025N': 'NHPIOTHVAP20',
 'P3_027N': 'WHITEBLACKAMINVAP20',
 'P3_028N': 'WHITEBLACKASIANVAP20',
 'P3_029N': 'WHITEBLACKNHPIVAP20',
 'P3_030N': 'WHITEBLACKOTHVAP20',
 'P3_031N': 'WHITEAMINASIANVAP20',
 'P3_032N': 'WHITEAMINNHPIVAP20',
 'P3_033N': 'WHITEAMINOTHVAP20',
 'P3_034N': 'WHITEASIANNHPIVAP20',
 'P3_035N': 'WHITEASIANOTHVAP20',
 'P3_036N': 'WHITENHPIOTHVAP20',
 'P3_037N': 'BLACKAMINASIANVAP20',
 'P

In [5]:
# Census column P3_001N is total Voting Age Population
mapping['P3_001N']

'VAP20'

### acs5
Uses the US Census Bureau's API to retrieve 5-year population estimates from the American Community Survey (ACS) for the provided state, geometry level, and year.

In [6]:
%%time 
# this should take < 1 min
acs5_df = acs5(
    us.states.MA,
    geometry="block group", # data granularity, either "tract" (default) or "block group"
    year=2019, # Year for which data is retrieved. Defaults to 2019, i.e. 2015-19 ACS 5-year
)
acs5_df[["BLOCKGROUP10", "TOTPOP19", "WHITE19", "BLACK19", "ASIAN19", "OTH19"]].head()

CPU times: user 523 ms, sys: 37.2 ms, total: 561 ms
Wall time: 19.5 s


Unnamed: 0,BLOCKGROUP10,TOTPOP19,WHITE19,BLACK19,ASIAN19,OTH19
0,250173173012,571,340,15,137,0
1,250173531012,1270,660,311,93,0
2,250173222002,2605,2315,61,96,21
3,250251101035,1655,1077,242,82,0
4,250251101032,659,158,225,0,0


### cvap
Uses the US Census Bureau's API to retrieve the 2019 5-year CVAP (Citizen Voting Age Population) data for the provided state at the specified geometry. Please note that the geometries are from the **2010 Census**.

In [7]:
%%time
# this should take < 15s
cvap_df = cvap(
    us.states.MA,
    geometry="block group", # data granularity, either "tract" (default) or "block group"
    year=2019
)
cvap_df.head()

CPU times: user 5.22 s, sys: 510 ms, total: 5.73 s
Wall time: 7.39 s


Unnamed: 0,BLOCKGROUP10,CVAP19,CVAP19e,NHCVAP19,NHCVAP19e,NHAMINCVAP19,NHAMINCVAP19e,NHASIANCVAP19,NHASIANCVAP19e,NHBLACKCVAP19,...,NHWHITEASIANCVAP19e,NHWHITEBLACKCVAP19,NHWHITEBLACKCVAP19e,NHBLACKAMINCVAP19,NHBLACKAMINCVAP19e,NHOTHCVAP19,NHOTHCVAP19e,HCVAP19,HCVAP19e,POCCVAP19
0,250010101001,790,175,775,171,0,12,0,12,0,...,12,20,27,0,12,0,12,15,23,35
1,250010101002,420,120,410,120,0,12,4,16,20,...,12,10,15,0,12,0,12,10,20,45
2,250010101003,640,153,620,154,0,12,20,19,10,...,36,0,12,0,12,0,12,20,18,85
3,250010101004,360,148,360,148,0,12,4,16,4,...,12,0,12,0,12,0,12,0,12,10
4,250010101005,515,139,510,136,0,12,0,12,0,...,18,0,12,0,12,0,12,10,16,20


### estimating cvap
This function wraps the above `cvap()` and `acs5()` functions to help users pull forward CVAP estimates from 2019 (on 2010 geometries) to estimates for 2020 (on 2020 geometries). To use this, one must supply a base geodataframe with the 2020 geometries on which they want CVAP estimates. Additionally, users must specify the demographic groups whose CVAP statistics are to be estimated. For each group, users specify a triple $(X, Y, Z)$ where $X$ is the old CVAP column for that group, $Y$ is the old VAP column for that group, and $Z$ is the new VAP column for that group, which must be an existing column on `base`.  Then, the estimated new CVAP for that group will be constructed by multiplying $X / Y \cdot Z$ for each new geometry.

In [9]:
%%time
base = gpd.read_file("data/al_bg/") # Load AL 2020 block-group shapefile
acs5_cvap19 = acs5(us.states.AL, year=2019) # Get CVAP19 estimates from ACS
cvap_cvap19 = cvap(us.states.AL, year=2019) # Get CVAP19 estimates from ACS Special Tabulation

CPU times: user 4.2 s, sys: 212 ms, total: 4.41 s
Wall time: 15.8 s


#### Tips for picking $X$, $Y$, and $Z$:

$X$ should be any CVAP column returned by either `acs5()` or `cvap()`, so anything from the following list:

In [10]:
print([col for col in pd.concat([acs5_cvap19, cvap_cvap19]) if "CVAP" in col])

['WHITECVAP19', 'BLACKCVAP19', 'AMINCVAP19', 'ASIANCVAP19', 'NHPICVAP19', 'OTHCVAP19', '2MORECVAP19', 'NHWHITECVAP19', 'HCVAP19', 'CVAP19', 'POCVAP19', 'CVAP19e', 'NHCVAP19', 'NHCVAP19e', 'NHAMINCVAP19', 'NHAMINCVAP19e', 'NHASIANCVAP19', 'NHASIANCVAP19e', 'NHBLACKCVAP19', 'NHBLACKCVAP19e', 'NHNHPICVAP19', 'NHNHPICVAP19e', 'NHWHITECVAP19e', 'NHWHITEAMINCVAP19', 'NHWHITEAMINCVAP19e', 'NHWHITEASIANCVAP19', 'NHWHITEASIANCVAP19e', 'NHWHITEBLACKCVAP19', 'NHWHITEBLACKCVAP19e', 'NHBLACKAMINCVAP19', 'NHBLACKAMINCVAP19e', 'NHOTHCVAP19', 'NHOTHCVAP19e', 'HCVAP19e', 'POCCVAP19']


Note that the `acs5()` method returns things like `BCVAP19` or `HCVAP19` (Black-alone CVAP and Hispanic CVAP, respectively) while the `cvap()` method returns things like `NHBCVAP19` (Non-Hispanic Black-alone CVAP). There are also columns like `NHBCWVAP19`, which refer to all Non-Hispanic citizens of voting age who self-identified as Black and White. However, since your choice of $Y$ is restricted to single-race or ethnicity columns (see below), we recommend only estimating CVAP for single-race or ethnicity columns, like `BCVAP19`, `HCVAP19`, or `NHBCVAP19`).

In [11]:
print([col for col in pd.concat([acs5_cvap19, cvap_cvap19]) if "VAP" in col and "CVAP" not in col])

['WHITEVAP19', 'BLACKVAP19', 'AMINVAP19', 'ASIANVAP19', 'NHPIVAP19', 'OTHVAP19', '2MOREVAP19', 'NHWHITEVAP19', 'HVAP19', 'VAP19']


Lastly, one should choose $Z$ to match one's choice for $Y$ (say, `BVAP20` to match `BVAP19`). However, in some cases it is reasonable to choose a $Z$ that is a close but imperfect match. For example, setting $(X, Y, Z) = $ `(BCVAP19, BVAP19, APBVAP20)` (where $Z = $ `APBVAP` refers to all people of voting age who selected Black alone or in combination with other Census-defined races) would allow one to estimate the 2020 CVAP population of people who selected Black alone or in combination with other races.

One final note: there are some instances in which, due to small Census reporting discrepancies, the `acs5()` and the `cvap()` methods disagree on CVAP19 estimates (this might happen for total `CVAP19` or `HCVAP19`, for example). In these cases we default to the `acs5()` numbers.

In [12]:
acs5_cvap19.columns

Index(['TRACT10', 'TOTPOP19', 'WHITE19', 'BLACK19', 'AMIN19', 'ASIAN19',
       'NHPI19', 'OTH19', '2MORE19', 'NHISP19', 'WHITEVAP19', 'BLACKVAP19',
       'AMINVAP19', 'ASIANVAP19', 'NHPIVAP19', 'OTHVAP19', '2MOREVAP19',
       'NHWHITEVAP19', 'HVAP19', 'WHITECVAP19', 'BLACKCVAP19', 'AMINCVAP19',
       'ASIANCVAP19', 'NHPICVAP19', 'OTHCVAP19', '2MORECVAP19',
       'NHWHITECVAP19', 'HCVAP19', 'VAP19', 'CVAP19', 'POCVAP19'],
      dtype='object')

In [13]:
estimates = estimatecvap2010(
    base,
    us.states.AL,
    groups=[ # (Old CVAP, Old VAP, new VAP)
        ("WHITECVAP19", "WHITEVAP19", "WVAP20"),
        ("BLACKCVAP19", "BLACKVAP19", "BVAP20"),
    ],
    ceiling=1, # see below
    zfill=0.1, # see below
    geometry10="tract"
)

Removing the following columns: HISP20, NAMELSAD20, LOGRECNO, INTPTLAT20, CIFSN, COUNTYFP20, STUSAB, BLKGRPCE20, FUNCSTAT20, MTFCC20, AWATER20, SUMLEV, FILEID, CHARITER, STATEFP20, ALAND20, TRACTCE20, GEOCODE, INTPTLON20


100%|██████████| 1181/1181 [00:01<00:00, 734.44it/s]
100%|██████████| 1181/1181 [00:05<00:00, 209.64it/s]
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])


The `ceiling` parameter marks when we will cap the CVAP / VAP ratio to 1. Set to 1, this means that if there is ever more CVAP19 in a geometry than VAP19, we will "cap" the CVAP20 estimate to 100\% of the VAP20. The `zfill` parameter tells us what to do when there is 0 CVAP19 in a geometry. Set to 0.1, this will estimate that 10\% of the VAP20 is CVAP.

We can see that our estimate for Black-alone Voting Age Population in Alabama in 2020 is 970,120, down slightly from 970,239 in 2019.

In [14]:
print(estimates.columns)

Index(['BVAP20', 'APBPOP20', 'NWBHPOP20', 'ASIANPOP20', 'AMINVAP20',
       'OTHERVAP20', 'APAVAP20', 'APAMIPOP20', 'AMINPOP20', 'NHPIVAP20',
       'APAMIVAP20', '2MOREVAP20', 'APBVAP20', 'WVAP20', 'BPOP20', 'GEOID20',
       'OTHERPOP20', 'HVAP20', '2MOREPOP20', 'WPOP20', 'TOTPOP20', 'NHPIPOP20',
       'APAPOP20', 'NWBHVAP20', 'ASIANVAP20', 'DOJBVAP20', 'VAP20', 'geometry',
       'TRACT10', 'WHITECVAP20_EST', 'BLACKCVAP20_EST'],
      dtype='object')


In [15]:
print(f"AL BLACKCVAP20: {estimates.BLACKCVAP20_EST.sum()}")
print(f"AL BLACKCVAP19: {acs5_cvap19.BLACKCVAP19.sum()}")

AL BLACKCVAP20: 970120.3645540088
AL BLACKCVAP19: 970239


We can also make estimates of Black VAP in Alabama among `APBVAP` — Alabamians who identified as Black alone or in combination with other races. This bumps up the estimate to around 1,007,363.

In [16]:
estimates = estimatecvap2010(
    base,
    us.states.AL,
    groups=[
        # Changing the new VAP column from BVAP20 -> APBVAP20
        ("BLACKCVAP19", "BLACKVAP19", "APBVAP20"),
    ],
    ceiling=1,
    zfill=0.1,
    geometry10="tract"
)

Removing the following columns: HISP20, NAMELSAD20, LOGRECNO, INTPTLAT20, CIFSN, COUNTYFP20, STUSAB, BLKGRPCE20, FUNCSTAT20, MTFCC20, AWATER20, SUMLEV, FILEID, CHARITER, STATEFP20, ALAND20, TRACTCE20, GEOCODE, INTPTLON20


100%|██████████| 1181/1181 [00:01<00:00, 718.55it/s]
100%|██████████| 1181/1181 [00:05<00:00, 206.67it/s]
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])


In [19]:
print(f"AL APBCVAP20 estimate: {estimates.BLACKCVAP20_EST.sum()}")

AL APBCVAP20 estimate: 1007362.5586538106
