## data
### for grabbing / processing (census) data

In [1]:
from evaltools.data import *
import geopandas as gpd
import pandas as pd
import us

_Note: Sometimes, when calling any of the below functions, you may get an error code that looks like this:_
```
ValueError: Unexpected response (URL: ...): Sorry, the system is currently undergoing maintenance or is busy.  Please try again later. 
```
_This is due to a Census API issue and can't be fixed on our end. Usually, running the function again works like a charm!_

### census
Uses the US Census Bureau's API to retrieve 2020 Decennnial Census PL 94-171 data at the stated geometry level. The five tables are
 * P1: Race
 * P2: Hispanic or Latino, and Not Hispanic or Latino by Race
 * P3: Race for the Population 18 Years and Over (Race by VAP)
 * P4: Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
 * P5: Group Quarters Population by Major Group Quarters Type

In [2]:
%%time
# this should take < 5s
df = census(us.states.MA, 
            table="P3", # Table from which we retrieve data, defaults to "P1"
            columns={}, # mapping Census column names from the table to human-readable names, if desired
            geometry="tract", # data granularity, one of "block" (default), "block group", or "tract"
           )
df.head()

CPU times: user 80 ms, sys: 12.3 ms, total: 92.3 ms
Wall time: 8.31 s


Unnamed: 0,GEOID20,VAP20,WHITEVAP20,BLACKVAP20,AMINVAP20,ASIANVAP20,NHPIVAP20,OTHVAP20,WHITEBLACKVAP20,WHITEAMINVAP20,...,BLACKAMINNHPIOTHVAP20,BLACKASIANNHPIOTHVAP20,AMINASIANNHPIOTHVAP20,WHITEBLACKAMINASIANNHPIVAP20,WHITEBLACKAMINASIANOTHVAP20,WHITEBLACKAMINNHPIOTHVAP20,WHITEBLACKASIANNHPIOTHVAP20,WHITEAMINASIANNHPIOTHVAP20,BLACKAMINASIANNHPIOTHVAP20,WHITEBLACKAMINASIANNHPIOTHVAP20
0,25027710602,3046,1836,202,24,110,0,470,47,24,...,0,0,0,0,0,0,0,0,0,0
1,25027710400,1800,1253,64,5,101,0,202,22,23,...,0,0,0,0,0,0,0,0,0,0
2,25027710500,2522,1373,196,14,73,3,473,26,19,...,0,0,0,0,0,0,0,0,0,0
3,25027710601,2394,1211,250,24,63,1,494,32,21,...,0,0,0,0,0,0,0,0,0,0
4,25027710700,1498,814,144,17,35,0,288,14,23,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# The `variables()` function produces the default mapping that `census()` uses 
# to map Census column-names to human-readable ones
mapping = variables("P3")

In [4]:
# Census column P3_001N is total Voting Age Population
mapping['P3_001N']

'VAP20'

### acs5
Uses the US Census Bureau's API to retrieve 5-year population estimates from the American Community Survey (ACS) for the provided state, geometry level, and year.

In [5]:
%%time 
# this should take < 1 min
acs5_df = acs5(us.states.MA,
               geometry="block group", # data granularity, either "tract" (default) or "block group"
               year=2019, # Year for which data is retrieved. Defaults to 2019, i.e. 2015-19 ACS 5-year
              )
acs5_df.head()

CPU times: user 988 ms, sys: 90.1 ms, total: 1.08 s
Wall time: 35.6 s


Unnamed: 0,BLOCKGROUP10,TOTPOP19,WHITE19,BLACK19,AMIN19,ASIAN19,NHPI19,OTH19,2MORE19,NHISP19,...,AMINCVAP19,ASIANCVAP19,NHPICVAP19,OTHCVAP19,2MORECVAP19,NHWCVAP19,HCVAP19,VAP19,CVAP19,POCVAP19
0,250173173012,571,340,15,0,137,0,0,0,492,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,482,0.0,482.0
1,250173531012,1270,660,311,0,93,0,0,41,1105,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1181,0.0,1181.0
2,250173222002,2605,2315,61,0,96,0,21,18,2511,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2059,0.0,2059.0
3,250251101035,1655,1077,242,0,82,0,0,131,1532,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1413,0.0,1413.0
4,250251101032,659,158,225,0,0,0,0,0,383,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,421,0.0,421.0


### cvap
Uses the US Census Bureau's API to retrieve the 2019 5-year CVAP (Citizen Voting Age Population) data for the provided state at the specified geometry. Please note that the geometries are from the **2010 Census**.

In [6]:
%%time
# this should take < 15s
cvap_df = cvap(us.states.MA,
               geometry="block group", # data granularity, either "tract" (default) or "block group"
              )
cvap_df.head()

CPU times: user 9.09 s, sys: 709 ms, total: 9.8 s
Wall time: 9.84 s


Unnamed: 0,BLOCKGROUP10,CVAP19,CVAP19e,NHCVAP19,NHCVAP19e,NHAICVAP19,NHAICVAP19e,NHACVAP19,NHACVAP19e,NHBCVAP19,...,NHAWCVAP19e,NHBWCVAP19,NHBWCVAP19e,NHAIBCVAP19,NHAIBCVAP19e,NHOTHCVAP19,NHOTHCVAP19e,HCVAP19,HCVAP19e,POCCVAP19
0,250010101001,790,175,775,171,0,12,0,12,0,...,12,20,27,0,12,0,12,15,23,35
1,250010101002,420,120,410,120,0,12,4,16,20,...,12,10,15,0,12,0,12,10,20,45
2,250010101003,640,153,620,154,0,12,20,19,10,...,36,0,12,0,12,0,12,20,18,85
3,250010101004,360,148,360,148,0,12,4,16,4,...,12,0,12,0,12,0,12,0,12,10
4,250010101005,515,139,510,136,0,12,0,12,0,...,18,0,12,0,12,0,12,10,16,20


### estimating cvap
This function wraps the above `cvap()` and `acs5()` functions to help users pull forward CVAP estimates from 2019 (on 2010 geometries) to estimates for 2020 (on 2020 geometries). To use this, one must supply a base geodataframe with the 2020 geometries on which they want CVAP estimates. Additionally, users must specify the demographic groups whose CVAP statistics are to be estimated. For each group, users specify a triple $(X, Y, Z)$ where $X$ is the old CVAP column for that group, $Y$ is the old VAP column for that group, and $Z$ is the new VAP column for that group, which must be an existing column on `base`.  Then, the estimated new CVAP for that group will be constructed by multiplying $X / Y \cdot Z$ for each new geometry.

In [9]:
%%time
base = gpd.read_file("data/al_bg/") # Load AL 2020 block-group shapefile
acs5_cvap19 = acs5(us.states.AL) # Get CVAP19 estimates from ACS
cvap_cvap19 = cvap(us.states.AL) # Get CVAP19 estimates from ACS Special Tabulation

CPU times: user 4.26 s, sys: 314 ms, total: 4.57 s
Wall time: 46.7 s


#### Tips for picking $X$, $Y$, and $Z$:

$X$ should be any CVAP column returned by either `acs5()` or `cvap()`, so anything from the following list:

In [10]:
print([col for col in pd.concat([acs5_cvap19, cvap_cvap19]) if "CVAP" in col])

['WCVAP19', 'BCVAP19', 'AMINCVAP19', 'ASIANCVAP19', 'NHPICVAP19', 'OTHCVAP19', '2MORECVAP19', 'NHWCVAP19', 'HCVAP19', 'CVAP19', 'POCVAP19', 'CVAP19e', 'NHCVAP19', 'NHCVAP19e', 'NHAICVAP19', 'NHAICVAP19e', 'NHACVAP19', 'NHACVAP19e', 'NHBCVAP19', 'NHBCVAP19e', 'NHNHPICVAP19', 'NHNHPICVAP19e', 'NHWCVAP19e', 'NHAIWCVAP19', 'NHAIWCVAP19e', 'NHAWCVAP19', 'NHAWCVAP19e', 'NHBWCVAP19', 'NHBWCVAP19e', 'NHAIBCVAP19', 'NHAIBCVAP19e', 'NHOTHCVAP19', 'NHOTHCVAP19e', 'HCVAP19e', 'POCCVAP19']


Note that the `acs5()` method returns things like `BCVAP19` or `HCVAP19` (Black-alone CVAP and Hispanic CVAP, respectively) while the `cvap()` method returns things like `NHBCVAP19` (Non-Hispanic Black-alone CVAP). There are also columns like `NHBCWVAP19`, which refer to all Non-Hispanic citizens of voting age who self-identified as Black and White. However, since your choice of $Y$ is restricted to single-race or ethnicity columns (see below), we recommend only estimating CVAP for single-race or ethnicity columns, like `BCVAP19`, `HCVAP19`, or `NHBCVAP19`).

In [11]:
print([col for col in pd.concat([acs5_cvap19, cvap_cvap19]) if "VAP" in col and "CVAP" not in col])

['WVAP19', 'BVAP19', 'AMINVAP19', 'ASIANVAP19', 'NHPIVAP19', 'OTHVAP19', '2MOREVAP19', 'NHWVAP19', 'HVAP19', 'VAP19']


Lastly, one should choose $Z$ to match one's choice for $Y$ (say, `BVAP20` to match `BVAP19`). However, in some cases it is reasonable to choose a $Z$ that is a close but imperfect match. For example, setting $(X, Y, Z) = $ `(BCVAP19, BVAP19, APBVAP20)` (where $Z = $ `APBVAP` refers to all people of voting age who selected Black alone or in combination with other Census-defined races) would allow one to estimate the 2020 CVAP population of people who selected Black alone or in combination with other races.

One final note: there are some instances in which, due to small Census reporting discrepancies, the `acs5()` and the `cvap()` methods disagree on CVAP19 estimates (this might happen for total `CVAP19` or `HCVAP19`, for example). In these cases we default to the `acs5()` numbers.

In [13]:
estimates = estimatecvap(base,
                         us.states.AL,
                         groups=[ # (Old CVAP, Old VAP, new VAP)
                             ("WCVAP19", "WVAP19", "WVAP20"),
                             ("BCVAP19", "BVAP19", "BVAP20"),
                         ],
                         ceiling=1, # see below
                         zfill=0.1, # see below
                         geometry10="tract"
                        )

100%|██████████████████████████████████████| 1181/1181 [00:08<00:00, 137.21it/s]
100%|███████████████████████████████████████| 1181/1181 [00:14<00:00, 82.88it/s]


The `ceiling` parameter marks when we will cap the CVAP / VAP ratio to 1. Set to 1, this means that if there is ever more CVAP19 in a geometry than VAP19, we will "cap" the CVAP20 estimate to 100\% of the VAP20. The `zfill` parameter tells us what to do when there is 0 CVAP19 in a geometry. Set to 0.1, this will estimate that 10\% of the VAP20 is CVAP.

We can see that our estimate for Black-alone Voting Age Population in Alabama in 2020 is 970,120, down slightly from 970,239 in 2019.

In [15]:
print(f"AL BCVAP20: {estimates.BCVAP20_EST.sum()}")
print(f"AL BCVAP19: {acs5_cvap19.BCVAP19.sum()}")

AL BCVAP20: 970120.3645540088
AL BCVAP19: 970239


We can also make estimates of Black VAP in Alabama among `APBVAP` — Alabamians who identified as Black alone or in combination with other races. This bumps up the estimate to around 1,007,363.

In [17]:
estimates = estimatecvap(base,
                         us.states.AL,
                         groups=[
                             # Changing the new VAP column from BVAP20 -> APBVAP20
                             ("BCVAP19", "BVAP19", "APBVAP20"),
                         ],
                         ceiling=1,
                         zfill=0.1,
                         geometry10="tract"
                        )

100%|██████████████████████████████████████| 1181/1181 [00:08<00:00, 138.90it/s]
100%|███████████████████████████████████████| 1181/1181 [00:14<00:00, 84.24it/s]


In [18]:
print(f"AL APBCVAP20 estimate: {estimates.BCVAP20_EST.sum()}")

AL APBCVAP20 estimate: 1007362.5586538106
