# Enrich Standard Geographies

In the previous Notebook, we took advantage of the ability to enrich standard geographies using little more than the unique identifier for the geographies. We did not look up these standard geographies, though. The Geoenrichment module also provides the ability to look these up and also optionally use in conjunction with the `enrich` method.

## Example Use Case - Variable Variance

Just as before, we are going to retrieve high variance variables, but this time we are going to also look up the unique identifiers for all the US Census Block Groups in Seattle.

### Create a Country

Just as in the previous example Geoenrichment notebooks, our analysis starts with identifying the country we are going to be working with and instantating an `arcgis.geoenrichment.Country` object referencing an `arcgis.gis.GIS` source to use for analysis.

In [112]:
import os

from arcgis.geoenrichment import Country
from arcgis.gis import GIS

gis_agol = GIS(os.getenv('AGOL_URL'), username=os.getenv('AGOL_USERNAME'), password=os.getenv('AGOL_PASSWORD'))
usa = Country('usa', gis=gis_agol)

usa

<Country - United States (GIS @ https://bateam.maps.arcgis.com version:10.1)>

### Selecting Data to Start

Just as before, we are using Pandas data frame filtering to identify a subset of variables to focus on from the thousands available.

In [113]:
enrich_vars = usa.enrich_variables[
    (usa.enrich_variables.name.str.lower().str.contains('cy'))
    & (
        (usa.enrich_variables.data_collection == 'occupation')
        | (usa.enrich_variables.data_collection == 'Wealth')
        | (usa.enrich_variables.data_collection == 'financial')
        | (usa.enrich_variables.data_collection == 'educationalattainment')
        | (usa.enrich_variables.data_collection == 'language')
        | (usa.enrich_variables.data_collection == 'healthinsurancecoverage')
        | (usa.enrich_variables.data_collection == 'veterans')
        | (usa.enrich_variables.data_collection == 'yearmovedin')
        | (usa.enrich_variables.data_collection == 'yearbuilt')
        | (usa.enrich_variables.data_collection == 'population')
        | (usa.enrich_variables.data_collection == 'occupation')
        | (usa.enrich_variables.data_collection == 'housingcosts')
    )
].drop_duplicates('name').reset_index(drop=True)

enrich_vars

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name,description,vintage,units
0,TOTHH_CY,2021 Total Households,Wealth,Wealth.TOTHH_CY,Wealth_TOTHH_CY,2021 Total Households (Esri),2021,count
1,HINC0_CY,2021 HH Income <$15000,Wealth,Wealth.HINC0_CY,Wealth_HINC0_CY,"2021 Household Income less than $15,000 (Esri)",2021,count
2,HINC15_CY,2021 HH Income $15000-24999,Wealth,Wealth.HINC15_CY,Wealth_HINC15_CY,"2021 Household Income $15,000-$24,999 (Esri)",2021,count
3,HINC25_CY,2021 HH Income $25000-34999,Wealth,Wealth.HINC25_CY,Wealth_HINC25_CY,"2021 Household Income $25,000-$34,999 (Esri)",2021,count
4,HINC35_CY,2021 HH Income $35000-49999,Wealth,Wealth.HINC35_CY,Wealth_HINC35_CY,"2021 Household Income $35,000-$49,999 (Esri)",2021,count
...,...,...,...,...,...,...,...,...
92,OCCFARM_CY,2021 Occupation: Farm/Fish/Forestry,occupation,occupation.OCCFARM_CY,occupation_OCCFARM_CY,2021 Occupation: Farming/Fishing/Forestry (Esri),2021,count
93,OCCCONS_CY,2021 Occupation: Construction/Extraction,occupation,occupation.OCCCONS_CY,occupation_OCCCONS_CY,2021 Occupation: Construction/Extraction (Esri),2021,count
94,OCCMAIN_CY,2021 Occupation: Maintenance/Repair,occupation,occupation.OCCMAIN_CY,occupation_OCCMAIN_CY,2021 Occupation: Installation/Maintenance/Repa...,2021,count
95,OCCPROD_CY,2021 Occupation: Production,occupation,occupation.OCCPROD_CY,occupation_OCCPROD_CY,2021 Occupation: Production (Esri),2021,count


### Get the Geographic Level

Just as in the previous notebook, we are retrieving `levels` and using the `level_name` colum to discover valid values for the `enrich` method's `standard_geography_level` parameter.

In [114]:
usa.levels

Unnamed: 0,level_name,singular_name,plural_name,alias,level_id,admin_level
0,block_groups,Block Group,Block Groups,Block Groups,US.BlockGroups,
1,tracts,Census Tract,Census Tracts,Census Tracts,US.Tracts,
2,places,Place,Places,Cities and Towns (Places),US.Places,
3,zip5,ZIP Code,ZIP Codes,ZIP Codes,US.ZIP5,Admin4
4,csd,County Subdivision,County Subdivisions,County Subdivisions,US.CSD,
5,counties,County,Counties,Counties,US.Counties,Admin3
6,cbsa,CBSA,CBSAs,CBSAs,US.CBSA,
7,cd,Congressional District,Congressional Districts,Congressional Districts,US.CD,
8,dma,DMA,DMAs,DMAs,US.DMA,
9,states,State,States,States,US.States,Admin2


## Retrive Seattle Block Groups

Although many times you may already have the standard geography unique identifiers, if you simply need to retrieve those within a larger area, you can retrieve these using `standard_geography_query`. The most versatile parameter in this method is `geoquery`. This method, since little more than a wrapper for the standard geography query REST endpoint, there is more explanation of the options for the `geoquery` parameter under the [geographyQuery parameter documentation](https://developers.arcgis.com/rest/geoenrichment/api-reference/standard-geography-query.htm#geographyQuery). We can start by seeing what is returned when searching for `seattle`.

In [157]:
from arcgis.geoenrichment import standard_geography_query

standard_geography_query('usa', layers='US.Places', geoquery='seattle')

Unnamed: 0,DatasetID,Hierarchy,DataLayerID,AreaID,AreaName,MajorSubdivisionName,MajorSubdivisionAbbr,MajorSubdivisionType,CountryAbbr,Score,ObjectId
0,USA_ESRI_2021,census,US.Places,5363000,Seattle city,Washington,WA,State,US,100,1


Since only one location is returned, we can use this to retrieve the block groups by populating the `sub_geography` parameteters.

In [155]:
bg_df = standard_geography_query('usa', layers='US.Places', geoquery='seattle', sub_geography_layer='US.BlockGroups', return_sub_geography=True)

bg_df.info()
bg_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   DatasetID             478 non-null    object
 1   Hierarchy             478 non-null    object
 2   DataLayerID           478 non-null    object
 3   AreaID                478 non-null    object
 4   AreaName              478 non-null    object
 5   MajorSubdivisionName  478 non-null    object
 6   MajorSubdivisionAbbr  478 non-null    object
 7   MajorSubdivisionType  478 non-null    object
 8   CountryAbbr           478 non-null    object
 9   Score                 478 non-null    int32 
 10  ObjectId              478 non-null    int64 
dtypes: int32(1), int64(1), object(9)
memory usage: 39.3+ KB


Unnamed: 0,DatasetID,Hierarchy,DataLayerID,AreaID,AreaName,MajorSubdivisionName,MajorSubdivisionAbbr,MajorSubdivisionType,CountryAbbr,Score,ObjectId
0,USA_ESRI_2021,census,US.BlockGroups,530330009001,530330009.001,Washington,WA,State,US,100,1
1,USA_ESRI_2021,census,US.BlockGroups,530330009002,530330009.002,Washington,WA,State,US,100,2
2,USA_ESRI_2021,census,US.BlockGroups,530330010001,530330010.001,Washington,WA,State,US,100,3
3,USA_ESRI_2021,census,US.BlockGroups,530330010002,530330010.002,Washington,WA,State,US,100,4
4,USA_ESRI_2021,census,US.BlockGroups,530330011001,530330011.001,Washington,WA,State,US,100,5


## Enrich

Now, we can use the retrieved block groups as input into the `enrich` method to acheive the same results.

In [162]:
enrich_df = usa.enrich(bg_df, enrich_variables=enrich_vars, standard_geography_level='block_groups', standard_geography_id_column='AreaID')

enrich_df.info()
enrich_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Columns: 106 entries, std_geography_level to SHAPE
dtypes: float64(99), geometry(1), int32(1), object(5)
memory usage: 394.1+ KB


Unnamed: 0,std_geography_level,std_geography_name,std_geography_id,source_country,aggregation_method,population_to_polygon_size_rating,apportionment_confidence,has_data,tothh_cy,hinc0_cy,hinc15_cy,hinc25_cy,hinc35_cy,hinc50_cy,hinc75_cy,hinc100_cy,hinc150_cy,hinc200_cy,medhinc_cy,avghinc_cy,pci_cy,agginc_cy,agghinc_cy,hincbasecy,avgia15_cy,avgia25_cy,avgia35_cy,avgia45_cy,avgia55_cy,avgia65_cy,avgia75_cy,di0_cy,di15_cy,di25_cy,di35_cy,di50_cy,di75_cy,di100_cy,di150_cy,di200_cy,meddi_cy,aggdi_cy,avgdi_cy,dibase_cy,nw0_cy,nw15_cy,nw35_cy,nw50_cy,nw75_cy,nw100_cy,nw150_cy,nw250_cy,nw500_cy,mednw_cy,avgnw_cy,aggnw_cy,nwbase_cy,val0_cy,val50k_cy,val100k_cy,val150k_cy,val200k_cy,val250k_cy,val300k_cy,val400k_cy,val500k_cy,val750k_cy,val1m_cy,medval_cy,avgval_cy,valbase_cy,wlthindxcy,nohs_cy,somehs_cy,hsgrad_cy,ged_cy,smcoll_cy,asscdeg_cy,bachdeg_cy,graddeg_cy,educbasecy,civlbfr_cy,occbase_cy,occmgmt_cy,occbus_cy,occcomp_cy,occarch_cy,occssci_cy,occssrv_cy,occlegl_cy,occeduc_cy,occent_cy,occhtch_cy,occhlth_cy,occprot_cy,occfood_cy,occbldg_cy,occpers_cy,occsale_cy,occadmn_cy,occfarm_cy,occcons_cy,occmain_cy,occprod_cy,occtran_cy,SHAPE
0,US.BlockGroups,530330009.001,530330009001,USA,Query:US.BlockGroups,2.191,2.576,1,427.0,5.0,9.0,6.0,15.0,23.0,38.0,87.0,64.0,180.0,170189.0,223215.0,90027.0,95338297.0,95312752.0,427.0,194956.0,206112.0,252013.0,256656.0,221484.0,233196.0,151389.0,6.0,10.0,10.0,18.0,40.0,57.0,105.0,80.0,101.0,129577.0,65863919.0,154248.0,427.0,10.0,4.0,1.0,7.0,6.0,14.0,26.0,35.0,82.0,1265258.0,4055452.0,1731678000.0,427.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,5.0,89.0,106.0,146.0,944575.0,980753.0,365.0,347.0,14.0,7.0,68.0,16.0,76.0,24.0,245.0,392.0,842.0,652.0,640.0,65.0,69.0,51.0,16.0,30.0,44.0,26.0,83.0,42.0,73.0,0.0,0.0,4.0,0.0,5.0,78.0,30.0,0.0,4.0,4.0,0.0,16.0,"{""rings"": [[[-122.28001399933198, 47.719146999..."
1,US.BlockGroups,530330009.002,530330009002,USA,Query:US.BlockGroups,2.191,2.576,1,484.0,18.0,0.0,16.0,43.0,33.0,28.0,87.0,86.0,173.0,157438.0,201371.0,83094.0,97469646.0,97463751.0,484.0,84603.0,160487.0,236318.0,266698.0,248283.0,189735.0,118104.0,18.0,7.0,26.0,37.0,39.0,53.0,126.0,79.0,99.0,118861.0,68671709.0,141884.0,484.0,29.0,7.0,4.0,13.0,9.0,14.0,29.0,45.0,88.0,1027184.0,3525640.0,1706410000.0,484.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,9.0,55.0,55.0,112.0,1368304.0,1378993.0,407.0,306.0,0.0,0.0,53.0,3.0,92.0,51.0,340.0,329.0,868.0,641.0,600.0,145.0,46.0,30.0,2.0,51.0,21.0,33.0,71.0,11.0,49.0,2.0,4.0,1.0,14.0,5.0,39.0,42.0,0.0,28.0,0.0,0.0,6.0,"{""rings"": [[[-122.27643999897352, 47.712159999..."
2,US.BlockGroups,530330010.001,530330010001,USA,Query:US.BlockGroups,2.191,2.576,1,365.0,30.0,26.0,38.0,15.0,68.0,20.0,63.0,57.0,48.0,80917.0,114774.0,45535.0,41892358.0,41892358.0,365.0,46111.0,119495.0,139072.0,127931.0,107521.0,120643.0,93309.0,34.0,39.0,26.0,39.0,52.0,41.0,83.0,26.0,25.0,70252.0,31948492.0,87530.0,365.0,97.0,19.0,11.0,24.0,19.0,25.0,37.0,44.0,35.0,121626.0,554688.0,202461100.0,365.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0,35.0,111.0,23.0,0.0,592342.0,597849.0,186.0,93.0,0.0,31.0,84.0,3.0,107.0,36.0,211.0,129.0,601.0,559.0,471.0,42.0,20.0,58.0,7.0,18.0,25.0,12.0,44.0,32.0,23.0,8.0,10.0,1.0,17.0,12.0,52.0,56.0,0.0,13.0,5.0,4.0,12.0,"{""rings"": [[[-122.2937919998337, 47.7119679995..."
3,US.BlockGroups,530330010.002,530330010002,USA,Query:US.BlockGroups,2.191,2.576,1,423.0,24.0,18.0,32.0,10.0,42.0,48.0,88.0,72.0,89.0,116390.0,153452.0,57176.0,64951670.0,64910405.0,423.0,71790.0,153356.0,137631.0,212072.0,185267.0,129133.0,95142.0,28.0,29.0,20.0,25.0,57.0,61.0,109.0,44.0,50.0,95608.0,47570949.0,112461.0,423.0,63.0,14.0,7.0,17.0,16.0,21.0,31.0,43.0,72.0,495597.0,2344729.0,991820400.0,423.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,20.0,139.0,74.0,62.0,716727.0,786489.0,309.0,214.0,7.0,11.0,36.0,24.0,103.0,63.0,202.0,335.0,781.0,652.0,633.0,96.0,44.0,74.0,17.0,60.0,1.0,7.0,63.0,46.0,47.0,19.0,9.0,16.0,4.0,3.0,60.0,44.0,0.0,10.0,0.0,5.0,8.0,"{""rings"": [[[-122.2908140002252, 47.7067909999..."
4,US.BlockGroups,530330011.001,530330011001,USA,Query:US.BlockGroups,2.191,2.576,1,572.0,28.0,6.0,22.0,46.0,145.0,59.0,103.0,70.0,93.0,90113.0,123594.0,52562.0,70695519.0,70695519.0,572.0,55967.0,130616.0,149904.0,135472.0,120098.0,110664.0,73967.0,30.0,17.0,28.0,95.0,121.0,72.0,115.0,46.0,48.0,73517.0,53699638.0,93880.0,572.0,87.0,26.0,14.0,35.0,26.0,40.0,60.0,94.0,80.0,245368.0,1228677.0,702803500.0,572.0,0.0,0.0,0.0,0.0,0.0,0.0,34.0,48.0,172.0,97.0,0.0,635901.0,643519.0,351.0,136.0,53.0,12.0,122.0,5.0,145.0,43.0,403.0,237.0,1020.0,791.0,767.0,122.0,23.0,117.0,50.0,12.0,0.0,0.0,55.0,17.0,42.0,0.0,9.0,82.0,0.0,28.0,34.0,65.0,0.0,42.0,0.0,25.0,44.0,"{""rings"": [[[-122.30163299958835, 47.706698999..."


## Calculate Variance

Variation can now be calculated for the retrieved variables to identify those with exceedingly high variance, and follow on analysis can be used for feature selection or feature reduction to address covariance between variables and perform modeling.

Just as in the previous notebook, we can evaluate the variance and select the top variables.

In [163]:
# get just the enrich value columns
enrich_cols = [c for c in enrich_df if c in usa.enrich_variables.name.str.lower().values]
enrich_df = enrich_df.set_index('std_geography_id').loc[:,enrich_cols]

# get top 20 highest variance columns
top20 = enrich_df.var(ddof=0).sort_values(ascending=False).iloc[:20]
top20.name = 'variance'

# add human readable names
ev = usa.enrich_variables
ev.index = ev.name.str.lower()
top20_df = ev.join(top20, how='right').loc[:,['name', 'alias', 'variance']]

top20_df.info()
top20_df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 58 entries, aggdi_cy to pci_cy
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      58 non-null     object 
 1   alias     58 non-null     object 
 2   variance  58 non-null     float64
dtypes: float64(1), object(2)
memory usage: 1.8+ KB


Unnamed: 0,name,alias,variance
aggdi_cy,AGGDI_CY,2021 Aggregate Disposable Income,2509823000000000.0
aggdi_cy,AGGDI_CY,2021 Aggregate Disposable Income,2509823000000000.0
agghinc_cy,AGGHINC_CY,2021 Aggregate HH Income,5218319000000000.0
agghinc_cy,AGGHINC_CY,2021 Aggregate HH Income,5218319000000000.0
agghinc_cy,AGGHINC_CY,2021 Aggregate HH Income,5218319000000000.0


## Continuing Analysis

From here, a variety of techniques can be used, but with so many income and net worth variables, before subsequent modeling steps, covariacne needs to be addressed. Using the Geoenrichment dramatically streamlines getting to this point, though. It provides extremely easy access to thousands of demographic variables for modeling and analysis directly in Python, making it easy to integrate with data engineering pipelines.