# Enrich Using Standard Geographies

Starting off, we import a few required Python resources. While there are quite a few in there, the import of note is `from arcgis.geoenrichment import Country, get_countries`. We are going to use this object and method to discover and perform our analysis.

In [1]:
import os
from pathlib import Path

from arcgis.features import GeoAccessor
from arcgis.geoenrichment import Country
from arcgis.gis import GIS
import pandas as pd
from sklearn.pipeline import make_pipeline

Next, we need some test data to work with. Hence, here we are accessing two files with pickled exports from Spatially Enabled Pandas Data Frames. One is of postal codes (zip codes) in Portland, Oregon, and the other is for block groups in Portland.

In [2]:
# paths to input data
dir_prj = Path.cwd().parent
dir_data = dir_prj/'data'
dir_raw = dir_data/'raw'

# import the two preprocessors from the examples
sys.path.append(str(dir_prj/'src'))
from ba_samples.preprocessing import EnrichStandardGeography, KeepOnlyEnrichColumns

# specifically, the data being used for this example - pickled dataframes
postal_codes_pth = dir_raw/'postal_codes.pkl'
block_groups_pth = dir_raw/'block_groups.pkl'

For the first example, we are going to get a list of standard geography codes to use from the postal code data, just a list of zip codes.

In [3]:
postal_code_df = pd.read_pickle(postal_codes_pth)
postal_code_lst = list(postal_code_df['ID'])

print(postal_code_lst)

['83801', '83803', '83810', '83814', '83815', '83833', '83835', '83842', '83854', '83858', '83861', '83869', '83876']


For the latter, we are going to use the Data Frame directly, and tell the `enrich` function which column to look in when we invoke the function.

In [4]:
block_groups_df = pd.read_pickle(block_groups_pth)

block_groups_df.info()
block_groups_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ID       66 non-null     object
 1   NAME     66 non-null     object
 2   IDField  66 non-null     object
dtypes: object(3)
memory usage: 1.7+ KB


Unnamed: 0,ID,NAME,IDField
0,160550001001,160550001.001,17660
1,160550003011,160550003.011,17660
2,160550003012,160550003.012,17660
3,160550003021,160550003.021,17660
4,160550004011,160550004.011,17660


Now, we are going to need a connection to ArcGIS Online to demonstrate the abiliy to use ArcGIS Online for geoenrichment. This is accomplished by instantiating a `GIS` object instance with valid credentials read from environment variables.

In [5]:
gis_local = GIS('pro')

## Standard Geography From a List

To enrich, we start by creating a `Country` object instance. As part of the constructor, we need to tell the object what Business Analyst source to use in the `gis` parameter. In this case, we are telling the object to use ArcGIS Pro with Business Analyst and local data for the United States.

In [6]:
usa_local = Country('usa', gis=gis_local)

usa_local

<Country - United States 2021 ('local')>

Next, we need to get some enrich variables to use. We can discover what is available using the `enrich_variables` property of the country object to retrieve a Pandas Data Frame of variables available for the country. From these tens of thousands of variables, we can prune this down to a manageable subset.

In [7]:
ev = usa_local.enrich_variables
kv = ev[
    (ev.data_collection.str.lower().str.contains('key'))
    & (ev.name.str.lower().str.endswith('cy'))
].reset_index(drop=True)

kv

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name
0,TOTPOP_CY,2021 Total Population,KeyUSFacts,KeyUSFacts.TOTPOP_CY,KeyUSFacts_TOTPOP_CY
1,GQPOP_CY,2021 Group Quarters Population,KeyUSFacts,KeyUSFacts.GQPOP_CY,KeyUSFacts_GQPOP_CY
2,DIVINDX_CY,2021 Diversity Index,KeyUSFacts,KeyUSFacts.DIVINDX_CY,KeyUSFacts_DIVINDX_CY
3,TOTHH_CY,2021 Total Households,KeyUSFacts,KeyUSFacts.TOTHH_CY,KeyUSFacts_TOTHH_CY
4,AVGHHSZ_CY,2021 Average Household Size,KeyUSFacts,KeyUSFacts.AVGHHSZ_CY,KeyUSFacts_AVGHHSZ_CY
5,MEDHINC_CY,2021 Median Household Income,KeyUSFacts,KeyUSFacts.MEDHINC_CY,KeyUSFacts_MEDHINC_CY
6,AVGHINC_CY,2021 Average Household Income,KeyUSFacts,KeyUSFacts.AVGHINC_CY,KeyUSFacts_AVGHINC_CY
7,PCI_CY,2021 Per Capita Income,KeyUSFacts,KeyUSFacts.PCI_CY,KeyUSFacts_PCI_CY
8,TOTHU_CY,2021 Total Housing Units,KeyUSFacts,KeyUSFacts.TOTHU_CY,KeyUSFacts_TOTHU_CY
9,OWNER_CY,2021 Owner Occupied HUs,KeyUSFacts,KeyUSFacts.OWNER_CY,KeyUSFacts_OWNER_CY


Finally, we are ready to enrich the postal codes using the resources above.

In [8]:
pipe = make_pipeline(
    EnrichStandardGeography(usa_local, enrich_variables=kv, standard_geography_level='zip5', return_geometry=False),
    KeepOnlyEnrichColumns(usa_local, id_column='id_field', keep_geometry=False)
)

trn_df = pipe.fit_transform(postal_code_lst)

trn_df.info()
trn_df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, 83801 to 83876
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   keyusfacts_totpop_cy   13 non-null     float64
 1   keyusfacts_gqpop_cy    13 non-null     float64
 2   keyusfacts_divindx_cy  13 non-null     float64
 3   keyusfacts_tothh_cy    13 non-null     float64
 4   keyusfacts_avghhsz_cy  13 non-null     float64
 5   keyusfacts_medhinc_cy  13 non-null     float64
 6   keyusfacts_avghinc_cy  13 non-null     float64
 7   keyusfacts_pci_cy      13 non-null     float64
 8   keyusfacts_tothu_cy    13 non-null     float64
 9   keyusfacts_owner_cy    13 non-null     float64
 10  keyusfacts_renter_cy   13 non-null     float64
 11  keyusfacts_vacant_cy   13 non-null     float64
 12  keyusfacts_medval_cy   13 non-null     float64
 13  keyusfacts_avgval_cy   13 non-null     float64
 14  keyusfacts_popgrw10cy  13 non-null     float64
 15  keyusf

Unnamed: 0_level_0,keyusfacts_totpop_cy,keyusfacts_gqpop_cy,keyusfacts_divindx_cy,keyusfacts_tothh_cy,keyusfacts_avghhsz_cy,keyusfacts_medhinc_cy,keyusfacts_avghinc_cy,keyusfacts_pci_cy,keyusfacts_tothu_cy,keyusfacts_owner_cy,keyusfacts_renter_cy,keyusfacts_vacant_cy,keyusfacts_medval_cy,keyusfacts_avgval_cy,keyusfacts_popgrw10cy,keyusfacts_hhgrw10cy,keyusfacts_famgrw10cy,keyusfacts_dpop_cy,keyusfacts_dpopwrk_cy,keyusfacts_dpopres_cy
id_field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
83801,8497.0,0.0,13.6,3142.0,2.7,62413.0,78746.0,30156.0,3563.0,2660.0,482.0,421.0,382542.0,391184.0,2.07,2.09,1.92,6502.0,1962.0,4540.0
83803,656.0,0.0,10.5,341.0,1.92,43344.0,61732.0,29472.0,659.0,264.0,76.0,318.0,391333.0,403113.0,1.76,1.8,1.56,499.0,145.0,354.0
83810,1090.0,4.0,11.7,479.0,2.27,60088.0,89006.0,38340.0,579.0,406.0,72.0,100.0,437903.0,464383.0,0.94,0.96,0.76,762.0,246.0,516.0
83814,28546.0,609.0,21.2,12454.0,2.24,57487.0,80524.0,34960.0,15054.0,6855.0,5599.0,2600.0,377722.0,508589.0,1.86,1.91,1.72,38178.0,23011.0,15167.0
83815,38614.0,617.0,24.1,15247.0,2.49,56872.0,73431.0,29323.0,16393.0,9649.0,5598.0,1146.0,308198.0,374194.0,1.79,1.78,1.51,37345.0,15891.0,21454.0


In [45]:
from sklearn.preprocessing import PowerTransformer as _PowerTransformer
from sklearn.base import TransformerMixin

class PowerTransformer(_PowerTransformer, TransformerMixin):
    def transform(self, X:pd.DataFrame):
        
        check_is_fitted(self)
        X = self._check_input(X, in_fit=False, check_positive=True, check_shape=True)

        transform_function = {
            "box-cox": boxcox,
            "yeo-johnson": self._yeo_johnson_transform,
        }[self.method]
        for i, lmbda in enumerate(self.lambdas_):
            with np.errstate(invalid="ignore"):  # hide NaN warnings
                X[:, i] = transform_function(X[:, i], lmbda)

        if self.standardize:
            X = self._scaler.transform(X)

        X_df = pd.DataFrame(X, columns=X.columns, index=X.index)
        
        return Xt_df
    
ss = PowerTransformer()

res = ss.transform(trn_df)

res

NameError: name 'check_is_fitted' is not defined

In [36]:
from sklearn.preprocessing import PowerTransformer

class PowerTransformerDF(PowerTransformer):

    def transform(self, X: pd.DataFrame):
        
        print('child')
        
        # call the parent transform
        Xt_arr = super(PowerTransformerDF, self).transform(X)

        # put the array back into a Pandas DataFrame
        Xt_df = pd.DataFrame(Xt_arr, columns=X.columns, index=X.index)
        
        print('made it')

        return Xt_df

In [37]:
pt = PowerTransformerDF(method='yeo-johnson', )

res = pt.fit_transform(trn_df)

res

  loglike = -n_samples / 2 * np.log(x_trans.var())


array([[ 0.16855042, -1.11350457, -0.54217628,  0.11286603,  1.15896583,
         0.24090778, -0.18358702,  0.        , -0.04677022,  0.18431223,
        -0.06240979, -0.60302857,  0.06434243, -0.46008614,  0.50360584,
         0.50700755,  0.5253881 ,  0.07725866,  0.01926198,  0.13649968],
       [-1.19606641, -1.11350457, -1.47718988, -1.11876269, -1.77935893,
        -1.99284792, -1.68652843,  0.        , -1.06741558, -1.16113791,
        -0.96699852, -0.81475061,  0.17348101, -0.33543968, -0.08582943,
        -0.06192985, -0.18216615, -1.20874058, -1.22239441, -1.20442689],
       [-0.96889112, -0.14835307, -1.06063985, -0.95810529, -0.79336842,
        -0.07251376,  1.24661914,  0.        , -1.13045506, -0.95473799,
        -0.99126772, -1.43920363,  0.75550625,  0.26417162, -1.5413854 ,
        -1.59827938, -1.62841907, -1.02855037, -0.9930183 , -1.0430459 ],
       [ 1.04167961,  1.45241314,  0.63891596,  1.13405084, -0.89616209,
        -0.40967738,  0.03227582,  0.        ,  

In [51]:
res.shape == trn_df.shape

True

(13, 20)

## Standard Geography From a Data Frame

For the sake of efficiency, it is possible to provide the dataframe directly as input to the `enrich` method, and use the `standard_geography_id_column` parameter to specify the column. Also, in the following example, we are going to use a local source (ArcGIS Pro with Business Analyst and the USA data pack) and consolidate the steps above.

Previously, we loaded a dataframe from memory to use for enrichment consisting of all the block groups in Portland, OR. The values we will be using are located in the `ID` column.

In [13]:
block_groups_df.head()

Unnamed: 0,ID,NAME,IDField
0,160550001001,160550001.001,17660
1,160550003011,160550003.011,17660
2,160550003012,160550003.012,17660
3,160550003021,160550003.021,17660
4,160550004011,160550004.011,17660


Next, we create a `Country` object instance using the `'pro'` keyword to indicate using the local ArcGIS Pro instance.

In [14]:
usa_local = Country('usa', gis=GIS('pro'))

usa_local

<Country - United States 2021 ('local')>

Just as before, we can filter to just the key current year variables. Although in this case they are identical, since there are small variations in available variables between local and ArcGIS Online, I always apply the filter on variables from the same soure as I will be using for enricment.

In [15]:
ev = usa_local.enrich_variables
kv = ev[
    (ev.data_collection.str.lower().str.contains('key'))
    & (ev.name.str.lower().str.endswith('cy'))
].reset_index(drop=True)

kv

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name
0,TOTPOP_CY,2021 Total Population,KeyUSFacts,KeyUSFacts.TOTPOP_CY,KeyUSFacts_TOTPOP_CY
1,GQPOP_CY,2021 Group Quarters Population,KeyUSFacts,KeyUSFacts.GQPOP_CY,KeyUSFacts_GQPOP_CY
2,DIVINDX_CY,2021 Diversity Index,KeyUSFacts,KeyUSFacts.DIVINDX_CY,KeyUSFacts_DIVINDX_CY
3,TOTHH_CY,2021 Total Households,KeyUSFacts,KeyUSFacts.TOTHH_CY,KeyUSFacts_TOTHH_CY
4,AVGHHSZ_CY,2021 Average Household Size,KeyUSFacts,KeyUSFacts.AVGHHSZ_CY,KeyUSFacts_AVGHHSZ_CY
5,MEDHINC_CY,2021 Median Household Income,KeyUSFacts,KeyUSFacts.MEDHINC_CY,KeyUSFacts_MEDHINC_CY
6,AVGHINC_CY,2021 Average Household Income,KeyUSFacts,KeyUSFacts.AVGHINC_CY,KeyUSFacts_AVGHINC_CY
7,PCI_CY,2021 Per Capita Income,KeyUSFacts,KeyUSFacts.PCI_CY,KeyUSFacts_PCI_CY
8,TOTHU_CY,2021 Total Housing Units,KeyUSFacts,KeyUSFacts.TOTHU_CY,KeyUSFacts_TOTHU_CY
9,OWNER_CY,2021 Owner Occupied HUs,KeyUSFacts,KeyUSFacts.OWNER_CY,KeyUSFacts_OWNER_CY


Just double checking, we can quickly see the correct value we need to use for United States Census Block Groups is `block_groups`.

In [16]:
usa_local.levels

Unnamed: 0,level_name,alias,level_id,id_field,name_field,singular_name,plural_name,admin_level
0,block_groups,Block Groups,US.BlockGroups,ID,NAME,Block Group,Block Groups,Admin11
1,tracts,Census Tracts,US.Tracts,ID,NAME,Census Tract,Census Tracts,Admin10
2,places,Cities and Towns (Places),US.Places,ID,NAME,Place,Places,Admin9
3,zip5,ZIP Codes,US.ZIP5,ID,NAME,ZIP Code,ZIP Codes,Admin4
4,csd,County Subdivisions,US.CSD,ID,NAME,County Subdivision,County Subdivisions,Admin7
5,counties,Counties,US.Counties,ID,NAME,County,Counties,Admin3
6,cbsa,CBSAs,US.CBSA,ID,NAME,CBSA,CBSAs,Admin5
7,cd,Congressional Districts,US.CD,ID,NAME,Congressional District,Congressional Districts,Admin8
8,dma,DMAs,US.DMA,ID,NAME,DMA,DMAs,Admin6
9,states,States,US.States,ID,NAME,State,States,Admin2


Now, we are ready to enrich using ArcGIS Pro with Business Analyst and the USA data pack.

In [None]:
bg_enrich_df = usa_local.enrich(block_groups_df, enrich_variables=kv, standard_geography_level='block_groups', standard_geography_id_column='ID')

bg_enrich_df.info()
bg_enrich_df.head()

## Consolidating into a Function

All of the above steps, if consolidated into a succicent function only containing the logic necessary to run, can look like the following. Also, although some of the imports below are redundant, I include everything needed in this cell soy you can easily copy this and modify it to suit your needs.

In [16]:
from typing import Iterable, Optional, Union

from arcgis.gis import GIS
from arcgis.geoenrichment import Country
import pandas as pd

def enrich_std_geo(
    country: Country, 
    geographies: Iterable, 
    std_geo_lvl: str, 
    enrich_vars: Optional[Iterable] = None, 
    std_geo_col: Optional[str] = None
) -> pd.DataFrame:
    
    # if no enrich variables provided, get the current year key variables
    if enrich_vars is None:
        enrich_vars = country.enrich_variables[
            (country.enrich_variables.data_collection.str.lower().str.contains('key'))
            & (country.enrich_variables.name.str.lower().str.endswith('cy'))
        ].reset_index(drop=True)
        
    # invoke enrich method against standard geographies
    enrich_df = country.enrich(geographies, 
                               enrich_variables=enrich_vars, 
                               standard_geography_level=std_geo_lvl, 
                               standard_geography_id_column=std_geo_col)
    
    return enrich_df

## Switching Sources

Using the above function we can easily accomplish all the steps detailed above in one succicent step. Based on this, we can quickly switch between sources by swapping out the `Country` instance. As is evident below, switching sources requires little more than changing the country object. Everything else works the same, and _this is the point_. This API is designed to enable you to easily work with Business Analyst data.

In [17]:
local_fn_enrich_df = enrich_std_geo(usa_local, geographies=postal_code_lst, std_geo_lvl='zip5')

local_fn_enrich_df.info()
local_fn_enrich_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   id_field               13 non-null     object  
 1   area_desc              13 non-null     object  
 2   ta_desc                13 non-null     object  
 3   names                  13 non-null     object  
 4   has_data               13 non-null     int64   
 5   aggregation_method     13 non-null     object  
 6   keyusfacts_totpop_cy   13 non-null     float64 
 7   keyusfacts_gqpop_cy    13 non-null     float64 
 8   keyusfacts_divindx_cy  13 non-null     float64 
 9   keyusfacts_tothh_cy    13 non-null     float64 
 10  keyusfacts_avghhsz_cy  13 non-null     float64 
 11  keyusfacts_medhinc_cy  13 non-null     float64 
 12  keyusfacts_avghinc_cy  13 non-null     float64 
 13  keyusfacts_pci_cy      13 non-null     float64 
 14  keyusfacts_tothu_cy    13 non-null     float

Unnamed: 0,id_field,area_desc,ta_desc,names,has_data,aggregation_method,keyusfacts_totpop_cy,keyusfacts_gqpop_cy,keyusfacts_divindx_cy,keyusfacts_tothh_cy,...,keyusfacts_vacant_cy,keyusfacts_medval_cy,keyusfacts_avgval_cy,keyusfacts_popgrw10cy,keyusfacts_hhgrw10cy,keyusfacts_famgrw10cy,keyusfacts_dpop_cy,keyusfacts_dpopwrk_cy,keyusfacts_dpopres_cy,SHAPE
0,83801,83801,TA from geography Layer: 83801,Athol,1,BlockApportionment:US.BlockGroups;PointsLayer:...,8497.0,0.0,13.6,3142.0,...,421.0,382542.0,391184.0,2.07,2.09,1.92,6502.0,1962.0,4540.0,"{""rings"": [[[-116.69531754199994, 48.075561486..."
1,83803,83803,TA from geography Layer: 83803,Bayview,1,BlockApportionment:US.BlockGroups;PointsLayer:...,656.0,0.0,10.5,341.0,...,318.0,391333.0,403113.0,1.76,1.8,1.56,499.0,145.0,354.0,"{""rings"": [[[-116.49664843299996, 47.991467237..."
2,83810,83810,TA from geography Layer: 83810,Cataldo,1,BlockApportionment:US.BlockGroups;PointsLayer:...,1090.0,4.0,11.7,479.0,...,100.0,437903.0,464383.0,0.94,0.96,0.76,762.0,246.0,516.0,"{""rings"": [[[-116.40476999999998, 47.612660000..."
3,83814,83814,TA from geography Layer: 83814,Coeur D Alene,1,BlockApportionment:US.BlockGroups;PointsLayer:...,28546.0,609.0,21.2,12454.0,...,2600.0,377722.0,508589.0,1.86,1.91,1.72,38178.0,23011.0,15167.0,"{""rings"": [[[-116.58028267699996, 47.744349415..."
4,83815,83815,TA from geography Layer: 83815,Coeur D Alene,1,BlockApportionment:US.BlockGroups;PointsLayer:...,38614.0,617.0,24.1,15247.0,...,1146.0,308198.0,374194.0,1.79,1.78,1.51,37345.0,15891.0,21454.0,"{""rings"": [[[-116.69478052499994, 47.725640161..."


In [18]:
agol_fn_enrich_df = enrich_std_geo(usa_agol, geographies=postal_code_lst, std_geo_lvl='zip5')

agol_fn_enrich_df.info()
agol_fn_enrich_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 29 columns):
 #   Column                             Non-Null Count  Dtype   
---  ------                             --------------  -----   
 0   std_geography_level                13 non-null     object  
 1   std_geography_name                 13 non-null     object  
 2   std_geography_id                   13 non-null     object  
 3   source_country                     13 non-null     object  
 4   aggregation_method                 13 non-null     object  
 5   population_to_polygon_size_rating  13 non-null     float64 
 6   apportionment_confidence           13 non-null     float64 
 7   has_data                           13 non-null     int64   
 8   totpop_cy                          13 non-null     int64   
 9   gqpop_cy                           13 non-null     int64   
 10  divindx_cy                         13 non-null     float64 
 11  tothh_cy                           13 non-null 

Unnamed: 0,std_geography_level,std_geography_name,std_geography_id,source_country,aggregation_method,population_to_polygon_size_rating,apportionment_confidence,has_data,totpop_cy,gqpop_cy,...,vacant_cy,medval_cy,avgval_cy,popgrw10_cy,hhgrw10_cy,famgrw10_cy,dpop_cy,dpopwrk_cy,dpopres_cy,SHAPE
0,US.ZIP5,Athol,83801,US,Query:US.ZIP5,2.191,2.576,1,8497,0,...,421,382542,391184,2.07,2.09,1.92,6502,1962,4540,"{""rings"": [[[-116.69531754168734, 48.075561485..."
1,US.ZIP5,Bayview,83803,US,Query:US.ZIP5,2.191,2.576,1,656,0,...,318,391333,403113,1.76,1.8,1.56,499,145,354,"{""rings"": [[[-116.49664843313857, 47.991467236..."
2,US.ZIP5,Cataldo,83810,US,Query:US.ZIP5,2.191,2.576,1,1090,4,...,100,439516,465172,0.94,0.96,0.76,762,246,516,"{""rings"": [[[-116.40476999992698, 47.612659999..."
3,US.ZIP5,Coeur D Alene,83814,US,Query:US.ZIP5,2.191,2.576,1,28546,609,...,2600,377722,508589,1.86,1.91,1.72,38109,22942,15167,"{""rings"": [[[-116.58028267664488, 47.744349414..."
4,US.ZIP5,Coeur D Alene,83815,US,Query:US.ZIP5,2.191,2.576,1,38614,617,...,1146,308198,374194,1.79,1.78,1.51,37345,15891,21454,"{""rings"": [[[-116.69790643782714, 47.729762324..."
