# Enrich Introspection

Whether working with locally installed data or connected to a GIS object instance, the first step in performing data enrichment is figuring out what is possible, what you *can* do. This begins with figuring out which countries are available to you, and what variables are available in the country you are interested in working in.

In [1]:
from arcgis.gis import GIS
from arcgis.geoenrichment import Country, get_countries

## Discovering Countries

Data enrichment is available across the entire ArcGIS Platform. As a result, there are a variety of data sources available both local and Web GIS. Local enrichment requires ArcGIS Pro with Business Analyst and local data to be installed. Enrichment with a Web GIS requires access to an ArcGIS Enterprise environment with Business Analyst or access to ArcGIS Online.

If working in an environment with ArcGIS Pro with the Business Analyst extension and local data, you can explicitly specify this when discovering countries.

In [2]:
get_countries('local')

Unnamed: 0,iso2,iso3,country_name,vintage,country_id,data_source_id
0,CA,CAN,Canada,2020,CAN_ESRI_2019,LOCAL;;CAN_ESRI_2019
1,US,USA,United States,2019,USA_ESRI_2019,LOCAL;;USA_ESRI_2019
2,US,USA,United States,2020,USA_ESRI_2020,LOCAL;;USA_ESRI_2020


The function is intelligent as well. If you do not explicitly set a source, the function first attempts to use local resources. If this is not possible, the function tries to utlize the `active_gis` a `GIS` object instance already instantiated in the session.

In [3]:
get_countries()  # uses local since I have this environment configured

Unnamed: 0,iso2,iso3,country_name,vintage,country_id,data_source_id
0,CA,CAN,Canada,2020,CAN_ESRI_2019,LOCAL;;CAN_ESRI_2019
1,US,USA,United States,2019,USA_ESRI_2019,LOCAL;;USA_ESRI_2019
2,US,USA,United States,2020,USA_ESRI_2020,LOCAL;;USA_ESRI_2020


Both of these are overridden when explicitly passing in a source so it is possible to interrogate any source explicitly.

In [4]:
gis = GIS()  # anonymous connection to ArcGIS Online
get_countries(gis)

Unnamed: 0,iso2,iso3,country_name,country_id,alt_name,continent
0,AL,ALB,Albania,ALB_MBR_2019,ALBANIA,Europe
1,DZ,DZA,Algeria,DZA_MBR_2019,ALGERIA,Africa
2,AD,AND,Andorra,AND_MBR_2019,ANDORRA,Europe
3,AO,AGO,Angola,AGO_MBR_2019,ANGOLA,Africa
4,AR,ARG,Argentina,ARG_MBR_2020,ARGENTINA,South America
...,...,...,...,...,...,...
131,UY,URY,Uruguay,URY_MBR_2020,URUGUAY,South America
132,UZ,UZB,Uzbekistan,UZB_MBR_2020,UZBEKISTAN,Asia
133,VE,VEN,Venezuela,VEN_MBR_2020,"VENEZUELA, BOLIVARIAN REPUBLIC OF",South America
134,VN,VNM,Vietnam,VNM_MBR_2020,VIET NAM,Asia


## Working in a Country

Currently the `Country` object is how most data is organized in Business Analyst. Consequently, to enrich data, the first step is creating a `Country` object instance using a source.

In [5]:
usa_gis = Country('USA', source=gis)

usa_gis

<Country - USA (GIS @ https://www.arcgis.com version:9.1) >

A `Country` instance, just like with country discovery, tries to use local and then an active GIS instance. Since I have a local environment configured, it references local data for the most current year's data available (2020).

In [6]:
usa_lcl = Country('USA')

usa_lcl

<Country - USA 2020 (local) >

With a local source, frequently due to modeling against older datasets, there is a need to explicitly access older data. This is supported using the `year` parameter.

In [7]:
usa_lcl_2019 = Country('USA', source='local', year=2019)

usa_lcl_2019

<Country - USA 2019 (local) >

## Enrich Variable Introspection

Once a country is instantiated, data available for enrichment can be discovered. Whether the source is local or a GIS instance, the process is the same. Also, there are some differences in the specific columns returned based on the source due to differences in what is returned from the underlying interrogation methods.

In [8]:
usa_gis.enrich_variables

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name,description,vintage,units
0,AGE0_CY,2020 Population Age <1,1yearincrements,1yearincrements.AGE0_CY,F1yearincrements_AGE0_CY,2020 Total Population Age <1 (Esri),2020,count
1,AGE1_CY,2020 Population Age 1,1yearincrements,1yearincrements.AGE1_CY,F1yearincrements_AGE1_CY,2020 Total Population Age 1 (Esri),2020,count
2,AGE2_CY,2020 Population Age 2,1yearincrements,1yearincrements.AGE2_CY,F1yearincrements_AGE2_CY,2020 Total Population Age 2 (Esri),2020,count
3,AGE3_CY,2020 Population Age 3,1yearincrements,1yearincrements.AGE3_CY,F1yearincrements_AGE3_CY,2020 Total Population Age 3 (Esri),2020,count
4,AGE4_CY,2020 Population Age 4,1yearincrements,1yearincrements.AGE4_CY,F1yearincrements_AGE4_CY,2020 Total Population Age 4 (Esri),2020,count
...,...,...,...,...,...,...,...,...
18394,MOEMEDYRMV,2019 Median Year Householder Moved In MOE (ACS...,yearmovedin,yearmovedin.MOEMEDYRMV,yearmovedin_MOEMEDYRMV,2019 Median Year Householder Moved into Unit M...,2015-2019,count
18395,RELMEDYRMV,2019 Median Year Householder Moved In REL (ACS...,yearmovedin,yearmovedin.RELMEDYRMV,yearmovedin_RELMEDYRMV,2019 Median Year Householder Moved into Unit R...,2015-2019,count
18396,ACSOWNER,2019 Owner Households (ACS 5-Yr),yearmovedin,yearmovedin.ACSOWNER,yearmovedin_ACSOWNER,2019 Owner Households (ACS 5-Yr),2015-2019,count
18397,MOEOWNER,2019 Owner Households MOE (ACS 5-Yr),yearmovedin,yearmovedin.MOEOWNER,yearmovedin_MOEOWNER,2019 Owner Households MOE (ACS 5-Yr),2015-2019,count


In [9]:
usa_lcl.enrich_variables

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name
0,CHILD_CY,2020 Child Population,AgeDependency,AgeDependency.CHILD_CY,AgeDependency_CHILD_CY
1,WORKAGE_CY,2020 Working-Age Population,AgeDependency,AgeDependency.WORKAGE_CY,AgeDependency_WORKAGE_CY
2,SENIOR_CY,2020 Senior Population,AgeDependency,AgeDependency.SENIOR_CY,AgeDependency_SENIOR_CY
3,CHLDDEP_CY,2020 Child Dependency Ratio,AgeDependency,AgeDependency.CHLDDEP_CY,AgeDependency_CHLDDEP_CY
4,AGEDEP_CY,2020 Age Dependency Ratio,AgeDependency,AgeDependency.AGEDEP_CY,AgeDependency_AGEDEP_CY
...,...,...,...,...,...
16869,MOEMEDYRMV,2018 Median Year Householder Moved In MOE (ACS...,yearmovedin,yearmovedin.MOEMEDYRMV,yearmovedin_MOEMEDYRMV
16870,RELMEDYRMV,2018 Median Year Householder Moved In REL (ACS...,yearmovedin,yearmovedin.RELMEDYRMV,yearmovedin_RELMEDYRMV
16871,ACSOWNER,2018 Owner Households (ACS 5-Yr),yearmovedin,yearmovedin.ACSOWNER,yearmovedin_ACSOWNER
16872,MOEOWNER,2018 Owner Households MOE (ACS 5-Yr),yearmovedin,yearmovedin.MOEOWNER,yearmovedin_MOEOWNER


In [10]:
usa_lcl_2019.enrich_variables

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name
0,CHILD_CY,2020 Child Population,AgeDependency,AgeDependency.CHILD_CY,AgeDependency_CHILD_CY
1,WORKAGE_CY,2020 Working-Age Population,AgeDependency,AgeDependency.WORKAGE_CY,AgeDependency_WORKAGE_CY
2,SENIOR_CY,2020 Senior Population,AgeDependency,AgeDependency.SENIOR_CY,AgeDependency_SENIOR_CY
3,CHLDDEP_CY,2020 Child Dependency Ratio,AgeDependency,AgeDependency.CHLDDEP_CY,AgeDependency_CHLDDEP_CY
4,AGEDEP_CY,2020 Age Dependency Ratio,AgeDependency,AgeDependency.AGEDEP_CY,AgeDependency_AGEDEP_CY
...,...,...,...,...,...
16869,MOEMEDYRMV,2018 Median Year Householder Moved In MOE (ACS...,yearmovedin,yearmovedin.MOEMEDYRMV,yearmovedin_MOEMEDYRMV
16870,RELMEDYRMV,2018 Median Year Householder Moved In REL (ACS...,yearmovedin,yearmovedin.RELMEDYRMV,yearmovedin_RELMEDYRMV
16871,ACSOWNER,2018 Owner Households (ACS 5-Yr),yearmovedin,yearmovedin.ACSOWNER,yearmovedin_ACSOWNER
16872,MOEOWNER,2018 Owner Households MOE (ACS 5-Yr),yearmovedin,yearmovedin.MOEOWNER,yearmovedin_MOEOWNER


### Usefulness of a DataFrame

Since a Pandas DataFrame, this facilitates quick discovery of different variable combinations.

#### Unique Variables

Since variables can be repeated due to being used in multiple Data Collections, we can easily remove duplicates using the functionality of the DataFrame.

In [11]:
usa_lcl.enrich_variables.drop_duplicates('name').reset_index(drop=True)

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name
0,CHILD_CY,2020 Child Population,AgeDependency,AgeDependency.CHILD_CY,AgeDependency_CHILD_CY
1,WORKAGE_CY,2020 Working-Age Population,AgeDependency,AgeDependency.WORKAGE_CY,AgeDependency_WORKAGE_CY
2,SENIOR_CY,2020 Senior Population,AgeDependency,AgeDependency.SENIOR_CY,AgeDependency_SENIOR_CY
3,CHLDDEP_CY,2020 Child Dependency Ratio,AgeDependency,AgeDependency.CHLDDEP_CY,AgeDependency_CHLDDEP_CY
4,AGEDEP_CY,2020 Age Dependency Ratio,AgeDependency,AgeDependency.AGEDEP_CY,AgeDependency_AGEDEP_CY
...,...,...,...,...,...
14660,RELRMV1989,2018 RHHs/Moved In: 1989/Before REL (ACS 5-Yr),yearmovedin,yearmovedin.RELRMV1989,yearmovedin_RELRMV1989
14661,ACSMEDYRMV,2018 Median Year Householder Moved In (ACS 5-Yr),yearmovedin,yearmovedin.ACSMEDYRMV,yearmovedin_ACSMEDYRMV
14662,MOEMEDYRMV,2018 Median Year Householder Moved In MOE (ACS...,yearmovedin,yearmovedin.MOEMEDYRMV,yearmovedin_MOEMEDYRMV
14663,RELMEDYRMV,2018 Median Year Householder Moved In REL (ACS...,yearmovedin,yearmovedin.RELMEDYRMV,yearmovedin_RELMEDYRMV


#### Unique Current Year Variables

One of the things I like to do is grab all the current year demographics. Due to the naming convention, all of these variables' names end with `CY`. This enables us to quickly find them.

In [12]:
uniq_df = usa_lcl.enrich_variables.drop_duplicates('name').reset_index(drop=True)

cy_df = uniq_df[uniq_df['name'].str.endswith('CY')].reset_index(drop=True)

cy_df

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name
0,CHILD_CY,2020 Child Population,AgeDependency,AgeDependency.CHILD_CY,AgeDependency_CHILD_CY
1,WORKAGE_CY,2020 Working-Age Population,AgeDependency,AgeDependency.WORKAGE_CY,AgeDependency_WORKAGE_CY
2,SENIOR_CY,2020 Senior Population,AgeDependency,AgeDependency.SENIOR_CY,AgeDependency_SENIOR_CY
3,CHLDDEP_CY,2020 Child Dependency Ratio,AgeDependency,AgeDependency.CHLDDEP_CY,AgeDependency_CHLDDEP_CY
4,AGEDEP_CY,2020 Age Dependency Ratio,AgeDependency,AgeDependency.AGEDEP_CY,AgeDependency_AGEDEP_CY
...,...,...,...,...,...
1318,NHSPASN_CY,2020 Non-Hispanic Asian Pop,raceandhispanicorigin,raceandhispanicorigin.NHSPASN_CY,raceandhispanicorigin_NHSPASN_CY
1319,NHSPPI_CY,2020 Non-Hispanic Pacific Islander Pop,raceandhispanicorigin,raceandhispanicorigin.NHSPPI_CY,raceandhispanicorigin_NHSPPI_CY
1320,NHSPOTH_CY,2020 Non-Hispanic Other Race Pop,raceandhispanicorigin,raceandhispanicorigin.NHSPOTH_CY,raceandhispanicorigin_NHSPOTH_CY
1321,NHSPMLT_CY,2020 Non-Hispanic Multiple Race Pop,raceandhispanicorigin,raceandhispanicorigin.NHSPMLT_CY,raceandhispanicorigin_NHSPMLT_CY


#### Current Year Sample Variables

Frequently, when demonstrating analysis ad-hoc quickly, I take advantage of Data Collections to grab a few useful current year variables.

In [13]:
var_df = usa_lcl.enrich_variables

smpl_df = uniq_df[
    (uniq_df['name'].str.endswith('CY'))
    & ((uniq_df['data_collection'] == 'incomebyage') | (uniq_df['data_collection'] == 'population'))
].drop_duplicates('name').reset_index(drop=True)

smpl_df

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name
0,A15I0_CY,2020 HH Inc <$15000/HHr 15-24,incomebyage,incomebyage.A15I0_CY,incomebyage_A15I0_CY
1,A15I15_CY,2020 HH Inc $15K-24999/HHr 15-24,incomebyage,incomebyage.A15I15_CY,incomebyage_A15I15_CY
2,A15I25_CY,2020 HH Inc $25K-34999/HHr 15-24,incomebyage,incomebyage.A15I25_CY,incomebyage_A15I25_CY
3,A15I35_CY,2020 HH Inc $35K-49999/HHr 15-24,incomebyage,incomebyage.A15I35_CY,incomebyage_A15I35_CY
4,A15I50_CY,2020 HH Inc $50K-74999/HHr 15-24,incomebyage,incomebyage.A15I50_CY,incomebyage_A15I50_CY
...,...,...,...,...,...
90,IA75BASECY,2020 HH Income Base: HHr 75+,incomebyage,incomebyage.IA75BASECY,incomebyage_IA75BASECY
91,MEDHHR_CY,2020 Median Age of Householder,incomebyage,incomebyage.MEDHHR_CY,incomebyage_MEDHHR_CY
92,IA65UBASCY,2020 HHs by Inc Base: HHr 65+,incomebyage,incomebyage.IA65UBASCY,incomebyage_IA65UBASCY
93,MEDIA65UCY,2020 Median HH Inc: HHr 65+,incomebyage,incomebyage.MEDIA65UCY,incomebyage_MEDIA65UCY


## Enrichment

The Python API already includes a wrapper around the enrich REST endpoint. The ArcGIS Pro GeoProcessing Enrich Layer tool currently is the only way to enrich data using local resources. Each requires the variable input in a slightly different format.

### Local Enrichment

Enrichment varialbes must be concantenated into a single semicolon separated string for input into the Enrich Layer GeoProcessing tool. This can be accommplished easily enough using a little string concantenation in Python.

In [14]:
var_df = usa_lcl.enrich_variables

smpl_df = uniq_df[
    (uniq_df['name'].str.endswith('CY'))
    & ((uniq_df['data_collection'] == 'incomebyage') | (uniq_df['data_collection'] == 'population'))
].drop_duplicates('name').reset_index(drop=True)

enrich_str = ';'.join(smpl_df.enrich_name)

print(enrich_str)

incomebyage.A15I0_CY;incomebyage.A15I15_CY;incomebyage.A15I25_CY;incomebyage.A15I35_CY;incomebyage.A15I50_CY;incomebyage.A15I75_CY;incomebyage.A15I100_CY;incomebyage.A15I150_CY;incomebyage.A15I200_CY;incomebyage.MEDIA15_CY;incomebyage.AVGIA15_CY;incomebyage.AGGIA15_CY;incomebyage.IA15BASECY;incomebyage.A25I0_CY;incomebyage.A25I15_CY;incomebyage.A25I25_CY;incomebyage.A25I35_CY;incomebyage.A25I50_CY;incomebyage.A25I75_CY;incomebyage.A25I100_CY;incomebyage.A25I150_CY;incomebyage.A25I200_CY;incomebyage.MEDIA25_CY;incomebyage.AVGIA25_CY;incomebyage.AGGIA25_CY;incomebyage.IA25BASECY;incomebyage.A35I0_CY;incomebyage.A35I15_CY;incomebyage.A35I25_CY;incomebyage.A35I35_CY;incomebyage.A35I50_CY;incomebyage.A35I75_CY;incomebyage.A35I100_CY;incomebyage.A35I150_CY;incomebyage.A35I200_CY;incomebyage.MEDIA35_CY;incomebyage.AVGIA35_CY;incomebyage.AGGIA35_CY;incomebyage.IA35BASECY;incomebyage.A45I0_CY;incomebyage.A45I15_CY;incomebyage.A45I25_CY;incomebyage.A45I35_CY;incomebyage.A45I50_CY;incomebyage.A45

### Web GIS Enrichment

Enriching using the Python API calling the rest endpoint requires slightly different syntax, a semicolon separated list of variable names.

In [15]:
var_df = usa_gis.enrich_variables

smpl_df = uniq_df[
    (uniq_df['name'].str.endswith('CY'))
    & ((uniq_df['data_collection'] == 'incomebyage') | (uniq_df['data_collection'] == 'population'))
].drop_duplicates('name').reset_index(drop=True)

enrich_str = ';'.join(smpl_df.name)

print(enrich_str)

A15I0_CY;A15I15_CY;A15I25_CY;A15I35_CY;A15I50_CY;A15I75_CY;A15I100_CY;A15I150_CY;A15I200_CY;MEDIA15_CY;AVGIA15_CY;AGGIA15_CY;IA15BASECY;A25I0_CY;A25I15_CY;A25I25_CY;A25I35_CY;A25I50_CY;A25I75_CY;A25I100_CY;A25I150_CY;A25I200_CY;MEDIA25_CY;AVGIA25_CY;AGGIA25_CY;IA25BASECY;A35I0_CY;A35I15_CY;A35I25_CY;A35I35_CY;A35I50_CY;A35I75_CY;A35I100_CY;A35I150_CY;A35I200_CY;MEDIA35_CY;AVGIA35_CY;AGGIA35_CY;IA35BASECY;A45I0_CY;A45I15_CY;A45I25_CY;A45I35_CY;A45I50_CY;A45I75_CY;A45I100_CY;A45I150_CY;A45I200_CY;MEDIA45_CY;AVGIA45_CY;AGGIA45_CY;IA45BASECY;A55I0_CY;A55I15_CY;A55I25_CY;A55I35_CY;A55I50_CY;A55I75_CY;A55I100_CY;A55I150_CY;A55I200_CY;MEDIA55_CY;AVGIA55_CY;AGGIA55_CY;IA55BASECY;A65I0_CY;A65I15_CY;A65I25_CY;A65I35_CY;A65I50_CY;A65I75_CY;A65I100_CY;A65I150_CY;A65I200_CY;MEDIA65_CY;AVGIA65_CY;AGGIA65_CY;IA65BASECY;A75I0_CY;A75I15_CY;A75I25_CY;A75I35_CY;A75I50_CY;A75I75_CY;A75I100_CY;A75I150_CY;A75I200_CY;MEDIA75_CY;AVGIA75_CY;AGGIA75_CY;IA75BASECY;MEDHHR_CY;IA65UBASCY;MEDIA65UCY;AVGIA65UCY


## Next Step - Enrich Method

The next step is updating the Python API `enrich` method to support both local and remote resources, and handle the variable introspection dataframes as input. This will enable a single workflow for discovering what data is available and retrieving the data whether the data is being retrieved from a local or GIS source.