# Modeling Modlule

The modeling module provides a single Python API for data scientists to take advantage of the capabilities of ArcGIS as part of demographic geographic data science workflows.

This first cell is largely a bunch of bubblegum and duct tape tying all the things together so I can use this notebook for prototyping and testing.

In [1]:
import importlib
import os
from pathlib import Path
import re
import sys

from dotenv import load_dotenv, find_dotenv
import pandas as pd

# load the "autoreload" extension so that code can change, & always reload modules so that as you change code in src, it gets loaded
%load_ext autoreload
%autoreload 2

# load environment variables from .env
load_dotenv(find_dotenv())

dir_src = Path.cwd().parent.parent/'src'

sys.path.insert(0, str(dir_src))

In [2]:
from arcgis.gis import GIS
from modeling import get_countries, Country

In [4]:
usa = Country('USA')

usa

<modeling.Country - USA (local 2020)>

Next, is is extremely useful to be able to discover the heirarchial geographic levels in a country.

In [5]:
usa.levels

Unnamed: 0,geo_name,geo_alias,col_id,col_name,feature_class_path
0,block_groups,Block Groups,ID,NAME,D:\arcgis\ba_data\us_2020\Data\Demographic Dat...
1,census_tracts,Census Tracts,ID,NAME,D:\arcgis\ba_data\us_2020\Data\Demographic Dat...
2,cities_and_towns_places,Cities and Towns (Places),ID,NAME,D:\arcgis\ba_data\us_2020\Data\Demographic Dat...
3,zip_codes,ZIP Codes,ID,NAME,D:\arcgis\ba_data\us_2020\Data\Demographic Dat...
4,county_subdivisions,County Subdivisions,ID,NAME,D:\arcgis\ba_data\us_2020\Data\Demographic Dat...
5,counties,Counties,ID,NAME,D:\arcgis\ba_data\us_2020\Data\Demographic Dat...
6,cbsas,CBSAs,ID,NAME,D:\arcgis\ba_data\us_2020\Data\Demographic Dat...
7,congressional_districts,Congressional Districts,ID,NAME,D:\arcgis\ba_data\us_2020\Data\Demographic Dat...
8,dmas,DMAs,ID,NAME,D:\arcgis\ba_data\us_2020\Data\Demographic Dat...
9,states,States,ID,NAME,D:\arcgis\ba_data\us_2020\Data\Demographic Dat...


Also, access to discover the enrichment variables along with all the ways to reference these variables is extremely useful.

In [11]:
ev = usa.enrich_variables

ev

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name
0,AGE0_CY,2020 Population Age <1,1yearincrements,1yearincrements.AGE0_CY,F1yearincrements_AGE0_CY
1,AGE1_CY,2020 Population Age 1,1yearincrements,1yearincrements.AGE1_CY,F1yearincrements_AGE1_CY
2,AGE2_CY,2020 Population Age 2,1yearincrements,1yearincrements.AGE2_CY,F1yearincrements_AGE2_CY
3,AGE3_CY,2020 Population Age 3,1yearincrements,1yearincrements.AGE3_CY,F1yearincrements_AGE3_CY
4,AGE4_CY,2020 Population Age 4,1yearincrements,1yearincrements.AGE4_CY,F1yearincrements_AGE4_CY
...,...,...,...,...,...
9222,ACSRMV2000,2018 RHHs/Moved In: 2000-2009 (ACS 5-Yr),yearmovedin,yearmovedin.ACSRMV2000,yearmovedin_ACSRMV2000
9223,ACSRMV1990,2018 RHHs/Moved In: 1990-1999 (ACS 5-Yr),yearmovedin,yearmovedin.ACSRMV1990,yearmovedin_ACSRMV1990
9224,ACSRMV1989,2018 RHHs/Moved In: 1989/Before (ACS 5-Yr),yearmovedin,yearmovedin.ACSRMV1989,yearmovedin_ACSRMV1989
9225,ACSMEDYRMV,2018 Median Year Householder Moved In (ACS 5-Yr),yearmovedin,yearmovedin.ACSMEDYRMV,yearmovedin_ACSMEDYRMV


For instance, if we are only interested in key variables for the current observed year (2019), we can reduce our subset to just these variables. Typically, this is a very useful way to get started and see what type of results a preliminary exploratory modeling effort can discover.

In [12]:
ev[
    (ev.data_collection.str.lower().str.contains('key'))  # get the key variables
    & (ev.name.str.endswith('CY'))                     # just current year (2019) variables
].reset_index(drop=True)

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name
0,TOTPOP_CY,2020 Total Population,KeyUSFacts,KeyUSFacts.TOTPOP_CY,KeyUSFacts_TOTPOP_CY
1,GQPOP_CY,2020 Group Quarters Population,KeyUSFacts,KeyUSFacts.GQPOP_CY,KeyUSFacts_GQPOP_CY
2,DIVINDX_CY,2020 Diversity Index,KeyUSFacts,KeyUSFacts.DIVINDX_CY,KeyUSFacts_DIVINDX_CY
3,TOTHH_CY,2020 Total Households,KeyUSFacts,KeyUSFacts.TOTHH_CY,KeyUSFacts_TOTHH_CY
4,AVGHHSZ_CY,2020 Average Household Size,KeyUSFacts,KeyUSFacts.AVGHHSZ_CY,KeyUSFacts_AVGHHSZ_CY
5,MEDHINC_CY,2020 Median Household Income,KeyUSFacts,KeyUSFacts.MEDHINC_CY,KeyUSFacts_MEDHINC_CY
6,AVGHINC_CY,2020 Average Household Income,KeyUSFacts,KeyUSFacts.AVGHINC_CY,KeyUSFacts_AVGHINC_CY
7,PCI_CY,2020 Per Capita Income,KeyUSFacts,KeyUSFacts.PCI_CY,KeyUSFacts_PCI_CY
8,TOTHU_CY,2020 Total Housing Units,KeyUSFacts,KeyUSFacts.TOTHU_CY,KeyUSFacts_TOTHU_CY
9,OWNER_CY,2020 Owner Occupied HUs,KeyUSFacts,KeyUSFacts.OWNER_CY,KeyUSFacts_OWNER_CY


## Using ArcGIS Online

While it is common to work with a primary country locally, for quite a few large corporations, once they have a successful model running for their primary country, typically the United States, then they want to start experimenting with their international markets. The `modeling` module provides a single interface to explore these hypotheticals without having to install the data locally or learn a new API or interface.

Using the `GIS` object instance connected to ArcGIS Online, we can investigate what countries are available.

In [8]:
agol = GIS(
    os.getenv('ESRI_GIS_URL'),
    username=os.getenv('ESRI_GIS_USERNAME'),
    password=os.getenv('ESRI_GIS_PASSWORD')
)

cntry_df = get_countries(agol)

cntry_df

Unnamed: 0,iso2,iso3,country_name,country_id,alt_name,continent
0,AL,ALB,Albania,ALB_MBR_2019,ALBANIA,Europe
1,DZ,DZA,Algeria,DZA_MBR_2019,ALGERIA,Africa
2,AD,AND,Andorra,AND_MBR_2019,ANDORRA,Europe
3,AO,AGO,Angola,AGO_MBR_2019,ANGOLA,Africa
4,AR,ARG,Argentina,ARG_MBR_2020,ARGENTINA,South America
...,...,...,...,...,...,...
131,UY,URY,Uruguay,URY_MBR_2020,URUGUAY,South America
132,UZ,UZB,Uzbekistan,UZB_MBR_2020,UZBEKISTAN,Asia
133,VE,VEN,Venezuela,VEN_MBR_2020,"VENEZUELA, BOLIVARIAN REPUBLIC OF",South America
134,VN,VNM,Vietnam,VNM_MBR_2020,VIET NAM,Asia


With 136 countires available, rather than scanning the entire dataframe, we can simply see if `GBR` the ISO3 code for Great Britan is available.

In [9]:
cntry_df.iso3.str.contains('GBR').any()

True

In [10]:
cntry_df[cntry_df.iso3.str.contains('GBR')]

Unnamed: 0,iso2,iso3,country_name,country_id,alt_name,continent
129,GB,GBR,United Kingdom,GBR_MBR_2019,UNITED KINGDOM,Europe


Since it is, we can create a `Country` object instance just like before, and find the key facts for Great Britian. Obviously each country is going to have a different set of key facts based on what is available. Also, there are a few more columns displaying simply because there are a few more columns available from the REST endpoint than are easily available through local introspection. This, though, is something we are working on to expose locally as well.

In [11]:
gbr = Country('GBR', agol)

gbr_ev = gbr.enrich_variables

gbr_ev[
    (gbr_ev.data_collection.str.lower().str.contains('keyfacts'))
    & (gbr_ev.name.str.endswith('CY'))
].reset_index(drop=True)

Unnamed: 0,name,alias,data_collection,description,vintage,units
0,TOTPOP_CY,2019 Total Population,KeyFacts,2019 Total Population,2019,count
1,POPDENS_CY,2019 Population Density (per sq. km),KeyFacts,2019 Population Density (Population per Square...,2019,count
2,POPPRM_CY,2019 Population Per Mill,KeyFacts,2019 Population Per Mill,2019,count
3,MALES_CY,2019 Total Male Population,KeyFacts,2019 Total Male Population,2019,count
4,FEMALES_CY,2019 Total Female Population,KeyFacts,2019 Total Female Population,2019,count
5,TOTHH_CY,2019 Total Households,KeyFacts,2019 Total Households,2019,count
6,AVGHHSZ_CY,2019 Average Household Size,KeyFacts,2019 Average Household Size,2019,count
7,PAGE01_CY,2019 Total Population Age 0-14,KeyFacts,2019 Total Population Age 0-14,2019,count
8,PAGE02_CY,2019 Total Population Age 15-29,KeyFacts,2019 Total Population Age 15-29,2019,count
9,PAGE03_CY,2019 Total Population Age 30-44,KeyFacts,2019 Total Population Age 30-44,2019,count


## ArcGIS Enterprise (Busines Analyst Server)

Within an organization, if more than just a few data scientists are working with the data, it begins to make sense to migrate local analysis workflows to reference Business Analyst Server as part of an ArcGIS Enterprise installation instead of using ArcGIS Pro with Business Analyst or ArcGIS Online. ArcGIS Pro with Busines Analyst simply does not scale out very well technically or fiscally. ArcGIS Enterprise simply does not scale fiscally. Every time enrichment is run, it costs credits. Thus, it makes sense to quickly be able to move to Business Analyst Server.

In [12]:
prtl = GIS(
    os.getenv('ESRI_PORTAL_URL'),
    username=os.getenv('ESRI_PORTAL_USERNAME'),
    password=os.getenv('ESRI_PORTAL_PASSWORD')
)

get_countries(prtl)

Unnamed: 0,iso2,iso3,country_name,country_id,alt_name,continent
0,US,USA,United States,USA_ESRI_2020,UNITED STATES,North America


In [13]:
usa = Country('USA', prtl)

usa

<modeling.Country - USA (GIS at https://geoai-ent.bd.esri.com/portal/ logged in as jmccune)>

In [14]:
usa_ev = usa.enrich_variables

usa_ev

Unnamed: 0,name,alias,data_collection,description,vintage,units
0,AGE0_CY,2020 Population Age <1,1yearincrements,2020 Total Population Age <1 (Esri),2020,count
1,AGE1_CY,2020 Population Age 1,1yearincrements,2020 Total Population Age 1 (Esri),2020,count
2,AGE2_CY,2020 Population Age 2,1yearincrements,2020 Total Population Age 2 (Esri),2020,count
3,AGE3_CY,2020 Population Age 3,1yearincrements,2020 Total Population Age 3 (Esri),2020,count
4,AGE4_CY,2020 Population Age 4,1yearincrements,2020 Total Population Age 4 (Esri),2020,count
...,...,...,...,...,...,...
37,MOEMEDYRMV,2018 Median Year Householder Moved In MOE (ACS...,yearmovedin,2018 Median Year Householder Moved into Unit M...,2014-2018,count
38,RELMEDYRMV,2018 Median Year Householder Moved In REL (ACS...,yearmovedin,2018 Median Year Householder Moved into Unit R...,2014-2018,count
39,ACSOWNER,2018 Owner Households (ACS 5-Yr),yearmovedin,2018 Owner Households (ACS 5-Yr),2014-2018,count
40,MOEOWNER,2018 Owner Households MOE (ACS 5-Yr),yearmovedin,2018 Owner Households MOE (ACS 5-Yr),2014-2018,count


In [15]:
usa_ev[
    (usa_ev.data_collection.str.lower().str.contains('key'))
    & (usa_ev.name.str.endswith('CY'))
].reset_index(drop=True)

Unnamed: 0,name,alias,data_collection,description,vintage,units
0,TOTPOP_CY,2020 Total Population,KeyUSFacts,2020 Total Population (Esri),2020,count
1,GQPOP_CY,2020 Group Quarters Population,KeyUSFacts,2020 Group Quarters Population (Esri),2020,count
2,DIVINDX_CY,2020 Diversity Index,KeyUSFacts,2020 Diversity Index (Esri),2020,count
3,TOTHH_CY,2020 Total Households,KeyUSFacts,2020 Total Households (Esri),2020,count
4,AVGHHSZ_CY,2020 Average Household Size,KeyUSFacts,2020 Average Household Size (Esri),2020,count
5,MEDHINC_CY,2020 Median Household Income,KeyUSFacts,2020 Median Household Income (Esri),2020,currency
6,AVGHINC_CY,2020 Average Household Income,KeyUSFacts,2020 Average Household Income (Esri),2020,currency
7,PCI_CY,2020 Per Capita Income,KeyUSFacts,2020 Per Capita Income (Esri),2020,currency
8,TOTHU_CY,2020 Total Housing Units,KeyUSFacts,2020 Total Housing Units (Esri),2020,count
9,OWNER_CY,2020 Owner Occupied HUs,KeyUSFacts,2020 Owner Occupied Housing Units (Esri),2020,count


## Flexiblity Illustrated

While each of the above examples illustrated different data sources, just to illustrate the ease of moving between these. Here is the same workflow for the United States in ArcGIS Online and with local data.

In [16]:
usa = Country('USA', agol)
usa_ev = usa.enrich_variables
usa_ev[
    (usa_ev.data_collection.str.lower().str.contains('key'))
    & (usa_ev.name.str.endswith('CY'))
].reset_index(drop=True)

Unnamed: 0,name,alias,data_collection,description,vintage,units
0,TOTPOP_CY,2020 Total Population,KeyUSFacts,2020 Total Population (Esri),2020,count
1,GQPOP_CY,2020 Group Quarters Population,KeyUSFacts,2020 Group Quarters Population (Esri),2020,count
2,DIVINDX_CY,2020 Diversity Index,KeyUSFacts,2020 Diversity Index (Esri),2020,count
3,TOTHH_CY,2020 Total Households,KeyUSFacts,2020 Total Households (Esri),2020,count
4,AVGHHSZ_CY,2020 Average Household Size,KeyUSFacts,2020 Average Household Size (Esri),2020,count
5,MEDHINC_CY,2020 Median Household Income,KeyUSFacts,2020 Median Household Income (Esri),2020,currency
6,AVGHINC_CY,2020 Average Household Income,KeyUSFacts,2020 Average Household Income (Esri),2020,currency
7,PCI_CY,2020 Per Capita Income,KeyUSFacts,2020 Per Capita Income (Esri),2020,currency
8,TOTHU_CY,2020 Total Housing Units,KeyUSFacts,2020 Total Housing Units (Esri),2020,count
9,OWNER_CY,2020 Owner Occupied HUs,KeyUSFacts,2020 Owner Occupied Housing Units (Esri),2020,count


In [17]:
usa = Country('USA', 'local')
usa_ev = usa.enrich_variables
usa_ev[
    (usa_ev.data_collection.str.lower().str.contains('key'))
    & (usa_ev.name.str.endswith('CY'))
].reset_index(drop=True)

Unnamed: 0,name,alias,data_collection,enrich_name,enrich_field_name
0,TOTPOP_CY,2020 Total Population,KeyUSFacts,KeyUSFacts.TOTPOP_CY,KeyUSFacts_TOTPOP_CY
1,GQPOP_CY,2020 Group Quarters Population,KeyUSFacts,KeyUSFacts.GQPOP_CY,KeyUSFacts_GQPOP_CY
2,DIVINDX_CY,2020 Diversity Index,KeyUSFacts,KeyUSFacts.DIVINDX_CY,KeyUSFacts_DIVINDX_CY
3,TOTHH_CY,2020 Total Households,KeyUSFacts,KeyUSFacts.TOTHH_CY,KeyUSFacts_TOTHH_CY
4,AVGHHSZ_CY,2020 Average Household Size,KeyUSFacts,KeyUSFacts.AVGHHSZ_CY,KeyUSFacts_AVGHHSZ_CY
5,MEDHINC_CY,2020 Median Household Income,KeyUSFacts,KeyUSFacts.MEDHINC_CY,KeyUSFacts_MEDHINC_CY
6,AVGHINC_CY,2020 Average Household Income,KeyUSFacts,KeyUSFacts.AVGHINC_CY,KeyUSFacts_AVGHINC_CY
7,PCI_CY,2020 Per Capita Income,KeyUSFacts,KeyUSFacts.PCI_CY,KeyUSFacts_PCI_CY
8,TOTHU_CY,2020 Total Housing Units,KeyUSFacts,KeyUSFacts.TOTHU_CY,KeyUSFacts_TOTHU_CY
9,OWNER_CY,2020 Owner Occupied HUs,KeyUSFacts,KeyUSFacts.OWNER_CY,KeyUSFacts_OWNER_CY
