# Data Preperation Pipeline Using ArcGIS GeoEnrichment

Esri makes avaialable a huge number of demographic variables, over 14,000, that can be added to any location. These variables can be used for creating machine learning models to create customer clusters or for creating performance forecasting. This process of attaching Esri data to locations, due to the geographic nature, is referred to as GeoEnrichment.

GeoEnrichment can be accessed in a variety of ways across the ArcGIS suite of software products. For data science one of the most useful tools is the ability to use the ArcGIS Python API to add demographic variables to location data for analysis and modeling. Further, to streamline data preperation workflows, GeoEnrichment can be integrated into a custom Sci-Kit Learn Estimator for integration into a compmlete data Pipeline. To examine this wokflow, first we will step through the process of performing GeoEnrichment, and then see how it looks to integrate this into a custom Estimator.

In [1]:
from arcgis.gis import GIS, Item
import arcgis.geoenrichment as geoenrichment

In [2]:
user_id = 'jmccune_geoai'
customer_item_id = '47d2cb05d9c1494797293b62ba167211'

## Create a GIS

The first step is creating a GIS object instance. The GIS object instance is used to manage the connection to our Esri Web GIS. This Web GIS can either be an implementation of ArcGIS Enterprise, or an Organization in ArcGIS Online. Either way, being logged in is a required prerequsite to accessing GeoEnrichment. Although this example is prompting for my password, for automated data pipelines there are a few options to streamline this process to be hands off [detailed in the documentation](https://developers.arcgis.com/python/guide/working-with-different-authentication-schemes/#Storing-your-credentialls-locally).

In [3]:
gis = GIS(username=user_id)
gis

## Create a SpatialDataFrame for Demonstration

To demonstrate how GeoEnrichment works, we need data - specifically data with _locations_. In this case, we are collecting the first three customers from a demonstration customer dataset hosted on ArcGIS Online, and creating a [SpatialDataFrame](https://esri.github.io/arcgis-python-api/apidoc/html/arcgis.features.toc.html#spatialdataframe) to work with.

In [4]:
customer_layer = Item(gis, customer_item_id).layers[0]

id_list = customer_layer.query(return_ids_only=True)
enrich_id_list = [str(id) for id in id_list['objectIds'][:3]]
enrich_id_string = ','.join(enrich_id_list)
print('We\'re going to use just three features for this demonstration, identified by the first three object identifiers - {}.'.format(enrich_id_string))

customer_sdf = customer_layer.query(object_ids=enrich_id_string).df
customer_sdf

We're going to use just three features for this demonstration, identified by the first three object identifiers - 1,2,3.


Unnamed: 0,CITY,CUSTOMER_CLASS,Customer_Spending,DMA,Distance,FIRSTNAME,Join_Count,LASTNAME,OBJECTID,PAYMETHOD,...,Store_ID,TARGET_FID,X_Long,Y_Lat,ZIP,ZIP4,description,test,time_of_day,SHAPE
0,Prattville,Steady,3527.8,Montgomery (Selma) AL,57691.556784,JIM,2,BROWN,1,MC,...,,1,-86.497305,32.474348,36067,2816,Island hemp skirt - The Island Hemp Skirt brin...,,,"{'x': -9628835.9624, 'y': 3825738.4985999987, ..."
1,Prattville,Steady,2667.1,Montgomery (Selma) AL,57691.556784,CARL,2,ATKINS,2,MC,...,,2,-86.478971,32.481213,36067,1814,Vintage logo pkt t-shirt - Keep it on the down...,,,"{'x': -9626795.0102, 'y': 3826644.4267000034, ..."
2,Prattville,Steady,2897.6,Montgomery (Selma) AL,57691.556784,JOHN,2,ASHBY,3,PP,...,,3,-86.457961,32.485113,36067,2110,Solimar pants - In case your travel plans coin...,,,"{'x': -9624456.2475, 'y': 3827159.0424999967, ..."


## Identify GeoEnrichment variables

Although there are over 14,000 variables available for GeoEnrichment from Esri, we are only focusing on those describing median disposable income for our customer points. Granted, you can use anything you want. These are simply the variables I chose for this demonstration.

For our analysis, we are going to need the variable names combined into a single list for input into the GeoEnrichment `enrich` method. First though, we interrogate [GeoEnrichment](https://esri.github.io/arcgis-python-api/apidoc/html/arcgis.geoenrichment.html) to get a dataframe with more information on these variables to ensure they are what we are interestd in.

In [5]:
# get a country to work with for GeoEnrichment
usa = geoenrichment.Country.get('US')

# get a Data Frame with all the data variables available for the selected country
factors_df = usa.data_collections
print('There are {:,} data variables available from Esri.'.format(len(factors_df.index)))

# filter out just the variables for disposable income
disposableincome_factors_df = factors_df[factors_df.index == 'disposableincome'].copy()
print('For just disposable income there are {:,} variables!'.format(len(disposableincome_factors_df.index)))

# from these, filter out just those describing median disposable income
mediandi_factors_df = disposableincome_factors_df[disposableincome_factors_df['alias'].str.contains('Median')].copy()

# adding a column with only the names of the variables we will see in the results of GeoProcessing
mediandi_factors_df['out_column'] = mediandi_factors_df['analysisVariable'].apply(lambda val: val.split('.')[1])

print('For this demonstration, we are using {} variables describing median disposable income.'.format(len(mediandi_factors_df)))

mediandi_factors_df

There are 14,630 data variables available from Esri.
For just disposable income there are 104 variables!
For this demonstration, we are using 8 variables describing median disposable income.


Unnamed: 0_level_0,analysisVariable,alias,fieldCategory,vintage,out_column
dataCollectionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
disposableincome,disposableincome.MEDDI_CY,2018 Median Disposable Income,2018 Disposable Income (Esri),2018,MEDDI_CY
disposableincome,disposableincome.MEDDIA15CY,2018 Median Disposable Inc: HHr 15-24,2018 Disposable Income by Age (Esri),2018,MEDDIA15CY
disposableincome,disposableincome.MEDDIA25CY,2018 Median Disposable Inc: HHr 25-34,2018 Disposable Income by Age (Esri),2018,MEDDIA25CY
disposableincome,disposableincome.MEDDIA35CY,2018 Median Disposable Inc: HHr 35-44,2018 Disposable Income by Age (Esri),2018,MEDDIA35CY
disposableincome,disposableincome.MEDDIA45CY,2018 Median Disposable Inc: HHr 45-54,2018 Disposable Income by Age (Esri),2018,MEDDIA45CY
disposableincome,disposableincome.MEDDIA55CY,2018 Median Disposable Inc: HHr 55-64,2018 Disposable Income by Age (Esri),2018,MEDDIA55CY
disposableincome,disposableincome.MEDDIA65CY,2018 Median Disposable Inc: HHr 65-74,2018 Disposable Income by Age (Esri),2018,MEDDIA65CY
disposableincome,disposableincome.MEDDIA75CY,2018 Median Disposable Inc: HHr 75+,2018 Disposable Income by Age (Esri),2018,MEDDIA75CY


Values from the `analysisVariable` column above are combined as a list to describe the data we want a
For the `enrich` method we describe the data we want using the values from the `analysisVariable` column above. These values need to be combined into a Python List. If going to be reused (_very likely_), this list can easily be saved in a configuration file or simply hard coded into a script as a varaible toward the top.

In [6]:
enrich_variable_list = list(mediandi_factors_df['analysisVariable'].values)
enrich_variable_list

['disposableincome.MEDDI_CY',
 'disposableincome.MEDDIA15CY',
 'disposableincome.MEDDIA25CY',
 'disposableincome.MEDDIA35CY',
 'disposableincome.MEDDIA45CY',
 'disposableincome.MEDDIA55CY',
 'disposableincome.MEDDIA65CY',
 'disposableincome.MEDDIA75CY']

## Perform GeoEnrichment

For GeoEnrichment we only need to identify the geoenrichment locations with geometry objects defining the locations. The GeoEnrichment `enrich` will only accept a finite number of columns, and since we can easily join the data back to itself after receiving the response, we only send the index and geometry columns to the GeoEnrichment `enrich` method.

In [7]:
sdf_for_enrich = customer_sdf[['SHAPE']].copy()
sdf_for_enrich

Unnamed: 0,SHAPE
0,"{'x': -9628835.9624, 'y': 3825738.4985999987, ..."
1,"{'x': -9626795.0102, 'y': 3826644.4267000034, ..."
2,"{'x': -9624456.2475, 'y': 3827159.0424999967, ..."


In [8]:
enrich_df = geoenrichment.enrich(
    study_areas=sdf_for_enrich, # only send the geometry
    analysis_variables=enrich_variable_list,
    return_geometry=False  # already have the geometry locally, so do not repeat
)

# some cleanup to ensure the index column matches our original data
enrich_df.set_index('ID', drop=True, inplace=True)  # index to match with original data
enrich_df.index = enrich_df.index.astype(customer_sdf.index.dtype)  # so the join will work later
enrich_df

Unnamed: 0_level_0,HasData,MEDDIA15CY,MEDDIA25CY,MEDDIA35CY,MEDDIA45CY,MEDDIA55CY,MEDDIA65CY,MEDDIA75CY,MEDDI_CY,OBJECTID_0,aggregationMethod,areaType,bufferRadii,bufferUnits,bufferUnitsAlias,sourceCountry
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,1,30717,51239,61686,70704,52912,42742,27632,51445,1,BlockApportionment:US.BlockGroups,RingBuffer,1,esriMiles,Miles,US
1,1,30717,50397,54461,57858,45039,39248,27789,46602,2,BlockApportionment:US.BlockGroups,RingBuffer,1,esriMiles,Miles,US
2,1,36263,47177,54538,57423,48567,46279,34101,48455,3,BlockApportionment:US.BlockGroups,RingBuffer,1,esriMiles,Miles,US


In creating our DataFrame of the analysis variables before, we created a column named `out_column` identifying the analysis variable column names. These values can be used to filter the DataFrame result after GeoEnrichment to only show the data columnns we are interested in.

In [9]:
# only show the columns with the actual variables we want added
enrich_df[mediandi_factors_df['out_column']]

Unnamed: 0_level_0,MEDDI_CY,MEDDIA15CY,MEDDIA25CY,MEDDIA35CY,MEDDIA45CY,MEDDIA55CY,MEDDIA65CY,MEDDIA75CY
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,51445,30717,51239,61686,70704,52912,42742,27632
1,46602,30717,50397,54461,57858,45039,39248,27789
2,48455,36263,47177,54538,57423,48567,46279,34101


Using the index field, we can now join our added variables back to our original data for further analysis.

In [10]:
customer_enrich_sdf = customer_sdf.join(enrich_df[mediandi_factors_df['out_column']])
customer_enrich_sdf

Unnamed: 0,CITY,CUSTOMER_CLASS,Customer_Spending,DMA,Distance,FIRSTNAME,Join_Count,LASTNAME,OBJECTID,PAYMETHOD,...,time_of_day,SHAPE,MEDDI_CY,MEDDIA15CY,MEDDIA25CY,MEDDIA35CY,MEDDIA45CY,MEDDIA55CY,MEDDIA65CY,MEDDIA75CY
0,Prattville,Steady,3527.8,Montgomery (Selma) AL,57691.556784,JIM,2,BROWN,1,MC,...,,"{'x': -9628835.9624, 'y': 3825738.4985999987, ...",51445,30717,51239,61686,70704,52912,42742,27632
1,Prattville,Steady,2667.1,Montgomery (Selma) AL,57691.556784,CARL,2,ATKINS,2,MC,...,,"{'x': -9626795.0102, 'y': 3826644.4267000034, ...",46602,30717,50397,54461,57858,45039,39248,27789
2,Prattville,Steady,2897.6,Montgomery (Selma) AL,57691.556784,JOHN,2,ASHBY,3,PP,...,,"{'x': -9624456.2475, 'y': 3827159.0424999967, ...",48455,36263,47177,54538,57423,48567,46279,34101


# Integrate GeoEnrichment into a Data Pipeline

While useful, the real power of accessing GeoEnrichment through the Python API as part of a data preperation workflow is when combined into a larger data preperation pipeline in SciKit Learn by extending a BaseEstimator. Everything below can be extracted. If the GIS authentication automated to read the password from a configuration file, everything below can be used as a standalone part of a larger data preperation workflow.

In [11]:
from sklearn.base import BaseEstimator, TransformerMixin
from arcgis.gis import GIS, Item
import arcgis.geoenrichment as geoenrichment

In [12]:
analysis_variables = ['disposableincome.MEDDI_CY', 'disposableincome.MEDDIA15CY', 'disposableincome.MEDDIA25CY', 'disposableincome.MEDDIA35CY', 
                      'disposableincome.MEDDIA45CY', 'disposableincome.MEDDIA55CY', 'disposableincome.MEDDIA65CY', 'disposableincome.MEDDIA75CY']
user_id = 'jmccune_geoai'

In [13]:
# again, could be hard coded, but I'm not going to show you my password...
gis = GIS(username=user_id)
gis

Enter password: ········


In [14]:
class ArcgisGeoenricher(BaseEstimator, TransformerMixin):
    
    def __init__(self, analysis_variables):
        self.analysis_variables = analysis_variables
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # create the lightweight SpatialDataFrame for sending
        sdf_for_enrich = X[['SHAPE']].copy()
        
        # call GeoEnrichment REST API
        enrich_df = geoenrichment.enrich(
            study_areas=sdf_for_enrich, # only send the geometry
            analysis_variables=self.analysis_variables,
            return_geometry=False  # already have the geometry locally, so do not repeat
        )
        
        # some cleanup to ensure the index column matches our original data
        enrich_df.set_index('ID', drop=True, inplace=True)  # index to match with original data
        enrich_df.index = enrich_df.index.astype(customer_sdf.index.dtype)  # so the join will work later
        
        # join and return the result
        return customer_sdf.join(enrich_df[mediandi_factors_df['out_column']])

Now, test it out to see if it all works...

In [15]:
arcgis_geoenricher = ArcgisGeoenricher(analysis_variables)
arcgis_geoenricher.fit_transform(customer_enrich_sdf)

Unnamed: 0,CITY,CUSTOMER_CLASS,Customer_Spending,DMA,Distance,FIRSTNAME,Join_Count,LASTNAME,OBJECTID,PAYMETHOD,...,time_of_day,SHAPE,MEDDI_CY,MEDDIA15CY,MEDDIA25CY,MEDDIA35CY,MEDDIA45CY,MEDDIA55CY,MEDDIA65CY,MEDDIA75CY
0,Prattville,Steady,3527.8,Montgomery (Selma) AL,57691.556784,JIM,2,BROWN,1,MC,...,,"{'x': -9628835.9624, 'y': 3825738.4985999987, ...",51445,30717,51239,61686,70704,52912,42742,27632
1,Prattville,Steady,2667.1,Montgomery (Selma) AL,57691.556784,CARL,2,ATKINS,2,MC,...,,"{'x': -9626795.0102, 'y': 3826644.4267000034, ...",46602,30717,50397,54461,57858,45039,39248,27789
2,Prattville,Steady,2897.6,Montgomery (Selma) AL,57691.556784,JOHN,2,ASHBY,3,PP,...,,"{'x': -9624456.2475, 'y': 3827159.0424999967, ...",48455,36263,47177,54538,57423,48567,46279,34101
