# Part 4 - What to enrich with? (What are Data Collections and Analysis Variables?)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Collections-and-GeoEnrichment-coverage" data-toc-modified-id="Data-Collections-and-GeoEnrichment-coverage-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Collections and GeoEnrichment coverage</a></span><ul class="toc-item"><li><span><a href="#List-Countries-with-GeoEnrichment-Data" data-toc-modified-id="List-Countries-with-GeoEnrichment-Data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>List Countries with GeoEnrichment Data</a></span></li><li><span><a href="#Data-Collections-for-U.S." data-toc-modified-id="Data-Collections-for-U.S.-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data Collections for U.S.</a></span><ul class="toc-item"><li><span><a href="#Unique-Data-Collections-for-U.S." data-toc-modified-id="Unique-Data-Collections-for-U.S.-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Unique Data Collections for U.S.</a></span></li><li><span><a href="#Data-Collections-by-Socio-demographic-Factors" data-toc-modified-id="Data-Collections-by-Socio-demographic-Factors-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Data Collections by Socio-demographic Factors</a></span></li><li><span><a href="#Analysis-variables-for-Data-Collections" data-toc-modified-id="Analysis-variables-for-Data-Collections-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Analysis variables for Data Collections</a></span></li></ul></li><li><span><a href="#Data-Collections-for-Another-Country" data-toc-modified-id="Data-Collections-for-Another-Country-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data Collections for Another Country</a></span><ul class="toc-item"><li><span><a href="#Unique-Data-Collections-for-New-Zealand" data-toc-modified-id="Unique-Data-Collections-for-New-Zealand-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Unique Data Collections for New Zealand</a></span></li><li><span><a href="#Data-Collections-for-Socio-demographic-Factors" data-toc-modified-id="Data-Collections-for-Socio-demographic-Factors-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Data Collections for Socio-demographic Factors</a></span></li><li><span><a href="#Analysis-variables-for-Data-Collections" data-toc-modified-id="Analysis-variables-for-Data-Collections-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Analysis variables for Data Collections</a></span></li></ul></li><li><span><a href="#Perform-Enrichment-using-Data-Collections-and-Analysis-Variables" data-toc-modified-id="Perform-Enrichment-using-Data-Collections-and-Analysis-Variables-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Perform Enrichment using Data Collections and Analysis Variables</a></span><ul class="toc-item"><li><span><a href="#Enrich-using-Data-Collections" data-toc-modified-id="Enrich-using-Data-Collections-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Enrich using Data Collections</a></span></li><li><span><a href="#Enrich-using-Analysis-Variables" data-toc-modified-id="Enrich-using-Analysis-Variables-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>Enrich using Analysis Variables</a></span></li></ul></li></ul></li><li><span><a href="#Enriching-Spatially-Enabled-Dataframes" data-toc-modified-id="Enriching-Spatially-Enabled-Dataframes-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Enriching Spatially Enabled Dataframes</a></span><ul class="toc-item"><li><span><a href="#Visualize-on-a-Map" data-toc-modified-id="Visualize-on-a-Map-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Visualize on a Map</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

## Data Collections and GeoEnrichment coverage

As described earlier, a data collection is a preassembled list of attributes that will be used to enrich the input features. Collection attributes can describe various types of information, such as demographic characteristics and geographic context of the locations or areas submitted as input features. 

Some data collections (such as default) can be used in all supported countries. Other data collections may only be available in one or a collection of countries. [Data Browser](https://doc.arcgis.com/en/esri-demographics/data/data-browser.htm) can be used to examine the entire global listing of variables, and associated datasets for each country.

<div class="embed-container"><iframe frameborder="false" scrolling="no" width="800" height="600" marginheight="0" marginwidth="0" 
src="https://la.arcgis.com/databrowser/index.html"></iframe></div>

### List Countries with GeoEnrichment Data

The `get_countries()` method can be used to query the countries for which GeoEnrichment data is available, and it returns a list of `Country` objects with which you can further query for properties. This list can also be viewed [here](https://developers.arcgis.com/rest/geoenrichment/api-reference/geoenrichment-coverage.htm).

In [1]:
# Import Libraries
from arcgis.gis import GIS
from arcgis.geoenrichment import Country, enrich, get_countries

In [2]:
# Create a GIS Connection
gis = GIS('https://baqa.mapsqa.arcgis.com', username='jmccune_baqa', password='xujmit-vujzaj-7Fuggo')

In [3]:
countries = get_countries()
print("Number of countries for which GeoEnrichment data is available: " + str(len(countries)))

#print a few countries for a sample
countries[0:10]

Number of countries for which GeoEnrichment data is available: 144


[<Country - Albania (GIS @ https://baqa.mapsqa.arcgis.com version:9.2)>,
 <Country - Algeria (GIS @ https://baqa.mapsqa.arcgis.com version:9.2)>,
 <Country - Andorra (GIS @ https://baqa.mapsqa.arcgis.com version:9.2)>,
 <Country - Angola (GIS @ https://baqa.mapsqa.arcgis.com version:9.2)>,
 <Country - Anguilla (GIS @ https://baqa.mapsqa.arcgis.com version:9.2)>,
 <Country - Argentina (GIS @ https://baqa.mapsqa.arcgis.com version:9.2)>,
 <Country - Armenia (GIS @ https://baqa.mapsqa.arcgis.com version:9.2)>,
 <Country - Aruba (GIS @ https://baqa.mapsqa.arcgis.com version:9.2)>,
 <Country - Australia (GIS @ https://baqa.mapsqa.arcgis.com version:9.2)>,
 <Country - Austria (GIS @ https://baqa.mapsqa.arcgis.com version:9.2)>]

### Data Collections for U.S.

The `data_collections` property of a `Country` object lists its available data collections and analysis variables under each data collection as a Pandas dataframe.

In order to discover the data collections for a particular country, you may first access the reference variable to it using the `country.get()` method, and then fetch the data collections from `country.data_collections` property. Once we know the data collection we would like to use, we can look at `analysisVariable`s available in that data collection.

In [4]:
# Get US as a country
usa = Country.get('US')
type(usa)

arcgis.geoenrichment.enrichment.Country

In [5]:
usa_df = usa.data_collections

# print a few rows of the DataFrame
usa_df.head()

Unnamed: 0_level_0,analysisVariable,alias,fieldCategory,vintage
dataCollectionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1yearincrements,1yearincrements.AGE0_CY,2021 Population Age <1,2021 Age: 1 Year Increments (Esri),2021
1yearincrements,1yearincrements.AGE1_CY,2021 Population Age 1,2021 Age: 1 Year Increments (Esri),2021
1yearincrements,1yearincrements.AGE2_CY,2021 Population Age 2,2021 Age: 1 Year Increments (Esri),2021
1yearincrements,1yearincrements.AGE3_CY,2021 Population Age 3,2021 Age: 1 Year Increments (Esri),2021
1yearincrements,1yearincrements.AGE4_CY,2021 Population Age 4,2021 Age: 1 Year Increments (Esri),2021


In [6]:
usa_df.shape

(18632, 4)

#### Unique Data Collections for U.S.

Each data collection and analysis variable has a unique ID. When calling the `enrich()` method (explained earlier in this guide) these analysis variables can be passed in the `data_collections` and `analysis_variables` parameters.

As an example, here we see a subset of the data collections for US showing 2 different data collections and multiple analysis variables for each collection.

In [7]:
usa_df.iloc[500:600,:]

Unnamed: 0_level_0,analysisVariable,alias,fieldCategory,vintage
dataCollectionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1yearincrements,1yearincrements.FAGE75_FY,2026 Females Age 75,2026 Age: 1 Year Increments (Esri),2026
1yearincrements,1yearincrements.FAGE76_FY,2026 Females Age 76,2026 Age: 1 Year Increments (Esri),2026
1yearincrements,1yearincrements.FAGE77_FY,2026 Females Age 77,2026 Age: 1 Year Increments (Esri),2026
1yearincrements,1yearincrements.FAGE78_FY,2026 Females Age 78,2026 Age: 1 Year Increments (Esri),2026
1yearincrements,1yearincrements.FAGE79_FY,2026 Females Age 79,2026 Age: 1 Year Increments (Esri),2026
...,...,...,...,...
5yearincrements,5yearincrements.MEDAGE_CY,2021 Median Age,2021 Age: 5 Year Increments (Esri),2021
5yearincrements,5yearincrements.MALES_CY,2021 Male Population,2021 Age: 5 Year Increments (Esri),2021
5yearincrements,5yearincrements.MALE0_CY,2021 Males Age 0-4,2021 Age: 5 Year Increments (Esri),2021
5yearincrements,5yearincrements.MALE5_CY,2021 Males Age 5-9,2021 Age: 5 Year Increments (Esri),2021


The table above shows 2 different data collections (1yearincrements and 5yearincrements). Since these are `Age` data collections, the `analysisVariable`s for these collections are similar. `vintage` shows the year that the demographic data represents. For example, a vintage of 2020 means that the data represents the year 2020.

Let's get a list of unique data collections that are available for U.S. 

In [8]:
usa_df.index.nunique()

149

*United States has 150 unique data collections.* Here are the first 10 data collections.


In [9]:
list(usa_df.index.unique())[:10]

['1yearincrements',
 '5yearincrements',
 'ACS_Housing_Summary_rep',
 'ACS_Population_Summary_rep',
 'Age',
 'AgeDependency',
 'Age_50_Profile_rep',
 'Age_by_Sex_Profile_rep',
 'Age_by_Sex_by_Race_Profile_rep',
 'AtRisk']

Looking at `fieldCategory` is a great way to clearly understand what the data collection is about. `fieldCategory` combines vintage, datacollectionID columns along with the year and data collection. However, to query a data collection its unique ID (`dataCollectionID`) must be used.

Let's look at the `fieldCategory` column for a few data collections in US.

In [10]:
usa_df.fieldCategory.unique()[:10]

array(['2021 Age: 1 Year Increments (Esri)',
       '2026 Age: 1 Year Increments (Esri)',
       '2010 Age: 1 Year Increments (U.S. Census)',
       '2021 Age: 5 Year Increments (Esri)',
       '2026 Age: 5 Year Increments (Esri)',
       '2010 Age: 5 Year Increments (U.S. Census)',
       '2015-2019 Age: 5 Year Increments (ACS)',
       '2015-2019 Key Demographic Indicators (ACS)',
       '2015-2019 Mortgage Status (ACS)', '2015-2019 Contract Rent (ACS)'],
      dtype=object)

#### Data Collections by Socio-demographic Factors

You can filter the `data_collections` to get collections for a specific factor using Pandas expressions. Let's loook at data collections for different `socio-demographic factors` such as `Age, Population, Income`.

__Data Collections for Age__

In [11]:
Age_Collections = usa_df['fieldCategory'].str.contains('Age', na=False)
usa_df[Age_Collections].fieldCategory.unique()

array(['2021 Age: 1 Year Increments (Esri)',
       '2026 Age: 1 Year Increments (Esri)',
       '2010 Age: 1 Year Increments (U.S. Census)',
       '2021 Age: 5 Year Increments (Esri)',
       '2026 Age: 5 Year Increments (Esri)',
       '2010 Age: 5 Year Increments (U.S. Census)',
       '2015-2019 Age: 5 Year Increments (ACS)',
       '2015-2019 Females by Age of Children and Employment Status (ACS)',
       '2015-2019 Households by Type and Size and Age (ACS)',
       '2021 Age Dependency (Esri)', '2026 Age Dependency (Esri)',
       '2021 Income by Age (Esri)', '2026 Income by Age (Esri)',
       '2021 Disposable Income by Age (Esri)', 'Age: 5 Year Increments',
       '2021 Net Worth by Age (Esri)',
       '2010 Households by Age of Householder (U.S. Census)',
       '2021 Age by Sex by Race (Esri)', '2026 Age by Sex by Race (Esri)',
       '2010 Age by Sex by Race (U.S. Census)',
       '2010 Housing by Age of Householder (U.S. Census)',
       '2015-2019 Income by Age (ACS)'], d

__Data Collections for Population__

In [12]:
Pop_Collections = usa_df['fieldCategory'].str.contains('Population', na=False)
usa_df[Pop_Collections].fieldCategory.unique()

array(['2015-2019 Population 65+ by Relationship and Household Type (ACS)',
       '2010 Population (U.S. Census)',
       '2015-2019 Population by Language Spoken at Home (ACS)',
       '2021 Daytime Population (Esri)',
       '2021 Population by Generation (Esri)',
       '2026 Population by Generation (Esri)',
       '2021 Population Time Series (Esri)', '2021 Population (Esri)',
       '2010 Group Quarters Population (U.S. Census)',
       '2010 Population by Relationship and Household Type (U.S. Census)'],
      dtype=object)

__Data Collections for Income__

In [13]:
Income_Collections = usa_df['fieldCategory'].str.contains('Income', na=False)
Income_Collections.index.unique()

Index(['1yearincrements', '5yearincrements', 'ACS_Housing_Summary_rep',
       'ACS_Population_Summary_rep', 'Age', 'AgeDependency',
       'Age_50_Profile_rep', 'Age_by_Sex_Profile_rep',
       'Age_by_Sex_by_Race_Profile_rep', 'AtRisk',
       ...
       'transportation', 'travelMPI', 'unitsinstructure',
       'urbanizationgroupsNEW', 'vacant', 'vehiclesavailable', 'veterans',
       'women', 'yearbuilt', 'yearmovedin'],
      dtype='object', name='dataCollectionID', length=149)

As mentioned earlier, using a `data_collection`'s unique ID (`dataCollectionID`) is the best way to further query a data collection. Let's look at the `dataCollectionID` for various Income data collections.

In [14]:
usa_df[Income_Collections].index.unique()

Index(['ACS_Population_Summary_rep', 'Age_50_Profile_rep', 'AtRisk',
       'Community_Profile_rep',
       'Demographic_and_Income_Comparison_Profile_rep',
       'Demographic_and_Income_Profile_rep', 'Disposable_Income_Profile_rep',
       'Graphic_Profile_rep', 'Health', 'Household_Income_Profile_rep',
       'KeyUSFacts', 'Market_Profile_rep', 'Policy',
       'Retail_MarketPlace_Profile_rep', 'Wealth', 'basicFactsForMobileApps',
       'disposableincome', 'foodstampsSNAP', 'householdincome', 'households',
       'incomebyage', 'population'],
      dtype='object', name='dataCollectionID')

#### Analysis variables for Data Collections
Once we know the data collection we would like to use, we can look at all the unique variables available in that data collection using its unique ID. Let's discover `analysisVariable`s for some of the data collections.

__Analysis variables for `Age` data collection__

In [15]:
usa_df.loc['Age']['analysisVariable'].unique()

array(['Age.MALE0', 'Age.MALE5', 'Age.MALE10', 'Age.MALE15', 'Age.MALE20',
       'Age.MALE25', 'Age.MALE30', 'Age.MALE35', 'Age.MALE40',
       'Age.MALE45', 'Age.MALE50', 'Age.MALE55', 'Age.MALE60',
       'Age.MALE65', 'Age.MALE70', 'Age.MALE75', 'Age.MALE80',
       'Age.MALE85', 'Age.FEM0', 'Age.FEM5', 'Age.FEM10', 'Age.FEM15',
       'Age.FEM20', 'Age.FEM25', 'Age.FEM30', 'Age.FEM35', 'Age.FEM40',
       'Age.FEM45', 'Age.FEM50', 'Age.FEM55', 'Age.FEM60', 'Age.FEM65',
       'Age.FEM70', 'Age.FEM75', 'Age.FEM80', 'Age.FEM85'], dtype=object)

Analysis variables are typically represented as `dataCollectionID.<analysis variable name>` as seen above.

__Analysis variables for `Age_by_Sex_by_Race_Profile_rep` data collection__

In [16]:
usa_df.loc['Age_by_Sex_by_Race_Profile_rep']['analysisVariable'].unique()

array(['Age_by_Sex_by_Race_Profile_rep.MEDMAGE_CY',
       'Age_by_Sex_by_Race_Profile_rep.MEDFAGE_CY',
       'Age_by_Sex_by_Race_Profile_rep.TOTPOP_FY',
       'Age_by_Sex_by_Race_Profile_rep.TOTHH_FY',
       'Age_by_Sex_by_Race_Profile_rep.POPGRWCYFY',
       'Age_by_Sex_by_Race_Profile_rep.HHGRWCYFY',
       'Age_by_Sex_by_Race_Profile_rep.MEDAGE_FY',
       'Age_by_Sex_by_Race_Profile_rep.MEDMAGE_FY',
       'Age_by_Sex_by_Race_Profile_rep.MEDFAGE_FY',
       'Age_by_Sex_by_Race_Profile_rep.TOTPOP10',
       'Age_by_Sex_by_Race_Profile_rep.MEDAGE10',
       'Age_by_Sex_by_Race_Profile_rep.MEDMAGE10',
       'Age_by_Sex_by_Race_Profile_rep.MEDFAGE10',
       'Age_by_Sex_by_Race_Profile_rep.TOTHH10'], dtype=object)

__Analysis variables for `DaytimePopulation` data collection__

In [17]:
usa_df.loc['DaytimePopulation']['analysisVariable'].unique()

array(['DaytimePopulation.DPOP_CY', 'DaytimePopulation.DPOPWRK_CY',
       'DaytimePopulation.DPOPRES_CY', 'DaytimePopulation.DPOPDENSCY'],
      dtype=object)

### Data Collections for Another Country

Let's look at data collections for New Zealand. [Data Browser](https://doc.arcgis.com/en/esri-demographics/data/data-browser.htm) can be used to examine the entire global listing of variables, and associated datasets for New Zealand.

In order to discover the data collections for a particular country, you may first access the reference variable to it using the `country.get()` method, and then fetch the data collections from `country.data_collections` property. Once we know the data collection we would like to use, we can look at `analysisVariable`s available in that data collection.

In [18]:
# Get US as a country
nz = Country.get('New Zealand')
type(nz)

arcgis.geoenrichment.enrichment.Country

In [19]:
nz_df = nz.data_collections

# print a few rows of the DataFrame
nz_df.head()

Unnamed: 0_level_0,analysisVariable,alias,fieldCategory,vintage
dataCollectionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
15YearIncrements,15YearIncrements.PAGE01_CY,2020 Total Population Age 0-14,2020 Population Totals (MBR),2020
15YearIncrements,15YearIncrements.PAGE02_CY,2020 Total Population Age 15-29,2020 Population Totals (MBR),2020
15YearIncrements,15YearIncrements.PAGE03_CY,2020 Total Population Age 30-44,2020 Population Totals (MBR),2020
15YearIncrements,15YearIncrements.PAGE04_CY,2020 Total Population Age 45-59,2020 Population Totals (MBR),2020
15YearIncrements,15YearIncrements.PAGE05_CY,2020 Total Population Age 60+,2020 Population Totals (MBR),2020


In [20]:
nz_df.shape

(193, 4)

#### Unique Data Collections for New Zealand

Let's get a list of unique data collections that are available for New Zealand.

In [21]:
nz_df.index.unique()

Index(['15YearIncrements', 'EducationalAttainment', 'Gender',
       'HouseholdTotals', 'HouseholdsbyIncome', 'HouseholdsbyType', 'KeyFacts',
       'KeyGlobalFacts', 'MaritalStatus', 'PopulationTotals',
       'PurchasingPower', 'Spending'],
      dtype='object', name='dataCollectionID')

New Zealand has 12 unique data collections.

We can look at the `fieldCategory` column to understand each category better.

In [22]:
nz_df.fieldCategory.unique()

array(['2020 Population Totals (MBR)',
       '2020 Male Population Totals (MBR)',
       '2020 Female Population Totals (MBR)',
       '2020 Educational Attainment (MBR)', '2020 Household Totals (MBR)',
       '2020 Households by Income (MBR)', '2020 Households by Type (MBR)',
       '2020 Marital Status (MBR)', '2020 Purchasing Power (MBR)',
       'Key Demographic Indicators', 'Age: 5 Year Increments',
       '2020 Food & Beverage Expenditures (MBR)',
       '2020 Alcoholic Beverage Expenditures (MBR)',
       '2020 Tobacco Expenditures (MBR)',
       '2020 Clothing Expenditures (MBR)',
       '2020 Footwear Expenditures (MBR)',
       '2020 Furniture & Furnishing Expenditures (MBR)',
       '2020 Household Textiles Expenditures (MBR)',
       '2020 Household Appliances Expenditures (MBR)',
       '2020 Household Utensils Expenditures (MBR)',
       '2020 House & Garden Expenditures (MBR)',
       '2020 Household Maintenance Expenditures (MBR)',
       '2020 Medical Products & Suppl

Looking at `fieldCategory` is a great way to clearly understand what the data collection is about. However, to query a data collection its unique ID (`dataCollectionID`) must be used.

#### Data Collections for Socio-demographic Factors

New Zealand has fewer `data_collections` compared to U.S. Let's look at data collections for Key Facts, Education and Spending.

__Data Collection for Key Facts__

In [23]:
nz_df.loc['KeyGlobalFacts']

Unnamed: 0_level_0,analysisVariable,alias,fieldCategory,vintage
dataCollectionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KeyGlobalFacts,KeyGlobalFacts.TOTPOP,Total Population,Key Demographic Indicators,
KeyGlobalFacts,KeyGlobalFacts.TOTHH,Total Households,Key Demographic Indicators,
KeyGlobalFacts,KeyGlobalFacts.AVGHHSZ,Average Household Size,Key Demographic Indicators,
KeyGlobalFacts,KeyGlobalFacts.TOTMALES,Male Population,Age: 5 Year Increments,
KeyGlobalFacts,KeyGlobalFacts.TOTFEMALES,Female Population,Age: 5 Year Increments,


__Data Collection for Education__

In [24]:
nz_df.loc['EducationalAttainment']

Unnamed: 0_level_0,analysisVariable,alias,fieldCategory,vintage
dataCollectionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
EducationalAttainment,EducationalAttainment.EDUC01A_CY,2020 Pop 15+/Edu: No Qualification,2020 Educational Attainment (MBR),2020
EducationalAttainment,EducationalAttainment.EDUC02A_CY,2020 Pop 15+/Edu: Level 1,2020 Educational Attainment (MBR),2020
EducationalAttainment,EducationalAttainment.EDUC03A_CY,2020 Pop 15+/Edu: Level 2,2020 Educational Attainment (MBR),2020
EducationalAttainment,EducationalAttainment.EDUC04A_CY,2020 Pop 15+/Edu: Level 3,2020 Educational Attainment (MBR),2020
EducationalAttainment,EducationalAttainment.EDUC05A_CY,2020 Pop 15+/Edu: Level 4,2020 Educational Attainment (MBR),2020
EducationalAttainment,EducationalAttainment.EDUC06B_CY,2020 Pop 15+/Edu: Level 5 Diploma,2020 Educational Attainment (MBR),2020
EducationalAttainment,EducationalAttainment.EDUC07A_CY,2020 Pop 15+/Edu: Level 6 Diploma,2020 Educational Attainment (MBR),2020
EducationalAttainment,EducationalAttainment.EDUC08A_CY,2020 Pop 15+/Edu: Bachelor Degree,2020 Educational Attainment (MBR),2020
EducationalAttainment,EducationalAttainment.EDUC09A_CY,2020 Pop 15+/Edu: Post-graduate and Honours de...,2020 Educational Attainment (MBR),2020
EducationalAttainment,EducationalAttainment.EDUC10A_CY,2020 Pop 15+/Edu: Master's Degree,2020 Educational Attainment (MBR),2020


__Data Collection for Spending__

In [25]:
nz_df.loc['Spending']

Unnamed: 0_level_0,analysisVariable,alias,fieldCategory,vintage
dataCollectionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Spending,Spending.CS01_CY,2020 Food & Beverage: Total,2020 Food & Beverage Expenditures (MBR),2020
Spending,Spending.CS01PRM_CY,2020 Food & Beverage: Per Mill,2020 Food & Beverage Expenditures (MBR),2020
Spending,Spending.CSPC01_CY,2020 Food & Beverage: Per Capita,2020 Food & Beverage Expenditures (MBR),2020
Spending,Spending.CS01IDX_CY,2020 Food & Beverage: Index,2020 Food & Beverage Expenditures (MBR),2020
Spending,Spending.CS02_CY,2020 Alcoholic Beverage: Total,2020 Alcoholic Beverage Expenditures (MBR),2020
...,...,...,...,...
Spending,Spending.CS19IDX_CY,2020 Personal Care: Index,2020 Personal Care Expenditures (MBR),2020
Spending,Spending.CS20_CY,2020 Personal Effects: Total,2020 Jewelry & Personal Effects Expenditures (...,2020
Spending,Spending.CS20PRM_CY,2020 Personal Effects: Per Mill,2020 Jewelry & Personal Effects Expenditures (...,2020
Spending,Spending.CSPC20_CY,2020 Personal Effects: Per Capita,2020 Jewelry & Personal Effects Expenditures (...,2020


#### Analysis variables for Data Collections
Once we know the data collection we would like to use, we can look at all the unique variables available in that data collection using its unique ID. Let's discover `analysisVariable`s for some of the data collections we looked at earlier.

__Analysis variables for `KeyGlobalFacts` data collection__

In [26]:
nz_df.loc['KeyGlobalFacts']['analysisVariable'].unique()

array(['KeyGlobalFacts.TOTPOP', 'KeyGlobalFacts.TOTHH',
       'KeyGlobalFacts.AVGHHSZ', 'KeyGlobalFacts.TOTMALES',
       'KeyGlobalFacts.TOTFEMALES'], dtype=object)

__Analysis variables for `EducationalAttainment` data collection__

In [27]:
nz_df.loc['EducationalAttainment']['analysisVariable'].unique()

array(['EducationalAttainment.EDUC01A_CY',
       'EducationalAttainment.EDUC02A_CY',
       'EducationalAttainment.EDUC03A_CY',
       'EducationalAttainment.EDUC04A_CY',
       'EducationalAttainment.EDUC05A_CY',
       'EducationalAttainment.EDUC06B_CY',
       'EducationalAttainment.EDUC07A_CY',
       'EducationalAttainment.EDUC08A_CY',
       'EducationalAttainment.EDUC09A_CY',
       'EducationalAttainment.EDUC10A_CY',
       'EducationalAttainment.EDUC11A_CY',
       'EducationalAttainment.EDUC12A_CY',
       'EducationalAttainment.EDUC13_CY'], dtype=object)

__Analysis variables for `Spending` data collection__

In [28]:
nz_df.loc['Spending']['analysisVariable'].unique()

array(['Spending.CS01_CY', 'Spending.CS01PRM_CY', 'Spending.CSPC01_CY',
       'Spending.CS01IDX_CY', 'Spending.CS02_CY', 'Spending.CS02PRM_CY',
       'Spending.CSPC02_CY', 'Spending.CS02IDX_CY', 'Spending.CS03_CY',
       'Spending.CS03PRM_CY', 'Spending.CSPC03_CY', 'Spending.CS03IDX_CY',
       'Spending.CS04_CY', 'Spending.CS04PRM_CY', 'Spending.CSPC04_CY',
       'Spending.CS04IDX_CY', 'Spending.CS05_CY', 'Spending.CS05PRM_CY',
       'Spending.CSPC05_CY', 'Spending.CS05IDX_CY', 'Spending.CS06_CY',
       'Spending.CS06PRM_CY', 'Spending.CSPC06_CY', 'Spending.CS06IDX_CY',
       'Spending.CS07_CY', 'Spending.CS07PRM_CY', 'Spending.CSPC07_CY',
       'Spending.CS07IDX_CY', 'Spending.CS08_CY', 'Spending.CS08PRM_CY',
       'Spending.CSPC08_CY', 'Spending.CS08IDX_CY', 'Spending.CS09_CY',
       'Spending.CS09PRM_CY', 'Spending.CSPC09_CY', 'Spending.CS09IDX_CY',
       'Spending.CS10_CY', 'Spending.CS10PRM_CY', 'Spending.CSPC10_CY',
       'Spending.CS10IDX_CY', 'Spending.CS11_CY', 'S

### Perform Enrichment using Data Collections and Analysis Variables

Data Collections can be used to enrich various study areas. `data_collection`s and `analysis_variable`s  can be passed in the `enrich()` method. Details about enriching study areas can be found in __Enriching Study Areas__ section. 

Let's look at a few similar examples of GeoEnrichment here.

#### Enrich using Data Collections

__Enrich with `Age` data collection__

Here we see an address being enriched by data from `Age` data collection.

In [29]:
# Enriching single address as single line imput
age_coll = enrich(study_areas=["380 New York St Redlands CA 92373"], 
                       data_collections=['Age'])

In [30]:
age_coll

Unnamed: 0,ID,OBJECTID,sourceCountry,X,Y,areaType,bufferUnits,bufferUnitsAlias,bufferRadii,aggregationMethod,...,FEM45,FEM50,FEM55,FEM60,FEM65,FEM70,FEM75,FEM80,FEM85,SHAPE
0,0,1,US,-117.19479,34.057265,RingBuffer,esriMiles,Miles,1,BlockApportionment:US.BlockGroups,...,366,382,369,342,315,264,162,100,125,"{""rings"": [[[-117.19479001927878, 34.071773611..."


In [31]:
age_coll.columns

Index(['ID', 'OBJECTID', 'sourceCountry', 'X', 'Y', 'areaType', 'bufferUnits',
       'bufferUnitsAlias', 'bufferRadii', 'aggregationMethod',
       'populationToPolygonSizeRating', 'apportionmentConfidence', 'HasData',
       'MALE0', 'MALE5', 'MALE10', 'MALE15', 'MALE20', 'MALE25', 'MALE30',
       'MALE35', 'MALE40', 'MALE45', 'MALE50', 'MALE55', 'MALE60', 'MALE65',
       'MALE70', 'MALE75', 'MALE80', 'MALE85', 'FEM0', 'FEM5', 'FEM10',
       'FEM15', 'FEM20', 'FEM25', 'FEM30', 'FEM35', 'FEM40', 'FEM45', 'FEM50',
       'FEM55', 'FEM60', 'FEM65', 'FEM70', 'FEM75', 'FEM80', 'FEM85', 'SHAPE'],
      dtype='object')

When a data collection is specified without specific analysis variables, all variables under the data collection are used for enrichment as can be seen above.

__Enrich with `Health` data collection__

Here we see a zip code being enriched by data from Health data collection.

In [32]:
redlands = usa.subgeographies.states['California'].zip5['92373']

In [33]:
redlands_df = enrich(study_areas=[redlands], data_collections=['Health'] )

In [34]:
redlands_df

Unnamed: 0,ID,OBJECTID,StdGeographyLevel,StdGeographyName,StdGeographyID,sourceCountry,aggregationMethod,populationToPolygonSizeRating,apportionmentConfidence,HasData,...,POP85_CY,POP18UP_CY,POP21UP_CY,MEDAGE_CY,HHU18C10,MEDHINC_CY,S27_BUS,S27_SALES,S27_EMP,SHAPE
0,0,1,US.ZIP5,Redlands,92373,US,Query:US.ZIP5,2.191,2.576,1,...,1096,27998,26855,41.3,4076,83235,205,298882,4310,"{""rings"": [[[-117.23452000005528, 34.065769999..."


In [35]:
redlands_df.columns

Index(['ID', 'OBJECTID', 'StdGeographyLevel', 'StdGeographyName',
       'StdGeographyID', 'sourceCountry', 'aggregationMethod',
       'populationToPolygonSizeRating', 'apportionmentConfidence', 'HasData',
       ...
       'POP85_CY', 'POP18UP_CY', 'POP21UP_CY', 'MEDAGE_CY', 'HHU18C10',
       'MEDHINC_CY', 'S27_BUS', 'S27_SALES', 'S27_EMP', 'SHAPE'],
      dtype='object', length=433)

#### Enrich using Analysis Variables

Data can be enriched by specifying specific analysis variables of a data collection with which we want to enrich our data. In this example, we will look at `analysis_variables` for Age `data_collection` and then use specific analysis variables to `enrich()` a study area.

In [36]:
# Unique analysis variables for Age data collection
usa = Country.get('US')
usa.data_collections.loc['Age']['analysisVariable'].unique()

array(['Age.MALE0', 'Age.MALE5', 'Age.MALE10', 'Age.MALE15', 'Age.MALE20',
       'Age.MALE25', 'Age.MALE30', 'Age.MALE35', 'Age.MALE40',
       'Age.MALE45', 'Age.MALE50', 'Age.MALE55', 'Age.MALE60',
       'Age.MALE65', 'Age.MALE70', 'Age.MALE75', 'Age.MALE80',
       'Age.MALE85', 'Age.FEM0', 'Age.FEM5', 'Age.FEM10', 'Age.FEM15',
       'Age.FEM20', 'Age.FEM25', 'Age.FEM30', 'Age.FEM35', 'Age.FEM40',
       'Age.FEM45', 'Age.FEM50', 'Age.FEM55', 'Age.FEM60', 'Age.FEM65',
       'Age.FEM70', 'Age.FEM75', 'Age.FEM80', 'Age.FEM85'], dtype=object)

Now, we will enrich our study area with `Age.FEM45, Age.FEM55, Age.FEM65` variables

In [37]:
enrich(study_areas=["380 New York St Redlands CA 92373"], 
       analysis_variables=["Age.FEM45","Age.FEM55","Age.FEM65"])

Unnamed: 0,ID,OBJECTID,sourceCountry,X,Y,areaType,bufferUnits,bufferUnitsAlias,bufferRadii,aggregationMethod,populationToPolygonSizeRating,apportionmentConfidence,HasData,FEM45,FEM55,FEM65,SHAPE
0,0,1,US,-117.19479,34.057265,RingBuffer,esriMiles,Miles,1,BlockApportionment:US.BlockGroups,2.191,2.576,1,366,369,315,"{""rings"": [[[-117.19479001927878, 34.071773611..."


## Enriching Spatially Enabled Dataframes

One of the most common use case for GeoEnrichment is enriching existing data in feature layers. As a user, you may need to analyze and enrich your data that already exists in feature layers. Spatially Enabled DataFrame (SeDF) helps us bring the data from layer into a dataframe which can then be GeoEnriched. 

Let's look at an example using an existing layer of Covid-19 dataset. This feature layer includes latest Covid-19 Cases, Recovered and Deaths data for U.S. at the county level.

In [38]:
# Get the layer
gis = GIS(set_active=False)
covid_item = gis.content.get('628578697fb24d8ea4c32fa0c5ae1843')
print(covid_item)
covid_layer = covid_item.layers[0]
covid_layer

<Item title:"COVID-19 Cases US" type:Feature Layer Collection owner:CSSE_covid19>


<FeatureLayer url:"https://services1.arcgis.com/0MSEUqKaxRlEPj5g/arcgis/rest/services/ncov_cases_US/FeatureServer/0">

We can query the layer as a dataframe and then use the dataframe for enrichment.

In [39]:
covid_df = covid_layer.query(as_df=True)
covid_df.shape

(3271, 19)

In [40]:
covid_df.head()

Unnamed: 0,OBJECTID,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Recovered,Deaths,Active,Admin2,FIPS,Combined_Key,Incident_Rate,People_Tested,People_Hospitalized,UID,ISO3,SHAPE
0,1,Alabama,US,2021-08-27 20:21:31,32.539527,-86.644082,8550,,117,,Autauga,1001,"Autauga, Alabama, US",15303.656769,,,84001001,USA,"{""x"": -86.64408226999996, ""y"": 32.539527450000..."
1,2,Alabama,US,2021-08-27 20:21:31,30.72775,-87.722071,32054,,377,,Baldwin,1003,"Baldwin, Alabama, US",14358.92382,,,84001003,USA,"{""x"": -87.72207057999998, ""y"": 30.727749910000..."
2,3,Alabama,US,2021-08-27 20:21:31,31.868263,-85.387129,2948,,65,,Barbour,1005,"Barbour, Alabama, US",11941.991412,,,84001005,USA,"{""x"": -85.38712859999998, ""y"": 31.868263000000..."
3,4,Alabama,US,2021-08-27 20:21:31,32.996421,-87.125115,3385,,71,,Bibb,1007,"Bibb, Alabama, US",15115.655979,,,84001007,USA,"{""x"": -87.12511459999996, ""y"": 32.996420640000..."
4,5,Alabama,US,2021-08-27 20:21:31,33.982109,-86.567906,8413,,143,,Blount,1009,"Blount, Alabama, US",14548.81887,,,84001009,USA,"{""x"": -86.56790592999994, ""y"": 33.982109180000..."


To showcase GeoEnrichment, we will create a subset of the original data and then `enrich()` the subset.

In [41]:
# Create subset
test_df = covid_df.iloc[:100].copy()
test_df.shape

(100, 19)

In [42]:
# Check geometry
test_df.spatial.set_geometry('SHAPE')
test_df.spatial.geometry_type

['point', None]

A dataframe can be passed as a value to `study_areas` parameter of the `enrich()` method. Here we are enriching our dataframe with specific variables from `Age` data collection.

In [43]:
# Enrich dataframe
new_df = enrich(study_areas=test_df, 
       analysis_variables=["Age.FEM45","Age.FEM55","Age.FEM65"])

Exception: {'code': 10020007, 'message': "Parameter 'studyareas' is invalid.", 'details': []}

In [None]:
new_df.head()

In [None]:
new_df.drop(['OBJECTID_0', 'ID','Last_Update'], axis=1, inplace=True)

In [None]:
# Check shape
new_df.shape

We can see that enrichment resulted in 91 records and 31 columns. There are some areas in our dataframe for which enrichment information is not available. Hence, we have 91 records instead of 100. Geoenrichment adds some additional columns along with the analysis variables we enriched for and so we see 31 columns however we are dropping duplicates and unnecessary columns to bring the count down to 28 columns.

### Visualize on a Map

Let's visualize the enriched dataframe on a map. We will use `FEM65` column to classify our data for plotting on the map.

In [None]:
covid_map = gis.map('USA', 4)
covid_map

In [None]:
# Plot on a map
covid_map.remove_layers()
new_df.spatial.plot(map_widget=covid_map,
                    renderer_type='c',  # for class breaks renderer,
                    method='esriClassifyNaturalBreaks',  # classification algorithm,
                    class_count=5,  # choose the number of classes,
                    col='FEM65',  # numeric column to classify,
                    cmap='viridis',  # color map to pick colors from for each class,
                    alpha=0.7)

## Conclusion

In this part of the `arcgis.geoenrichment` module guide series, you saw how `data_collections` property of a `Country` object lists its available `data_collection`s and `analysis_variable`s. You explored different data collections, their analysis variables and then enriched study areas using the same. Towards the end, you experienced how spatially enabled dataframes can be enriched.

In the subsequent pages, you will learn about Generating Reports and Standard Geography Queries.