# Enrich

While getting geographies for analysis is a good start for analysis, the real fuel for analysis is the demographic variables quantitatively describing who the people are living in these geographic areas. Enrichment enables selecting, retrieving demographic variables for analysis.

Just as in the previous notebook, this starts with importing resources.

In [1]:
from pathlib import Path
import sys

project_parent = Path('./').absolute().parent
sys.path.append(str(project_parent/'src'))

from dm import Country, util

  class GeoAccessorIO(GeoAccessor):


## Get Geographic Areas for Analyis Within an Area of Interest

Just as before, we want to be able to programatically say, "_I want the lowest level of geographic areaas in my metro area, Victoria._" Since we may or may not be intimately familiar with the resources available in Canada, we can use a few introspection techniques and tools along the way to get what we need.

First, we instantiate an instance of the Country object and introspecively discover what geographic levels are available for analysis in Canada.

In [2]:
can = Country('CAN')

can.geographies

Unnamed: 0,geo_name,geo_alias,col_id,col_name,feature_class_path
0,disseminationareas,DisseminationAreas,ID,NAME,D:\ba_data\can\Data\Demographic Data\CAN_ESRI_...
1,censustracts,CensusTracts,ID,NAME,D:\ba_data\can\Data\Demographic Data\CAN_ESRI_...
2,fsas,FSAs,ID,NAME,D:\ba_data\can\Data\Demographic Data\CAN_ESRI_...
3,censussubdivisions,CensusSubdivisions,ID,NAME,D:\ba_data\can\Data\Demographic Data\CAN_ESRI_...
4,feds,FEDs,ID,NAME,D:\ba_data\can\Data\Demographic Data\CAN_ESRI_...
5,cmacas,CMACAs,ID,NAME,D:\ba_data\can\Data\Demographic Data\CAN_ESRI_...
6,censusdivisions,CensusDivisions,ID,NAME,D:\ba_data\can\Data\Demographic Data\CAN_ESRI_...
7,provinceterritories,ProvinceTerritories,ID,NAME,D:\ba_data\can\Data\Demographic Data\CAN_ESRI_...
8,country,Country,ID,NAME,D:\ba_data\can\Data\Demographic Data\CAN_ESRI_...


Now, we can use the `get_names` to see what is returned if we are looking for the Victoria Census Metropolitan Area and Census Agglomeration (CMACA).

In [3]:
metro_df = can.cmacas.get_names('victoria')

metro_df

0    Victoriaville
1         Victoria
Name: geo_name, dtype: object

Unlike searching for Seattle or Vancouver in the previous notebook, in this case simply using Victoria returns multiple records. In the docstring, there is a hint for how to start creating a custom query, and both the `get_names` and `get` methods accept an explicit query string specifically for instances such as this.

In [4]:
metro_df = can.cmacas.get(query_string="NAME = 'Victoria'")

metro_df

Unnamed: 0,ID,NAME,SHAPE
0,935,Victoria,"{""rings"": [[[-123.54234999996063, 48.314970000..."


Using the returned Spatially Enabled DataFrame, we can get the lowest level of geography, Dissemination Areas, for analysis. Once retrieved, we can investigate the properties, preview the table, and view the geography to ensure the results are what we expect.

In [5]:
lvl0_df = metro_df.level(0).get()

lvl0_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 561 entries, 0 to 560
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   ID      561 non-null    object  
 1   NAME    561 non-null    object  
 2   SHAPE   561 non-null    geometry
dtypes: geometry(1), object(2)
memory usage: 13.3+ KB


In [6]:
lvl0_df.head()

Unnamed: 0,ID,NAME,SHAPE
0,59170115,59170115,"{""rings"": [[[-123.36419999982431, 48.469490000..."
1,59170116,59170116,"{""rings"": [[[-123.36292999940484, 48.467920000..."
2,59170117,59170117,"{""rings"": [[[-123.36586000023107, 48.462860000..."
3,59170118,59170118,"{""rings"": [[[-123.36382999992844, 48.460949999..."
4,59170119,59170119,"{""rings"": [[[-123.36376000050655, 48.458879999..."


In [7]:
wm00 = lvl0_df.spatial.plot()
wm00.basemap = 'dark-gray-vector'
wm00

MapView(layout=Layout(height='400px', width='100%'))

## Enrich

From here, we can begin the process of enrichment. First though, we need to decide which enrichment varialbes to use. Using an introspection property on the Country, `enrich_variables`, we can investigate what is available.

In [8]:
e_vars = can.enrich_variables

e_vars

Unnamed: 0,name,alias,type,vintage,data_collection,enrich_str,enrich_field_name
0,A16AITFNAT,2016 First Nations Single Ident,COUNT,2016,AboriginalIdentity,AboriginalIdentity.A16AITFNAT,AboriginalIdentity_A16AITFNAT
1,A16AITIDT,2016 Aboriginal Identity,COUNT,2016,AboriginalIdentity,AboriginalIdentity.A16AITIDT,AboriginalIdentity_A16AITIDT
2,A16AITIDX,2016 Aboriginal Identities,COUNT,2016,AboriginalIdentity,AboriginalIdentity.A16AITIDX,AboriginalIdentity_A16AITIDX
3,A16AITINUK,2016 Inuk Single Identity,COUNT,2016,AboriginalIdentity,AboriginalIdentity.A16AITINUK,AboriginalIdentity_A16AITINUK
4,A16AITMETI,2016 Metis Single Identity,COUNT,2016,AboriginalIdentity,AboriginalIdentity.A16AITMETI,AboriginalIdentity_A16AITMETI
...,...,...,...,...,...,...,...
2568,P0YVISKOR,2029 VM: Korean,COUNT,2029,VisibleMinorityStatus,VisibleMinorityStatus.P0YVISKOR,VisibleMinorityStatus_P0YVISKOR
2569,P0YVISJAPA,2029 VM: Japanese,COUNT,2029,VisibleMinorityStatus,VisibleMinorityStatus.P0YVISJAPA,VisibleMinorityStatus_P0YVISJAPA
2570,P0YVISOVM,2029 VM: All Oth VM,COUNT,2029,VisibleMinorityStatus,VisibleMinorityStatus.P0YVISOVM,VisibleMinorityStatus_P0YVISOVM
2571,P0YVISMVM,2029 VM: Multiple VM,COUNT,2029,VisibleMinorityStatus,VisibleMinorityStatus.P0YVISMVM,VisibleMinorityStatus_P0YVISMVM


Frequently, I like to start my analysis and demostrate using current year key variables in the Key Facts data collection. Using Pandas selection methods, we can identify these variables.

In [9]:
key_vars = e_vars[(e_vars.data_collection.str.startswith('Key')) & (e_vars.vintage == '2022')]

key_vars

Unnamed: 0,name,alias,type,vintage,data_collection,enrich_str,enrich_field_name
1413,P3YPTAPOP,2022 Total Population,COUNT,2022,KeyCanFacts,KeyCanFacts.P3YPTAPOP,KeyCanFacts_P3YPTAPOP
1414,P3YHFSCF,2022 Tot Census Fam HHs,COUNT,2022,KeyCanFacts,KeyCanFacts.P3YHFSCF,KeyCanFacts_P3YHFSCF
1415,P3YHSZAVG,2022 Private HHs Avg Num Persons,COUNT,2022,KeyCanFacts,KeyCanFacts.P3YHSZAVG,KeyCanFacts_P3YHSZAVG
1416,P3YHNIAVG,2022 HH Inc Average Curr$,CURRENCY,2022,KeyCanFacts,KeyCanFacts.P3YHNIAVG,KeyCanFacts_P3YHNIAVG
1417,P3YHNIMED,2022 HH Inc Median Curr$,CURRENCY,2022,KeyCanFacts,KeyCanFacts.P3YHNIMED,KeyCanFacts_P3YHNIMED
1418,P3YTENHHD,2022 Tenure - Total HHs,COUNT,2022,KeyCanFacts,KeyCanFacts.P3YTENHHD,KeyCanFacts_P3YTENHHD
1419,P3YTENOWN,2022 Tenure - Owned,COUNT,2022,KeyCanFacts,KeyCanFacts.P3YTENOWN,KeyCanFacts_P3YTENOWN
1420,P3YTENRENT,2022 Tenure - Rented,COUNT,2022,KeyCanFacts,KeyCanFacts.P3YTENRENT,KeyCanFacts_P3YTENRENT
1421,P3YTENBAND,2022 Tenure - Band Housing,COUNT,2022,KeyCanFacts,KeyCanFacts.P3YTENBAND,KeyCanFacts_P3YTENBAND


The `enrich` method needs an iterable of variables from the `enrich_str` column as input, so we can extract these from the above filtered Pandas DataFrame as a Pandas Series.

In [10]:
enrich_vars = key_vars.enrich_str

enrich_vars

1413     KeyCanFacts.P3YPTAPOP
1414      KeyCanFacts.P3YHFSCF
1415     KeyCanFacts.P3YHSZAVG
1416     KeyCanFacts.P3YHNIAVG
1417     KeyCanFacts.P3YHNIMED
1418     KeyCanFacts.P3YTENHHD
1419     KeyCanFacts.P3YTENOWN
1420    KeyCanFacts.P3YTENRENT
1421    KeyCanFacts.P3YTENBAND
Name: enrich_str, dtype: object

These variables can now be used to enrich the previously retrieved Dissemination Areas in Victora. Once retrieved, we also can view the metadata and preview the returned tabular data.

In [11]:
lvl0_enrich_df = lvl0_df.spatial.enrich(enrich_vars)

lvl0_enrich_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 561 entries, 0 to 560
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   ID                      561 non-null    object  
 1   NAME                    561 non-null    object  
 2   KeyCanFacts_P3YPTAPOP   561 non-null    float64 
 3   KeyCanFacts_P3YHFSCF    561 non-null    float64 
 4   KeyCanFacts_P3YHSZAVG   561 non-null    float64 
 5   KeyCanFacts_P3YHNIAVG   561 non-null    float64 
 6   KeyCanFacts_P3YHNIMED   561 non-null    float64 
 7   KeyCanFacts_P3YTENHHD   561 non-null    float64 
 8   KeyCanFacts_P3YTENOWN   561 non-null    float64 
 9   KeyCanFacts_P3YTENRENT  561 non-null    float64 
 10  KeyCanFacts_P3YTENBAND  561 non-null    float64 
 11  SHAPE                   561 non-null    geometry
dtypes: float64(9), geometry(1), object(2)
memory usage: 52.7+ KB


In [12]:
lvl0_enrich_df.head()

Unnamed: 0,ID,NAME,KeyCanFacts_P3YPTAPOP,KeyCanFacts_P3YHFSCF,KeyCanFacts_P3YHSZAVG,KeyCanFacts_P3YHNIAVG,KeyCanFacts_P3YHNIMED,KeyCanFacts_P3YTENHHD,KeyCanFacts_P3YTENOWN,KeyCanFacts_P3YTENRENT,KeyCanFacts_P3YTENBAND,SHAPE
0,59170115,59170115,596.0,135.0,2.1,104074.17,83793.1,279.0,232.0,47.0,0.0,"{""rings"": [[[-123.36419999982431, 48.469490000..."
1,59170116,59170116,437.0,118.0,2.3,106194.13,93600.0,192.0,139.0,53.0,0.0,"{""rings"": [[[-123.36292999940484, 48.467920000..."
2,59170117,59170117,444.0,113.0,2.1,92093.08,66956.52,216.0,180.0,36.0,0.0,"{""rings"": [[[-123.36586000023107, 48.462860000..."
3,59170118,59170118,546.0,139.0,2.5,116164.33,97142.86,216.0,137.0,79.0,0.0,"{""rings"": [[[-123.36382999992844, 48.460949999..."
4,59170119,59170119,506.0,128.0,2.5,102794.3,90740.74,199.0,133.0,66.0,0.0,"{""rings"": [[[-123.36376000050655, 48.458879999..."


## Consolidating and Chaining

Just like in the previous notebook, once we know the steps, all of the above can be significantly streamlined into a concise workflow that, on my computer, takes just over 10 seconds to execute.

In [13]:
%%time

can = Country('CAN')

key_vars = can.enrich_variables[(can.enrich_variables.data_collection.str.startswith('Key')) & (can.enrich_variables.vintage == '2022')].enrich_str

lvl0_enrich_df = can.cmacas.get(query_string="NAME = 'Victoria'").level(0).get().spatial.enrich(key_vars)

Wall time: 12.7 s


## Save Results

Finally, this data can easily be exportd to a variety of formats for subsequent analysis. First, it can be exported to common tablular formats making for easier analysis using machine learning toolsets on this machine.

In [14]:
lvl0_enrich_df.spatial.to_parquet('../data/interim/vancouver_lvl0_enrich.parquet')

Also, it can be exported to an Esri Feature Class, Esri's spatial tabular format. Since supporting field name aliases, we can use a utility function to add these much more human readable names to the output data. This way, instead of seeing `KeyCanFacts_P3YPTAPOP` when viewing the data in ArcGIS, we will see `2022 Total Population`, which is much easier to understand.

In [15]:
lvl0_fc = '../data/interim/interim.gdb/vancouver_lvl0_enrich'

lvl0_enrich_df.spatial.to_featureclass(lvl0_fc)

util.add_enrich_aliases(lvl0_fc, can)