This notebook demonstrates internal functions the geosnap team uses to build its curated datasets from the U.S. Census bureau. Most users will not need these, but we include the code here for tranparency and reproducibility

In [12]:
import fiona
from geosnap.io import get_census_gdb, convert_census_gdb
from geosnap.util import process_acs

The `get_census_gdb` function will fetch geodatabases containing ACS demographic profile data (which contains most of the useful variables) and store them locally for processing

In [22]:
get_census_gdb?

[0;31mSignature:[0m [0mget_census_gdb[0m[0;34m([0m[0myears[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mgeom_level[0m[0;34m=[0m[0;34m'blockgroup'[0m[0;34m,[0m [0moutput_dir[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Fetch file geodatabases of ACS demographic profile data from the Census bureau server.

Parameters
----------
years : list, optional
    set of years to download (2010 onward), defaults to 2010-2018
geom_level : str, optional
    geographic unit to download (tract or blockgroup), by default "blockgroup"
output_dir : str, optional
    output directory to write files, by default None
[0;31mFile:[0m      ~/Dropbox/projects/geosnap/geosnap/io/util.py
[0;31mType:[0m      function


In [3]:
get_census_gdb(years=['2014'], output_dir=".", geom_level='blockgroup')

Using fiona we can quickly list all the layers present in the geodatabase

In [6]:
fiona.listlayers("ACS_2014_5YR_BG.gdb.zip")

['ACS_2014_5YR_BG',
 'BG_METADATA_2014',
 'X00_COUNTS',
 'X01_AGE_AND_SEX',
 'X02_RACE',
 'X03_HISPANIC_OR_LATINO_ORIGIN',
 'X07_MIGRATION',
 'X08_COMMUTING',
 'X09_CHILDREN_HOUSEHOLD_RELATIONSHIP',
 'X11_HOUSEHOLD_FAMILY_SUBFAMILIES',
 'X12_MARITAL_STATUS_AND_HISTORY',
 'X14_SCHOOL_ENROLLMENT',
 'X15_EDUCATIONAL_ATTAINMENT',
 'X16_LANGUAGE_SPOKEN_AT_HOME',
 'X17_POVERTY',
 'X19_INCOME',
 'X20_EARNINGS',
 'X21_VETERAN_STATUS',
 'X22_FOOD_STAMPS',
 'X23_EMPLOYMENT_STATUS',
 'X24_INDUSTRY_OCCUPATION',
 'X27_HEALTH_INSURANCE',
 'X99_IMPUTATION',
 'X25_HOUSING_CHARACTERISTICS']

File geodatabases can be convient, but they are also painfuly slow to process in python, so the `convert_census_gdb` will convert the layers in a gdb to a parquet file instead

In [7]:
convert_census_gdb?

[0;31mSignature:[0m
[0mconvert_census_gdb[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfile[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlayers[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0myear[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlevel[0m[0;34m=[0m[0;34m'bg'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msave_intermediate[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_dir[0m[0;34m=[0m[0;34m'.'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Convert file geodatabases from Census into (set of) parquet files.

Parameters
----------
file : str
    path to file geodatabase
layers : list
    set of layers to extract from gdb
year : str, optional
    [description], by default None
level : str, optional
    geographic level of data ('bg' for blockgroups or 'tr' for tract), by default "bg"
save_intermediate : bool, optional
    if true, each layer will be stored 

In [8]:
convert_census_gdb(file='ACS_2014_5YR_BG.gdb.zip', save_intermediate=True, layers=['X00_COUNTS'], year='2014', level='tract')

X00_COUNTS


In [9]:
import pandas as pd

In [10]:
pd.read_parquet('acs_2014_X00_COUNTS_tract.parquet')

Unnamed: 0_level_0,B00001_001E,B00002_001E
GEOID,Unnamed: 1_level_1,Unnamed: 2_level_1
15000US020130001001,283.0,89.0
15000US020130001002,410.0,138.0
15000US020130001003,473.0,183.0
15000US020160001001,418.0,486.0
15000US020160002001,561.0,211.0
...,...,...
15000US560459511001,223.0,124.0
15000US560459511002,234.0,63.0
15000US560459513001,95.0,46.0
15000US560459513002,92.0,46.0


If we wanted to convert all available data, we could pass `fiona.listlayers("ACS_2014_5YR_BG.gdb.zip")` to the `layers` argument in `convert_census_gdb`

The resulting (combined) parquet files can be processed with `process_acs` to generate the datasets described in the geosnap codebook. Note, unless all layers are processed from the geodatabase, several variables will be unavailable

In [15]:
df = pd.read_parquet('acs_2014_bg.parquet')

In [20]:
df.index.name='GEOID'
df.reset_index(inplace=True)

In [21]:
process_acs(df)

n_mexican_pop=B03001_004E name 'B03001_004E' is not defined
n_cuban_pop=B03001_006E name 'B03001_006E' is not defined
n_puerto_rican_pop=B03001_005E name 'B03001_005E' is not defined
n_russian_pop=B04004_064E name 'B04004_064E' is not defined
n_italian_pop=B04004_051E name 'B04004_051E' is not defined
n_german_pop=B04004_042E name 'B04004_042E' is not defined
n_irish_pop=B04004_049E name 'B04004_049E' is not defined
n_scandaniavian_pop=B04004_065E name 'B04004_065E' is not defined
n_foreign_born_pop=B05002_013E name 'B05002_013E' is not defined
n_recent_immigrant_pop=B05005_007E name 'B05005_007E' is not defined
n_naturalized_pop=B05002_014E name 'B05002_014E' is not defined
n_age_5_older=B16001_001E name 'B16001_001E' is not defined
n_other_language=B16001_001E - B16001_002E name 'B16001_001E' is not defined
n_limited_english=DP02_0113E name 'DP02_0113E' is not defined
n_russian_born_pop=B05006_040E name 'B05006_040E' is not defined
n_italian_born_pop=B05006_023E name 'B05006_023E' is

15000US020130001001
15000US020130001002
15000US020130001003
15000US020160001001
15000US020160002001
...
15000US560459511001
15000US560459511002
15000US560459513001
15000US560459513002
15000US560459513003
