Discovering Census Variables
============================

With more than 20,000 fields, the Census can be hard to manage and difficult to
investigate with the API. The Census maintains a number of tools
that can help you explore the data, tables, fields, and geographies.
**<https://data.census.gov/>** is a good place to start.

They also maintain a list of of fields that you can retrieve through the API,
which contain the exact variable names used for queries.

This lab demonstrates how to use a light wrapper that makes that data more easily searchable. The package is called `maptools` and the
functions we will be using are in `census_vars`. You can see
the [code on github](https://github.com/mcuringa/cartopy/raw/refs/heads/main/dist/maptools-latest.tar.gz).

**[Watch the video walkthrough on YouTube](https://youtu.be/hBlp2KaEgEw)**


In [1]:
# install this very alpha library directly from github
# !pip install https://github.com/mcuringa/cartopy/raw/refs/heads/main/dist/maptools-latest.tar.gz -q
!pip install census us mapclassify -q

from census import Census
import us
import pandas as pd
import geopandas as gpd
from maptools import census_vars

# from google.colab import userdata
# api_key = userdata.get('CENSUS_API_KEY')
import os
api_key = os.getenv('CENSUS_API_KEY')


Searching Census Tables
=======================
This search isn't great, but it can help find census tables based on
keywords and phrases. It doesn't need to be an exact match to get a result.

`results` is an optiona parameter that tells the search how many
rows to return, the default is 20.


In [7]:
census_vars.search("home language")


Unnamed: 0,group,concept,match
25514,C16001,Language Spoken At Home For The Population 5 Years And Over,60.44%
25950,B16007,Age By Language Spoken At Home For The Population 5 Years And Over,58.11%
17374,B99162,Allocation Of Language Spoken At Home For The Population 5 Years And Over,57.28%
6449,B16009,Poverty Status In The Past 12 Months By Age By Language Spoken At Home For The Population 5 Years And Over,48.41%
15105,B16001,Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over,46.01%
5416,B16004,Age By Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over,44.51%
24629,B16010,Educational Attainment And Employment Status By Language Spoken At Home For The Population 25 Years And Over,43.42%
2238,B26113,Group Quarters Type (3 Types) By Language Spoken At Home By Ability To Speak English,43.21%
13458,B26213,Group Quarters Type (5 Types) By Language Spoken At Home By Ability To Speak English,43.21%
13384,B16005,Nativity By Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over,42.92%


Finding Field Names
===================
`census_vars` has a function called `get_table()` which
looks up all of the variables in a table based on the
group variable name.

In the example above, we see that the **Travel time to work**
fields are all in group `B08303`

By default, `get_table()` returns a `dict` of field names,
where the keys are the census variables and the values
are suggest column names.

If you pass the `as_dict=False` argument, you will see
a `DataFrame` with data for each field in the table.

In [6]:
# get as a table
# census_vars.get_table("C16001", as_dict=False)

# get it as a fields dict

field_names = census_vars.get_table("B27001")
field_names

{'B27001_001E': 'total',
 'B27001_002E': 'male',
 'B27001_003E': 'male_under_6_years',
 'B27001_004E': 'male_under_6_years_with_health_insurance_coverage',
 'B27001_005E': 'male_under_6_years_no_health_insurance_coverage',
 'B27001_006E': 'male_6_to_18_years',
 'B27001_007E': 'male_6_to_18_years_with_health_insurance_coverage',
 'B27001_008E': 'male_6_to_18_years_no_health_insurance_coverage',
 'B27001_009E': 'male_19_to_25_years',
 'B27001_010E': 'male_19_to_25_years_with_health_insurance_coverage',
 'B27001_011E': 'male_19_to_25_years_no_health_insurance_coverage',
 'B27001_012E': 'male_26_to_34_years',
 'B27001_013E': 'male_26_to_34_years_with_health_insurance_coverage',
 'B27001_014E': 'male_26_to_34_years_no_health_insurance_coverage',
 'B27001_015E': 'male_35_to_44_years',
 'B27001_016E': 'male_35_to_44_years_with_health_insurance_coverage',
 'B27001_017E': 'male_35_to_44_years_no_health_insurance_coverage',
 'B27001_018E': 'male_45_to_54_years',
 'B27001_019E': 'male_45_to_54_ye

In [4]:
# let's look up these fields at the State level using the census API
c = Census(api_key)

# field_names = {
#   'C16001_001E': 'total',
#   'C16001_002E': 'speak_only_english',
#   'C16001_003E': 'spanish',
#   'C16001_006E': 'french_haitian_or_cajun',
#   'C16001_009E': 'german_or_other_west_germanic_languages',
#   'C16001_012E': 'russian_polish_other_slavic_languages',
#   'C16001_015E': 'other_indo-european_languages',
#   'C16001_018E': 'korean',
#   'C16001_021E': 'chinese',
#   'C16001_024E': 'vietnamese',
#   'C16001_027E': 'tagalog',
#   'C16001_030E': 'other_asian_and_pacific_island_languages',
#   'C16001_033E': 'arabic',
#   'C16001_036E': 'other_languages'
# }

fields = list(field_names.keys())

data = c.acs5.get(fields=fields, geo={ 'for': 'state:*'}, year=2022)

df = pd.DataFrame(data)
df.rename(columns=field_names, inplace=True)
df["statefp"] = df["state"]
df["state"] = df.statefp.apply(census_vars.lookup_state)

df.head(10)

Unnamed: 0,total,with_an_internet_subscription,with_an_internet_subscription_dial_up_alone,with_an_internet_subscription_broadband_such_as_cable_fiber_optic_or_dsl,with_an_internet_subscription_satellite_internet_service,with_an_internet_subscription_other_service,internet_access_without_a_subscription,no_internet_access,state,statefp
0,1933150.0,1625807.0,6090.0,1221985.0,170518.0,18608.0,51456.0,255887.0,AL,1
1,264376.0,236875.0,610.0,182105.0,15342.0,2792.0,5635.0,21866.0,AK,2
2,2739136.0,2448838.0,5550.0,2027802.0,226635.0,27356.0,62512.0,227786.0,AZ,4
3,1171694.0,968272.0,2584.0,706838.0,98526.0,10378.0,36213.0,167209.0,AR,5
4,13315822.0,12195945.0,18361.0,10310555.0,1084864.0,106354.0,263836.0,856041.0,CA,6
5,2278044.0,2095757.0,4374.0,1793323.0,174866.0,21740.0,49493.0,132794.0,CO,8
6,1409807.0,1272555.0,2700.0,1112601.0,48639.0,9132.0,29921.0,107331.0,CT,9
7,389000.0,352105.0,673.0,302258.0,17403.0,3316.0,7910.0,28985.0,DE,10
8,315785.0,281424.0,292.0,248241.0,11014.0,2980.0,8343.0,26018.0,DC,11
9,8353441.0,7429632.0,12312.0,6270480.0,500248.0,92556.0,241949.0,681860.0,FL,12
