Discovering Census Variables
============================

With more than 20,000 fields, the Census can be hard to manage and difficult to
investigate with the API. The Census maintains a number of tools
that can help you explore the data, tables, fields, and geographies.
**<https://data.census.gov/>** is a good place to start.

They also maintain a list of of fields that you can retrieve through the API,
which contain the exact variable names used for queries.

This lab demonstrates how to use a light wrapper that makes that data more easily searchable. The package is called `maptools` and the
functions we will be using are in `census_vars`. You can see
the [code on github](https://github.com/mcuringa/cartopy/raw/refs/heads/main/dist/maptools-latest.tar.gz).

**[Watch the video walkthrough on YouTube](https://youtu.be/hBlp2KaEgEw)**


In [3]:
# install this very alpha library directly from github
# !pip install https://github.com/mcuringa/cartopy/raw/refs/heads/main/dist/maptools-latest.tar.gz -q
!pip install census us mapclassify -q

from census import Census
import us
import pandas as pd
import geopandas as gpd
from maptools import census_vars

# from google.colab import userdata
# api_key = userdata.get('CENSUS_API_KEY')
import os
api_key = os.getenv('CENSUS_API_KEY')


Searching Census Tables
=======================
This search isn't great, but it can help find census tables based on
keywords and phrases. It doesn't need to be an exact match to get a result.

`results` is an optiona parameter that tells the search how many
rows to return, the default is 20.


In [4]:
census_vars.search("home language")


Unnamed: 0,group,concept,match
25514,C16001,Language Spoken At Home For The Population 5 Years And Over,60.44%
25950,B16007,Age By Language Spoken At Home For The Population 5 Years And Over,58.11%
17374,B99162,Allocation Of Language Spoken At Home For The Population 5 Years And Over,57.28%
6449,B16009,Poverty Status In The Past 12 Months By Age By Language Spoken At Home For The Population 5 Years And Over,48.41%
15105,B16001,Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over,46.01%
5416,B16004,Age By Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over,44.51%
24629,B16010,Educational Attainment And Employment Status By Language Spoken At Home For The Population 25 Years And Over,43.42%
2238,B26113,Group Quarters Type (3 Types) By Language Spoken At Home By Ability To Speak English,43.21%
13458,B26213,Group Quarters Type (5 Types) By Language Spoken At Home By Ability To Speak English,43.21%
13384,B16005,Nativity By Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over,42.92%


Finding Field Names
===================
`census_vars` has a function called `get_table()` which
looks up all of the variables in a table based on the
group variable name.

In the example above, we see that the **Travel time to work**
fields are all in group `B08303`

By default, `get_table()` returns a `dict` of field names,
where the keys are the census variables and the values
are suggest column names.

If you pass the `as_dict=False` argument, you will see
a `DataFrame` with data for each field in the table.

In [5]:
# get as a table
# census_vars.get_table("C16001", as_dict=False)

# get it as a fields dict

field_names = census_vars.get_table("C16001")
field_names


{'C16001_001E': 'total',
 'C16001_002E': 'speak_only_english',
 'C16001_003E': 'spanish',
 'C16001_004E': 'spanish_speak_english_very_well',
 'C16001_005E': 'spanish_speak_english_less_than_very_well',
 'C16001_006E': 'french_haitian_or_cajun',
 'C16001_007E': 'french_haitian_or_cajun_speak_english_very_well',
 'C16001_008E': 'french_haitian_or_cajun_speak_english_less_than_very_well',
 'C16001_009E': 'german_or_other_west_germanic_languages',
 'C16001_010E': 'german_or_other_west_germanic_languages_speak_english_very_well',
 'C16001_011E': 'german_or_other_west_germanic_languages_speak_english_less_than_very_well',
 'C16001_012E': 'russian_polish_or_other_slavic_languages',
 'C16001_013E': 'russian_polish_or_other_slavic_languages_speak_english_very_well',
 'C16001_014E': 'russian_polish_or_other_slavic_languages_speak_english_less_than_very_well',
 'C16001_015E': 'other_indo_european_languages',
 'C16001_016E': 'other_indo_european_languages_speak_english_very_well',
 'C16001_017E': 

In [6]:
# let's look up these fields at the State level using the census API
c = Census(api_key)

field_names = {
  'C16001_001E': 'total',
  'C16001_002E': 'speak_only_english',
  'C16001_003E': 'spanish',
  'C16001_006E': 'french_haitian_or_cajun',
  'C16001_009E': 'german_or_other_west_germanic_languages',
  'C16001_012E': 'russian_polish_other_slavic_languages',
  'C16001_015E': 'other_indo-european_languages',
  'C16001_018E': 'korean',
  'C16001_021E': 'chinese',
  'C16001_024E': 'vietnamese',
  'C16001_027E': 'tagalog',
  'C16001_030E': 'other_asian_and_pacific_island_languages',
  'C16001_033E': 'arabic',
  'C16001_036E': 'other_languages'
}

fields = list(field_names.keys())

data = c.acs5.get(fields=fields, geo={ 'for': 'state:*'}, year=2022)

df = pd.DataFrame(data)
df.rename(columns=field_names, inplace=True)
df["statefp"] = df["state"]
df["state"] = df.statefp.apply(census_vars.lookup_state)

df.head(10)

Unnamed: 0,total,speak_only_english,spanish,french_haitian_or_cajun,german_or_other_west_germanic_languages,russian_polish_other_slavic_languages,other_indo-european_languages,korean,chinese,vietnamese,tagalog,other_asian_and_pacific_island_languages,arabic,other_languages,state,statefp
0,4736236.0,4480958.0,160709.0,7036.0,9704.0,4128.0,19899.0,9993.0,10373.0,7771.0,3854.0,8604.0,6181.0,7026.0,AL,1
1,685830.0,578448.0,24158.0,2449.0,2892.0,5506.0,3194.0,3494.0,1884.0,788.0,18520.0,14848.0,1520.0,28129.0,AK,2
2,6769646.0,4980662.0,1355303.0,18724.0,19968.0,25467.0,72057.0,10112.0,33133.0,22276.0,25913.0,45135.0,23371.0,137525.0,AZ,4
3,2837345.0,2617018.0,159013.0,3663.0,5387.0,1813.0,10491.0,2057.0,4856.0,4533.0,3154.0,20530.0,1958.0,2872.0,AR,5
4,37097796.0,20809671.0,10478088.0,134920.0,118844.0,256925.0,1210840.0,359747.0,1269524.0,559059.0,772833.0,713292.0,204651.0,209402.0,CA,6
5,5453601.0,4572556.0,602110.0,21029.0,25075.0,28989.0,49072.0,14043.0,24721.0,19808.0,8724.0,35187.0,14008.0,38279.0,CO,8
6,3428549.0,2654207.0,418652.0,37755.0,10589.0,50347.0,137325.0,6575.0,31651.0,8089.0,8353.0,29463.0,12223.0,23320.0,CT,9
7,939645.0,809280.0,67918.0,8457.0,5227.0,2734.0,14389.0,1729.0,8252.0,1683.0,2837.0,6817.0,2765.0,7557.0,DE,10
8,629065.0,518687.0,57647.0,9061.0,2373.0,3522.0,10670.0,1926.0,5091.0,1209.0,1250.0,3741.0,2202.0,11686.0,DC,11
9,20529964.0,14384847.0,4549382.0,538169.0,80397.0,128712.0,367930.0,17768.0,76455.0,69570.0,70430.0,94225.0,70511.0,81568.0,FL,12
