Census API & Geographies
========================
In this notebook we will look up data directly from the US Census
API (application programming interface) and try to understand the
geographic levels that the Census uses.

**Goals:**

- set up and test out your own Census API key
- understand some key geographic levels in the Census API
- work with Census fields and geographies to query data

**To follow this lab, you will have to [sign up for a Census API key](https://api.census.gov/data/key_signup.html).** You will need your key to run the code in this notebook. Students in CSC 602 can put **Adelphi University** as the organization.

[Watch the video walkthrough](https://youtu.be/AwyhtIcpeLw) [35:11]

In [None]:
# install libraries that are not part of Colab by default
!pip install census us mapclassify -q

# load the libraries we need
from census import Census
import us
import pandas as pd

# replace this with your own Census API key (this is a fake key)
api_key = "796f9e16b6e3f73329d0d36de60d226d53215cc5"

In [None]:
# if you want to share this document and keep your key hidden
# add it to your Colab .env
# otherwise, DELETE THIS BLOCK


from google.colab import userdata
api_key = userdata.get('CENSUS_API_KEY')


Your first API call
===================
We will use the python `census` package to connect to the
five year American Community Survey (ACS) data. This package
is a "thin wrapper" around the Census **REST API**.

A REST API uses the internet and HTTP (urls like we use for web pages)
to receive queries from us and return data. The python package formats
the requests for us and sends them to the correct URLS. It also returns
python `dicts` that we can easily convert to `pandas` `DataFrames`.

Basic steps to making a census API call:
----------------------------------------
1. Set up your API key and save it as a variable (we did this already).
2. Create a census object with your key and the year we want to query.
3. Identify the variables you want to query.
4. Identify the geographic places you want to query.
5. Send the request (and confirm the results).
6. Convert the results to a `DataFrame`.
7. Give the columns meaningful names.
8. "Clean" the data (if needed).


In this next code block, we will do steps 2-5. We won't yet convert it to a `DataFrame`
so that we have a sense of what the "raw" data looks like. We are going to query for
total population at the US "state" level.

We are going to ask for the "NAME" field (which _in this case_ is the name of the state),
and variable `B01003_001E` which is the total population. We only know it's
the total population because we looked it up in the [ACS documentation](https://api.census.gov/data/2022/acs/acs5/variables.html).

The call to `acs5.get()` returns a python **list** of **dict** objects. Each dict
represents a row of data in key:value pairs. Here's a sample from the results
of our query below:

~~~~~
[{'NAME': 'Alabama', 'B10052_001E': 117645.0, 'state': '01'},
 {'NAME': 'Alaska', 'B10052_001E': 17353.0, 'state': '02'},
 {'NAME': 'Arizona', 'B10052_001E': 164973.0, 'state': '04'},
 {'NAME': 'Arkansas', 'B10052_001E': 69607.0, 'state': '05'},
 {'NAME': 'California', 'B10052_001E': 1088952.0, 'state': '06'}]
~~~~~

We have the two variables we asked for (`NAME` and `B01003_001E`) and a `state` code,
which is returned because we queried at the state level.


In [None]:
# build the Census object
c = Census(api_key)
# get B01003_001E at the state level
data = c.acs5.get(fields=["NAME", "B01003_001E"], geo={"for": "state:*"}, year=2022)
# show the first 5 records
data[:5]

[{'NAME': 'Alabama', 'B01003_001E': 5028092.0, 'state': '01'},
 {'NAME': 'Alaska', 'B01003_001E': 734821.0, 'state': '02'},
 {'NAME': 'Arizona', 'B01003_001E': 7172282.0, 'state': '04'},
 {'NAME': 'Arkansas', 'B01003_001E': 3018669.0, 'state': '05'},
 {'NAME': 'California', 'B01003_001E': 39356104.0, 'state': '06'}]

Convert the results to a DataFrame
-----------------------------------
In the next code block, we're going to work with the
results for our query, in the `data` variable,
and convert them to a `DataFrame` called `df`.

First, `df` looks like this:
~~~~~~~~~~~~
           NAME  B10052_001E  state
0       Alabama     103948.0     01
1        Alaska      14760.0     02
2       Arizona     153232.0     04
3      Arkansas      64394.0     05
4    California    1035548.0     06
~~~~~~~~~~~~
We rename the columns _and_ clean the data by converting the population
data from a float to an integer. You can see the result in the output
of the code block.

**Note:** _Census data uses numeric codes for things like states, counties,
tracts, etc. These are called FIPS codes. Even though they are numbers,
we need to treat them as strings to keep the leading zeros. If needed, we
can look up a state based on its FIPS code._

In [None]:
df = pd.DataFrame(data)
df.rename(columns={"NAME":"state_name", "state":"statefp", "B01003_001E": "total_pop"}, inplace=True)
# convert total_pop from a float to an int
df.total_pop = df.total_pop.astype(int)

df.head(10)


Unnamed: 0,state_name,total_pop,statefp
0,Alabama,5028092,1
1,Alaska,734821,2
2,Arizona,7172282,4
3,Arkansas,3018669,5
4,California,39356104,6
5,Colorado,5770790,8
6,Connecticut,3611317,9
7,Delaware,993635,10
8,District of Columbia,670587,11
9,Florida,21634529,12


Querying at different geographic levels
=======================================


County level
------------
We _could_ query the population for every county in the US and territories,
but that would be a lot of data and too much to map. In this example,
we will get the counties for 3 states: New York, New Jersey, and Connecticut.
The same concept can be used for just one state or more states and counties.

In the code block, note:

- we use **dict** called  `field_names` to specify the variables:
  the **keys** match Census variables and the **values** are the nice names
- we get just the **fields** as a list by calling `keys()` on the `field_names` dict
- we added the `STATE` field to the `fields` list
- we can use the `field_names` dict to rename the columns in the `DataFrame`
- we use an f-string to format the state fips codes in the **geo** argument
- we are using the string .join() method for the first time: it
  combines a list into a single string with a separator.
  (we are using `,` to join our state fips codes)
- we get the `county` field as a fips code because we specified "county"
  in the `geo` argument
- we sort the counties by population size

The Census wants this information in the form of the fips code for the state,
not the postal abbreviation. We will use the `us` package to get the FIPS codes,
put them in a **list** and then use that list in our query.

In python, a list is an ordered collection of items (they can be anything --
strings, ints, floats, DataFrames, etc.). We can use a list to hold the FIPS,
which are just strings.

In [None]:
c = Census(api_key)
# use a dict for the fields and their nice names
field_names = {
    "NAME": "county_name",
    "STATE": "statefp",
    "B01003_001E": "total_pop"
}
fields = list(field_names.keys())

# a list of strings generated using the us library
state_fips = [us.states.NY.fips, us.states.NJ.fips]

display("fips: ")
display(",".join(state_fips))

data = c.acs5.get(fields=fields, geo={'for': 'county:*', 'in': f'state:{",".join(state_fips)}'}, year=2022)

counties = pd.DataFrame(data)
# add any fields that we want to rename to our dict
field_names["county"] = "countyfp"
counties.rename(columns=field_names, inplace=True)
counties.total_pop = counties.total_pop.astype(int)
counties.sort_values("total_pop", ascending=False, inplace=True)

counties.head(10)

'fips: '

'36,34'

Unnamed: 0,county_name,statefp,total_pop,state,countyfp
44,"Kings County, New York",36,2679620,36,47
61,"Queens County, New York",36,2360826,36,81
51,"New York County, New York",36,1645867,36,61
72,"Suffolk County, New York",36,1524486,36,103
23,"Bronx County, New York",36,1443229,36,5
50,"Nassau County, New York",36,1389160,36,59
80,"Westchester County, New York",36,997904,36,119
1,"Bergen County, New Jersey",34,953243,34,3
35,"Erie County, New York",36,951232,36,29
11,"Middlesex County, New Jersey",34,860147,34,23


### Cleaning counties
That data looks pretty good, but we notice that we don't have
the state name (on its own), the county name, or the state postal code.
We're going to write some functions that we can use with `apply()` to
add these columns to our DataFrame.


In [None]:
def parse_county(county_name):
    parts = county_name.split(", ")
    county = parts[0]
    county = county.replace(" County", "")
    return county

def parse_state(county_name):
    parts = county_name.split(", ")
    return parts[1]

def lookup_state(statefp):
    state = us.states.lookup(statefp)
    # this will work for all 50 states
    # but not Puerto Rico, DC, Guam, etc.
    if state is not None:
        return state.abbr
    # if we didn't find a state, look for a territory
    territory = us.states.lookup(statefp)
    if territory is not None:
        return territory.abbr
    return ""


counties["county"] = counties.county_name.apply(parse_county)
counties["state_name"] = counties.county_name.apply(parse_state)
counties["state"] = counties.statefp.apply(lookup_state)
# re-order the columns
counties = counties[["state", "county", "total_pop", "county_name", "state_name", "countyfp", "statefp"]]
counties.head(10)

Unnamed: 0,state,county,total_pop,statefp,county_name,state_name,countyfp
44,NY,Kings,2679620,36,"Kings County, New York",New York,47
61,NY,Queens,2360826,36,"Queens County, New York",New York,81
51,NY,New York,1645867,36,"New York County, New York",New York,61
72,NY,Suffolk,1524486,36,"Suffolk County, New York",New York,103
23,NY,Bronx,1443229,36,"Bronx County, New York",New York,5
50,NY,Nassau,1389160,36,"Nassau County, New York",New York,59
80,NY,Westchester,997904,36,"Westchester County, New York",New York,119
1,NJ,Bergen,953243,34,"Bergen County, New Jersey",New Jersey,3
35,NY,Erie,951232,36,"Erie County, New York",New York,29
11,NJ,Middlesex,860147,34,"Middlesex County, New Jersey",New Jersey,23


Census tracts
-------------
Census tracts are geographic regions that stay relatively stable
across different years of census surveys. Each tract contains
about 4,000 people. In dense areas, tracts are smaller, in rural
parts of the country tracts cover larger areas.

For the query below, we will use the county data we loaded to look
up the census tracts in Nassau and Suffolk counties in New York.

In [None]:
# get just the county fips for LI and make them a comma-separated string
li_counties = counties[(counties.state == "NY") & (counties.county.isin(["Nassau", "Suffolk"]))]
li_counties = list(li_counties.countyfp)
li_counties = ",".join(li_counties)

# use a dict for the fields and their nice names
field_names = {
    "NAME": "tract_name",
    "STATE": "statefp",
    "COUNTY": "countyfp",
    "B01003_001E": "total_pop"
}
fields = list(field_names.keys())

data = c.acs5.get(fields=fields, geo={'for': 'tract:*', 'in': f'state:{us.states.NY.fips} county:{li_counties}'}, year=2022)

tracts = pd.DataFrame(data)
tracts.rename(columns=field_names, inplace=True)
tracts.total_pop = tracts.total_pop.astype(int)
tracts.sort_values("total_pop", ascending=False, inplace=True)
# just the cols we want
tracts = tracts[["tract", "tract_name", "total_pop", "countyfp", "statefp"]]

tracts.head()

Unnamed: 0,tract,tract_name,total_pop,countyfp,statefp
84,407000,Census Tract 4070; Nassau County; New York,9657,59,36
60,405100,Census Tract 4051; Nassau County; New York,9554,59,36
165,413900,Census Tract 4139; Nassau County; New York,8788,59,36
64,405400,Census Tract 4054; Nassau County; New York,8578,59,36
442,146005,Census Tract 1460.05; Suffolk County; New York,8373,103,36


Census Places
-------------
Places are cities, towns, villages, and other **Census Designated Places** (CDP). There's no consistent
relationship between places and the other designations (tracts, counties, etc.).
A place may span several counties or tracts, or (often) be contained within one.

This means that we cannot query them based on county. In the example below,
we find all of the places in New York State. In a future lab, we will learn
how to do a **spatial join** to find the places that touch or are contained
within a specific geographic area.

In [None]:
field_names = {
    "NAME": "place_name",
    "B01003_001E": "total_pop"
}
fields = list(field_names.keys())

data = c.acs5.get(fields=fields, geo={ 'for': 'place:*', 'in': f'state:{us.states.NY.fips}'}, year=2022)
places = pd.DataFrame(data)
field_names["state"] = "statefp"
places.rename(columns=field_names, inplace=True)
places.total_pop = places.total_pop.astype(int)
places.sort_values("total_pop", ascending=False, inplace=True)

places.head(10)

Unnamed: 0,place_name,total_pop,statefp,place
766,"New York city, New York",8622467,36,51000
128,"Buffalo city, New York",276688,36,11000
957,"Rochester city, New York",210992,36,63000
1286,"Yonkers city, New York",209780,36,84000
1117,"Syracuse city, New York",146134,36,73000
8,"Albany city, New York",99692,36,1000
762,"New Rochelle city, New York",80828,36,50617
185,"Cheektowaga CDP, New York",76483,36,15000
726,"Mount Vernon city, New York",72817,36,49121
1014,"Schenectady city, New York",68476,36,65508


Zip codes
---------
Zip codes are a little _weird_. They are a nice geographic region
because they aren't that big, and people know where they are
(in their region). The problem is, they are not standard Census geographies:
they can cut across counties, cities, and states.

In this example, we will get the population for zip codes in New York State.

Note:

- zip codes are called **zip code tabulation areas** (ZCTAs) in the Census
- we cant use the state fips like we did for counties
- we will load all of the zip-code-county-state relationships as a DataFrame
- we will get all of the zipcodes in NYC, Westchester, and Long Island
- remember, these zips might not be wholly contained in the counties we are looking at
- we want our zipcodes to be strings, so we use an optional argument in read_csv()
  to tell pandas to parse that column as type `str`

In [None]:
field_names = {
    "NAME": "zipcode",
    "B01003_001E": "total_pop"
}
fields = list(field_names.keys())
url = "https://raw.githubusercontent.com/mcuringa/cartopy/refs/heads/main/notebooks/data/zipcodes-counties.csv"
zips = pd.read_csv(url, dtype={"zipcode": str})

county_names = ['NEW YORK', 'KINGS', 'QUEENS', 'BRONX',
                'RICHMOND', 'WESTCHESTER', 'NASSAU', 'SUFFOLK']
zips = zips[(zips.COUNTYLINE.isin(county_names)) & (zips.state == "NY")]
zips.head()

Unnamed: 0,zipcode,countyfp,COUNTYLINE,state
3678,10001,36061,NEW YORK,NY
3679,10002,36061,NEW YORK,NY
3680,10003,36061,NEW YORK,NY
3681,10004,36061,NEW YORK,NY
3682,10005,36061,NEW YORK,NY


In [None]:
# now get just he unique zipcode column as a list
zips = list(zips.zipcode.unique())
zips = ",".join(zips)
data = c.acs5.get(fields=fields, geo={'for': f'zip code tabulation area:{zips}'}, year=2022)

zipcodes = pd.DataFrame(data)
# add "zip code tabulation area" to the field names
field_names["zip code tabulation area"] = "zipcode"
zipcodes.rename(columns=field_names, inplace=True)
zipcodes.total_pop = zipcodes.total_pop.astype(int)
zipcodes.sort_values("total_pop", ascending=False, inplace=True)

# zipcodes.head(10)
zipcodes

Unnamed: 0,zipcode,total_pop,zipcode.1
85,ZCTA5 10467,98713,10467
22,ZCTA5 10025,96988,10025
74,ZCTA5 10456,88575,10456
76,ZCTA5 10458,82678,10458
86,ZCTA5 10468,81397,10468
...,...,...,...
56,ZCTA5 10169,0,10169
57,ZCTA5 10170,0,10170
58,ZCTA5 10171,0,10171
59,ZCTA5 10172,0,10172
