# Explore the Census Wrapper and API

### Requirements
* install the `census` module before getting started. To do this, run the following command from the command line: 
    * **`pip install census`**

### Documentation
* [Python wrapper for census API](https://github.com/datamade/census)
* [List of available fields and labels](https://gist.github.com/afhaque/60558290d6efd892351c4b64e5c01e9b)
* [Census API Docs](https://www.census.gov/data/developers/data-sets.html)


### Import Dependencies

In [2]:
import pandas as pd
from census import Census #<-- Python wrapper for census API
import requests

# Census API Key
from config import api_key

# provide the api key and the year to establish a session
c = Census(api_key, year=2013)

# Set an option to allow up to 300 characters to print in each column
pd.set_option('max_colwidth', 300)

ModuleNotFoundError: No module named 'config'

### Gather all of the available tables for the 2013 ACS5 data

There are a number of convenient methods that the wrapper provides, but the standard function requires a tuple of field IDs that you're interested in, and a geographic reference stored in a dictionary as seen below. In this code, we're saying we want data for these 6 fields for ALL zip codes

**NOTE:** We're using the `acs5` function set to pull our data from the 5-year American Consumer Survey.

In [3]:
# query for all tables
tables = c.acs5.tables()

# The tables variable contains a list of dicts, so we can convert directly to a dataframe
table_df = pd.DataFrame(tables)
print(f"Number of available tables: {len(table_df)}")
table_df.head()

NameError: name 'c' is not defined

### Execute a string search against the *description* column to filter to an area of interest

In [13]:
table_df[table_df['description'].str.contains("POVERTY")].head(15)
# len(table_df[table_df['description'].str.contains("POVERTY")])

Unnamed: 0,description,name,variables
0,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY FAMILY TYPE BY SOCIAL SECURITY INCOME BY SUPPLEMENTAL SECURITY INCOME (SSI) AND CASH PUBLIC ASSISTANCE INCOME,B17015,https://api.census.gov/data/2013/acs/acs5/groups/B17015.json
2,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY FAMILY TYPE BY WORK EXPERIENCE OF HOUSEHOLDER AND SPOUSE,B17016,https://api.census.gov/data/2013/acs/acs5/groups/B17016.json
4,POVERTY STATUS IN THE PAST 12 MONTHS BY HOUSEHOLD TYPE BY AGE OF HOUSEHOLDER,B17017,https://api.census.gov/data/2013/acs/acs5/groups/B17017.json
6,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER,B17018,https://api.census.gov/data/2013/acs/acs5/groups/B17018.json
11,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY NUMBER OF RELATED CHILDREN UNDER 18 YEARS,B17012,https://api.census.gov/data/2013/acs/acs5/groups/B17012.json
13,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY NUMBER OF PERSONS IN FAMILY,B17013,https://api.census.gov/data/2013/acs/acs5/groups/B17013.json
15,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY NUMBER OF WORKERS IN FAMILY,B17014,https://api.census.gov/data/2013/acs/acs5/groups/B17014.json
21,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY FAMILY TYPE BY PRESENCE OF RELATED CHILDREN UNDER 18 YEARS BY AGE OF RELATED CHILDREN,B17010,https://api.census.gov/data/2013/acs/acs5/groups/B17010.json
43,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY TENURE,B17019,https://api.census.gov/data/2013/acs/acs5/groups/B17019.json
44,POVERTY STATUS IN THE PAST 12 MONTHS OF INDIVIDUALS BY SEX BY WORK EXPERIENCE,B17004,https://api.census.gov/data/2013/acs/acs5/groups/B17004.json


### Use the provided URL for your table of interest to retrieve all available variables

Note: I couldn't find a wrapper function for this, so we're using requests to make the API call

In [1]:
# Determine which table you're interested in
table_id = 'B17001'

# Capture the variables URL from the table_df
url = table_df.loc[table_df['name']==table_id, 'variables'].values[0]

# Make the API call
response = requests.get(url).json()
print(response)

# convert the response to a DataFrame
variables = pd.DataFrame(response['variables']).transpose()

print(f"Number of available variables: {len(variables)}")
variables.head(5)

NameError: name 'table_df' is not defined

### Filter to only fields that will contain an integer

Many of the available variables for a table are annotation (notes) fields that are typically null. Luckily the API lets us know what data type each variable is. We can use this to filter to only the ones that will contain an integer.

In [20]:
variables[variables['predicateType']=='int'].head()
# len(variables[variables['predicateType']=='int'])

Unnamed: 0,concept,group,label,limit,predicateOnly,predicateType
B17001_050M,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Margin of Error!!Total!!Income in the past 12 months at or above poverty level!!Female!!12 to 14 years,0,True,int
B17001_050E,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Estimate!!Total!!Income in the past 12 months at or above poverty level!!Female!!12 to 14 years,0,True,int
B17001_051E,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Estimate!!Total!!Income in the past 12 months at or above poverty level!!Female!!15 years,0,True,int
B17001_051M,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Margin of Error!!Total!!Income in the past 12 months at or above poverty level!!Female!!15 years,0,True,int
B17001_052E,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Estimate!!Total!!Income in the past 12 months at or above poverty level!!Female!!16 and 17 years,0,True,int


### Use the wrapper to query for your selected fields

Once we've identified which fields we want, we can begin to query for the actual content

There are a number of convenient methods that the wrapper provides, but the standard `get()` function requires a tuple of field IDs, and a geographic reference stored in a dictionary as seen below. 

In this code, we're saying we want data for these 6 fields for ALL zip codes.


In [16]:
census_data = c.acs5.get(("NAME", "B19013_001E", "B01003_001E", "B01002_001E", "B19301_001E", "B17001_002E"), 
                         {'for': 'zip code tabulation area:*'})

census_data[:5]

[{'NAME': 'ZCTA5 08518',
  'B19013_001E': 74286.0,
  'B01003_001E': 5217.0,
  'B01002_001E': 41.5,
  'B19301_001E': 33963.0,
  'B17001_002E': 170.0,
  'zip code tabulation area': '08518'},
 {'NAME': 'ZCTA5 08520',
  'B19013_001E': 90293.0,
  'B01003_001E': 27468.0,
  'B01002_001E': 37.4,
  'B19301_001E': 37175.0,
  'B17001_002E': 1834.0,
  'zip code tabulation area': '08520'},
 {'NAME': 'ZCTA5 08525',
  'B19013_001E': 118656.0,
  'B01003_001E': 4782.0,
  'B01002_001E': 47.1,
  'B19301_001E': 59848.0,
  'B17001_002E': 43.0,
  'zip code tabulation area': '08525'},
 {'NAME': 'ZCTA5 08527',
  'B19013_001E': 88588.0,
  'B01003_001E': 54867.0,
  'B01002_001E': 42.2,
  'B19301_001E': 37021.0,
  'B17001_002E': 2191.0,
  'zip code tabulation area': '08527'},
 {'NAME': 'ZCTA5 08528',
  'B19013_001E': 58676.0,
  'B01003_001E': 245.0,
  'B01002_001E': 48.5,
  'B19301_001E': 49117.0,
  'B17001_002E': 0.0,
  'zip code tabulation area': '08528'}]

### Format the response

In [21]:
# Convert to DataFrame
census_pd = pd.DataFrame(census_data)

# Renaming columns to be more user-friendly
census_pd = census_pd.rename(columns={"B01003_001E": "Population",
                                      "B01002_001E": "Median Age",
                                      "B19013_001E": "Household Income",
                                      "B19301_001E": "Per Capita Income",
                                      "B17001_002E": "Poverty Count",
                                      "NAME": "Name", 
                                      "zip code tabulation area": "Zipcode"})

# Since Census doesn't provide the poverty rate, we can divide Poverty Count by Population to calculate it ourselves
census_pd["Poverty Rate"] = 100 * census_pd["Poverty Count"].astype(int) / census_pd["Population"].astype(int)

# Reorder columns and only include ones we're interested in for the final DataFrame
census_pd = census_pd[["Zipcode", "Population", "Median Age", "Household Income",
                       "Per Capita Income", "Poverty Rate"]]

# Visualize
print("Total number of zip codes in response: " + str(len(census_pd)))
census_pd.head()

Total number of zip codes in response: 33120


Unnamed: 0,Zipcode,Population,Median Age,Household Income,Per Capita Income,Poverty Rate
0,8518,5217.0,41.5,74286.0,33963.0,3.258578
1,8520,27468.0,37.4,90293.0,37175.0,6.67686
2,8525,4782.0,47.1,118656.0,59848.0,0.899205
3,8527,54867.0,42.2,88588.0,37021.0,3.993293
4,8528,245.0,48.5,58676.0,49117.0,0.0


### Save to a CSV

In [None]:
census_pd.to_csv("census_data.csv", encoding="utf-8", index=False)