# Python Tutorial Program: Gathering and Exporting Census Data

By Kenneth Burchfiel

This code is released under the MIT license; the datasets produced by the code are in the public domain.

You can find my blog post on this code at https://kburchfiel3.wordpress.com/2021/08/12/python-tutorial-program-retrieving-u-s-census-data/ .

This program demonstrates how Python can be used to retrieve and export US Census data at the zip code, county, and state level. It also shows how the same variable can be accessed across years.

The census-specific functions called by this program can be found in census_query.py. Please read the documentation there for more information on applying these functions.

Although this tutorial program will focus on gathering education, family type, and income/poverty statistics from the American Community Survey (5-year estimates), the functions on which it is based can also be used to gather data from certain other sources, such as the decennial census.

Before being able to run the code below on your computer, you'll need to obtain a free US Census API key from https://api.census.gov/data/key_signup.html .

The following US Census links proved helpful in creating this program:

API list: https://www.census.gov/data/developers/data-sets.html

2020 Census redistricting data:

Variables: https://api.census.gov/data/2020/dec/pl/variables.html

Examples: https://api.census.gov/data/2020/dec/pl/examples.html

2010 Census:

Variables: https://api.census.gov/data/2010/dec/sf1/variables.html

Examples: https://api.census.gov/data/2010/dec/sf1/examples.html

ACS (5 year estimates):

Variables: https://api.census.gov/data/2019/acs/acs5/variables.html

Examples: https://api.census.gov/data/2019/acs/acs5/examples.html

ACS (1 year estimates):

Variables: https://api.census.gov/data/2019/acs/acs1/variables.html

Examples: https://api.census.gov/data/2019/acs/acs1.html

Variable code examples:

ACS5: population = B01001_001E

ACS1: population = B01001_001E

Census (redistricting data): population = P1_001N

Census (SF1 data): population = P001001

First, I imported a number of libraries:

In [1]:
import time
start_time = time.time() # Allows the program's runtime to be measured
# from census import Census I didn't end up using this library in this version
# of census_query_tutorial, but you may find it useful for your own
# analyses. See https://github.com/datamade/census for more information.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from census_query import retrieve_census_data, \
retrieve_single_census_variable, test_variables, \
compare_variable_across_years, generate_variable_and_group_lists

Instead of hard coding the year into my Census queries, I chose to set it as a variable so that the queries could be modified more easily. I picked 2019 because it was the most recent year (at the time of first creating this project) that American Community Survey census data was available.

In [2]:
year = 2019

Next, I imported my Census API key into the code. I stored the path to the key and the key itself in separate file locations. 

In [3]:
with open('..\\key_paths\\path_to_keys_folder.txt') as fin:
    api_folder_path = fin.readline()
with open(api_folder_path+'\\census_api_key.txt') as fin:
    api_key = fin.readline() 

Creating a list of American Community Survey variables

In order to determine which variables I would import into my spreadsheet, I used the generate_variable_and_group_lists function in census_query.py to read in all 27,000+ variables in the 2014 American Community Survey from https://api.census.gov/data/2014/acs/acs5/variables.html . Next, I created a list of groups from this survey. 

'Groups' are general categories of data, whereas 'variables' are specific data points within a given group. For instance, group B16010 contains data on 'EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER,' and variable B16010_015E has the data label 'Estimate!!Total:!!High school graduate (includes equivalency)'. 

In [4]:
create_new_variable_and_group_lists = True
if create_new_variable_and_group_lists == True:
    df_variables_and_groups = generate_variable_and_group_lists(year = 2019, source = 'acs5', variable_filter = 'Estimate')
    df_variables_and_groups[0].to_csv(
        'variables_from_html_acs5_2019.csv', index = False)
    df_variables_and_groups[1].to_csv(
        'groups_from_html_acs5_2019.csv', index = False)
    df_variables_and_groups[1]    

retrieving data from: https://api.census.gov/data/2019/acs/acs5/variables.html


Here are all the variables that I can incorporate into my project, along with their descriptions:

In [5]:
df_variables = pd.read_csv('variables_from_html_acs5_2019.csv')
df_groups = pd.read_csv('groups_from_html_acs5_2019.csv')
df_variables

Unnamed: 0,Variable,Label,Concept,Group,Description
0,B01001A_001E,Estimate!!Total:,SEX BY AGE (WHITE ALONE),B01001A,SEX BY AGE (WHITE ALONE) Estimate!!Total:
1,B01001A_002E,Estimate!!Total:!!Male:,SEX BY AGE (WHITE ALONE),B01001A,SEX BY AGE (WHITE ALONE) Estimate!!Total:!!Male:
2,B01001A_003E,Estimate!!Total:!!Male:!!Under 5 years,SEX BY AGE (WHITE ALONE),B01001A,SEX BY AGE (WHITE ALONE) Estimate!!Total:!!Mal...
3,B01001A_004E,Estimate!!Total:!!Male:!!5 to 9 years,SEX BY AGE (WHITE ALONE),B01001A,SEX BY AGE (WHITE ALONE) Estimate!!Total:!!Mal...
4,B01001A_005E,Estimate!!Total:!!Male:!!10 to 14 years,SEX BY AGE (WHITE ALONE),B01001A,SEX BY AGE (WHITE ALONE) Estimate!!Total:!!Mal...
...,...,...,...,...,...
27034,C27021_011E,Estimate!!Total:!!In family households:!!In ot...,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...,C27021,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...
27035,C27021_012E,Estimate!!Total:!!In family households:!!In ot...,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...,C27021,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...
27036,C27021_013E,Estimate!!Total:!!In non-family households and...,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...,C27021,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...
27037,C27021_014E,Estimate!!Total:!!In non-family households and...,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...,C27021,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...


And here are all the groups (categories of variables) that I can use:

In [6]:
df_groups

Unnamed: 0,Concept,Group
0,SEX BY AGE (WHITE ALONE),B01001A
1,SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE),B01001B
2,SEX BY AGE (AMERICAN INDIAN AND ALASKA NATIVE ...,B01001C
3,SEX BY AGE (ASIAN ALONE),B01001D
4,SEX BY AGE (NATIVE HAWAIIAN AND OTHER PACIFIC ...,B01001E
...,...,...
1131,PUBLIC HEALTH INSURANCE BY WORK EXPERIENCE,C27014
1132,HEALTH INSURANCE COVERAGE STATUS BY RATIO OF I...,C27016
1133,PRIVATE HEALTH INSURANCE BY RATIO OF INCOME TO...,C27017
1134,PUBLIC HEALTH INSURANCE BY RATIO OF INCOME TO ...,C27018


I could then look through these two CSV files ('groups_from_html_acs' and 'variables_from_html_acs' to determine which variables to add to my project. 

First, I could look through the groups_from_html_acs list to find categories that interested me. For instance, since I was interested in comparing educational attainment across regions, I wanted to look further into group B16010, which contains data on ('EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER'). One of this group's variable entries (which I could find within variables_from_html_acs) is B16010_015E, whose data label is 'Estimate!!Total:!!High school graduate (includes equivalency):.' I added this variable code to my variable list, along with other variables that covered education, poverty, and demographic data.

When creating your own variable list, you might find it helpful to create a copy of the .csv file containing variables, then add an 'include' column to the file. You can then use this column to keep track of which variables to use. Once you've selected all the desired variables, you can sort the .csv file by the 'include' column; copy and paste all the variable codes that you wish to include into your Python notebook; and then convert these codes into a list. This strategy makes it easier to keep track of which codes you've already included and what they refer to.

My list focused on variables from the following categories:

1. Household types (mostly married households vs. ones led by a female householder with no spouse present, which, for brevity's sake, I'll abbreviate as 'female-householder' homes.
2. The presence of children within these households
3. Median household income
4. Poverty status by family type
5. Poverty status by family type and the highest level of education completed

In [7]:
variable_list = ['B01001_001E', 'B11005_001E', 'B11005_013E', 'B11005_002E', 
'B11005_004E', 'B19013_001E', 'B17006_002E', 'B17006_016E', 
'B17006_003E', 'B17006_017E', 'B17006_012E', 'B17006_026E',
'B17006_008E', 'B17006_022E', 'B17018_004E', 'B17018_021E', 
'B17018_005E', 'B17018_022E', 'B17018_006E', 'B17018_023E', 
'B17018_007E', 'B17018_024E', 'B17018_015E', 'B17018_032E', 
'B17018_016E', 'B17018_033E', 'B17018_017E', 'B17018_034E', 
'B17018_018E', 'B17018_035E', 'B16010_001E', 'B16010_002E',
'B16010_015E', 'B16010_028E', 'B16010_041E']

# You can use the following template for your own list:
# variable_list = ['', '', '', '',
# '', '', '', '',
# '', '', '', '']


To see how the function performs with longer variable lists, you can try using the following version of variable_list (which I created for a separate project):

In [8]:
# variable_list = ['B01001_001E', 'B01002_001E', 'B06008_001E', 'B06008_003E', 
# 'B08124_001E', 'B08124_002E', 'B08124_003E', 'B08124_004E', 'B08124_005E', 
# 'B08124_006E', 'B08124_007E', 'B09001_001E', 'B13002_001E', 'B13002_002E',
# 'B14001_001E', 'B14001_002E', 'B14001_008E', 'B14001_009E', 'B16010_001E',
# 'B16010_002E', 'B16010_041E', 'B17001_001E', 'B17001_002E', 'B19001_001E',
# 'B19001_002E', 'B19013_001E', 'B19083_001E', 'B19325_001E', 'B19325_002E',
# 'B19325_003E', 'B19325_049E', 'B19325_050E', 'B23025_001E', 'B23025_002E',
# 'B23025_004E', 'B23025_005E', 'B23027_012E', 'B23027_013E', 'B24011_001E',
# 'B24011_002E', 'B24011_018E', 'B24011_026E', 'B24011_029E', 'B24011_033E',
# 'B25002_001E', 'B25002_002E', 'B25010_001E', 'B25064_001E', 'B25077_001E',
# 'B25105_001E', 'C27012_001E', 'C27012_002E']

Next, I will create a filtered version of df_variables that includes only the variables stored in variable_list.

In [9]:
df_variables = df_variables.query("Variable in @variable_list").copy()
# Adding .copy() here prevents a SettingWithCopyWarning later on.

In [10]:
df_variables

Unnamed: 0,Variable,Label,Concept,Group,Description
279,B01001_001E,Estimate!!Total:,SEX BY AGE,B01001,SEX BY AGE Estimate!!Total:
6835,B11005_001E,Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
6836,B11005_002E,Estimate!!Total:!!Households with one or more ...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
6838,B11005_004E,Estimate!!Total:!!Households with one or more ...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
6847,B11005_013E,Estimate!!Total:!!Households with no people un...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
8692,B16010_001E,Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
8693,B16010_002E,Estimate!!Total:!!Less than high school graduate:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
8706,B16010_015E,Estimate!!Total:!!High school graduate (includ...,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
8719,B16010_028E,Estimate!!Total:!!Some college or associate's ...,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
8732,B16010_041E,Estimate!!Total:!!Bachelor's degree or higher:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...


It's now time to query census data on each of the variables in variable_list for the year specified. retrieve_census_data (contained in census_query.py) allows for this to be accomplished in blocks of variables, thus saving time.

The following three code blocks retrieve 2019 American Community Survey (5-year estimates) data for zip codes, counties, and states. Each block **also** merges in population data from 5 years ago so that each region's population growth can be measured.

In [11]:
zip_data = retrieve_census_data(df_variable_list = df_variables, year = year, region = 'zip', source = 'acs5', api_key = api_key)
zip_data = zip_data.merge(retrieve_single_census_variable(variable = 'B01001_001E', column_name = 'population', region = 'zip', year = year-5, source = 'acs5', api_key = api_key), on = 'NAME', how = 'outer')
zip_data

Retrieving data from rows 0 to 34


Unnamed: 0,NAME,Year,state,SEX BY AGE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:,...,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Less than high school graduate,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency),"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree",POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher,"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher",MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2019 inflation-adjusted dollars),population_2014
0,25245,2019,54,600,294,20,20,152,499,65,...,16,48,89,0,0,0,0,0,57895,686
1,25268,2019,54,964,354,80,54,197,745,157,...,50,61,71,51,0,0,0,0,27200,731
2,25286,2019,54,1700,613,211,152,152,1177,322,...,35,216,28,10,0,9,17,0,38313,1479
3,25303,2019,54,6764,2970,876,534,731,4838,201,...,0,174,469,583,0,48,54,116,58820,7181
4,25311,2019,54,10964,5088,1229,527,999,7866,754,...,101,418,432,533,0,99,299,126,40920,10059
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33115,38704,2019,28,2,2,0,0,0,2,0,...,0,0,0,0,0,0,0,0,-666666666,3
33116,38731,2019,28,246,54,46,26,8,145,59,...,26,0,8,0,0,20,0,0,53173,184
33117,38749,2019,28,71,34,5,0,5,62,29,...,0,5,0,0,0,0,0,0,18750,20
33118,38781,2019,28,198,107,12,7,7,172,37,...,0,6,4,4,4,0,0,0,10772,247


In [12]:
county_data = retrieve_census_data(df_variable_list = df_variables, year = year, source = 'acs5', region = 'county', api_key = api_key)
county_data = county_data.merge(retrieve_single_census_variable(variable = 'B01001_001E', column_name = 'population', region = 'county', year = year-5, source = 'acs5', api_key = api_key), on = 'NAME', how = 'outer')
county_data

Retrieving data from rows 0 to 34


Unnamed: 0,NAME,Year,state,county,SEX BY AGE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,...,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Less than high school graduate,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency),"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree",POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher,"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher",MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2019 inflation-adjusted dollars),population_2014
0,"Fayette County, Illinois",2019.0,17.0,51.0,21565.0,7737.0,2193.0,1433.0,2778.0,15303.0,...,411.0,1489.0,1414.0,572.0,27.0,168.0,162.0,39.0,46650.0,22041.0
1,"Logan County, Illinois",2019.0,17.0,107.0,29003.0,10797.0,2831.0,2023.0,3538.0,20373.0,...,171.0,1732.0,1793.0,1705.0,35.0,208.0,264.0,153.0,57308.0,30047.0
2,"Saline County, Illinois",2019.0,17.0,165.0,23994.0,9972.0,3122.0,1823.0,3046.0,17113.0,...,257.0,781.0,2178.0,1183.0,34.0,177.0,279.0,162.0,44090.0,24876.0
3,"Lake County, Illinois",2019.0,17.0,97.0,701473.0,246122.0,90926.0,68192.0,76607.0,457676.0,...,9449.0,19759.0,31821.0,79600.0,1827.0,4757.0,6471.0,7024.0,89427.0,703170.0
4,"Massac County, Illinois",2019.0,17.0,127.0,14219.0,5822.0,1886.0,1293.0,1807.0,10021.0,...,210.0,729.0,1369.0,546.0,36.0,66.0,175.0,50.0,47481.0,15148.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3217,"Knox County, Tennessee",2019.0,47.0,93.0,461104.0,187319.0,53211.0,35388.0,52752.0,308366.0,...,3744.0,15768.0,23352.0,41482.0,1258.0,3874.0,5607.0,4066.0,57470.0,440732.0
3218,"Benton County, Washington",2019.0,53.0,5.0,197518.0,72121.0,24931.0,16610.0,21172.0,127960.0,...,1975.0,6346.0,12719.0,14841.0,564.0,1219.0,2673.0,1490.0,69023.0,182053.0
3219,"Clark County, Washington",2019.0,53.0,11.0,473252.0,174661.0,59067.0,42004.0,53759.0,319955.0,...,4124.0,16089.0,35353.0,36977.0,1011.0,3178.0,7050.0,3032.0,75253.0,438272.0
3220,"Shannon County, South Dakota",,,,,,,,,,...,,,,,,,,,,14005.0


In [13]:
state_data = retrieve_census_data(df_variable_list = df_variables, year = year, source = 'acs5', region = 'state', api_key = api_key)
state_data = state_data.merge(retrieve_single_census_variable(variable = 'B01001_001E', column_name = 'population', region = 'state', year = year-5, source = 'acs5', api_key = api_key), on = 'NAME', how = 'outer')
state_data

Retrieving data from rows 0 to 34


Unnamed: 0,NAME,Year,state,SEX BY AGE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:,...,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Less than high school graduate,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency),"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree",POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher,"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher",MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2019 inflation-adjusted dollars),population_2014
0,Alabama,2019,1,4876250,1867893,560887,346010,537511,3320877,458922,...,62087,210389,264847,298240,20161,48224,67244,40003,50536,4817678
1,Alaska,2019,2,737068,253346,87149,57885,66718,480586,34376,...,4590,23254,45878,46756,1451,5736,8862,4896,77640,728300
2,Arizona,2019,4,7050299,2571268,789782,496083,728297,4732532,608637,...,90976,201862,399596,455654,27334,52220,95193,55475,58945,6561516
3,Arkansas,2019,5,2999370,1158071,364192,227539,332292,2011639,270168,...,44422,151631,163452,165962,11825,30258,38982,20971,47597,2947036
4,California,2019,6,39283497,13044266,4482879,3050730,3440506,26471543,4418675,...,738802,930831,1770325,2678932,202212,268597,478465,350097,75235,38066920
5,Colorado,2019,8,5610349,2148994,658465,466708,602135,3825579,315751,...,54539,161380,297242,518231,13597,34086,58818,50725,72331,5197580
6,Delaware,2019,10,957248,363322,102861,64145,112493,669320,66816,...,9956,39777,48062,72373,2578,11959,11590,9751,68287,917060
7,District of Columbia,2019,11,692683,284386,59173,29257,44627,494116,44850,...,3222,5447,6985,56225,3223,8405,8166,8300,86420,633736
8,Connecticut,2019,9,3575074,1370746,403685,266633,392880,2483095,232663,...,31391,127014,151799,331188,12098,39620,45503,37762,78444,3592053
9,Florida,2019,12,20901636,7736311,2087688,1281766,2340583,14965745,1767583,...,227887,746082,1061698,1381751,78473,201731,274802,198438,55660,19361792


I admit that many of the column names are obscenely long and unwieldy. This is less of an issue when viewing the table as a CSV export (which I'll perform later), since spreadsheet software can make the columns a uniform width while allowing the full name to be displayed in a separate box. An alternative to these long names, though, would be to keep the variable codes as the column name, then include a key mapping each variable code to its description.

So far, the values shown in the DataFrame are nominal in nature. For example, the table reports on the number of married-couple households with one or more children, but doesn't say what *proportion* have at laest one child--which is much more useful when comparing different zip codes.

Therefore, in the following code block, I added additional columns to the DataFrame that generate various proportions using a function called calc_proportion_and_rename. Some of these were generated using pre-existing totals as a denominator, whereas others used the sum of two diferent statistics as the denominator. (For example, to calculate the proportion of children below the poverty level for a given zip code, I divided the number of children below the poverty level by the sum of (1) children below the poverty level and (2) children above the poverty level. This was a useful strategy when a given Census table didn't have a 'totals' row.

(When creating proportions, be careful about using a total in one table as the denominator for a proportion calculation that involves a separate table. For example, if Table A says that there are 10,000 kids in a zip code, and Table B says that there are 2,000 kids below the poverty line, you may be tempted to conclude that the proportion of children below the poverty line equals 2,000/10,000 = 0.2. However, suppose not all the kids identified in Table A show up in Table B, and that Table B doesn't have a totals row. In that case, you'd want to divide the proportion of kids in Table B above below the poverty level (2,000) by the number in Table B above the poverty level (let's say it's 6,000) to arrive at a more accurate proportion--in this case, 2,000/(2,000+6,000) = 2,000/8,000 = 25%.)

The calc_proportions_and_rename function also renames certain columns to make them more intuitive.

In [14]:
def calc_proportions_and_rename(df_results):

    df_results['5_year_population_growth'] = (df_results['SEX BY AGE Estimate!!Total:']/df_results['population_2014']-1)

    df_results['Married_couple_households_with_one_or_more_children_as_proportion_of_all_households'] = df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family']/df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:']

    df_results['Married_couple_households_with_one_or_more_children_as_proportion_of_all_households_with_one_or_more_children'] = df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family']/df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:']

    df_results['Proportion_of_children_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:'])

    df_results['Proportion_of_children_in_married_couple_families_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In married-couple family:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In married-couple family:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!In married-couple family:'])

    df_results['Proportion_of_children_in_female_householder_families_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Female householder, no spouse present:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Female householder, no spouse present:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!In other family:!!Female householder, no spouse present:'])

    df_results['Proportion_of_children_in_male_householder_families_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Male householder, no spouse present:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Male householder, no spouse present:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!In other family:!!Male householder, no spouse present:'])

    # Calculating proportions of residents living below the poverty level by education and household type

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Less than high school graduate']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Less than high school graduate']+df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Less than high school graduate'])

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!High school graduate (includes equivalency)"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!High school graduate (includes equivalency)"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency)"])

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Some college, associate's degree"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Some college, associate's degree"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree"])

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Bachelor's degree or higher"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Bachelor's degree or higher"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher"])


    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate"]/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate']+df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate'])

    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)']+df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)'])

    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree"])

    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher"])

    df_results['Proportion_of_individuals_25+y/o_who_did_not_graduate_high_school'] = df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:']/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    df_results['Proportion_of_individuals_25+y/o_whose_highest_education_level_=_high_school_graduate/equivalent'] = df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!High school graduate (includes equivalency):']/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    df_results['Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_some_college/associate\'s_degree'] = df_results["EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Some college or associate's degree:"]/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    df_results['Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_bachelor\'s_degree_or_higher'] = df_results["EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Bachelor's degree or higher:"]/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    # df_results[''] = df_results['']/(df_results['']+df_results[''])

    df_results.rename(columns = {

        "SEX BY AGE Estimate!!Total:":"Total_population",

        "MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2019 inflation-adjusted dollars)":"Median_household_income",
        
        "HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:":"Households",
        
        },inplace=True)



    return df_results
    

In [15]:
county_data = calc_proportions_and_rename(county_data)
county_data

Unnamed: 0,NAME,Year,state,county,Total_population,Households,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,...,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_individuals_25+y/o_who_did_not_graduate_high_school,Proportion_of_individuals_25+y/o_whose_highest_education_level_=_high_school_graduate/equivalent,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_some_college/associate's_degree,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_bachelor's_degree_or_higher
0,"Fayette County, Illinois",2019.0,17.0,51.0,21565.0,7737.0,2193.0,1433.0,2778.0,15303.0,...,0.056075,0.020548,0.666667,0.391304,0.384030,0.113636,0.161994,0.401621,0.323531,0.112854
1,"Logan County, Illinois",2019.0,17.0,107.0,29003.0,10797.0,2831.0,2023.0,3538.0,20373.0,...,0.029237,0.013881,0.708333,0.200000,0.180124,0.012903,0.112845,0.353213,0.334708,0.199234
2,"Saline County, Illinois",2019.0,17.0,165.0,23994.0,9972.0,3122.0,1823.0,3046.0,17113.0,...,0.069628,0.065561,0.690909,0.570388,0.487132,0.124324,0.133115,0.274762,0.399988,0.192135
3,"Lake County, Illinois",2019.0,17.0,97.0,701473.0,246122.0,90926.0,68192.0,76607.0,457676.0,...,0.024105,0.011499,0.434190,0.264419,0.198241,0.078820,0.093649,0.207586,0.245519,0.453246
4,"Massac County, Illinois",2019.0,17.0,127.0,14219.0,5822.0,1886.0,1293.0,1807.0,10021.0,...,0.041317,0.053726,0.550000,0.600000,0.297189,0.000000,0.132322,0.335795,0.392576,0.139307
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3217,"Knox County, Tennessee",2019.0,47.0,93.0,461104.0,187319.0,53211.0,35388.0,52752.0,308366.0,...,0.035639,0.014890,0.506086,0.374152,0.265907,0.090380,0.082850,0.252748,0.287947,0.376455
3218,"Benton County, Washington",2019.0,53.0,5.0,197518.0,72121.0,24931.0,16610.0,21172.0,127960.0,...,0.050608,0.012903,0.466919,0.347081,0.224768,0.064658,0.098312,0.244616,0.347929,0.309143
3219,"Clark County, Washington",2019.0,53.0,11.0,473252.0,174661.0,59067.0,42004.0,53759.0,319955.0,...,0.033833,0.016700,0.307060,0.209453,0.195756,0.075328,0.073348,0.242437,0.378072,0.306143
3220,"Shannon County, South Dakota",,,,,,,,,,...,,,,,,,,,,


In [16]:
state_data = calc_proportions_and_rename(state_data)
state_data

Unnamed: 0,NAME,Year,state,Total_population,Households,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:,...,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_individuals_25+y/o_who_did_not_graduate_high_school,Proportion_of_individuals_25+y/o_whose_highest_education_level_=_high_school_graduate/equivalent,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_some_college/associate's_degree,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_bachelor's_degree_or_higher
0,Alabama,2019,1,4876250,1867893,560887,346010,537511,3320877,458922,...,0.046679,0.015528,0.520581,0.398838,0.311018,0.106018,0.138193,0.308003,0.299121,0.254683
1,Alaska,2019,2,737068,253346,87149,57885,66718,480586,34376,...,0.02953,0.009407,0.364989,0.311487,0.194876,0.063862,0.071529,0.280037,0.352921,0.295512
2,Arizona,2019,4,7050299,2571268,789782,496083,728297,4732532,608637,...,0.049651,0.023411,0.485497,0.311236,0.225336,0.099841,0.128607,0.238589,0.338136,0.294668
3,Arkansas,2019,5,2999370,1158071,364192,227539,332292,2011639,270168,...,0.054923,0.021075,0.491398,0.366842,0.328429,0.112339,0.134302,0.340349,0.295071,0.230278
4,California,2019,6,39283497,13044266,4482879,3050730,3440506,26471543,4418675,...,0.043251,0.021298,0.393669,0.272962,0.205808,0.091198,0.166922,0.204879,0.28894,0.33926
5,Colorado,2019,8,5610349,2148994,658465,466708,602135,3825579,315751,...,0.036452,0.014747,0.418832,0.273684,0.217846,0.088155,0.082537,0.213681,0.294659,0.409123
6,Delaware,2019,10,957248,363322,102861,64145,112493,669320,66816,...,0.035074,0.013279,0.451489,0.275168,0.224282,0.057601,0.099827,0.312928,0.267312,0.319934
7,District of Columbia,2019,11,692683,284386,59173,29257,44627,494116,44850,...,0.073976,0.006608,0.493717,0.375464,0.258108,0.07788,0.090768,0.168351,0.155474,0.585407
8,Connecticut,2019,9,3575074,1370746,403685,266633,392880,2483095,232663,...,0.025743,0.013164,0.439674,0.240327,0.205576,0.076566,0.093699,0.268547,0.244912,0.392842
9,Florida,2019,12,20901636,7736311,2087688,1281766,2340583,14965745,1767583,...,0.047571,0.029266,0.410075,0.295617,0.219937,0.109616,0.118109,0.285735,0.297361,0.298796


In [17]:
zip_data = calc_proportions_and_rename(zip_data)
zip_data

Unnamed: 0,NAME,Year,state,Total_population,Households,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:,...,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_individuals_25+y/o_who_did_not_graduate_high_school,Proportion_of_individuals_25+y/o_whose_highest_education_level_=_high_school_graduate/equivalent,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_some_college/associate's_degree,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_bachelor's_degree_or_higher
0,25245,2019,54,600,294,20,20,152,499,65,...,0.000000,1.000000,,,,,0.130261,0.348697,0.446894,0.074148
1,25268,2019,54,964,354,80,54,197,745,157,...,0.000000,0.000000,,,,,0.210738,0.343624,0.195973,0.249664
2,25286,2019,54,1700,613,211,152,152,1177,322,...,0.000000,0.000000,1.000000,0.590909,0.000000,,0.273577,0.491929,0.189465,0.045030
3,25303,2019,54,6764,2970,876,534,731,4838,201,...,0.000000,0.031561,1.000000,0.094340,0.393258,0.000000,0.041546,0.242042,0.324721,0.391691
4,25311,2019,54,10964,5088,1229,527,999,7866,754,...,0.048458,0.000000,1.000000,0.673267,0.155367,0.207547,0.095856,0.368167,0.271803,0.264175
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33115,38704,2019,28,2,2,0,0,0,2,0,...,,,,,,,0.000000,1.000000,0.000000,0.000000
33116,38731,2019,28,246,54,46,26,8,145,59,...,0.000000,,,0.000000,,,0.406897,0.489655,0.055172,0.048276
33117,38749,2019,28,71,34,5,0,5,62,29,...,,,,,,,0.467742,0.451613,0.080645,0.000000
33118,38781,2019,28,198,107,12,7,7,172,37,...,0.000000,0.000000,0.428571,1.000000,,,0.215116,0.552326,0.156977,0.075581


A look at the first few rows in the zip code table reveals that some median household income values are clearly inaccurate! $-666,666,666 is *not* the actual median household income in any zip code, yet that's the value listed for 2,229 entries in zip_data, as shown below:

In [18]:
len(zip_data.query("Median_household_income == -666666666"))

2299

This means that, when performing average calculations across the entire dataset, you must be extremely careful--otherwise, you'll end up with results like the one below:

In [19]:
np.mean(zip_data['Median_household_income'])

-46219120.15483092

These results are, of course, skewed by the thousands of -666,666,666 values. The U.S. would be in dire shape if the average median household income among zip codes were truly $-46,219,120! 

As shown below, performing some basic data cleaning (e.g. removing any results with a negative median household income) can produce a more accurate number.

In [20]:
np.mean(zip_data.query('Median_household_income > 0')['Median_household_income'])

61302.54067032218

I then exported this zip code, county, and state data to a CSV. I created copies of the county and zip code DataFrames that only include regions with at least 1,000 households, since lower sample sizes in smaller zip codes can skew the sample sizes shown.

In [21]:
zip_data_1k_plus_households = zip_data.query("Households > 1000").reset_index(drop=True)
county_data_1k_plus_households = county_data.query("Households > 1000").reset_index(drop=True)

zip_data.to_csv('acs5_'+str(year)+'_zip_results.csv')
zip_data_1k_plus_households.to_csv('acs5_'+str(year)+'_zip_results_1k_plus_households.csv')

county_data.to_csv('acs5_'+str(year)+'_county_results.csv')
county_data_1k_plus_households.to_csv('acs5_'+str(year)+'_county_results_1k_plus_households.csv')

state_data.to_csv('acs5_'+str(year)+'_state_results.csv')

Next, I'll use the compare_variable_across_years function within census_query.py to evaluate how county, state, and zip codes have grown in population over time. Whereas the previous functions all used the American Community Survey (5-year estimates) as a data source, some of the code blocks below will also use data from the decennial census and the American Community Survey (1-year estimates). 

The variable codes for a region's total population can be retrieved within the Census API website (see links near the beginning of this tutorial). They are as follows:

ACS5: population = B01001_001E

ACS1: population = B01001_001E

Census (redistricting data): population = P1_001N

Census (SF1 data): population = P001001

In [22]:
acs5_state_pop_2010_to_2020 = compare_variable_across_years(variable = 'B01001_001E', variable_name = 'population', source= 'acs5', year_list = [2010, 2015, 2020], region = 'state', api_key = api_key)
acs5_state_pop_2010_to_2020.to_csv('acs5_state_pop_2010_to_2020.csv')
acs5_state_pop_2010_to_2020

Retrieving data for: 2010
Retrieving data for: 2015
Retrieving data for: 2020


Unnamed: 0,NAME,state,population_2010,population_2015,population_2020,2010_to_2015_chg,2015_to_2020_chg,2010_to_2020_chg
0,Alabama,1,4712651,4830620,4893186,0.025032,0.012952,0.038309
1,Alaska,2,691189,733375,736990,0.061034,0.004929,0.066264
2,Arizona,4,6246816,6641928,7174064,0.06325,0.080118,0.148435
3,Arkansas,5,2872684,2958208,3011873,0.029771,0.018141,0.048453
4,California,6,36637290,38421464,39346023,0.048698,0.024064,0.073934
5,Colorado,8,4887061,5278906,5684926,0.08018,0.076914,0.163261
6,Connecticut,9,3545837,3593222,3570549,0.013364,-0.00631,0.006969
7,Delaware,10,881278,926454,967679,0.051262,0.044498,0.098041
8,District of Columbia,11,584400,647484,701974,0.107947,0.084157,0.201188
9,Florida,12,18511620,19645772,21216924,0.061267,0.079974,0.146141


In [23]:
acs5_county_pop_2010_to_2020 = compare_variable_across_years(variable = 'B01001_001E', variable_name = 'population', source= 'acs5', year_list = [2010, 2015, 2020], region = 'county', api_key = api_key)
acs5_county_pop_2010_to_2020.dropna(inplace=True)
acs5_county_pop_2010_to_2020.to_csv('acs5_county_pop_2010_to_2020.csv')
acs5_county_pop_2010_to_2020

Retrieving data for: 2010
Retrieving data for: 2015
Retrieving data for: 2020


Unnamed: 0,NAME,state,county,population_2010,population_2015,population_2020,2010_to_2015_chg,2015_to_2020_chg,2010_to_2020_chg
0,"Las Marías Municipio, Puerto Rico",72,083,10156.0,9306.0,8131.0,-0.083694,-0.126263,-0.199390
1,"San Germán Municipio, Puerto Rico",72,125,35997.0,34125.0,30811.0,-0.052004,-0.097114,-0.144068
2,"Comerío Municipio, Puerto Rico",72,045,20773.0,20339.0,18942.0,-0.020893,-0.068686,-0.088143
3,"Canóvanas Municipio, Puerto Rico",72,029,47151.0,47432.0,45120.0,0.005960,-0.048743,-0.043074
4,"Rincón Municipio, Puerto Rico",72,117,15203.0,14841.0,13849.0,-0.023811,-0.066842,-0.089061
...,...,...,...,...,...,...,...,...,...
3216,"Grimes County, Texas",48,185,26208.0,26961.0,28447.0,0.028732,0.055117,0.085432
3217,"Guadalupe County, Texas",48,187,122728.0,143460.0,163030.0,0.168926,0.136414,0.328385
3218,"Hale County, Texas",48,189,36041.0,35504.0,33463.0,-0.014900,-0.057486,-0.071530
3219,"Hall County, Texas",48,191,3424.0,3203.0,3025.0,-0.064544,-0.055573,-0.116530


In [24]:
acs5_zip_pop_2015_to_2020 = compare_variable_across_years(variable = 'B01001_001E', variable_name = 'population', source= 'acs5', year_list = [2015, 2020], region = 'zip', api_key = api_key)
# I received an error when trying to retrieve 2010 population data using this 
# function, which indicated that that data was either unavailable or existed
# under a different name. Therefore, I chose to run this function for only
# 2015 and 2020.
acs5_zip_pop_2015_to_2020.dropna(inplace=True)
acs5_zip_pop_2015_to_2020.to_csv('acs5_zip_pop_2015_to_2020.csv')
acs5_zip_pop_2015_to_2020

Retrieving data for: 2015
Retrieving data for: 2020


Unnamed: 0,NAME,state,population_2015,population_2020,2015_to_2020_chg
0,20152,51,28574,36465,0.276160
1,20155,51,32716,37151,0.135561
2,20615,24,460,513,0.115217
3,20646,24,18615,22583,0.213161
4,20657,24,19525,19711,0.009526
...,...,...,...,...,...
33115,33847,12,184,154,-0.163043
33116,33860,12,22607,28372,0.255010
33117,33873,12,14244,13394,-0.059674
33118,33884,12,29845,33008,0.105981


2020 data is not available for the 1-year estimates version of the American Community Survey, so the following code block retrieves data from 2010 to 2019 instead.

In [25]:
acs1_state_pop_2010_to_2019 = compare_variable_across_years(variable = 'B01001_001E', variable_name = 'population', source= 'acs1', year_list = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019], region = 'state', api_key = api_key)
acs1_state_pop_2010_to_2019.to_csv('acs1_state_pop_2010_to_2019.csv')
acs1_state_pop_2010_to_2019

Retrieving data for: 2010
Retrieving data for: 2011
Retrieving data for: 2012
Retrieving data for: 2013
Retrieving data for: 2014
Retrieving data for: 2015
Retrieving data for: 2016
Retrieving data for: 2017
Retrieving data for: 2018
Retrieving data for: 2019


Unnamed: 0,NAME,state,population_2010,population_2011,population_2012,population_2013,population_2014,population_2015,population_2016,population_2017,...,2010_to_2011_chg,2011_to_2012_chg,2012_to_2013_chg,2013_to_2014_chg,2014_to_2015_chg,2015_to_2016_chg,2016_to_2017_chg,2017_to_2018_chg,2018_to_2019_chg,2010_to_2019_chg
0,Alabama,1,4785298,4802740,4822023,4833722,4849377,4858979,4863300,4874747,...,0.003645,0.004015,0.002426,0.003239,0.00198,0.000889,0.002354,0.002692,0.003133,0.024635
1,Alaska,2,713985,722718,731449,735132,736732,738432,741894,739795,...,0.012231,0.012081,0.005035,0.002176,0.002307,0.004688,-0.002829,-0.003186,-0.007991,0.024594
2,Arizona,4,6413737,6482505,6553255,6626624,6731484,6828065,6931071,7016270,...,0.010722,0.010914,0.011196,0.015824,0.014348,0.015086,0.012292,0.022145,0.01493,0.134864
3,Arkansas,5,2921606,2937979,2949131,2959373,2966369,2978204,2988248,3004279,...,0.005604,0.003796,0.003473,0.002364,0.00399,0.003373,0.005365,0.003177,0.00132,0.032926
4,California,6,37349363,37691912,38041430,38332521,38802500,39144818,39250017,39536653,...,0.009171,0.009273,0.007652,0.012261,0.008822,0.002687,0.007303,0.000516,-0.001133,0.057909
5,Colorado,8,5049071,5116796,5187582,5268367,5355866,5456574,5540545,5607154,...,0.013413,0.013834,0.015573,0.016608,0.018803,0.015389,0.012022,0.015767,0.011091,0.140554
6,Connecticut,9,3577073,3580709,3590347,3596080,3596677,3590886,3576452,3588184,...,0.001016,0.002692,0.001597,0.000166,-0.00161,-0.00402,0.00328,-0.004325,-0.002065,-0.003295
7,Delaware,10,899769,907135,917092,925749,935614,945934,952065,961939,...,0.008187,0.010976,0.00944,0.010656,0.01103,0.006481,0.010371,0.005439,0.006817,0.082238
8,District of Columbia,11,604453,617996,632323,646449,658893,672228,681170,693972,...,0.022405,0.023183,0.02234,0.01925,0.020238,0.013302,0.018794,0.012224,0.004689,0.167583
9,Florida,12,18843326,19057542,19317568,19552860,19893297,20271272,20612439,20984400,...,0.011368,0.013644,0.01218,0.017411,0.019,0.01683,0.018045,0.015008,0.008376,0.139806


In [26]:
acs1_county_pop_2010_to_2019 = compare_variable_across_years(variable = 'B01001_001E', variable_name = 'population', source= 'acs1', year_list = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019], region = 'county', api_key = api_key)
acs1_county_pop_2010_to_2019.to_csv('acs1_county_pop_2015_to_2019.csv')
acs1_county_pop_2010_to_2019
# Note that fewer counties are contained in the American Community Survey (1-year estimates) dataset.

Retrieving data for: 2010
Retrieving data for: 2011
Retrieving data for: 2012
Retrieving data for: 2013
Retrieving data for: 2014
Retrieving data for: 2015
Retrieving data for: 2016
Retrieving data for: 2017
Retrieving data for: 2018
Retrieving data for: 2019


Unnamed: 0,NAME,state,county,population_2010,population_2011,population_2012,population_2013,population_2014,population_2015,population_2016,...,2010_to_2011_chg,2011_to_2012_chg,2012_to_2013_chg,2013_to_2014_chg,2014_to_2015_chg,2015_to_2016_chg,2016_to_2017_chg,2017_to_2018_chg,2018_to_2019_chg,2010_to_2019_chg
0,"Stark County, Ohio",39,151,375321.0,375087.0,374868.0,375432.0,375736.0,375165.0,373612.0,...,-0.000623,-0.000584,0.001505,0.000810,-0.001520,-0.004140,-0.002864,-0.002598,-0.002605,-0.012563
1,"Summit County, Ohio",39,153,541565.0,539832.0,540811.0,541824.0,541943.0,541968.0,540300.0,...,-0.003200,0.001814,0.001873,0.000220,0.000046,-0.003078,0.001718,0.001275,-0.001670,-0.001019
2,"Trumbull County, Ohio",39,155,209936.0,209264.0,207406.0,206442.0,205175.0,203751.0,201825.0,...,-0.003201,-0.008879,-0.004648,-0.006137,-0.006940,-0.009453,-0.007160,-0.008748,-0.003288,-0.056979
3,"Tuscarawas County, Ohio",39,157,92542.0,92508.0,92392.0,92672.0,92788.0,92916.0,92420.0,...,-0.000367,-0.001254,0.003031,0.001252,0.001379,-0.005338,-0.001331,-0.001311,-0.002050,-0.005997
4,"Warren County, Ohio",39,165,213192.0,214910.0,217241.0,219169.0,221659.0,224469.0,227063.0,...,0.008058,0.010846,0.008875,0.011361,0.012677,0.011556,0.008011,0.014379,0.010462,0.100426
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
838,"Franklin County, North Carolina",,,,,,,,,,...,,,,,,,,0.021037,0.031454,
839,"Kershaw County, South Carolina",,,,,,,,,,...,,,,,,,,0.008549,0.014621,
840,"Mason County, Washington",,,,,,,,,,...,,,,,,,,,0.019250,
841,"Tehama County, California",,,,,,,,,,...,,,,,,,,,,


At the time I created this program, not all of the 2020 Census data was available via the Census API. Therefore, in the following code block, I retrieved population values from the 'Summary File 1' sections of the 2000 and 2010 decennial censuses [or censusses? censi?], then merged these values with population values found in the 2020 Census's redistricting dataset. I then calculated some additional percentage changes.

In [27]:
census_state_population = compare_variable_across_years(variable = 'P001001', variable_name = 'population', source= 'census_sf1', year_list = [2000, 2010], region = 'state', api_key = api_key)
census_state_population_2000_to_2020 = census_state_population.merge(compare_variable_across_years(variable = 'P1_001N', variable_name = 'population', source= 'census_redistricting', year_list = [2020], region = 'state', api_key = api_key).drop('state',axis=1), on = 'NAME')
census_state_population_2000_to_2020.insert(4, 'population_2020', census_state_population_2000_to_2020.pop('population_2020'))
census_state_population_2000_to_2020['2010_to_2020_chg'] = (census_state_population_2000_to_2020['population_2020'] / census_state_population_2000_to_2020['population_2010'])-1
census_state_population_2000_to_2020['2000_to_2020_chg'] = (census_state_population_2000_to_2020['population_2020'] / census_state_population_2000_to_2020['population_2000'])-1
census_state_population_2000_to_2020.to_csv('census_state_population_2000_to_2020.csv')

census_state_population_2000_to_2020

Retrieving data for: 2000
Retrieving data for: 2010
Retrieving data for: 2020


Unnamed: 0,NAME,state,population_2000,population_2010,population_2020,2000_to_2010_chg,2010_to_2020_chg,2000_to_2020_chg
0,Alabama,1,4447100,4779736,5024279,0.074798,0.051162,0.129788
1,Alaska,2,626932,710231,733391,0.132868,0.032609,0.169809
2,Arizona,4,5130632,6392017,7151502,0.245854,0.118818,0.393883
3,Arkansas,5,2673400,2915918,3011524,0.090715,0.032788,0.126477
4,California,6,33871648,37253956,39538223,0.099857,0.061316,0.167296
5,Colorado,8,4301261,5029196,5773714,0.169238,0.148039,0.342331
6,Connecticut,9,3405565,3574097,3605944,0.049487,0.008911,0.058839
7,Delaware,10,783600,897934,989948,0.145909,0.102473,0.263333
8,District of Columbia,11,572059,601723,689545,0.051855,0.145951,0.205374
9,Florida,12,15982378,18801310,21538187,0.176378,0.145568,0.347621


The same process was used to obtain county-level population changes.

In [28]:
census_county_population = compare_variable_across_years(variable = 'P001001', variable_name = 'population', source= 'census_sf1', year_list = [2000, 2010], region = 'county', api_key = api_key)
census_county_population_2000_to_2020 = census_county_population.merge(compare_variable_across_years(variable = 'P1_001N', variable_name = 'population', source= 'census_redistricting', year_list = [2020], region = 'county', api_key = api_key).drop(['state', 'county'],axis=1), on = 'NAME')
census_county_population_2000_to_2020.insert(4, 'population_2020', census_county_population_2000_to_2020.pop('population_2020'))
census_county_population_2000_to_2020['2010_to_2020_chg'] = (census_county_population_2000_to_2020['population_2020'] / census_county_population_2000_to_2020['population_2010'])-1
census_county_population_2000_to_2020['2000_to_2020_chg'] = (census_county_population_2000_to_2020['population_2020'] / census_county_population_2000_to_2020['population_2000'])-1

census_county_population_2000_to_2020.dropna(inplace=True)

census_county_population_2000_to_2020.to_csv('census_county_population_2000_to_2020.csv')

census_county_population_2000_to_2020

Retrieving data for: 2000
Retrieving data for: 2010
Retrieving data for: 2020


Unnamed: 0,NAME,state,county,population_2000,population_2020,population_2010,2000_to_2010_chg,2010_to_2020_chg,2000_to_2020_chg
0,"Autauga County, Alabama",01,001,43671.0,58805,54571.0,0.249594,0.077587,0.346546
1,"Baldwin County, Alabama",01,003,140415.0,231767,182265.0,0.298045,0.271594,0.650586
2,"Barbour County, Alabama",01,005,29038.0,25223,27457.0,-0.054446,-0.081364,-0.131380
3,"Bibb County, Alabama",01,007,20826.0,22293,22915.0,0.100307,-0.027144,0.070441
4,"Blount County, Alabama",01,009,51024.0,59134,57322.0,0.123432,0.031611,0.158945
...,...,...,...,...,...,...,...,...,...
3203,"Vega Baja Municipio, Puerto Rico",72,145,61929.0,54414,59662.0,-0.036606,-0.087962,-0.121349
3204,"Vieques Municipio, Puerto Rico",72,147,9106.0,8249,9301.0,0.021414,-0.113106,-0.094114
3205,"Villalba Municipio, Puerto Rico",72,149,27913.0,22093,26073.0,-0.065919,-0.152648,-0.208505
3206,"Yabucoa Municipio, Puerto Rico",72,151,39246.0,30426,37941.0,-0.033252,-0.198071,-0.224736


That concludes the main part of this tutorial program. I hope that you find these examples useful in performing your own census data analysis!

These census DataFrames can also be a great source of information for regression analyses. The following code blocks show how one of the DataFrames can be modified to serve as a data source for regressions (albeit without any data cleaning or checking). In the future, I may move these regressions over to a separate tutorial program and provide detailed explanations of the code. In the meantime, I've left the code in place and added some brief explanations. 

The first regression examined the relationship between poverty rates and whether children were in a married-couple family as opposed to a female-householder one. This involved creating a reduced version of the df_results_1k_plus_households DataFrame:

In [29]:
df_regression_test = zip_data_1k_plus_households.copy()
df_regression_test.dropna(subset=['Proportion_of_children_in_female_householder_families_below_poverty_level','Proportion_of_children_in_married_couple_families_below_poverty_level'],inplace=True)
df_regression_test = df_regression_test[['NAME','Proportion_of_children_in_female_householder_families_below_poverty_level','Proportion_of_children_in_married_couple_families_below_poverty_level']].copy()

In [30]:
df_regression_test

Unnamed: 0,NAME,Proportion_of_children_in_female_householder_families_below_poverty_level,Proportion_of_children_in_married_couple_families_below_poverty_level
0,25303,0.289377,0.054054
1,25311,0.660470,0.000000
2,25419,0.493056,0.009572
3,25601,0.482558,0.060255
4,26726,0.386427,0.016000
...,...,...,...
16977,38237,0.608929,0.085351
16978,38948,0.260274,0.082418
16979,38016,0.064658,0.058087
16980,38571,0.449275,0.126911


I then converted the two different variable columns into two different rows for each zip code using pd.melt(), which would make it easier to create categorical or 'dummy' variables for the regression analysis:

In [31]:
df_regression_test_melt = pd.melt(df_regression_test.copy(), id_vars = ['NAME']) # https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.melt.html
df_regression_test_melt

Unnamed: 0,NAME,variable,value
0,25303,Proportion_of_children_in_female_householder_f...,0.289377
1,25311,Proportion_of_children_in_female_householder_f...,0.660470
2,25419,Proportion_of_children_in_female_householder_f...,0.493056
3,25601,Proportion_of_children_in_female_householder_f...,0.482558
4,26726,Proportion_of_children_in_female_householder_f...,0.386427
...,...,...,...
33811,38237,Proportion_of_children_in_married_couple_famil...,0.085351
33812,38948,Proportion_of_children_in_married_couple_famil...,0.082418
33813,38016,Proportion_of_children_in_married_couple_famil...,0.058087
33814,38571,Proportion_of_children_in_married_couple_famil...,0.126911


The following code block uses pd.get_dummies to generate categorical variables, then renames the resulting column for better legibility. 

In [32]:
df_regression_test_melt = pd.get_dummies(data = df_regression_test_melt.copy(), columns=['variable'], drop_first=True)
df_regression_test_melt.rename(columns={'variable_Proportion_of_children_in_married_couple_families_below_poverty_level':'in_married_household','value':'proportion_below_poverty_level'},inplace=True)
df_regression_test_melt

Unnamed: 0,NAME,proportion_below_poverty_level,in_married_household
0,25303,0.289377,0
1,25311,0.660470,0
2,25419,0.493056,0
3,25601,0.482558,0
4,26726,0.386427,0
...,...,...,...
33811,38237,0.085351,1
33812,38948,0.082418,1
33813,38016,0.058087,1
33814,38571,0.126911,1


With this table in place, I was able to perform the regression analysis.

In [33]:
y = df_regression_test_melt['proportion_below_poverty_level'] # Contains the list of scores for the current grade (or for the school total in the case of the 'Total' column)
x_vars = df_regression_test_melt[['in_married_household']]
x_vars = sm.add_constant(x_vars) 
model = sm.OLS(y,x_vars)
results = model.fit() # the resulst variable contains the information needed to fill in the other rows within the DataFrame.
results.summary()

0,1,2,3
Dep. Variable:,proportion_below_poverty_level,R-squared:,0.409
Model:,OLS,Adj. R-squared:,0.409
Method:,Least Squares,F-statistic:,23390.0
Date:,"Thu, 31 Mar 2022",Prob (F-statistic):,0.0
Time:,16:43:32,Log-Likelihood:,11307.0
No. Observations:,33816,AIC:,-22610.0
Df Residuals:,33814,BIC:,-22590.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3802,0.001,285.425,0.000,0.378,0.383
in_married_household,-0.2881,0.002,-152.952,0.000,-0.292,-0.284

0,1,2,3
Omnibus:,1609.716,Durbin-Watson:,1.66
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2450.374
Skew:,0.429,Prob(JB):,0.0
Kurtosis:,4.002,Cond. No.,2.62


My second regression analysis aimed to evaluate the impact of family type (married vs. female-householder-only) and education level (no high school diploma; high school diploma/equivalent; associate's/some college; and bachelor's or higher) on poverty status. This first involved retrieving data on income for both family type and education.

In [34]:
df_regression_test_2 = zip_data_1k_plus_households.copy()
df_regression_test_2.dropna(subset=['Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school','Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher'],inplace=True)
df_regression_test_2 = df_regression_test_2[['NAME','Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school','Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher']].copy()

In [35]:
df_regression_test_2

Unnamed: 0,NAME,Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher
1,25311,0.165289,0.000000,0.048458,0.000000,1.000000,0.673267,0.155367,0.207547
3,25601,0.233333,0.164557,0.048638,0.000000,0.272727,0.000000,0.321429,0.333333
4,26726,0.324675,0.113858,0.047354,0.000000,0.833333,0.135531,0.654206,0.166667
5,26753,0.000000,0.002222,0.004975,0.000000,1.000000,0.524590,0.851351,0.000000
6,26757,0.169231,0.071956,0.073944,0.000000,0.438095,0.706667,0.190083,0.000000
...,...,...,...,...,...,...,...,...,...
16977,38237,0.152941,0.044408,0.095119,0.011270,0.788732,0.607407,0.107477,0.451613
16978,38948,0.161765,0.069767,0.000000,0.000000,0.375000,0.095745,0.125000,0.000000
16979,38016,0.000000,0.041406,0.039346,0.031825,0.103627,0.058140,0.059701,0.008143
16980,38571,0.116725,0.065015,0.055504,0.000000,1.000000,0.132701,0.060606,1.000000


Next, I once again 'melted' various columns into the same column in order to facilitate the creation of categorical variables. I also created columns that would store these categorical variables.

In [36]:
df_regression_test_2_melt = pd.melt(df_regression_test_2.copy(), id_vars = ['NAME'])
df_regression_test_2_melt['Married'] = 0
df_regression_test_2_melt['highest_ed_=_high_school_grad'] = 0
df_regression_test_2_melt['highest_ed_=_some_college_or_associate\'s'] = 0
df_regression_test_2_melt['highest_ed_=_bachelor\'s_or_higher'] = 0

In [37]:
df_regression_test_2_melt

Unnamed: 0,NAME,variable,value,Married,highest_ed_=_high_school_grad,highest_ed_=_some_college_or_associate's,highest_ed_=_bachelor's_or_higher
0,25311,Proportion_of_married-couple_families_below_th...,0.165289,0,0,0,0
1,25601,Proportion_of_married-couple_families_below_th...,0.233333,0,0,0,0
2,26726,Proportion_of_married-couple_families_below_th...,0.324675,0,0,0,0
3,26753,Proportion_of_married-couple_families_below_th...,0.000000,0,0,0,0
4,26757,Proportion_of_married-couple_families_below_th...,0.169231,0,0,0,0
...,...,...,...,...,...,...,...
110235,38237,Proportion_of_female-householder_families_belo...,0.451613,0,0,0,0
110236,38948,Proportion_of_female-householder_families_belo...,0.000000,0,0,0,0
110237,38016,Proportion_of_female-householder_families_belo...,0.008143,0,0,0,0
110238,38571,Proportion_of_female-householder_families_belo...,1.000000,0,0,0,0


The output of the following for loop served as a reference for which column numbers corresponded to which variables.

In [38]:
for i in range(len(df_regression_test_2_melt.columns)):
    print("Column",i,":\t",df_regression_test_2_melt.columns[i])

Column 0 :	 NAME
Column 1 :	 variable
Column 2 :	 value
Column 3 :	 Married
Column 4 :	 highest_ed_=_high_school_grad
Column 5 :	 highest_ed_=_some_college_or_associate's
Column 6 :	 highest_ed_=_bachelor's_or_higher


In the next for loop, I filled in the categorical variables by seeing whether certain keywords ('married', 'some_college', etc.) were present in the variable column. For instance, given the variable 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school', the for loop returned 1 for the 'Married' column and 0 for the other columns. 

In [39]:
for i in range(len(df_regression_test_2_melt)):
    variable = df_regression_test_2_melt.iloc[i, 1]
    if 'married' in variable:
        df_regression_test_2_melt.iloc[i, 3] = 1
    if 'high_school_graduate' in variable:
        df_regression_test_2_melt.iloc[i, 4] = 1
    if 'some_college' in variable:
        df_regression_test_2_melt.iloc[i, 5] = 1
    if 'bachelor' in variable:
        df_regression_test_2_melt.iloc[i, 6] = 1


In [40]:
df_regression_test_2_melt.iloc[0,1]

'Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school'

In [41]:
df_regression_test_2_melt.rename(columns={'value':'proportion_below_poverty_level'},inplace=True)
df_regression_test_2.to_csv('marriage_education_poverty_regression.csv')
df_regression_test_2_melt

Unnamed: 0,NAME,variable,proportion_below_poverty_level,Married,highest_ed_=_high_school_grad,highest_ed_=_some_college_or_associate's,highest_ed_=_bachelor's_or_higher
0,25311,Proportion_of_married-couple_families_below_th...,0.165289,1,0,0,0
1,25601,Proportion_of_married-couple_families_below_th...,0.233333,1,0,0,0
2,26726,Proportion_of_married-couple_families_below_th...,0.324675,1,0,0,0
3,26753,Proportion_of_married-couple_families_below_th...,0.000000,1,0,0,0
4,26757,Proportion_of_married-couple_families_below_th...,0.169231,1,0,0,0
...,...,...,...,...,...,...,...
110235,38237,Proportion_of_female-householder_families_belo...,0.451613,0,0,0,1
110236,38948,Proportion_of_female-householder_families_belo...,0.000000,0,0,0,1
110237,38016,Proportion_of_female-householder_families_belo...,0.008143,0,0,0,1
110238,38571,Proportion_of_female-householder_families_belo...,1.000000,0,0,0,1


With the table complete, I performed a regression that used proportion_below_poverty_level as the dependent variable and various family type/education level values as the independent variables.

In [42]:
y = df_regression_test_2_melt['proportion_below_poverty_level']
x_vars = df_regression_test_2_melt[['Married',
       'highest_ed_=_high_school_grad',
       'highest_ed_=_some_college_or_associate\'s',
       'highest_ed_=_bachelor\'s_or_higher']]
x_vars = sm.add_constant(x_vars) 
model = sm.OLS(y,x_vars)
results_2 = model.fit() 
results_2.summary()

0,1,2,3
Dep. Variable:,proportion_below_poverty_level,R-squared:,0.319
Model:,OLS,Adj. R-squared:,0.319
Method:,Least Squares,F-statistic:,12930.0
Date:,"Thu, 31 Mar 2022",Prob (F-statistic):,0.0
Time:,16:43:40,Log-Likelihood:,36769.0
No. Observations:,110240,AIC:,-73530.0
Df Residuals:,110235,BIC:,-73480.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3605,0.001,308.825,0.000,0.358,0.363
Married,-0.1882,0.001,-180.231,0.000,-0.190,-0.186
highest_ed_=_high_school_grad,-0.0840,0.001,-56.899,0.000,-0.087,-0.081
highest_ed_=_some_college_or_associate's,-0.1170,0.001,-79.211,0.000,-0.120,-0.114
highest_ed_=_bachelor's_or_higher,-0.2022,0.001,-136.929,0.000,-0.205,-0.199

0,1,2,3
Omnibus:,22013.82,Durbin-Watson:,1.755
Prob(Omnibus):,0.0,Jarque-Bera (JB):,60772.713
Skew:,1.069,Prob(JB):,0.0
Kurtosis:,5.943,Cond. No.,5.39


In [43]:
end_time = time.time()
run_time = end_time - start_time
run_minutes = run_time // 60
run_seconds = run_time % 60
print("Completed run at",time.ctime(end_time),"(local time)")
print("Total run time:",'{:.2f}'.format(run_time),"second(s) ("+str(run_minutes),"minute(s) and",'{:.2f}'.format(run_seconds),"second(s))") # Only valid when the program is run nonstop from start to finish

Completed run at Thu Mar 31 16:43:40 2022 (local time)
Total run time: 94.32 second(s) (1.0 minute(s) and 34.32 second(s))
