# Python Tutorial Program: Gathering and Exporting Census Data

By Kenneth Burchfiel

This code is released under the MIT license; the datasets produced by the code are in the public domain.

You can find my blog post on this code at https://kburchfiel3.wordpress.com/2021/08/12/python-tutorial-program-retrieving-u-s-census-data/ .

This program demonstrates how Python can be used to retrieve and export US Census data at the zip code, county, and state level. It also shows how the same variable can be accessed across years.

The census-specific functions called by this program can be found in census_query.py. Please read the documentation there for more information on applying these functions.

Although this tutorial program will focus on gathering education, family type, and income/poverty statistics from the American Community Survey (5-year estimates), the functions on which it is based can also be used to gather data from certain other sources, such as the decennial census.

Before being able to run the code below on your computer, you'll need to obtain a free US Census API key from https://api.census.gov/data/key_signup.html .

The following US Census links proved helpful in creating this program:

API list: https://www.census.gov/data/developers/data-sets.html

2020 Census redistricting data:

Variables: https://api.census.gov/data/2020/dec/pl/variables.html

Examples: https://api.census.gov/data/2020/dec/pl/examples.html

2010 Census:

Variables: https://api.census.gov/data/2010/dec/sf1/variables.html

Examples: https://api.census.gov/data/2010/dec/sf1/examples.html

ACS (5 year estimates):

Variables: https://api.census.gov/data/YEAR/acs/acs5/variables.html (replace 'YEAR' in this URL and following ones with the year of interest, e.g. 2021)

Examples: https://api.census.gov/data/YEAR/acs/acs5/examples.html

ACS (1 year estimates):

Variables: https://api.census.gov/data/YEAR/acs/acs1/variables.html

Examples: https://api.census.gov/data/YEAR/acs/acs1.html

Variable code examples:

ACS5: population = B01001_001E

ACS1: population = B01001_001E

Census (redistricting data): population = P1_001N

Census (SF1 data): population = P001001

First, I imported a number of libraries:

In [1]:
import time
start_time = time.time() # Allows the program's runtime to be measured
# from census import Census I didn't end up using this library in this version
# of census_query_tutorial, but you may find it useful for your own
# analyses. See https://github.com/datamade/census for more information.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from census_query import retrieve_census_data, \
retrieve_single_census_variable, test_variables, \
compare_variable_across_years, generate_variable_and_group_lists

Instead of hard coding the year into my Census queries, I chose to set it as a variable so that the queries could be modified more easily. 

In [2]:
year = 2021

Next, I imported my Census API key into the code. I stored the path to the key and the key itself in separate file locations. 

In [3]:
with open('../key_paths/path_to_keys_folder.txt') as fin:
    api_folder_path = fin.readline()
with open(api_folder_path+'/census_api_key.txt') as fin:
    api_key = fin.readline() 

Creating a list of American Community Survey variables

In order to determine which variables I would import into my spreadsheet, I used the generate_variable_and_group_lists function in census_query.py to read in all 27,000+ variables in the American Community Survey from https://api.census.gov/data/2014/acs/acs5/variables.html . Next, I created a list of groups from this survey. (At the time I first created this script, the most recent ACS data was from 2019, so I wanted to find data from the ACS that took place 5 years earlier; that way, I could compare previous and current values for different variables. Thus, I chose to examine 2014 data.)

'Groups' are general categories of data, whereas 'variables' are specific data points within a given group. For instance, group B16010 contains data on 'EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER,' and variable B16010_015E has the data label 'Estimate!!Total:!!High school graduate (includes equivalency)'. 

In [4]:
create_new_variable_and_group_lists = True
if create_new_variable_and_group_lists == True:
    df_variables_and_groups = generate_variable_and_group_lists(year = year, source = 'acs5', variable_filter = 'Estimate')
    df_variables_and_groups[0].to_csv(
        f'variables_from_html_acs5_{year}.csv', index = False)
    df_variables_and_groups[1].to_csv(
        f'groups_from_html_acs5_{year}.csv', index = False)
    df_variables_and_groups[1]    

retrieving data from: https://api.census.gov/data/2021/acs/acs5/variables.html


Here are all the variables that I can incorporate into my project, along with their descriptions:

In [5]:
df_variables = pd.read_csv(f'variables_from_html_acs5_{year}.csv')
df_groups = pd.read_csv(f'groups_from_html_acs5_{year}.csv')
df_variables

Unnamed: 0,Variable,Label,Concept,Group,Description
0,B01001A_001E,Estimate!!Total:,SEX BY AGE (WHITE ALONE),B01001A,SEX BY AGE (WHITE ALONE) Estimate!!Total:
1,B01001A_002E,Estimate!!Total:!!Male:,SEX BY AGE (WHITE ALONE),B01001A,SEX BY AGE (WHITE ALONE) Estimate!!Total:!!Male:
2,B01001A_003E,Estimate!!Total:!!Male:!!Under 5 years,SEX BY AGE (WHITE ALONE),B01001A,SEX BY AGE (WHITE ALONE) Estimate!!Total:!!Mal...
3,B01001A_004E,Estimate!!Total:!!Male:!!5 to 9 years,SEX BY AGE (WHITE ALONE),B01001A,SEX BY AGE (WHITE ALONE) Estimate!!Total:!!Mal...
4,B01001A_005E,Estimate!!Total:!!Male:!!10 to 14 years,SEX BY AGE (WHITE ALONE),B01001A,SEX BY AGE (WHITE ALONE) Estimate!!Total:!!Mal...
...,...,...,...,...,...
27881,C27021_011E,Estimate!!Total:!!In family households:!!In ot...,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...,C27021,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...
27882,C27021_012E,Estimate!!Total:!!In family households:!!In ot...,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...,C27021,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...
27883,C27021_013E,Estimate!!Total:!!In non-family households and...,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...,C27021,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...
27884,C27021_014E,Estimate!!Total:!!In non-family households and...,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...,C27021,HEALTH INSURANCE COVERAGE STATUS BY LIVING ARR...


And here are all the groups (categories of variables) that I can use:

In [6]:
df_groups

Unnamed: 0,Concept,Group
0,SEX BY AGE (WHITE ALONE),B01001A
1,SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE),B01001B
2,SEX BY AGE (AMERICAN INDIAN AND ALASKA NATIVE ...,B01001C
3,SEX BY AGE (ASIAN ALONE),B01001D
4,SEX BY AGE (NATIVE HAWAIIAN AND OTHER PACIFIC ...,B01001E
...,...,...
1140,PUBLIC HEALTH INSURANCE BY WORK EXPERIENCE,C27014
1141,HEALTH INSURANCE COVERAGE STATUS BY RATIO OF I...,C27016
1142,PRIVATE HEALTH INSURANCE BY RATIO OF INCOME TO...,C27017
1143,PUBLIC HEALTH INSURANCE BY RATIO OF INCOME TO ...,C27018


I could then look through these two CSV files ('groups_from_html_acs' and 'variables_from_html_acs' to determine which variables to add to my project. 

First, I could look through the groups_from_html_acs list to find categories that interested me. For instance, since I was interested in comparing educational attainment across regions, I wanted to look further into group B16010, which contains data on ('EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER'). One of this group's variable entries (which I could find within variables_from_html_acs) is B16010_015E, whose data label is 'Estimate!!Total:!!High school graduate (includes equivalency):.' I added this variable code to my variable list, along with other variables that covered education, poverty, and demographic data.

When creating your own variable list, you might find it helpful to create a copy of the .csv file containing variables, then add an 'include' column to the file. You can then use this column to keep track of which variables to use. Once you've selected all the desired variables, you can sort the .csv file by the 'include' column; copy and paste all the variable codes that you wish to include into your Python notebook; and then convert these codes into a list. This strategy makes it easier to keep track of which codes you've already included and what they refer to.

My list focused on variables from the following categories:

1. Household types (mostly married households vs. ones led by a female householder with no spouse present, which, for brevity's sake, I'll abbreviate as 'female-householder' homes.
2. The presence of children within these households
3. Median household income
4. Poverty status by family type
5. Poverty status by family type and the highest level of education completed
6. Median housing prices 

In [7]:
variable_list = ['B01001_001E', 'B11005_001E', 'B11005_013E', 'B11005_002E', 
'B11005_004E', 'B19013_001E', 'B17006_002E', 'B17006_016E', 
'B17006_003E', 'B17006_017E', 'B17006_012E', 'B17006_026E',
'B17006_008E', 'B17006_022E', 'B17018_004E', 'B17018_021E', 
'B17018_005E', 'B17018_022E', 'B17018_006E', 'B17018_023E', 
'B17018_007E', 'B17018_024E', 'B17018_015E', 'B17018_032E', 
'B17018_016E', 'B17018_033E', 'B17018_017E', 'B17018_034E', 
'B17018_018E', 'B17018_035E', 'B16010_001E', 'B16010_002E',
'B16010_015E', 'B16010_028E', 'B16010_041E', 'B25077_001E']

# B25077_001E reports the median value of owner-occupied homes. See:
# # https://censusreporter.org/tables/B25077/ 

# You can use the following template for your own list:
# variable_list = ['', '', '', '',
# '', '', '', '',
# '', '', '', '']


To see how the function performs with longer variable lists, you can try using the following version of variable_list (which I created for a separate project):

In [8]:
# variable_list = ['B01001_001E', 'B01002_001E', 'B06008_001E', 'B06008_003E', 
# 'B08124_001E', 'B08124_002E', 'B08124_003E', 'B08124_004E', 'B08124_005E', 
# 'B08124_006E', 'B08124_007E', 'B09001_001E', 'B13002_001E', 'B13002_002E',
# 'B14001_001E', 'B14001_002E', 'B14001_008E', 'B14001_009E', 'B16010_001E',
# 'B16010_002E', 'B16010_041E', 'B17001_001E', 'B17001_002E', 'B19001_001E',
# 'B19001_002E', 'B19013_001E', 'B19083_001E', 'B19325_001E', 'B19325_002E',
# 'B19325_003E', 'B19325_049E', 'B19325_050E', 'B23025_001E', 'B23025_002E',
# 'B23025_004E', 'B23025_005E', 'B23027_012E', 'B23027_013E', 'B24011_001E',
# 'B24011_002E', 'B24011_018E', 'B24011_026E', 'B24011_029E', 'B24011_033E',
# 'B25002_001E', 'B25002_002E', 'B25010_001E', 'B25064_001E', 'B25077_001E',
# 'B25105_001E', 'C27012_001E', 'C27012_002E']

Next, I will create a filtered version of df_variables that includes only the variables stored in variable_list.

In [9]:
df_variables = df_variables.query("Variable in @variable_list").copy()
# Adding .copy() here prevents a SettingWithCopyWarning later on.

In [10]:
df_variables

Unnamed: 0,Variable,Label,Concept,Group,Description
279,B01001_001E,Estimate!!Total:,SEX BY AGE,B01001,SEX BY AGE Estimate!!Total:
6835,B11005_001E,Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
6836,B11005_002E,Estimate!!Total:!!Households with one or more ...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
6838,B11005_004E,Estimate!!Total:!!Households with one or more ...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
6847,B11005_013E,Estimate!!Total:!!Households with no people un...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
8692,B16010_001E,Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
8693,B16010_002E,Estimate!!Total:!!Less than high school graduate:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
8706,B16010_015E,Estimate!!Total:!!High school graduate (includ...,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
8719,B16010_028E,Estimate!!Total:!!Some college or associate's ...,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
8732,B16010_041E,Estimate!!Total:!!Bachelor's degree or higher:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...


It's now time to query census data on each of the variables in variable_list for the year specified. retrieve_census_data (contained in census_query.py) allows for this to be accomplished in blocks of variables, thus saving time.

The following three code blocks retrieve American Community Survey (5-year estimates) data for zip codes, counties, and states. Each block **also** merges in population data from 5 years ago so that each region's population growth can be measured.

In [11]:
zip_data = retrieve_census_data(df_variable_list = df_variables, year = year, region = 'zip', source = 'acs5', api_key = api_key)
zip_data = zip_data.merge(retrieve_single_census_variable(variable = 'B01001_001E', column_name = 'population', region = 'zip', year = year-5, source = 'acs5', api_key = api_key), on = 'NAME', how = 'outer')
zip_data

Retrieving data from rows 0 to 35


Unnamed: 0,NAME,Year,SEX BY AGE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!High school graduate (includes equivalency):,...,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency),"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree",POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher,"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher",MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars),MEDIAN VALUE (DOLLARS) Estimate!!Median value (dollars),population_2016
0,00601,2021.0,17126.0,5397.0,1306.0,590.0,1580.0,12436.0,4318.0,3618.0,...,264.0,254.0,253.0,44.0,52.0,32.0,68.0,15292.0,78800.0,17800.0
1,00602,2021.0,37895.0,12858.0,3089.0,1545.0,4170.0,28138.0,9494.0,7243.0,...,930.0,764.0,997.0,314.0,134.0,195.0,332.0,18716.0,88500.0,39716.0
2,00603,2021.0,49136.0,19295.0,4975.0,1947.0,4978.0,35933.0,9061.0,10641.0,...,1066.0,1068.0,2117.0,385.0,190.0,284.0,632.0,16789.0,121600.0,51565.0
3,00606,2021.0,5751.0,1968.0,585.0,273.0,311.0,4208.0,1715.0,1358.0,...,121.0,94.0,28.0,89.0,40.0,80.0,28.0,18835.0,96500.0,6320.0
4,00610,2021.0,26153.0,8934.0,2247.0,1027.0,2730.0,19352.0,5725.0,5950.0,...,663.0,720.0,582.0,134.0,208.0,308.0,227.0,21239.0,89900.0,27976.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33966,48551,,,,,,,,,,...,,,,,,,,,,0.0
33967,48553,,,,,,,,,,...,,,,,,,,,,0.0
33968,48554,,,,,,,,,,...,,,,,,,,,,0.0
33969,48667,,,,,,,,,,...,,,,,,,,,,0.0


In [12]:
county_data = retrieve_census_data(df_variable_list = df_variables, year = year, source = 'acs5', region = 'county', api_key = api_key)
county_data = county_data.merge(retrieve_single_census_variable(variable = 'B01001_001E', column_name = 'population', region = 'county', year = year-5, source = 'acs5', api_key = api_key), on = 'NAME', how = 'outer')
county_data

Retrieving data from rows 0 to 35


Unnamed: 0,NAME,Year,state,county,SEX BY AGE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,...,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency),"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree",POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher,"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher",MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars),MEDIAN VALUE (DOLLARS) Estimate!!Median value (dollars),population_2016
0,"Autauga County, Alabama",2021.0,1.0,1.0,58239.0,21856.0,7461.0,5145.0,6320.0,39614.0,...,2423.0,3375.0,4366.0,172.0,649.0,616.0,637.0,62660.0,164900.0,55049.0
1,"Baldwin County, Alabama",2021.0,1.0,3.0,227131.0,87190.0,23375.0,17327.0,30659.0,161977.0,...,10657.0,13916.0,20109.0,258.0,1451.0,2089.0,1641.0,64346.0,226600.0,199510.0
2,"Barbour County, Alabama",2021.0,1.0,5.0,25259.0,9088.0,2816.0,1238.0,2185.0,17995.0,...,1219.0,910.0,694.0,148.0,323.0,390.0,107.0,36422.0,89500.0,26614.0
3,"Bibb County, Alabama",2021.0,1.0,7.0,22412.0,7083.0,2297.0,1407.0,2334.0,16057.0,...,1629.0,943.0,560.0,70.0,156.0,158.0,20.0,54277.0,102900.0,22572.0
4,"Blount County, Alabama",2021.0,1.0,9.0,58884.0,21300.0,6986.0,5055.0,7006.0,40668.0,...,3470.0,4411.0,2225.0,323.0,454.0,758.0,242.0,52830.0,138100.0,57704.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3217,"Vieques Municipio, Puerto Rico",2021.0,72.0,147.0,8317.0,2374.0,445.0,83.0,631.0,5970.0,...,248.0,0.0,226.0,10.0,45.0,0.0,103.0,14942.0,123700.0,9046.0
3218,"Villalba Municipio, Puerto Rico",2021.0,72.0,149.0,22341.0,7823.0,2635.0,900.0,2053.0,15523.0,...,793.0,574.0,589.0,141.0,70.0,169.0,399.0,20722.0,93300.0,24186.0
3219,"Yabucoa Municipio, Puerto Rico",2021.0,72.0,151.0,31047.0,11905.0,3146.0,1234.0,3143.0,22690.0,...,780.0,1080.0,543.0,286.0,150.0,183.0,165.0,17267.0,85800.0,35670.0
3220,"Yauco Municipio, Puerto Rico",2021.0,72.0,153.0,34704.0,11836.0,2328.0,1188.0,3605.0,25661.0,...,1037.0,688.0,1071.0,121.0,204.0,85.0,326.0,16444.0,90800.0,38519.0


In [13]:
state_data = retrieve_census_data(df_variable_list = df_variables, year = year, source = 'acs5', region = 'state', api_key = api_key)
state_data = state_data.merge(retrieve_single_census_variable(variable = 'B01001_001E', column_name = 'population', region = 'state', year = year-5, source = 'acs5', api_key = api_key), on = 'NAME', how = 'outer')
state_data

Retrieving data from rows 0 to 35


Unnamed: 0,NAME,Year,state,SEX BY AGE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:,...,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency),"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree",POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher,"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher",MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars),MEDIAN VALUE (DOLLARS) Estimate!!Median value (dollars),population_2016
0,Alabama,2021,1,4997675,1902983,560352,348633,546370,3413803,430047,...,208851,268491,316069,17713,50598,69704,42695,54943,157100,4841164
1,Alaska,2021,2,735951,260561,88146,59909,69970,484382,32669,...,24771,46703,49483,1218,5495,9011,5065,80287,282800,736855
2,Arizona,2021,4,7079203,2683557,818733,512219,757947,4792007,560460,...,206512,410997,494969,28138,56097,101572,62093,65913,265600,6728577
3,Arkansas,2021,5,3006309,1158460,362428,226713,326603,2021290,248721,...,143614,162903,174009,11063,30752,39293,23621,52123,142100,2968472
4,California,2021,6,39455353,13217586,4462011,3030853,3508592,26797070,4236035,...,931028,1748928,2797512,200554,275209,492019,383829,84097,573200,38654206
5,Colorado,2021,8,5723176,2227932,666878,475845,628523,3937040,300620,...,158639,300768,556453,13748,35548,58363,55021,80184,397500,5359295
6,Connecticut,2021,9,3605330,1397324,411169,270122,396404,2515137,225021,...,123021,152512,341828,12299,39055,45682,43242,83572,286700,3588570
7,Delaware,2021,10,981892,381097,105611,64271,117210,690618,61788,...,39479,47811,78404,2400,10880,12540,11450,72724,269700,934695
8,District of Columbia,2021,11,683154,310104,62069,31755,48195,487726,37934,...,4929,7069,62189,2840,7750,8814,9687,93547,635900,659009
9,Florida,2021,12,21339762,8157420,2196679,1357347,2458358,15349290,1682505,...,756977,1108886,1511694,77593,212695,294277,227982,61777,248700,19934451


I admit that many of the column names are obscenely long and unwieldy. This is less of an issue when viewing the table as a CSV export (which I'll perform later), since spreadsheet software can make the columns a uniform width while allowing the full name to be displayed in a separate box. An alternative to these long names, though, would be to keep the variable codes as the column name, then include a key mapping each variable code to its description.

So far, the values shown in the DataFrame are nominal in nature. For example, the table reports on the number of married-couple households with one or more children, but doesn't say what *proportion* have at laest one child--which is much more useful when comparing different zip codes.

Therefore, in the following code block, I added additional columns to the DataFrame that generate various proportions using a function called calc_proportion_and_rename. Some of these were generated using pre-existing totals as a denominator, whereas others used the sum of two diferent statistics as the denominator. (For example, to calculate the proportion of children below the poverty level for a given zip code, I divided the number of children below the poverty level by the sum of (1) children below the poverty level and (2) children above the poverty level. This was a useful strategy when a given Census table didn't have a 'totals' row.

(When creating proportions, be careful about using a total in one table as the denominator for a proportion calculation that involves a separate table. For example, if Table A says that there are 10,000 kids in a zip code, and Table B says that there are 2,000 kids below the poverty line, you may be tempted to conclude that the proportion of children below the poverty line equals 2,000/10,000 = 0.2. However, suppose not all the kids identified in Table A show up in Table B, and that Table B doesn't have a totals row. In that case, you'd want to divide the proportion of kids in Table B above below the poverty level (2,000) by the number in Table B above the poverty level (let's say it's 6,000) to arrive at a more accurate proportion--in this case, 2,000/(2,000+6,000) = 2,000/8,000 = 25%.)

The calc_proportions_and_rename function also renames certain columns to make them more intuitive.

In [14]:
def calc_proportions_and_rename(df_results):

    df_results['5_year_population_growth'] = (df_results['SEX BY AGE Estimate!!Total:']/df_results[f'population_{year-5}']-1)

    df_results['Married_couple_households_with_one_or_more_children_as_proportion_of_all_households'] = df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family']/df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:']

    df_results['Married_couple_households_with_one_or_more_children_as_proportion_of_all_households_with_one_or_more_children'] = df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family']/df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:']

    df_results['Proportion_of_children_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:'])

    df_results['Proportion_of_children_in_married_couple_families_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In married-couple family:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In married-couple family:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!In married-couple family:'])

    df_results['Proportion_of_children_in_female_householder_families_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Female householder, no spouse present:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Female householder, no spouse present:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!In other family:!!Female householder, no spouse present:'])

    df_results['Proportion_of_children_in_male_householder_families_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Male householder, no spouse present:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Male householder, no spouse present:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!In other family:!!Male householder, no spouse present:'])

    # Calculating proportions of residents living below the poverty level by education and household type

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Less than high school graduate']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Less than high school graduate']+df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Less than high school graduate'])

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!High school graduate (includes equivalency)"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!High school graduate (includes equivalency)"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency)"])

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Some college, associate's degree"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Some college, associate's degree"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree"])

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Bachelor's degree or higher"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Bachelor's degree or higher"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher"])


    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate"]/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate']+df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate'])

    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)']+df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)'])

    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree"])

    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher"])

    df_results['Proportion_of_individuals_25+y/o_who_did_not_graduate_high_school'] = df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:']/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    df_results['Proportion_of_individuals_25+y/o_whose_highest_education_level_=_high_school_graduate/equivalent'] = df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!High school graduate (includes equivalency):']/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    df_results['Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_some_college/associate\'s_degree'] = df_results["EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Some college or associate's degree:"]/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    df_results['Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_bachelor\'s_degree_or_higher'] = df_results["EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Bachelor's degree or higher:"]/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    df_results['Median_income_as_proportion_of_median_home_value'] = df_results[f"MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN {year} INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in {year} inflation-adjusted dollars)"]/df_results['MEDIAN VALUE (DOLLARS) Estimate!!Median value (dollars)']

    # df_results[''] = df_results['']/(df_results['']+df_results[''])

    df_results.rename(columns = {

        "SEX BY AGE Estimate!!Total:":"Total_population",

        f"MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN {year} INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in {year} inflation-adjusted dollars)":"Median_household_income",
        
        "HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:":"Households",

        'MEDIAN VALUE (DOLLARS) Estimate!!Median value (dollars)':'Median House Value',
        
        },inplace=True)



    return df_results
    

In [15]:
county_data = calc_proportions_and_rename(county_data)
county_data

Unnamed: 0,NAME,Year,state,county,Total_population,Households,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,...,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_individuals_25+y/o_who_did_not_graduate_high_school,Proportion_of_individuals_25+y/o_whose_highest_education_level_=_high_school_graduate/equivalent,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_some_college/associate's_degree,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_bachelor's_degree_or_higher,Median_income_as_proportion_of_median_home_value
0,"Autauga County, Alabama",2021.0,1.0,1.0,58239.0,21856.0,7461.0,5145.0,6320.0,39614.0,...,0.011994,0.582524,0.402944,0.254237,0.057692,0.104155,0.327586,0.286944,0.281315,0.379988
1,"Baldwin County, Alabama",2021.0,1.0,3.0,227131.0,87190.0,23375.0,17327.0,30659.0,161977.0,...,0.012571,0.395785,0.270854,0.261837,0.169954,0.089858,0.273755,0.311884,0.324503,0.283963
2,"Barbour County, Alabama",2021.0,1.0,5.0,25259.0,9088.0,2816.0,1238.0,2185.0,17995.0,...,0.038781,0.734291,0.405157,0.422222,0.053097,0.243290,0.366769,0.278411,0.111531,0.406950
3,"Bibb County, Alabama",2021.0,1.0,7.0,22412.0,7083.0,2297.0,1407.0,2334.0,16057.0,...,0.000000,0.461538,0.686117,0.000000,0.000000,0.194619,0.439185,0.247057,0.119138,0.527473
4,"Blount County, Alabama",2021.0,1.0,9.0,58884.0,21300.0,6986.0,5055.0,7006.0,40668.0,...,0.013304,0.398510,0.299383,0.109283,0.043478,0.163519,0.351234,0.336210,0.149036,0.382549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3217,"Vieques Municipio, Puerto Rico",2021.0,72.0,147.0,8317.0,2374.0,445.0,83.0,631.0,5970.0,...,0.134100,0.935484,0.785714,1.000000,0.112069,0.270184,0.481072,0.107203,0.141541,0.120792
3218,"Villalba Municipio, Puerto Rico",2021.0,72.0,149.0,22341.0,7823.0,2635.0,900.0,2053.0,15523.0,...,0.073899,0.602817,0.854470,0.673114,0.297535,0.214456,0.352316,0.218514,0.214714,0.222101
3219,"Yabucoa Municipio, Puerto Rico",2021.0,72.0,151.0,31047.0,11905.0,3146.0,1234.0,3143.0,22690.0,...,0.096506,0.658711,0.720149,0.756325,0.584383,0.268709,0.264918,0.292243,0.174130,0.201247
3220,"Yauco Municipio, Puerto Rico",2021.0,72.0,153.0,34704.0,11836.0,2328.0,1188.0,3605.0,25661.0,...,0.063811,0.647230,0.651282,0.817987,0.291304,0.221503,0.346596,0.172713,0.259187,0.181101


In [16]:
state_data = calc_proportions_and_rename(state_data)
state_data

Unnamed: 0,NAME,Year,state,Total_population,Households,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:,...,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_individuals_25+y/o_who_did_not_graduate_high_school,Proportion_of_individuals_25+y/o_whose_highest_education_level_=_high_school_graduate/equivalent,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_some_college/associate's_degree,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_bachelor's_degree_or_higher,Median_income_as_proportion_of_median_home_value
0,Alabama,2021,1,4997675,1902983,560352,348633,546370,3413803,430047,...,0.015981,0.51893,0.374731,0.291316,0.116265,0.125973,0.305,0.302338,0.266689,0.349733
1,Alaska,2021,2,735951,260561,88146,59909,69970,484382,32669,...,0.010815,0.408163,0.314838,0.176551,0.066359,0.067445,0.284637,0.34198,0.305938,0.2839
2,Arizona,2021,4,7079203,2683557,818733,512219,757947,4792007,560460,...,0.021783,0.434401,0.280503,0.214228,0.094485,0.116957,0.23487,0.336108,0.312065,0.248166
3,Arkansas,2021,5,3006309,1158460,362428,226713,326603,2021290,248721,...,0.020495,0.443231,0.363313,0.309195,0.104315,0.123051,0.340739,0.293163,0.243047,0.366805
4,California,2021,6,39455353,13217586,4462011,3030853,3508592,26797070,4236035,...,0.021342,0.357584,0.251755,0.189619,0.091823,0.158078,0.204394,0.284824,0.352704,0.146715
5,Colorado,2021,8,5723176,2227932,666878,475845,628523,3937040,300620,...,0.014519,0.381946,0.253303,0.200518,0.084783,0.076357,0.205986,0.289373,0.428283,0.201721
6,Connecticut,2021,9,3605330,1397324,411169,270122,396404,2515137,225021,...,0.014189,0.416501,0.244657,0.201392,0.073631,0.089467,0.261198,0.243804,0.405531,0.291496
7,Delaware,2021,10,981892,381097,105611,64271,117210,690618,61788,...,0.015235,0.409594,0.279756,0.213152,0.063318,0.089468,0.303971,0.270089,0.336473,0.269648
8,District of Columbia,2021,11,683154,310104,62069,31755,48195,487726,37934,...,0.007738,0.508565,0.376809,0.24557,0.067841,0.077777,0.154927,0.153285,0.614011,0.14711
9,Florida,2021,12,21339762,8157420,2196679,1357347,2458358,15349290,1682505,...,0.029216,0.394004,0.278574,0.200751,0.101382,0.109615,0.279031,0.296033,0.315321,0.2484


In [17]:
zip_data = calc_proportions_and_rename(zip_data)
zip_data

Unnamed: 0,NAME,Year,Total_population,Households,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!High school graduate (includes equivalency):,...,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_individuals_25+y/o_who_did_not_graduate_high_school,Proportion_of_individuals_25+y/o_whose_highest_education_level_=_high_school_graduate/equivalent,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_some_college/associate's_degree,Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_bachelor's_degree_or_higher,Median_income_as_proportion_of_median_home_value
0,00601,2021.0,17126.0,5397.0,1306.0,590.0,1580.0,12436.0,4318.0,3618.0,...,0.469602,0.843416,0.820690,0.885714,0.704348,0.347218,0.290930,0.216227,0.145626,0.194061
1,00602,2021.0,37895.0,12858.0,3089.0,1545.0,4170.0,28138.0,9494.0,7243.0,...,0.062088,0.577389,0.711207,0.643510,0.446667,0.337408,0.257410,0.187007,0.218175,0.211480
2,00603,2021.0,49136.0,19295.0,4975.0,1947.0,4978.0,35933.0,9061.0,10641.0,...,0.110504,0.663755,0.849206,0.746202,0.362903,0.252164,0.296134,0.209028,0.242674,0.138067
3,00606,2021.0,5751.0,1968.0,585.0,273.0,311.0,4208.0,1715.0,1358.0,...,0.317073,0.639676,0.583333,0.252336,0.404255,0.407557,0.322719,0.166112,0.103612,0.195181
4,00610,2021.0,26153.0,8934.0,2247.0,1027.0,2730.0,19352.0,5725.0,5950.0,...,0.114155,0.683215,0.552688,0.385230,0.307927,0.295835,0.307462,0.208299,0.188404,0.236251
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33966,48551,,,,,,,,,,...,,,,,,,,,,
33967,48553,,,,,,,,,,...,,,,,,,,,,
33968,48554,,,,,,,,,,...,,,,,,,,,,
33969,48667,,,,,,,,,,...,,,,,,,,,,


A look at the first few rows in the zip code table reveals that some median household income values are clearly inaccurate! $-666,666,666 is *not* the actual median household income in any zip code, yet that's the value listed for 2,229 entries in zip_data, as shown below:

In [18]:
len(zip_data.query("Median_household_income == -666666666"))

3113

This means that, when performing average calculations across the entire dataset, you must be extremely careful--otherwise, you'll end up with results like the one below:

In [19]:
np.mean(zip_data['Median_household_income'])

-61386582.28983834

These results are, of course, skewed by the thousands of -666,666,666 values. The U.S. would be in dire shape if the average median household income among zip codes were truly $-46,219,120! 

As shown below, performing some basic data cleaning (e.g. removing any results with a negative median household income) can produce a more accurate number.

In [20]:
np.mean(zip_data.query('Median_household_income > 0')['Median_household_income'])

67280.94325038322

I then exported this zip code, county, and state data to a CSV. I created copies of the county and zip code DataFrames that only include regions with at least 1,000 households, since lower sample sizes in smaller zip codes can skew the sample sizes shown.

In [21]:
zip_data_1k_plus_households = zip_data.query("Households > 1000").reset_index(drop=True)
county_data_1k_plus_households = county_data.query("Households > 1000").reset_index(drop=True)

zip_data.to_csv(f'acs5_{year}_zip_results.csv', index = False)
zip_data_1k_plus_households.to_csv(f'acs5_{year}_zip_results_1k_plus_households.csv', index = False)

county_data.to_csv(f'acs5_{year}_county_results.csv', index = False)
county_data_1k_plus_households.to_csv(f'acs5_{year}_county_results_1k_plus_households.csv', index = False)

state_data.to_csv(f'acs5_{year}_state_results.csv', index = False)

Next, I'll use the compare_variable_across_years function within census_query.py to evaluate how county, state, and zip codes have grown in population over time. Whereas the previous functions all used the American Community Survey (5-year estimates) as a data source, some of the code blocks below will also use data from the decennial census and the American Community Survey (1-year estimates). 

The variable codes for a region's total population can be retrieved within the Census API website (see links near the beginning of this tutorial). They are as follows:

ACS5: population = B01001_001E

ACS1: population = B01001_001E

Census (redistricting data): population = P1_001N

Census (SF1 data): population = P001001

In [22]:
acs5_state_pop_5_year_periods = compare_variable_across_years(variable = 'B01001_001E', variable_name = 'population', source= 'acs5', year_list = [year-10, year-5, year], region = 'state', api_key = api_key)
acs5_state_pop_5_year_periods.to_csv(f'acs5_state_pop_{year-10}_{year-5}_{year}.csv', index = False)
acs5_state_pop_5_year_periods

Retrieving data for: 2011
Retrieving data for: 2016
Retrieving data for: 2021


Unnamed: 0,NAME,state,population_2011,population_2016,population_2021,2011_to_2016_chg,2016_to_2021_chg,2011_to_2021_chg
0,Mississippi,28,2956700,2989192,2967023,0.010989,-0.007416,0.003491
1,Missouri,29,5955802,6059651,6141534,0.017437,0.013513,0.031185
2,Montana,30,982854,1023391,1077978,0.041244,0.053339,0.096783
3,Nebraska,31,1813061,1881259,1951480,0.037615,0.037327,0.076345
4,Nevada,32,2673396,2839172,3059238,0.06201,0.077511,0.144327
5,New Hampshire,33,1315911,1327503,1372175,0.008809,0.033651,0.042757
6,New Jersey,34,8753064,8915456,9234024,0.018553,0.035732,0.054948
7,New Mexico,35,2037136,2082669,2109366,0.022351,0.012819,0.035457
8,New York,36,19302448,19697457,20114745,0.020464,0.021185,0.042083
9,North Carolina,37,9418736,9940828,10367022,0.055431,0.042873,0.100681


In [23]:
acs5_county_pop_5_year_periods = compare_variable_across_years(variable = 'B01001_001E', variable_name = 'population', source= 'acs5', year_list = [year-10, year-5, year], region = 'county', api_key = api_key)
acs5_county_pop_5_year_periods.dropna(inplace=True)
acs5_county_pop_5_year_periods.to_csv(f'acs5_county_pop_{year-10}_{year-5}_{year}.csv', index = False)
acs5_county_pop_5_year_periods

Retrieving data for: 2011
Retrieving data for: 2016
Retrieving data for: 2021


Unnamed: 0,NAME,state,county,population_2011,population_2016,population_2021,2011_to_2016_chg,2016_to_2021_chg,2011_to_2021_chg
0,"Sweetwater County, Wyoming",56,037,43152.0,44812.0,42459.0,0.038469,-0.052508,-0.016060
1,"Platte County, Wyoming",56,031,8681.0,8740.0,8607.0,0.006796,-0.015217,-0.008524
2,"Sheridan County, Wyoming",56,033,28743.0,29924.0,30812.0,0.041088,0.029675,0.071983
3,"Big Horn County, Wyoming",56,003,11553.0,11931.0,11671.0,0.032719,-0.021792,0.010214
4,"Crook County, Wyoming",56,011,6926.0,7284.0,7185.0,0.051689,-0.013591,0.037395
...,...,...,...,...,...,...,...,...,...
3216,"Oneida County, Idaho",16,071,4225.0,4269.0,4514.0,0.010414,0.057390,0.068402
3217,"Gem County, Idaho",16,045,16783.0,16853.0,18692.0,0.004171,0.109120,0.113746
3218,"Valley County, Idaho",16,085,9877.0,9897.0,11476.0,0.002025,0.159543,0.161891
3219,"Adams County, Idaho",16,003,3980.0,3865.0,4321.0,-0.028894,0.117982,0.085678


In [24]:
acs5_zip_pop_5_year_periods = compare_variable_across_years(variable = 'B01001_001E', variable_name = 'population', source= 'acs5', year_list = [year-5, year], region = 'zip', api_key = api_key)
# I received an error when trying to retrieve 2010 population data using this 
# function, which indicated that that data was either unavailable or existed
# under a different name. Therefore, I chose to run this function for only
# 2015 and 2020.
acs5_zip_pop_5_year_periods.dropna(inplace=True)
acs5_zip_pop_5_year_periods.to_csv(f'acs5_zip_pop_{year-5}_{year}.csv', index = False)
acs5_zip_pop_5_year_periods

Retrieving data for: 2016
Retrieving data for: 2021


Unnamed: 0,NAME,state,population_2016,population_2021,2016_to_2021_chg
0,99119,53,1209.0,1491.0,0.233251
1,99128,53,209.0,291.0,0.392344
2,99141,53,5140.0,5902.0,0.148249
3,99149,53,180.0,197.0,0.094444
4,99153,53,390.0,424.0,0.087179
...,...,...,...,...,...
33115,60619,17,62822.0,63481.0,0.010490
33116,60621,17,31383.0,26538.0,-0.154383
33117,60636,17,35779.0,30412.0,-0.150004
33118,60639,17,90211.0,89037.0,-0.013014


The following code block creates a list of years from 2010 to the variable stored in 'year' for multi-year analyses. It removes 2020 because data for this year wasn't available (at least when I checked).

In [25]:
year_list = list(range(2010, year+1))
year_list.remove(2020) 
year_list

[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2021]

In [26]:
acs1_state_pop_2010_to_year = compare_variable_across_years(variable = 'B01001_001E', variable_name = 'population', source= 'acs1', year_list = year_list, region = 'state', api_key = api_key)
acs1_state_pop_2010_to_year.to_csv(f'acs1_state_pop_2010_to_{year}.csv', index = False)
acs1_state_pop_2010_to_year

Retrieving data for: 2010
Retrieving data for: 2011
Retrieving data for: 2012
Retrieving data for: 2013
Retrieving data for: 2014
Retrieving data for: 2015
Retrieving data for: 2016
Retrieving data for: 2017
Retrieving data for: 2018
Retrieving data for: 2019
Retrieving data for: 2021


Unnamed: 0,NAME,state,population_2010,population_2011,population_2012,population_2013,population_2014,population_2015,population_2016,population_2017,...,2011_to_2012_chg,2012_to_2013_chg,2013_to_2014_chg,2014_to_2015_chg,2015_to_2016_chg,2016_to_2017_chg,2017_to_2018_chg,2018_to_2019_chg,2019_to_2021_chg,2010_to_2021_chg
0,California,6,37349363,37691912,38041430,38332521,38802500,39144818,39250017,39536653,...,0.009273,0.007652,0.012261,0.008822,0.002687,0.007303,0.000516,-0.001133,-0.006944,0.050562
1,Alabama,1,4785298,4802740,4822023,4833722,4849377,4858979,4863300,4874747,...,0.004015,0.002426,0.003239,0.00198,0.000889,0.002354,0.002692,0.003133,0.027878,0.0532
2,Alaska,2,713985,722718,731449,735132,736732,738432,741894,739795,...,0.012081,0.005035,0.002176,0.002307,0.004688,-0.002829,-0.003186,-0.007991,0.001542,0.026174
3,Arizona,4,6413737,6482505,6553255,6626624,6731484,6828065,6931071,7016270,...,0.010914,0.011196,0.015824,0.014348,0.015086,0.012292,0.022145,0.01493,-0.00033,0.134489
4,Arkansas,5,2921606,2937979,2949131,2959373,2966369,2978204,2988248,3004279,...,0.003796,0.003473,0.002364,0.00399,0.003373,0.005365,0.003177,0.00132,0.00268,0.035694
5,Colorado,8,5049071,5116796,5187582,5268367,5355866,5456574,5540545,5607154,...,0.013834,0.015573,0.016608,0.018803,0.015389,0.012022,0.015767,0.011091,0.009261,0.151117
6,Connecticut,9,3577073,3580709,3590347,3596080,3596677,3590886,3576452,3588184,...,0.002692,0.001597,0.000166,-0.00161,-0.00402,0.00328,-0.004325,-0.002065,0.011306,0.007974
7,Delaware,10,899769,907135,917092,925749,935614,945934,952065,961939,...,0.010976,0.00944,0.010656,0.01103,0.006481,0.010371,0.005439,0.006817,0.030418,0.115157
8,District of Columbia,11,604453,617996,632323,646449,658893,672228,681170,693972,...,0.023183,0.02234,0.01925,0.020238,0.013302,0.018794,0.012224,0.004689,-0.050583,0.108523
9,Florida,12,18843326,19057542,19317568,19552860,19893297,20271272,20612439,20984400,...,0.013644,0.01218,0.017411,0.019,0.01683,0.018045,0.015008,0.008376,0.014126,0.155907


In [27]:
acs1_county_pop_2010_to_year = compare_variable_across_years(variable = 'B01001_001E', variable_name = 'population', source= 'acs1', year_list = year_list, region = 'county', api_key = api_key)
acs1_county_pop_2010_to_year.to_csv(f'acs1_county_pop_2010_to_{year}.csv', index = False)
acs1_county_pop_2010_to_year
# Note that fewer counties are contained in the American Community Survey (1-year estimates) dataset.

Retrieving data for: 2010
Retrieving data for: 2011
Retrieving data for: 2012
Retrieving data for: 2013
Retrieving data for: 2014
Retrieving data for: 2015
Retrieving data for: 2016
Retrieving data for: 2017
Retrieving data for: 2018
Retrieving data for: 2019
Retrieving data for: 2021


Unnamed: 0,NAME,state,county,population_2010,population_2011,population_2012,population_2013,population_2014,population_2015,population_2016,...,2011_to_2012_chg,2012_to_2013_chg,2013_to_2014_chg,2014_to_2015_chg,2015_to_2016_chg,2016_to_2017_chg,2017_to_2018_chg,2018_to_2019_chg,2019_to_2021_chg,2010_to_2021_chg
0,"Stark County, Ohio",39,151,375321.0,375087.0,374868.0,375432.0,375736.0,375165.0,373612.0,...,-0.000584,0.001505,0.000810,-0.001520,-0.004140,-0.002864,-0.002598,-0.002605,0.008710,-0.003962
1,"Summit County, Ohio",39,153,541565.0,539832.0,540811.0,541824.0,541943.0,541968.0,540300.0,...,0.001814,0.001873,0.000220,0.000046,-0.003078,0.001718,0.001275,-0.001670,-0.006248,-0.007260
2,"Trumbull County, Ohio",39,155,209936.0,209264.0,207406.0,206442.0,205175.0,203751.0,201825.0,...,-0.008879,-0.004648,-0.006137,-0.006940,-0.009453,-0.007160,-0.008748,-0.003288,0.016977,-0.040970
3,"Tuscarawas County, Ohio",39,157,92542.0,92508.0,92392.0,92672.0,92788.0,92916.0,92420.0,...,-0.001254,0.003031,0.001252,0.001379,-0.005338,-0.001331,-0.001311,-0.002050,0.005577,-0.000454
4,"Warren County, Ohio",39,165,213192.0,214910.0,217241.0,219169.0,221659.0,224469.0,227063.0,...,0.010846,0.008875,0.011361,0.012677,0.011556,0.008011,0.014379,0.010462,0.050942,0.156483
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
849,"Montcalm County, Michigan",,,,,,,,,,...,,,,,,,,,,
850,"San Benito County, California",,,,,,,,,,...,,,,,,,,,,
851,"Effingham County, Georgia",,,,,,,,,,...,,,,,,,,,,
852,"Clay County, Minnesota",,,,,,,,,,...,,,,,,,,,,


At the time I created this program, not all of the 2020 Census data was available via the Census API. Therefore, in the following code block, I retrieved population values from the 'Summary File 1' sections of the 2000 and 2010 decennial censuses [or censusses? censi?], then merged these values with population values found in the 2020 Census's redistricting dataset. I then calculated some additional percentage changes.

In [28]:
census_state_population = compare_variable_across_years(variable = 'P001001', variable_name = 'population', source= 'census_sf1', year_list = [2000, 2010], region = 'state', api_key = api_key)
census_state_population_2000_to_2020 = census_state_population.merge(compare_variable_across_years(variable = 'P1_001N', variable_name = 'population', source= 'census_redistricting', year_list = [2020], region = 'state', api_key = api_key).drop('state',axis=1), on = 'NAME')
census_state_population_2000_to_2020.insert(4, 'population_2020', census_state_population_2000_to_2020.pop('population_2020'))
census_state_population_2000_to_2020['2010_to_2020_chg'] = (census_state_population_2000_to_2020['population_2020'] / census_state_population_2000_to_2020['population_2010'])-1
census_state_population_2000_to_2020['2000_to_2020_chg'] = (census_state_population_2000_to_2020['population_2020'] / census_state_population_2000_to_2020['population_2000'])-1
census_state_population_2000_to_2020.to_csv('census_state_population_2000_to_2020.csv', index = False)

census_state_population_2000_to_2020

Retrieving data for: 2000
Retrieving data for: 2010
Retrieving data for: 2020


Unnamed: 0,NAME,state,population_2000,population_2010,population_2020,2000_to_2010_chg,2010_to_2020_chg,2000_to_2020_chg
0,Alabama,1,4447100,4779736,5024279,0.074798,0.051162,0.129788
1,Alaska,2,626932,710231,733391,0.132868,0.032609,0.169809
2,Arizona,4,5130632,6392017,7151502,0.245854,0.118818,0.393883
3,Arkansas,5,2673400,2915918,3011524,0.090715,0.032788,0.126477
4,California,6,33871648,37253956,39538223,0.099857,0.061316,0.167296
5,Colorado,8,4301261,5029196,5773714,0.169238,0.148039,0.342331
6,Connecticut,9,3405565,3574097,3605944,0.049487,0.008911,0.058839
7,Delaware,10,783600,897934,989948,0.145909,0.102473,0.263333
8,District of Columbia,11,572059,601723,689545,0.051855,0.145951,0.205374
9,Florida,12,15982378,18801310,21538187,0.176378,0.145568,0.347621


The same process was used to obtain county-level population changes.

In [29]:
census_county_population = compare_variable_across_years(variable = 'P001001', variable_name = 'population', source= 'census_sf1', year_list = [2000, 2010], region = 'county', api_key = api_key)
census_county_population_2000_to_2020 = census_county_population.merge(compare_variable_across_years(variable = 'P1_001N', variable_name = 'population', source= 'census_redistricting', year_list = [2020], region = 'county', api_key = api_key).drop(['state', 'county'],axis=1), on = 'NAME')
census_county_population_2000_to_2020.insert(4, 'population_2020', census_county_population_2000_to_2020.pop('population_2020'))
census_county_population_2000_to_2020['2010_to_2020_chg'] = (census_county_population_2000_to_2020['population_2020'] / census_county_population_2000_to_2020['population_2010'])-1
census_county_population_2000_to_2020['2000_to_2020_chg'] = (census_county_population_2000_to_2020['population_2020'] / census_county_population_2000_to_2020['population_2000'])-1

census_county_population_2000_to_2020.dropna(inplace=True)

census_county_population_2000_to_2020.to_csv('census_county_population_2000_to_2020.csv', index = False)

census_county_population_2000_to_2020

Retrieving data for: 2000
Retrieving data for: 2010
Retrieving data for: 2020


Unnamed: 0,NAME,state,county,population_2000,population_2020,population_2010,2000_to_2010_chg,2010_to_2020_chg,2000_to_2020_chg
0,"Autauga County, Alabama",01,001,43671.0,58805,54571.0,0.249594,0.077587,0.346546
1,"Baldwin County, Alabama",01,003,140415.0,231767,182265.0,0.298045,0.271594,0.650586
2,"Barbour County, Alabama",01,005,29038.0,25223,27457.0,-0.054446,-0.081364,-0.131380
3,"Bibb County, Alabama",01,007,20826.0,22293,22915.0,0.100307,-0.027144,0.070441
4,"Blount County, Alabama",01,009,51024.0,59134,57322.0,0.123432,0.031611,0.158945
...,...,...,...,...,...,...,...,...,...
3203,"Vega Baja Municipio, Puerto Rico",72,145,61929.0,54414,59662.0,-0.036606,-0.087962,-0.121349
3204,"Vieques Municipio, Puerto Rico",72,147,9106.0,8249,9301.0,0.021414,-0.113106,-0.094114
3205,"Villalba Municipio, Puerto Rico",72,149,27913.0,22093,26073.0,-0.065919,-0.152648,-0.208505
3206,"Yabucoa Municipio, Puerto Rico",72,151,39246.0,30426,37941.0,-0.033252,-0.198071,-0.224736


I'll also obtain changes in median home values over the last 10 years using similar code blocks. (These will only use American Community Survey 5-year data.)

In [30]:
acs5_state_home_val_5_year_periods = compare_variable_across_years(variable = 'B25077_001E', variable_name = 'median_home_val', source= 'acs5', year_list = [year-10, year-5, year], region = 'state', api_key = api_key)
acs5_state_home_val_5_year_periods.to_csv(f'acs5_state_home_val_{year-10}_{year-5}_{year}.csv', index = False)
acs5_state_home_val_5_year_periods

Retrieving data for: 2011
Retrieving data for: 2016
Retrieving data for: 2021


Unnamed: 0,NAME,state,median_home_val_2011,median_home_val_2016,median_home_val_2021,2011_to_2016_chg,2016_to_2021_chg,2011_to_2021_chg
0,Mississippi,28,99200,105700,133000,0.065524,0.258278,0.340726
1,Missouri,29,138900,141200,171800,0.016559,0.216714,0.236861
2,Montana,30,179900,199700,263700,0.110061,0.320481,0.465814
3,Nebraska,31,125400,137300,174100,0.094896,0.268026,0.388357
4,Nevada,32,225400,191600,315900,-0.149956,0.648747,0.401508
5,New Hampshire,33,250000,239700,288700,-0.0412,0.204422,0.1548
6,New Jersey,34,349100,316400,355700,-0.093669,0.12421,0.018906
7,New Mexico,35,161800,161600,184800,-0.001236,0.143564,0.142151
8,New York,36,301000,286300,340600,-0.048837,0.189661,0.131561
9,North Carolina,37,152700,157100,197500,0.028815,0.257161,0.293386


In [31]:
acs5_county_home_val_5_year_periods = compare_variable_across_years(variable = 'B25077_001E', variable_name = 'median_home_val', source= 'acs5', year_list = [year-10, year-5, year], region = 'county', api_key = api_key)
acs5_county_home_val_5_year_periods.to_csv(f'acs5_county_home_val_{year-10}_{year-5}_{year}.csv', index = False)
acs5_county_home_val_5_year_periods

Retrieving data for: 2011
Retrieving data for: 2016
Retrieving data for: 2021


Unnamed: 0,NAME,state,county,median_home_val_2011,median_home_val_2016,median_home_val_2021,2011_to_2016_chg,2016_to_2021_chg,2011_to_2021_chg
0,"Sweetwater County, Wyoming",56,037,180300.0,190700.0,217300.0,0.057682,0.139486,0.205214
1,"Platte County, Wyoming",56,031,136100.0,164000.0,210400.0,0.204996,0.282927,0.545922
2,"Sheridan County, Wyoming",56,033,224900.0,237700.0,289800.0,0.056914,0.219184,0.288573
3,"Big Horn County, Wyoming",56,003,120400.0,148200.0,158800.0,0.230897,0.071525,0.318937
4,"Crook County, Wyoming",56,011,156300.0,217500.0,256000.0,0.391555,0.177011,0.637876
...,...,...,...,...,...,...,...,...,...
3222,"Petersburg Borough, Alaska",,,,213400.0,246900.0,,0.156982,
3223,"Kusilvak Census Area, Alaska",,,,101300.0,66200.0,,-0.346496,
3224,"LaSalle Parish, Louisiana",,,,69100.0,104900.0,,0.518090,
3225,"Chugach Census Area, Alaska",,,,,255100.0,,,


In [32]:
acs5_zip_home_val_5_year_periods = compare_variable_across_years(variable = 'B25077_001E', variable_name = 'median_home_val', source= 'acs5', year_list = [year-10, year-5, year], region = 'zip', api_key = api_key)
acs5_zip_home_val_5_year_periods.to_csv(f'acs5_zip_home_val_{year-10}_{year-5}_{year}.csv', index = False)
acs5_zip_home_val_5_year_periods

Retrieving data for: 2011
Retrieving data for: 2016
Retrieving data for: 2021


Unnamed: 0,NAME,state,median_home_val_2011,median_home_val_2016,median_home_val_2021,2011_to_2016_chg,2016_to_2021_chg,2011_to_2021_chg
0,00601,72,103200.0,92000.0,78800.0,-0.108527,-0.143478,-0.236434
1,00602,72,89300.0,90000.0,88500.0,0.007839,-0.016667,-0.008959
2,00603,72,116700.0,126000.0,121600.0,0.079692,-0.034921,0.041988
3,00606,72,101000.0,93200.0,96500.0,-0.077228,0.035408,-0.044554
4,00610,72,109400.0,99300.0,89900.0,-0.092322,-0.094663,-0.178245
...,...,...,...,...,...,...,...,...
33966,99635,,,,-666666666.0,,,
33967,99675,,,,-666666666.0,,,
33968,99707,,,,-666666666.0,,,
33969,99725,,,,209800.0,,,


That concludes the main part of this tutorial program. I hope that you find these examples useful in performing your own census data analysis!

These census DataFrames can also be a great source of information for regression analyses. The following code blocks show how one of the DataFrames can be modified to serve as a data source for regressions (albeit without any data cleaning or checking). In the future, I may move these regressions over to a separate tutorial program and provide detailed explanations of the code. In the meantime, I've left the code in place and added some brief explanations. 

The first regression examined the relationship between poverty rates and whether children were in a married-couple family as opposed to a female-householder one. This involved creating a reduced version of the df_results_1k_plus_households DataFrame:

In [33]:
df_regression_test = zip_data_1k_plus_households.copy()
df_regression_test.dropna(subset=['Proportion_of_children_in_female_householder_families_below_poverty_level','Proportion_of_children_in_married_couple_families_below_poverty_level'],inplace=True)
df_regression_test = df_regression_test[['NAME','Proportion_of_children_in_female_householder_families_below_poverty_level','Proportion_of_children_in_married_couple_families_below_poverty_level']].copy()

In [34]:
df_regression_test

Unnamed: 0,NAME,Proportion_of_children_in_female_householder_families_below_poverty_level,Proportion_of_children_in_married_couple_families_below_poverty_level
0,00601,0.931319,0.527845
1,00602,0.666420,0.301555
2,00603,0.788359,0.285192
3,00606,0.723039,0.556122
4,00610,0.603466,0.369383
...,...,...,...
16891,99801,0.208713,0.032994
16892,99824,0.494949,0.200000
16893,99833,0.166667,0.012755
16894,99835,0.335689,0.024349


I then converted the two different variable columns into two different rows for each zip code using pd.melt(), which would make it easier to create categorical or 'dummy' variables for the regression analysis:

In [35]:
df_regression_test_melt = pd.melt(df_regression_test.copy(), id_vars = ['NAME']) # https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.melt.html
df_regression_test_melt

Unnamed: 0,NAME,variable,value
0,00601,Proportion_of_children_in_female_householder_f...,0.931319
1,00602,Proportion_of_children_in_female_householder_f...,0.666420
2,00603,Proportion_of_children_in_female_householder_f...,0.788359
3,00606,Proportion_of_children_in_female_householder_f...,0.723039
4,00610,Proportion_of_children_in_female_householder_f...,0.603466
...,...,...,...
33563,99801,Proportion_of_children_in_married_couple_famil...,0.032994
33564,99824,Proportion_of_children_in_married_couple_famil...,0.200000
33565,99833,Proportion_of_children_in_married_couple_famil...,0.012755
33566,99835,Proportion_of_children_in_married_couple_famil...,0.024349


The following code block uses pd.get_dummies to generate categorical variables, then renames the resulting column for better legibility. 

In [36]:
df_regression_test_melt = pd.get_dummies(data = df_regression_test_melt.copy(), columns=['variable'], drop_first=True)
df_regression_test_melt.rename(columns={'variable_Proportion_of_children_in_married_couple_families_below_poverty_level':'in_married_household','value':'proportion_below_poverty_level'},inplace=True)
df_regression_test_melt

Unnamed: 0,NAME,proportion_below_poverty_level,in_married_household
0,00601,0.931319,0
1,00602,0.666420,0
2,00603,0.788359,0
3,00606,0.723039,0
4,00610,0.603466,0
...,...,...,...
33563,99801,0.032994,1
33564,99824,0.200000,1
33565,99833,0.012755,1
33566,99835,0.024349,1


With this table in place, I was able to perform the regression analysis.

In [37]:
y = df_regression_test_melt['proportion_below_poverty_level'] # Contains the list of scores for the current grade (or for the school total in the case of the 'Total' column)
x_vars = df_regression_test_melt[['in_married_household']]
x_vars = sm.add_constant(x_vars) 
model = sm.OLS(y,x_vars)
results = model.fit() # the resulst variable contains the information needed to fill in the other rows within the DataFrame.
results.summary()

0,1,2,3
Dep. Variable:,proportion_below_poverty_level,R-squared:,0.382
Model:,OLS,Adj. R-squared:,0.382
Method:,Least Squares,F-statistic:,20780.0
Date:,"Tue, 08 Aug 2023",Prob (F-statistic):,0.0
Time:,19:26:45,Log-Likelihood:,10910.0
No. Observations:,33568,AIC:,-21820.0
Df Residuals:,33566,BIC:,-21800.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3582,0.001,265.454,0.000,0.356,0.361
in_married_household,-0.2751,0.002,-144.156,0.000,-0.279,-0.271

0,1,2,3
Omnibus:,2296.175,Durbin-Watson:,1.577
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3599.536
Skew:,0.553,Prob(JB):,0.0
Kurtosis:,4.163,Cond. No.,2.62


My second regression analysis aimed to evaluate the impact of family type (married vs. female-householder-only) and education level (no high school diploma; high school diploma/equivalent; associate's/some college; and bachelor's or higher) on poverty status. This first involved retrieving data on income for both family type and education.

In [38]:
df_regression_test_2 = zip_data_1k_plus_households.copy()
df_regression_test_2.dropna(subset=['Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school','Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher'],inplace=True)
df_regression_test_2 = df_regression_test_2[['NAME','Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school','Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher']].copy()

In [39]:
df_regression_test_2

Unnamed: 0,NAME,Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher
0,00601,0.683280,0.596330,0.390887,0.469602,0.843416,0.820690,0.885714,0.704348
1,00602,0.515905,0.415829,0.270992,0.062088,0.577389,0.711207,0.643510,0.446667
2,00603,0.572423,0.417804,0.164319,0.110504,0.663755,0.849206,0.746202,0.362903
3,00606,0.584699,0.502058,0.196581,0.317073,0.639676,0.583333,0.252336,0.404255
4,00610,0.568983,0.438136,0.223301,0.114155,0.683215,0.552688,0.385230,0.307927
...,...,...,...,...,...,...,...,...,...
16890,99762,0.000000,0.070175,0.023256,0.000000,0.391304,0.137931,0.021053,0.000000
16891,99801,0.000000,0.038694,0.010274,0.004921,0.740741,0.333333,0.066313,0.193955
16893,99833,0.000000,0.036585,0.007168,0.017094,1.000000,0.258065,0.250000,0.000000
16894,99835,0.000000,0.054381,0.014474,0.003115,0.400000,0.323529,0.205674,0.180556


Next, I once again 'melted' various columns into the same column in order to facilitate the creation of categorical variables. I also created columns that would store these categorical variables.

In [40]:
df_regression_test_2_melt = pd.melt(df_regression_test_2.copy(), id_vars = ['NAME'])
df_regression_test_2_melt['Married'] = 0
df_regression_test_2_melt['highest_ed_=_high_school_grad'] = 0
df_regression_test_2_melt['highest_ed_=_some_college_or_associate\'s'] = 0
df_regression_test_2_melt['highest_ed_=_bachelor\'s_or_higher'] = 0

In [41]:
df_regression_test_2_melt

Unnamed: 0,NAME,variable,value,Married,highest_ed_=_high_school_grad,highest_ed_=_some_college_or_associate's,highest_ed_=_bachelor's_or_higher
0,00601,Proportion_of_married-couple_families_below_th...,0.683280,0,0,0,0
1,00602,Proportion_of_married-couple_families_below_th...,0.515905,0,0,0,0
2,00603,Proportion_of_married-couple_families_below_th...,0.572423,0,0,0,0
3,00606,Proportion_of_married-couple_families_below_th...,0.584699,0,0,0,0
4,00610,Proportion_of_married-couple_families_below_th...,0.568983,0,0,0,0
...,...,...,...,...,...,...,...
105275,99762,Proportion_of_female-householder_families_belo...,0.000000,0,0,0,0
105276,99801,Proportion_of_female-householder_families_belo...,0.193955,0,0,0,0
105277,99833,Proportion_of_female-householder_families_belo...,0.000000,0,0,0,0
105278,99835,Proportion_of_female-householder_families_belo...,0.180556,0,0,0,0


The output of the following for loop served as a reference for which column numbers corresponded to which variables.

In [42]:
for i in range(len(df_regression_test_2_melt.columns)):
    print("Column",i,":\t",df_regression_test_2_melt.columns[i])

Column 0 :	 NAME
Column 1 :	 variable
Column 2 :	 value
Column 3 :	 Married
Column 4 :	 highest_ed_=_high_school_grad
Column 5 :	 highest_ed_=_some_college_or_associate's
Column 6 :	 highest_ed_=_bachelor's_or_higher


In the next for loop, I filled in the categorical variables by seeing whether certain keywords ('married', 'some_college', etc.) were present in the variable column. For instance, given the variable 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school', the for loop returned 1 for the 'Married' column and 0 for the other columns. 

In [43]:
for i in range(len(df_regression_test_2_melt)):
    variable = df_regression_test_2_melt.iloc[i, 1]
    if 'married' in variable:
        df_regression_test_2_melt.iloc[i, 3] = 1
    if 'high_school_graduate' in variable:
        df_regression_test_2_melt.iloc[i, 4] = 1
    if 'some_college' in variable:
        df_regression_test_2_melt.iloc[i, 5] = 1
    if 'bachelor' in variable:
        df_regression_test_2_melt.iloc[i, 6] = 1


In [44]:
df_regression_test_2_melt.iloc[0,1]

'Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school'

In [45]:
df_regression_test_2_melt.rename(columns={'value':'proportion_below_poverty_level'},inplace=True)
df_regression_test_2.to_csv('marriage_education_poverty_regression.csv', index = False)
df_regression_test_2_melt

Unnamed: 0,NAME,variable,proportion_below_poverty_level,Married,highest_ed_=_high_school_grad,highest_ed_=_some_college_or_associate's,highest_ed_=_bachelor's_or_higher
0,00601,Proportion_of_married-couple_families_below_th...,0.683280,1,0,0,0
1,00602,Proportion_of_married-couple_families_below_th...,0.515905,1,0,0,0
2,00603,Proportion_of_married-couple_families_below_th...,0.572423,1,0,0,0
3,00606,Proportion_of_married-couple_families_below_th...,0.584699,1,0,0,0
4,00610,Proportion_of_married-couple_families_below_th...,0.568983,1,0,0,0
...,...,...,...,...,...,...,...
105275,99762,Proportion_of_female-householder_families_belo...,0.000000,0,0,0,1
105276,99801,Proportion_of_female-householder_families_belo...,0.193955,0,0,0,1
105277,99833,Proportion_of_female-householder_families_belo...,0.000000,0,0,0,1
105278,99835,Proportion_of_female-householder_families_belo...,0.180556,0,0,0,1


With the table complete, I performed a regression that used proportion_below_poverty_level as the dependent variable and various family type/education level values as the independent variables.

In [46]:
y = df_regression_test_2_melt['proportion_below_poverty_level']
x_vars = df_regression_test_2_melt[['Married',
       'highest_ed_=_high_school_grad',
       'highest_ed_=_some_college_or_associate\'s',
       'highest_ed_=_bachelor\'s_or_higher']]
x_vars = sm.add_constant(x_vars) 
model = sm.OLS(y,x_vars)
results_2 = model.fit() 
results_2.summary()

0,1,2,3
Dep. Variable:,proportion_below_poverty_level,R-squared:,0.29
Model:,OLS,Adj. R-squared:,0.29
Method:,Least Squares,F-statistic:,10770.0
Date:,"Tue, 08 Aug 2023",Prob (F-statistic):,0.0
Time:,19:26:52,Log-Likelihood:,31227.0
No. Observations:,105280,AIC:,-62440.0
Df Residuals:,105275,BIC:,-62400.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3501,0.001,282.406,0.000,0.348,0.352
Married,-0.1817,0.001,-163.916,0.000,-0.184,-0.180
highest_ed_=_high_school_grad,-0.0812,0.002,-51.812,0.000,-0.084,-0.078
highest_ed_=_some_college_or_associate's,-0.1170,0.002,-74.645,0.000,-0.120,-0.114
highest_ed_=_bachelor's_or_higher,-0.1964,0.002,-125.249,0.000,-0.199,-0.193

0,1,2,3
Omnibus:,23525.985,Durbin-Watson:,1.741
Prob(Omnibus):,0.0,Jarque-Bera (JB):,64661.008
Skew:,1.193,Prob(JB):,0.0
Kurtosis:,6.008,Cond. No.,5.39


In [47]:
end_time = time.time()
run_time = end_time - start_time
run_minutes = run_time // 60
run_seconds = run_time % 60
print("Completed run at",time.ctime(end_time),"(local time)")
print("Total run time:",'{:.2f}'.format(run_time),"second(s) ("+str(run_minutes),"minute(s) and",'{:.2f}'.format(run_seconds),"second(s))") # Only valid when the program is run nonstop from start to finish

Completed run at Tue Aug  8 19:26:52 2023 (local time)
Total run time: 88.84 second(s) (1.0 minute(s) and 28.84 second(s))
