# Python Tutorial Program: Gathering and Exporting Census Data

By Kenneth Burchfiel

This code is released under the MIT license; the datasets produced by the code are in the public domain.

You can find my blog post on this code at https://kburchfiel3.wordpress.com/2021/08/12/python-tutorial-program-retrieving-u-s-census-data/ .

This program demonstrates how Python (along with the census library, available at https://github.com/datamade/census) can be used to retrieve and export US Census data at the zip code, county, and state level. Although this tutorial program will focus on gathering education, family type, and income/poverty statistics from the American Community Survey, it should be a useful reference for those wishing to gather other types of census data instead.

Before being able to run the code below on your computer, you'll need to install the census library and obtain a free Census API key. See the above link for instructions.

First, I imported a number of libraries:

In [1]:
import time
start_time = time.time() # Allows the program's runtime to be measured
from census import Census
# import us I didn't end up using this library, but you may find it useful for your own Census query program. See https://github.com/datamade/census for more information.
import pandas as pd
import numpy as np
import statsmodels.api as sm

Instead of hard coding the year into my Census queries, I chose to set it as a variable so that the queries could be modified more easily. I picked 2019 because it was the recent year that American Community Survey census data was available.

In [2]:
year = 2019

Next, I imported my Census API key into the code. I stored the path to the key and the key itself in separate file locations. 

In [3]:
with open('..\\key_paths\\path_to_keys_folder.txt') as fin:
    api_folder_path = fin.readline()
with open(api_folder_path+'\\census_api_key.txt') as fin:
    api_key = fin.readline() 

In [4]:
c = Census(api_key) # See https://github.com/datamade/census

The next step was to locate the source of the data that I was interested in. For this program, I chose to retrieve zip code statistics for the following variables:

1. Household types (mostly married households vs. ones led by a female householder with no spouse present, which, for brevity's sake, I'll abbreviate as 'female-householder' homes.
2. The presence of children within these households
3. Median household income
4. Poverty status by family type
5. Poverty status by family type and the highest level of education completed

To search for this data, I used the Census's API site (https://api.census.gov/data.html). This is a very helpful site that provides links to different data sources, along with lists of groups and variables within those data sources.

For example, to access data from the 2019 American Community Survey, I searched in the above page in my web browser for 'acs5', then found the most recent year--which, in this case, happened to be 2019. To confirm that I could access data at the zip code level within this table, I could click on the 'geography' hyperlink (https://api.census.gov/data/2019/acs/acs5/geography.html). To figure out what types of data this survey provides, I clicked on its 'groups' hyperlink (https://api.census.gov/data/2019/acs/acs5/groups.html).

This groups page had 1,136 (!) different types of data that I could choose from. Fortunately, there were lots of options available for my variables of interest (marriage, income, education, household type, etc.)

The Census data site also provided an 'examples' page for accessing American Community Survey data (https://api.census.gov/data/2019/acs/acs5/examples.html), although the query format I used differed somewhat from the examples shown there.

I chose to query Census data in this program by:
1. Organizing different queries in dictionaries
2. Adding these dictionaries to a list (which I named 'metric_list')
3. Looping through this list
4. Storing the output of the queries in a DataFrame

The first two steps are shown below. I ended up adding many different queries to my dictionary, but you may choose to retrieve data for only a couple variables.

Each dictionary is based off information available on the Census Data page for a particular 'group.' For instance, to find data on the presence of children in households by household type, I chose to look into table B11005, 'HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE' (which can be found on https://api.census.gov/data/2019/acs/acs5/groups.html). Clicking the 'selected variables' link for that group took me to https://api.census.gov/data/2019/acs/acs5/groups/B11005.html. This page shows all the different statistics available for the 'HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE' group.

I stored the following information from these pages within the dictionaries below:

1. 'Name': the code on the Census website for that particular variable (e.g. B11005_001E). 

2. 'Label': the Census's text description of that variable (e.g. 'Estimate!!Total:')

3. 'Concept': the Census's text description of the group to which the variable belongs (e.g. 'HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE''). 

I also added an 'Alias' key to store my own description of these metrics. These aliases then served as column names in the Pandas DataFrame that stored the results of these queries. That DataFrame will appear later in this program.

I could have made the dictionaries simpler by including only the 'Name' and 'Alias' components, as the 'Label' and 'Concept' keys are neither used in the census queries nor displayed in the table. However, they can serve as a helpful reference for distinguishing between subtly different variable types.

Creating a list of American Community Survey variables

In order to determine which variables I would import into my spreadsheet, I used pd.read_html to read in all 27,000+ variables in the 2014 American Community Survey from https://api.census.gov/data/2014/acs/acs5/variables.html . Next, I created a list of groups from this survey. 

'Groups' are general categories of data, whereas 'variables' are specific data points within a given group. For instance, group B16010 contains data on 'EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER,' and variable B16010_015E has the data label 'Estimate!!Total:!!High school graduate (includes equivalency)'. 

In [5]:
generate_variable_list = False # If set to False, this code block will read in
# a pre-existing list of variables, which is much faster.

if generate_variable_list == True:
    df_variables = pd.read_html(
        'https://api.census.gov/data/2019/acs/acs5/variables.html')[0]
    # read_html returns a list of DataFrames, but I only want the first one 
    # (hence the inclusion of 0 [0])
    df_variables = df_variables.loc[df_variables['Label'].str.contains(
        'Estimate')].copy() # Filters the DataFrame to exclude rows that do not
        # contain variable values
    df_variables = df_variables[['Name', 'Label', 'Concept', 'Group']] # Removes
    # extraneous columns
    df_variables.rename(columns={'Name':'Variable'},inplace=True) # Changing the 
    # name will make it easier to merge this table with ones
    # that also have a 'Name' column}
    df_variables.reset_index(drop=True,inplace=True)
    df_variables.to_csv(
        'variables_from_html_acs_'+str(year)+'.csv', index = False)
    # Creating a shorter DataFrame with only group (table) information to make 
    # locating groups easier:
    df_groups = df_variables[['Concept', 'Group']].drop_duplicates(
    ).reset_index(drop=True)
    df_groups.to_csv('groups_from_html_acs_'+str(year)+'.csv', index = False)

else:
    
    df_variables = pd.read_csv('variables_from_html_acs_'+str(year)+'.csv')


I could then look through these two CSV files ('groups_from_html_acs' and 'variables_from_html_acs' to determine which variables to add to my project. First, I could look through the groups_from_html_acs list to find categories that interested me. For instance, since I was interested in comparing educational attainment across regions, I wanted to look further into group B16010, which contains data on ('EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER'). One of this group's variable entries (which I could find within variables_from_html_acs) is B16010_015E, whose data label is 'Estimate!!Total:!!High school graduate (includes equivalency):.' I added this variable code to my variable list, along with other variables that covered education, poverty, and demographic data.

In [6]:
variable_list = ['B01001_001E', 'B11005_001E', 'B11005_013E', 'B11005_002E', 
'B11005_004E', 'B19013_001E', 'B17006_002E', 'B17006_016E', 
'B17006_003E', 'B17006_017E', 'B17006_012E', 'B17006_026E',
'B17006_008E', 'B17006_022E', 'B17018_004E', 'B17018_021E', 
'B17018_005E', 'B17018_022E', 'B17018_006E', 'B17018_023E', 
'B17018_007E', 'B17018_024E', 'B17018_015E', 'B17018_032E', 
'B17018_016E', 'B17018_033E', 'B17018_017E', 'B17018_034E', 
'B17018_018E', 'B17018_035E', 'B16010_001E', 'B16010_002E',
'B16010_015E', 'B16010_028E', 'B16010_041E']

# You can use the following template for your own list:
# variable_list = ['', '', '', '',
# '', '', '', '',
# '', '', '', '']


To see how the function performs with longer variable lists, you can try using the following version of variable_list:

In [7]:
# variable_list = ['B01001_001E', 'B01002_001E', 'B06008_001E', 'B06008_003E', 
# 'B08124_001E', 'B08124_002E', 'B08124_003E', 'B08124_004E', 'B08124_005E', 
# 'B08124_006E', 'B08124_007E', 'B09001_001E', 'B13002_001E', 'B13002_002E',
# 'B14001_001E', 'B14001_002E', 'B14001_008E', 'B14001_009E', 'B16010_001E',
# 'B16010_002E', 'B16010_041E', 'B17001_001E', 'B17001_002E', 'B19001_001E',
# 'B19001_002E', 'B19013_001E', 'B19083_001E', 'B19325_001E', 'B19325_002E',
# 'B19325_003E', 'B19325_049E', 'B19325_050E', 'B23025_001E', 'B23025_002E',
# 'B23025_004E', 'B23025_005E', 'B23027_012E', 'B23027_013E', 'B24011_001E',
# 'B24011_002E', 'B24011_018E', 'B24011_026E', 'B24011_029E', 'B24011_033E',
# 'B25002_001E', 'B25002_002E', 'B25010_001E', 'B25064_001E', 'B25077_001E',
# 'B25105_001E', 'C27012_001E', 'C27012_002E']

Next, I will create a filtered version of df_variables that includes only the variables stored in variable_list.

In [8]:
df_variables = df_variables.query("Variable in @variable_list").copy()
# Adding .copy() here prevents a SettingWithCopyWarning later on.

In [9]:
df_variables

Unnamed: 0,Variable,Label,Concept,Group
0,B01001_001E,Estimate!!Total:,SEX BY AGE,B01001
6835,B11005_001E,Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005
6836,B11005_002E,Estimate!!Total:!!Households with one or more ...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005
6838,B11005_004E,Estimate!!Total:!!Households with one or more ...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005
6847,B11005_013E,Estimate!!Total:!!Households with no people un...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005
8692,B16010_001E,Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010
8693,B16010_002E,Estimate!!Total:!!Less than high school graduate:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010
8706,B16010_015E,Estimate!!Total:!!High school graduate (includ...,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010
8719,B16010_028E,Estimate!!Total:!!Some college or associate's ...,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010
8732,B16010_041E,Estimate!!Total:!!Bachelor's degree or higher:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010


Next, I'll add a 'Description' column that combines the 'Concept' and 'Label' data to describe what each variable code refers to.

In [10]:
df_variables['Description'] = df_variables['Concept'] + ' ' + df_variables['Label']
df_variables.sort_values('Variable',inplace=True)
df_variables.reset_index(drop=True,inplace=True)
df_variables

Unnamed: 0,Variable,Label,Concept,Group,Description
0,B01001_001E,Estimate!!Total:,SEX BY AGE,B01001,SEX BY AGE Estimate!!Total:
1,B11005_001E,Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
2,B11005_002E,Estimate!!Total:!!Households with one or more ...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
3,B11005_004E,Estimate!!Total:!!Households with one or more ...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
4,B11005_013E,Estimate!!Total:!!Households with no people un...,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...,B11005,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEAR...
5,B16010_001E,Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
6,B16010_002E,Estimate!!Total:!!Less than high school graduate:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
7,B16010_015E,Estimate!!Total:!!High school graduate (includ...,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
8,B16010_028E,Estimate!!Total:!!Some college or associate's ...,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...
9,B16010_041E,Estimate!!Total:!!Bachelor's degree or higher:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...,B16010,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...


It's now time to query census data on each of the variables in variable_list for the year specified. It does so in batches of 45 variables because the Census API has a 50-call limit, and additional values will also be returned in the data.
After retrieving the data, the function sets the 'Description' values in df_variables as the column names for easier identification; adds population data from 5 years ago to the table; and then converts data columns to numerical format.

In [11]:
def retrieve_census_data(df_variable_list, year, region):

    extra_columns = {}

    for start_point in range(0, len(df_variable_list), 45):
        variable_string = '' # This string will contain all the variables to be 
        # retrieved from the Census API.
        end_point = min(start_point + 45, len(df_variable_list)) # variables will stop
        # being added to the list once (A) the end of the DataFrame is reached or
        # (B) variable_string contains 45 variables.
        for i in range(start_point, end_point):
            if i != end_point-1:
                variable_string += df_variable_list.iloc[i]['Variable'] + ','
            else:
                variable_string += df_variable_list.iloc[i]['Variable']
        print("Retrieving data from rows",start_point,"to",end_point-1) # The 
        # last row is not included due to how slice notation works

# The following if statements determine what strings to use for the Census
# API request. The following page was very helpful in drafting these 
# strings: https://api.census.gov/data/2019/acs/acs5/examples.html

        if region == 'zip':
            request_string = 'https://api.census.gov/data/'+str(year)+'/acs/acs5?get=NAME,'+variable_string+'&for=zip%20code%20tabulation%20area:*&key='+api_key

        if region == 'state':
            request_string = 'https://api.census.gov/data/'+str(year)+'/acs/acs5?get=NAME,'+variable_string+'&for=state:*&key='+api_key

        if region == 'county':
            request_string = 'https://api.census.gov/data/'+str(year)+'/acs/acs5?get=NAME,'+variable_string+'&for=county:*&in=state:*&key='+api_key
            
        batch_request = pd.read_json(request_string)

        # For documentation on URL-based API calls, see: https://api.census.gov/data/2014/acs/acs1?get=NAME,B02015_009E,B02015_009M&for=state:*
        # The following two lines convert the first row into the header, then
        # drop the row from the DataFrame.
        batch_request.columns = batch_request.iloc[0] 
        batch_request = batch_request.iloc[1:]

        # pop these out and then insert them in at the end

        if 'state' in batch_request.columns:
            if start_point == 0:
                extra_columns['state'] = pd.DataFrame(batch_request[['NAME', 'state']])
            batch_request.drop('state', axis = 1, inplace = True)
        if 'zip code tabulation area' in batch_request.columns:
            batch_request.drop('zip code tabulation area', axis = 1, inplace = True)
        if 'county' in batch_request.columns:
            if start_point == 0:
                extra_columns['county'] = pd.DataFrame(batch_request[['NAME', 'county']])
            batch_request.drop('county', axis = 1, inplace = True)

        # The following code block either initializes df_county_data as
        # batch_request (if df_county_data does not yet exist) or merges the 
        # new copy of batch_request into df_county_data.

        if start_point == 0:
            df_region_data = batch_request 
        else:
            df_region_data = df_region_data.merge(batch_request, on = 'NAME', how = 'outer')


    # Now that all variable data has been obtained from the census, changes can 
    # be applied to the resulting dataframe as a whole.

    # Currently, the column names are mostly ID-based (e.g. 'B01001_001E',
    # 'B01002_001E', 'B06008_001E', 'B06008_003E'). To make them more intuitive
    # (at the expense of dramatically increasing their length), they will be
    #  replaced with values in the 'Description' column within df_variable_list.

    # Since the only columns in the DataFrame are the 'NAME' column
    # and these variable ID columns, this is a good opportunity
    # to rename the variable ID columns (as there are no other columns
    # to deal with).
    descriptions = ['NAME'] # This will remain as the first column
    descriptions.extend(df_variable_list['Description'])
    df_region_data.columns = descriptions


    df_region_data.insert(1, 'Year', year) # Stores the year of the ACS data 
    # as the first column in the DataFrame
    # The following for loop converts all numerical results
    # columns (e.g. all but the first column, which stores county name information)
    # into numerical values.

    if 'county' in extra_columns.keys():
        df_region_data = df_region_data.merge(extra_columns['county'], on = 'NAME', how = 'outer')
        df_region_data.insert(2, 'county', df_region_data.pop('county'))

    if 'state' in extra_columns.keys():
        df_region_data = df_region_data.merge(extra_columns['state'], on = 'NAME', how = 'outer')
        df_region_data.insert(2, 'state', df_region_data.pop('state'))

    if region == 'zip':
        df_region_data['NAME'] = df_region_data['NAME'].str.replace('ZCTA5 ','')

    for i in range(1, len(df_region_data.columns)):
        df_region_data.iloc[:,i] = pd.to_numeric(df_region_data.iloc[:,i])
    
    return df_region_data


In [12]:
def retrieve_data_from_other_year(region, year, column_name, variable):
    if region == 'zip':
        request_string = 'https://api.census.gov/data/'+str(year)+'/acs/acs5?get=NAME,'+variable+'&for=zip%20code%20tabulation%20area:*&key='+api_key

    if region == 'state':
        request_string = 'https://api.census.gov/data/'+str(year)+'/acs/acs5?get=NAME,'+variable+'&for=state:*&key='+api_key

    if region == 'county':
        request_string = 'https://api.census.gov/data/'+str(year)+'/acs/acs5?get=NAME,'+variable+'&for=county:*&in=state:*&key='+api_key
    
    df_result = pd.read_json(request_string)

    df_result.columns = df_result.iloc[0] 
    df_result = df_result.iloc[1:]
    result_col = str(column_name)+'_'+str(year)
    df_result.rename(columns={variable:result_col},inplace=True)
    df_result[result_col] = pd.to_numeric(df_result[result_col])

    if region == 'zip':
        df_result['NAME'] = df_result['NAME'].str.replace('ZCTA5 ','')

    return(df_result[['NAME', result_col]])

In [13]:
def test_variables(variable_list, year, region_type):
    for i in range(len(variable_list)):
        try:
            if region_type == 'county':
                pd.read_json('https://api.census.gov/data/'+str(year)+'/acs/acs5?get=NAME,'+variable_list[i]+'&for=county:*&in=state:*&key='+api_key)
            # print("Data for:",variable_list[i],"retrieved successfully.")
        except:
            print("Failed to retrieve data for:",variable_list[i]+". Confirm that the variable code was entered correctly and that this data is available for the specified region.")

In [14]:
zip_data = retrieve_census_data(df_variable_list = df_variables, year = year, region = 'zip')
zip_data = zip_data.merge(retrieve_data_from_other_year(variable = 'B01001_001E', column_name = 'population', region = 'zip', year = year-5), on = 'NAME', how = 'outer')
zip_data

Retrieving data from rows 0 to 34


Unnamed: 0,NAME,Year,state,SEX BY AGE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:,...,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Less than high school graduate,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency),"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree",POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher,"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher",MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2019 inflation-adjusted dollars),population_2014
0,25245,2019,54,600,294,20,20,152,499,65,...,16,48,89,0,0,0,0,0,57895,686
1,25268,2019,54,964,354,80,54,197,745,157,...,50,61,71,51,0,0,0,0,27200,731
2,25286,2019,54,1700,613,211,152,152,1177,322,...,35,216,28,10,0,9,17,0,38313,1479
3,25303,2019,54,6764,2970,876,534,731,4838,201,...,0,174,469,583,0,48,54,116,58820,7181
4,25311,2019,54,10964,5088,1229,527,999,7866,754,...,101,418,432,533,0,99,299,126,40920,10059
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33115,38704,2019,28,2,2,0,0,0,2,0,...,0,0,0,0,0,0,0,0,-666666666,3
33116,38731,2019,28,246,54,46,26,8,145,59,...,26,0,8,0,0,20,0,0,53173,184
33117,38749,2019,28,71,34,5,0,5,62,29,...,0,5,0,0,0,0,0,0,18750,20
33118,38781,2019,28,198,107,12,7,7,172,37,...,0,6,4,4,4,0,0,0,10772,247


In [15]:
county_data = retrieve_census_data(df_variable_list = df_variables, year = year, region = 'county')
county_data = county_data.merge(retrieve_data_from_other_year(variable = 'B01001_001E', column_name = 'population', region = 'county', year = year-5), on = 'NAME', how = 'outer')
county_data

Retrieving data from rows 0 to 34


Unnamed: 0,NAME,Year,state,county,SEX BY AGE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,...,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Less than high school graduate,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency),"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree",POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher,"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher",MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2019 inflation-adjusted dollars),population_2014
0,"Fayette County, Illinois",2019.0,17.0,51.0,21565.0,7737.0,2193.0,1433.0,2778.0,15303.0,...,411.0,1489.0,1414.0,572.0,27.0,168.0,162.0,39.0,46650.0,22041.0
1,"Logan County, Illinois",2019.0,17.0,107.0,29003.0,10797.0,2831.0,2023.0,3538.0,20373.0,...,171.0,1732.0,1793.0,1705.0,35.0,208.0,264.0,153.0,57308.0,30047.0
2,"Saline County, Illinois",2019.0,17.0,165.0,23994.0,9972.0,3122.0,1823.0,3046.0,17113.0,...,257.0,781.0,2178.0,1183.0,34.0,177.0,279.0,162.0,44090.0,24876.0
3,"Lake County, Illinois",2019.0,17.0,97.0,701473.0,246122.0,90926.0,68192.0,76607.0,457676.0,...,9449.0,19759.0,31821.0,79600.0,1827.0,4757.0,6471.0,7024.0,89427.0,703170.0
4,"Massac County, Illinois",2019.0,17.0,127.0,14219.0,5822.0,1886.0,1293.0,1807.0,10021.0,...,210.0,729.0,1369.0,546.0,36.0,66.0,175.0,50.0,47481.0,15148.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3217,"Knox County, Tennessee",2019.0,47.0,93.0,461104.0,187319.0,53211.0,35388.0,52752.0,308366.0,...,3744.0,15768.0,23352.0,41482.0,1258.0,3874.0,5607.0,4066.0,57470.0,440732.0
3218,"Benton County, Washington",2019.0,53.0,5.0,197518.0,72121.0,24931.0,16610.0,21172.0,127960.0,...,1975.0,6346.0,12719.0,14841.0,564.0,1219.0,2673.0,1490.0,69023.0,182053.0
3219,"Clark County, Washington",2019.0,53.0,11.0,473252.0,174661.0,59067.0,42004.0,53759.0,319955.0,...,4124.0,16089.0,35353.0,36977.0,1011.0,3178.0,7050.0,3032.0,75253.0,438272.0
3220,"Shannon County, South Dakota",,,,,,,,,,...,,,,,,,,,,14005.0


In [16]:
state_data = retrieve_census_data(df_variable_list = df_variables, year = year, region = 'state')
state_data = state_data.merge(retrieve_data_from_other_year(variable = 'B01001_001E', column_name = 'population', region = 'state', year = year-5), on = 'NAME', how = 'outer')
state_data

Retrieving data from rows 0 to 34


Unnamed: 0,NAME,Year,state,SEX BY AGE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family,HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with no people under 18 years:!!Family households:!!Married-couple family,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:,EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:,...,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Less than high school graduate,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency),"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree",POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher,"POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree","POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher",MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2019 inflation-adjusted dollars),population_2014
0,Alabama,2019,1,4876250,1867893,560887,346010,537511,3320877,458922,...,62087,210389,264847,298240,20161,48224,67244,40003,50536,4817678
1,Alaska,2019,2,737068,253346,87149,57885,66718,480586,34376,...,4590,23254,45878,46756,1451,5736,8862,4896,77640,728300
2,Arizona,2019,4,7050299,2571268,789782,496083,728297,4732532,608637,...,90976,201862,399596,455654,27334,52220,95193,55475,58945,6561516
3,Arkansas,2019,5,2999370,1158071,364192,227539,332292,2011639,270168,...,44422,151631,163452,165962,11825,30258,38982,20971,47597,2947036
4,California,2019,6,39283497,13044266,4482879,3050730,3440506,26471543,4418675,...,738802,930831,1770325,2678932,202212,268597,478465,350097,75235,38066920
5,Colorado,2019,8,5610349,2148994,658465,466708,602135,3825579,315751,...,54539,161380,297242,518231,13597,34086,58818,50725,72331,5197580
6,Delaware,2019,10,957248,363322,102861,64145,112493,669320,66816,...,9956,39777,48062,72373,2578,11959,11590,9751,68287,917060
7,District of Columbia,2019,11,692683,284386,59173,29257,44627,494116,44850,...,3222,5447,6985,56225,3223,8405,8166,8300,86420,633736
8,Connecticut,2019,9,3575074,1370746,403685,266633,392880,2483095,232663,...,31391,127014,151799,331188,12098,39620,45503,37762,78444,3592053
9,Florida,2019,12,20901636,7736311,2087688,1281766,2340583,14965745,1767583,...,227887,746082,1061698,1381751,78473,201731,274802,198438,55660,19361792


I admit that many of the column names are obscenely long and unwieldy. This is less of an issue when viewing the table as a CSV export (which I'll perform later), since spreadsheet software can make the columns a uniform width while allowing the full name to be displayed in a separate box. An alternative to these long names, though, would be to keep the variable codes as the column name, then include a key mapping each variable code to its description.

So far, the values shown in the DataFrame are nominal in nature. For example, the table reports on the number of married-couple households with one or more children, but doesn't say what *proportion* have at laest one child--which is much more useful when comparing different zip codes.

Therefore, in the following code block, I added additional columns to the DataFrame that generate various proportions. Some of these were generated using pre-existing totals as a denominator, whereas others used the sum of two diferent statistics as the denominator. (For example, to calculate the proportion of children below the poverty level for a given zip code, I divided the number of children below the poverty level by the sum of (1) children below the poverty level and (2) children above the poverty level. This was a useful strategy when a given Census table didn't have a 'totals' row.

(When creating proportions, be careful about using a total in one table as the denominator for a proportion calculation that involves a separate table. For example, if Table A says that there are 10,000 kids in a zip code, and Table B says that there are 2,000 kids below the poverty line, you may be tempted to conclude that the proportion of children below the poverty line equals 2,000/10,000 = 0.2. However, suppose not all the kids identified in Table A show up in Table B, and that Table B doesn't have a totals row. In that case, you'd want to divide the proportion of kids in Table B above below the poverty level (2,000) by the number in Table B above the poverty level (let's say it's 6,000) to arrive at a more accurate proportion--in this case, 2,000/(2,000+6,000) = 2,000/8,000 = 25%.)

In [17]:
def calc_proportions_and_rename(df_results):

    df_results['5_year_population_growth'] = (df_results['SEX BY AGE Estimate!!Total:']/df_results['population_2014']-1)

    df_results['Married_couple_households_with_one_or_more_children_as_proportion_of_all_households'] = df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family']/df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:']

    df_results['Married_couple_households_with_one_or_more_children_as_proportion_of_all_households_with_one_or_more_children'] = df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:!!Family households:!!Married-couple family']/df_results['HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:!!Households with one or more people under 18 years:']

    df_results['Proportion_of_children_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:'])

    df_results['Proportion_of_children_in_married_couple_families_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In married-couple family:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In married-couple family:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!In married-couple family:'])

    df_results['Proportion_of_children_in_female_householder_families_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Female householder, no spouse present:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Female householder, no spouse present:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!In other family:!!Female householder, no spouse present:'])

    df_results['Proportion_of_children_in_male_householder_families_below_poverty_level'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Male householder, no spouse present:']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months below poverty level:!!In other family:!!Male householder, no spouse present:'] + df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF RELATED CHILDREN UNDER 18 YEARS BY FAMILY TYPE BY AGE OF RELATED CHILDREN UNDER 18 YEARS Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!In other family:!!Male householder, no spouse present:'])

    # Calculating proportions of residents living below the poverty level by education and household type

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Less than high school graduate']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Less than high school graduate']+df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Less than high school graduate'])

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!High school graduate (includes equivalency)"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!High school graduate (includes equivalency)"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!High school graduate (includes equivalency)"])

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Some college, associate's degree"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Some college, associate's degree"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Some college, associate's degree"])

    df_results['Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Bachelor's degree or higher"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Married-couple family:!!Bachelor's degree or higher"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Married-couple family:!!Bachelor's degree or higher"])


    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate"]/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate']+df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Less than high school graduate'])

    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent'] = df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)']/(df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)']+df_results['POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!High school graduate (includes equivalency)'])

    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Some college, associate's degree"])

    df_results['Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher'] = df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher"]/(df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months below poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher"]+df_results["POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Other families:!!Female householder, no spouse present:!!Bachelor's degree or higher"])

    df_results['Proportion_of_individuals_25+y/o_who_did_not_graduate_high_school'] = df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Less than high school graduate:']/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    df_results['Proportion_of_individuals_25+y/o_whose_highest_education_level_=_high_school_graduate/equivalent'] = df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!High school graduate (includes equivalency):']/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    df_results['Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_some_college/associate\'s_degree'] = df_results["EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Some college or associate's degree:"]/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    df_results['Proportion_of_individuals_25+_y/o_whose_highest_education_level_=_bachelor\'s_degree_or_higher'] = df_results["EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:!!Bachelor's degree or higher:"]/(df_results['EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 25 YEARS AND OVER Estimate!!Total:'])

    # df_results[''] = df_results['']/(df_results['']+df_results[''])

    df_results.rename(columns = {

        "SEX BY AGE Estimate!!Total:":"Total_population",

        "MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) Estimate!!Median household income in the past 12 months (in 2019 inflation-adjusted dollars)":"Median_household_income",
        
        "HOUSEHOLDS BY PRESENCE OF PEOPLE UNDER 18 YEARS BY HOUSEHOLD TYPE Estimate!!Total:":"Households",
        
        },inplace=True)



    return df_results
    

In [18]:
county_data = calc_proportions_and_rename(county_data)

In [19]:
state_data = calc_proportions_and_rename(state_data)

In [20]:
zip_data = calc_proportions_and_rename(zip_data)

A look at the first few rows in this table reveals that some median household income values are clearly inaccurate! $-666,666,666 is *not* the actual median household income in any zip code, yet that's the value listed for 2,229 entries in zip_data, as shown below:

In [21]:
len(zip_data.query("Median_household_income == -666666666"))

2299

This means that, when performing average calculations across the entire dataset, you must be extremely careful--otherwise, you'll end up with results like the one below:

In [22]:
np.mean(zip_data['Median_household_income'])

-46219120.15483092

These results are, of course, skewed by the thousands of -666,666,666 values. The U.S. would be in dire shape if the average median household income among zip codes were truly $-46,219,120! 

I then exported two versions of this DataFrame to a CSV. The first version (df_results_1k_plus_households) only includes zip codes with at least 1,000 households, since lower sample sizes in smaller zip codes can skew the sample sizes shown. The second version contains all zip codes present in the dataset.

In [23]:
zip_data_1k_plus_households = zip_data.query("Households > 1000")
county_data_1k_plus_households = county_data.query("Households > 1000")

zip_data.to_csv('acs5_'+str(year)+'_zip_results.csv')
zip_data_1k_plus_households.to_csv('acs5_'+str(year)+'_zip_results_1k_plus_households.csv')

county_data.to_csv('acs5_'+str(year)+'_county_results.csv')
county_data_1k_plus_households.to_csv('acs5_'+str(year)+'_county_results_1k_plus_households.csv')

state_data.to_csv('acs5_'+str(year)+'_state_results.csv')

As shown below, running the same average median household calculation on the reduced dataset produces a more accurate-looking number. Nevertheless, it would still be better to look through the DataFrame beforehand and perform any necessary data cleaning.

In [24]:
np.mean(zip_data_1k_plus_households['Median_household_income'])

26550.97108703333

That concludes the main part of this tutorial program. I hope that you find these examples useful in performing your own census data analysis!

These census DataFrames can also be a great source of information for regression analyses. The following code blocks show how one of the DataFrames can be modified to serve as a data source for regressions (albeit without any data cleaning or checking). In the future, I may move these regressions over to a separate tutorial program and provide detailed explanations of the code. In the meantime, I've left the code in place and added some brief explanations. 

The first regression examined the relationship between poverty rates and whether children were in a married-couple family as opposed to a female-householder one. This involved creating a reduced version of the df_results_1k_plus_households DataFrame:

In [25]:
df_regression_test = zip_data_1k_plus_households.copy()
df_regression_test.dropna(subset=['Proportion_of_children_in_female_householder_families_below_poverty_level','Proportion_of_children_in_married_couple_families_below_poverty_level'],inplace=True)
df_regression_test = df_regression_test[['NAME','Proportion_of_children_in_female_householder_families_below_poverty_level','Proportion_of_children_in_married_couple_families_below_poverty_level']].copy()

In [26]:
df_regression_test

Unnamed: 0,NAME,Proportion_of_children_in_female_householder_families_below_poverty_level,Proportion_of_children_in_married_couple_families_below_poverty_level
3,25303,0.289377,0.054054
4,25311,0.660470,0.000000
5,25419,0.493056,0.009572
15,25601,0.482558,0.060255
37,26726,0.386427,0.016000
...,...,...,...
33095,38237,0.608929,0.085351
33097,38948,0.260274,0.082418
33101,38016,0.064658,0.058087
33111,38571,0.449275,0.126911


I then converted the two different variable columns into two different rows for each zip code using pd.melt(), which would make it easier to create categorical or 'dummy' variables for the regression analysis:

In [27]:
df_regression_test_melt = pd.melt(df_regression_test.copy(), id_vars = ['NAME']) # https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.melt.html
df_regression_test_melt

Unnamed: 0,NAME,variable,value
0,25303,Proportion_of_children_in_female_householder_f...,0.289377
1,25311,Proportion_of_children_in_female_householder_f...,0.660470
2,25419,Proportion_of_children_in_female_householder_f...,0.493056
3,25601,Proportion_of_children_in_female_householder_f...,0.482558
4,26726,Proportion_of_children_in_female_householder_f...,0.386427
...,...,...,...
33811,38237,Proportion_of_children_in_married_couple_famil...,0.085351
33812,38948,Proportion_of_children_in_married_couple_famil...,0.082418
33813,38016,Proportion_of_children_in_married_couple_famil...,0.058087
33814,38571,Proportion_of_children_in_married_couple_famil...,0.126911


The following code block uses pd.get_dummies to generate categorical variables, then renames the resulting column for better legibility. 

In [28]:
df_regression_test_melt = pd.get_dummies(data = df_regression_test_melt.copy(), columns=['variable'], drop_first=True)
df_regression_test_melt.rename(columns={'variable_Proportion_of_children_in_married_couple_families_below_poverty_level':'in_married_household','value':'proportion_below_poverty_level'},inplace=True)
df_regression_test_melt

Unnamed: 0,NAME,proportion_below_poverty_level,in_married_household
0,25303,0.289377,0
1,25311,0.660470,0
2,25419,0.493056,0
3,25601,0.482558,0
4,26726,0.386427,0
...,...,...,...
33811,38237,0.085351,1
33812,38948,0.082418,1
33813,38016,0.058087,1
33814,38571,0.126911,1


With this table in place, I was able to perform the regression analysis.

In [29]:
y = df_regression_test_melt['proportion_below_poverty_level'] # Contains the list of scores for the current grade (or for the school total in the case of the 'Total' column)
x_vars = df_regression_test_melt[['in_married_household']]
x_vars = sm.add_constant(x_vars) 
model = sm.OLS(y,x_vars)
results = model.fit() # the resulst variable contains the information needed to fill in the other rows within the DataFrame.
results.summary()

0,1,2,3
Dep. Variable:,proportion_below_poverty_level,R-squared:,0.409
Model:,OLS,Adj. R-squared:,0.409
Method:,Least Squares,F-statistic:,23390.0
Date:,"Thu, 31 Mar 2022",Prob (F-statistic):,0.0
Time:,00:20:18,Log-Likelihood:,11307.0
No. Observations:,33816,AIC:,-22610.0
Df Residuals:,33814,BIC:,-22590.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3802,0.001,285.425,0.000,0.378,0.383
in_married_household,-0.2881,0.002,-152.952,0.000,-0.292,-0.284

0,1,2,3
Omnibus:,1609.716,Durbin-Watson:,1.66
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2450.374
Skew:,0.429,Prob(JB):,0.0
Kurtosis:,4.002,Cond. No.,2.62


My second regression analysis aimed to evaluate the impact of family type (married vs. female-householder-only) and education level (no high school diploma; high school diploma/equivalent; associate's/some college; and bachelor's or higher) on poverty status. This first involved retrieving data on income for both family type and education.

In [30]:
df_regression_test_2 = zip_data_1k_plus_households.copy()
df_regression_test_2.dropna(subset=['Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school','Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher'],inplace=True)
df_regression_test_2 = df_regression_test_2[['NAME','Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school','Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_high_school_graduate/equivalent', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_=_some_college_or_associate\'s_degree', 'Proportion_of_female-householder_families_below_the_poverty_level_where_householder\'s_highest_education_level_=_bachelor\'s_degree_or_higher']].copy()

In [31]:
df_regression_test_2

Unnamed: 0,NAME,Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_married-couple_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher,Proportion_of_female-householder_families_below_the_poverty_level_where_householder_did_not_graduate_high_school,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_high_school_graduate/equivalent,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_=_some_college_or_associate's_degree,Proportion_of_female-householder_families_below_the_poverty_level_where_householder's_highest_education_level_=_bachelor's_degree_or_higher
4,25311,0.165289,0.000000,0.048458,0.000000,1.000000,0.673267,0.155367,0.207547
15,25601,0.233333,0.164557,0.048638,0.000000,0.272727,0.000000,0.321429,0.333333
37,26726,0.324675,0.113858,0.047354,0.000000,0.833333,0.135531,0.654206,0.166667
39,26753,0.000000,0.002222,0.004975,0.000000,1.000000,0.524590,0.851351,0.000000
40,26757,0.169231,0.071956,0.073944,0.000000,0.438095,0.706667,0.190083,0.000000
...,...,...,...,...,...,...,...,...,...
33095,38237,0.152941,0.044408,0.095119,0.011270,0.788732,0.607407,0.107477,0.451613
33097,38948,0.161765,0.069767,0.000000,0.000000,0.375000,0.095745,0.125000,0.000000
33101,38016,0.000000,0.041406,0.039346,0.031825,0.103627,0.058140,0.059701,0.008143
33111,38571,0.116725,0.065015,0.055504,0.000000,1.000000,0.132701,0.060606,1.000000


Next, I once again 'melted' various columns into the same column in order to facilitate the creation of categorical variables. I also created columns that would store these categorical variables.

In [32]:
df_regression_test_2_melt = pd.melt(df_regression_test_2.copy(), id_vars = ['NAME'])
df_regression_test_2_melt['Married'] = 0
df_regression_test_2_melt['highest_ed_=_high_school_grad'] = 0
df_regression_test_2_melt['highest_ed_=_some_college_or_associate\'s'] = 0
df_regression_test_2_melt['highest_ed_=_bachelor\'s_or_higher'] = 0

In [33]:
df_regression_test_2_melt

Unnamed: 0,NAME,variable,value,Married,highest_ed_=_high_school_grad,highest_ed_=_some_college_or_associate's,highest_ed_=_bachelor's_or_higher
0,25311,Proportion_of_married-couple_families_below_th...,0.165289,0,0,0,0
1,25601,Proportion_of_married-couple_families_below_th...,0.233333,0,0,0,0
2,26726,Proportion_of_married-couple_families_below_th...,0.324675,0,0,0,0
3,26753,Proportion_of_married-couple_families_below_th...,0.000000,0,0,0,0
4,26757,Proportion_of_married-couple_families_below_th...,0.169231,0,0,0,0
...,...,...,...,...,...,...,...
110235,38237,Proportion_of_female-householder_families_belo...,0.451613,0,0,0,0
110236,38948,Proportion_of_female-householder_families_belo...,0.000000,0,0,0,0
110237,38016,Proportion_of_female-householder_families_belo...,0.008143,0,0,0,0
110238,38571,Proportion_of_female-householder_families_belo...,1.000000,0,0,0,0


The output of the following for loop served as a reference for which column numbers corresponded to which variables.

In [34]:
for i in range(len(df_regression_test_2_melt.columns)):
    print("Column",i,":\t",df_regression_test_2_melt.columns[i])

Column 0 :	 NAME
Column 1 :	 variable
Column 2 :	 value
Column 3 :	 Married
Column 4 :	 highest_ed_=_high_school_grad
Column 5 :	 highest_ed_=_some_college_or_associate's
Column 6 :	 highest_ed_=_bachelor's_or_higher


In the next for loop, I filled in the categorical variables by seeing whether certain keywords ('married', 'some_college', etc.) were present in the variable column. For instance, given the variable 'Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school', the for loop returned 1 for the 'Married' column and 0 for the other columns. 

In [35]:
for i in range(len(df_regression_test_2_melt)):
    variable = df_regression_test_2_melt.iloc[i, 1]
    if 'married' in variable:
        df_regression_test_2_melt.iloc[i, 3] = 1
    if 'high_school_graduate' in variable:
        df_regression_test_2_melt.iloc[i, 4] = 1
    if 'some_college' in variable:
        df_regression_test_2_melt.iloc[i, 5] = 1
    if 'bachelor' in variable:
        df_regression_test_2_melt.iloc[i, 6] = 1


In [36]:
df_regression_test_2_melt.iloc[0,1]

'Proportion_of_married-couple_families_below_the_poverty_level_where_householder_did_not_graduate_high_school'

In [37]:
df_regression_test_2_melt.rename(columns={'value':'proportion_below_poverty_level'},inplace=True)
df_regression_test_2.to_csv('marriage_education_poverty_regression.csv')
df_regression_test_2_melt

Unnamed: 0,NAME,variable,proportion_below_poverty_level,Married,highest_ed_=_high_school_grad,highest_ed_=_some_college_or_associate's,highest_ed_=_bachelor's_or_higher
0,25311,Proportion_of_married-couple_families_below_th...,0.165289,1,0,0,0
1,25601,Proportion_of_married-couple_families_below_th...,0.233333,1,0,0,0
2,26726,Proportion_of_married-couple_families_below_th...,0.324675,1,0,0,0
3,26753,Proportion_of_married-couple_families_below_th...,0.000000,1,0,0,0
4,26757,Proportion_of_married-couple_families_below_th...,0.169231,1,0,0,0
...,...,...,...,...,...,...,...
110235,38237,Proportion_of_female-householder_families_belo...,0.451613,0,0,0,1
110236,38948,Proportion_of_female-householder_families_belo...,0.000000,0,0,0,1
110237,38016,Proportion_of_female-householder_families_belo...,0.008143,0,0,0,1
110238,38571,Proportion_of_female-householder_families_belo...,1.000000,0,0,0,1


With the table complete, I performed a regression that used proportion_below_poverty_level as the dependent variable and various family type/education level values as the independent variables.

In [38]:
y = df_regression_test_2_melt['proportion_below_poverty_level']
x_vars = df_regression_test_2_melt[['Married',
       'highest_ed_=_high_school_grad',
       'highest_ed_=_some_college_or_associate\'s',
       'highest_ed_=_bachelor\'s_or_higher']]
x_vars = sm.add_constant(x_vars) 
model = sm.OLS(y,x_vars)
results_2 = model.fit() 
results_2.summary()

0,1,2,3
Dep. Variable:,proportion_below_poverty_level,R-squared:,0.319
Model:,OLS,Adj. R-squared:,0.319
Method:,Least Squares,F-statistic:,12930.0
Date:,"Thu, 31 Mar 2022",Prob (F-statistic):,0.0
Time:,00:20:27,Log-Likelihood:,36769.0
No. Observations:,110240,AIC:,-73530.0
Df Residuals:,110235,BIC:,-73480.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3605,0.001,308.825,0.000,0.358,0.363
Married,-0.1882,0.001,-180.231,0.000,-0.190,-0.186
highest_ed_=_high_school_grad,-0.0840,0.001,-56.899,0.000,-0.087,-0.081
highest_ed_=_some_college_or_associate's,-0.1170,0.001,-79.211,0.000,-0.120,-0.114
highest_ed_=_bachelor's_or_higher,-0.2022,0.001,-136.929,0.000,-0.205,-0.199

0,1,2,3
Omnibus:,22013.82,Durbin-Watson:,1.755
Prob(Omnibus):,0.0,Jarque-Bera (JB):,60772.713
Skew:,1.069,Prob(JB):,0.0
Kurtosis:,5.943,Cond. No.,5.39


In [39]:
end_time = time.time()
run_time = end_time - start_time
run_minutes = run_time // 60
run_seconds = run_time % 60
print("Completed run at",time.ctime(end_time),"(local time)")
print("Total run time:",'{:.2f}'.format(run_time),"second(s) ("+str(run_minutes),"minute(s) and",'{:.2f}'.format(run_seconds),"second(s))") # Only valid when the program is run nonstop from start to finish

Completed run at Thu Mar 31 00:20:27 2022 (local time)
Total run time: 29.53 second(s) (0.0 minute(s) and 29.53 second(s))
