# Authored by Noah Tamminga (ntamm@umich.edu).

# County Permits Extraction


This file extracts both the imputed and reported permits for all counties listed on the US Census site for the last 25 years. This is specifically constructed for the monthly files, but could easily be adapted to extract the YTD files as well for validation of any adjustments.

However, YTD files would need extract extraction steps. Specifically, it would need to remove the prior file to extract the monthly value of the newest YTD file.

One possible approach for generating the YTD files on a monthly basis would be to add a join step where the prior file is stored in a dataframe and joined to the next extracted dataframe - if within the same year - and then the monthly difference gets calculated between the two files.

### NOTES 

A 2025 subset is used in this notebook. Write statements are turned off in case there are issues with requests.

This file has issues running in Google Collab, something about SSL verification. However, downloading the file and running it locally works fine.

In [1]:
import requests
import pandas as pd
from datetime import datetime
import time

In [2]:
# !pip install requests

### Extraction Setup

In the following sections we setup the necessary functions and list of files to extract. We start by generating a range of files that we need to pull. With that list, we use the extract_transform_housing function to pull and produce the data which includes a call to another helper function to clean the output received from the hyperlinked txt file.

In [3]:
#Ranges
start_year = 2025
start_month = 1
end_year = datetime.now().year
end_month = datetime.now().month - 2 #Check the census site to make sure the month aligns with the most recent file

#Function to format year and month into 'YYMM' style for the census txt file hyperlink
def generate_yyyymm_range(start_year, start_month, end_year, end_month):
    yyyymm_list = []
    year = start_year
    month = start_month
    while (year < end_year) or (year == end_year and month <= end_month):
        yyyymm = f"co{str(year)[-2:]}{month:02}c.txt"
        yyyymm_list.append(yyyymm)
        #Increment month
        month += 1
        if month > 12:
            month = 1
            year += 1
    return yyyymm_list

#Generate all yymm codes / txt filenames
yyyymm_list = generate_yyyymm_range(start_year, start_month, end_year, end_month)
print(yyyymm_list[:5])
print(yyyymm_list[-5:])

['co2501c.txt', 'co2502c.txt', 'co2503c.txt', 'co2504c.txt']
['co2501c.txt', 'co2502c.txt', 'co2503c.txt', 'co2504c.txt']


In [4]:

def clean_columns(new_columns):

    adjusted_cols = []
    unit_var = ''
    count = 0
    for col in new_columns:
        if col.startswith('_B') and unit_var == '':
            unit_var = '1'
        elif col.startswith('_B') and unit_var == '1':
            unit_var = '2'
        elif col.startswith('_B') and unit_var == '2':
            unit_var = '3_4'
        elif col.startswith('_B') and unit_var == '3_4':
            unit_var = '5_plus'
        elif col.startswith('_B') and unit_var == '5_plus':
            unit_var = '1'

        if count < 18:
            if col[0] == '_':
                new_col = unit_var + '_unit_' + col[1:] + '_i'
                adjusted_cols.append(new_col.lower())
            elif col[0] in ['1', '2', '3', '5']:
                new_col = unit_var + '_unit_' + col[-5:] + '_i'
                adjusted_cols.append(new_col.lower())
            elif col == 'nan':
                new_col = unit_var + '_unit_' + 'value' + '_i'
                adjusted_cols.append(new_col.lower())
            else:
                new_col = col
                adjusted_cols.append(new_col.lower())

        else:
            if col[0] == '_':
                new_col = unit_var + '_unit_' + col[1:] + '_r'
                adjusted_cols.append(new_col.lower())
            elif col[0] in ['1', '2', '3', '5']:
                new_col = unit_var + '_unit_' + col[-5:] + '_r'
                adjusted_cols.append(new_col.lower())
            elif col == 'nan':
                new_col = unit_var + '_unit_' + 'value' + '_r'
                adjusted_cols.append(new_col.lower())
            else:
                new_col = col
                adjusted_cols.append(new_col.lower())

        count += 1

    return adjusted_cols


def extract_transform_housing(filename):

    #URL based on filename
    url = f"https://www2.census.gov/econ/bps/County/{filename}"

    # print(url)

    #Get txt file data
    while True:
        try:
            response = requests.get(url)
            response.raise_for_status()
            content = response.text #response.text
            break
        except Exception as e:
            #Census site occassionally has connection errors to the url used by requests
            # print(f"If requests connection error, please reload and try again. Error {e}")
            time.sleep(3) #Waiting if error

    #Make dataframe based on txt split
    df = pd.DataFrame([line.split(',') for line in content.strip().split('\n')])

    #Fix Beginning Columns
    merge_columns = df.iloc[0].str.strip() + '_' + df.loc[1].str.strip()

    #List of columns, clean them, and update them
    new_columns = [str(col) for col in merge_columns]
    df_parsed = df.drop(index=[0,1,2]).reset_index(drop=True)
    df_parsed.columns = new_columns
    adjusted_cols = clean_columns(new_columns)
    df_parsed.columns = adjusted_cols

    #Define imputed subset and assign it to df_imputed
    imputed = df_parsed.columns[:18]
    df_imputed = df_parsed[imputed]

    #Get primary fields plus latter section for df_reported
    df_reported = df_parsed.iloc[:, list(range(0, 6)) + list(range(18, 30))]


    return df_imputed, df_reported





In [5]:
imputed, reported = extract_transform_housing(yyyymm_list[0])

pd.set_option('display.max_columns', None)
imputed.head()

Unnamed: 0,survey_date,fips_state,fips_county,region_code,division_code,county_name,1_unit_bldgs_i,1_unit_units_i,1_unit_value_i,2_unit_bldgs_i,2_unit_units_i,2_unit_value_i,3_4_unit_bldgs_i,3_4_unit_units_i,3_4_unit_value_i,5_plus_unit_bldgs_i,5_plus_unit_units_i,5_plus_unit_value_i
0,202501,1,1,3,6,Autauga County ...,8,8,3130806,0,0,0,0,0,0,0,0,0
1,202501,1,3,3,6,Baldwin County ...,250,250,82062741,0,0,0,0,0,0,0,0,0
2,202501,1,5,3,6,Barbour County ...,0,0,0,0,0,0,0,0,0,0,0,0
3,202501,1,7,3,6,Bibb County ...,1,1,243000,0,0,0,0,0,0,0,0,0
4,202501,1,9,3,6,Blount County ...,1,1,151710,0,0,0,0,0,0,0,0,0


Now that we have our functions set up and our list to supply our function, we can loop through the list to extract the data.

In [6]:
imputed_df = pd.DataFrame()
reported_df = pd.DataFrame()

for file in yyyymm_list:
    # print(file)
    try:
        imputed, reported = extract_transform_housing(filename=file)
        imputed_df = pd.concat([imputed_df, imputed], ignore_index=True)
        reported_df = pd.concat([reported_df, reported], ignore_index=True)
    except Exception:
        continue



In [7]:
imputed_df.tail()

Unnamed: 0,survey_date,fips_state,fips_county,region_code,division_code,county_name,1_unit_bldgs_i,1_unit_units_i,1_unit_value_i,2_unit_bldgs_i,2_unit_units_i,2_unit_value_i,3_4_unit_bldgs_i,3_4_unit_units_i,3_4_unit_value_i,5_plus_unit_bldgs_i,5_plus_unit_units_i,5_plus_unit_value_i
12075,202504,56,37,4,8,Sweetwater County ...,4,4,1363037,0,0,0,0,0,0,0,0,0
12076,202504,56,39,4,8,Teton County ...,9,9,24375544,0,0,0,1,3,200000,0,0,0
12077,202504,56,41,4,8,Uinta County ...,3,3,673000,0,0,0,0,0,0,0,0,0
12078,202504,56,43,4,8,Washakie County ...,0,0,0,0,0,0,0,0,0,0,0,0
12079,202504,56,45,4,8,Weston County ...,0,0,0,0,0,0,0,0,0,0,0,0


In [8]:
# imputed_df.to_parquet('data/imputed_permits.parquet.gzip',
#               compression='gzip')

# reported_df.to_parquet('data/reported_permits.parquet.gzip',
#               compression='gzip')