# Authored by Noah Tamminga (ntamm@umich.edu).

# Permit Cleaning

This file takes the extracted csv files from the US Census Bureau and formats the data in a long form for easier analysis and visualization

In [1]:
import pandas as pd
import numpy as np


In [2]:
imputed = pd.read_parquet('data/imputed_permits.parquet.gzip')

In [3]:
imputed.head()

Unnamed: 0.1,Unnamed: 0,survey_date,fips_state,fips_county,region_code,division_code,county_name,1_unit_bldgs_i,1_unit_units_i,1_unit_value_i,2_unit_bldgs_i,2_unit_units_i,2_unit_value_i,3_4_unit_bldgs_i,3_4_unit_units_i,3_4_unit_value_i,5_plus_unit_bldgs_i,5_plus_unit_units_i,5_plus_unit_value_i
0,0,200001,1,1,3,6,Autauga County,13,13,690525,0,0,0,0,0,0,0,0,0
1,1,200001,1,81,3,6,Lee County,28,28,3392260,1,2,73788,2,6,658488,1,27,1434000
2,2,200001,1,113,3,6,Russell County,2,2,253000,1,2,90000,0,0,0,0,0,0
3,3,200001,1,125,3,6,Tuscaloosa County,52,52,4889075,4,8,464774,0,0,0,0,0,0
4,4,200001,2,13,4,9,Aleutians East Borough,0,0,0,0,0,0,0,0,0,0,0,0


To aid in any potential visualizations and analyses, we need to reorganize our dataframe into a long format from it's current wide format. Specifically, we want three measure: buildings, units, and value. The extra breakout for number of units is more suited to be a separate categorical column. To accomplish this task, we will melt the dataframe, extract the categorical column and pivot the data back to aggregate our three measures of interest.

In [4]:
cols = list(imputed.columns)
ids = cols[:6]
vals = cols[6:]

imputed_melt = pd.melt(
    imputed,
    id_vars = ids,
    value_vars = vals,
    var_name = 'permit_metric',
    value_name = 'value'
)

#Extracting what we need for the num_units column
split_metric = imputed_melt['permit_metric'].str.rsplit('_', n=2, expand=True)

# print(split_metric)

imputed_melt['num_units'] = split_metric[0]
imputed_melt['measure'] = split_metric[1]
imputed_melt.drop(columns=['permit_metric'], inplace=True)

# imputed_melt.head()

ids.append('num_units')
# print(type(ids))

imputed_final = imputed_melt.pivot(index=ids, columns='measure', values='value').reset_index()
# imputed_final.drop(columns=['measure'], inplace=True)

#To join to other datasets, need to backfill fips state with 0 for 2 digits and backfill fips county with 0 for 3 digits then concat fields
imputed_final['fips_state'] = imputed_final['fips_state'].astype(str).str.zfill(2)
imputed_final['fips_county'] = imputed_final['fips_county'].astype(str).str.zfill(3)
imputed_final['fips'] = imputed_final['fips_state'] + imputed_final['fips_county']

imputed_final.drop(columns=['fips_state', 'fips_county'], inplace=True)

imputed_final.head()

measure,Unnamed: 0,survey_date,region_code,division_code,num_units,bldgs,name,units,value,fips
0,0,200001,3,6,1_unit,13.0,,13.0,690525.0,1001
1,0,200001,3,6,2_unit,0.0,,0.0,0.0,1001
2,0,200001,3,6,3_4_unit,0.0,,0.0,0.0,1001
3,0,200001,3,6,5_plus_unit,0.0,,0.0,0.0,1001
4,0,200001,3,6,county,,Autauga County,,,1001


Imputed data will be most useful for our analysis. The US Census provides well thought out and applied imputation methodology to fill in any missing values. Overall, their methodology imputes a value based on geographic area, history, and type of housing. This imputation specifically takes into account prior annual information for a given geographic area where reporting did occur and current information where reporting did occur to calculate a imputation factor. Then, that imputation factor is applied to non-reporting areas to impute missing values.

Detailed modeling information can be found here: https://www.census.gov/construction/bps/methodology.html


In [5]:
imputed_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1648070 entries, 0 to 1648069
Data columns (total 10 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   Unnamed: 0     1648070 non-null  int64 
 1   survey_date    1648070 non-null  int64 
 2   region_code    1648070 non-null  int64 
 3   division_code  1648070 non-null  int64 
 4   num_units      1648070 non-null  object
 5   bldgs          1318456 non-null  object
 6   name           329614 non-null   object
 7   units          1318456 non-null  object
 8   value          1318456 non-null  object
 9   fips           1648070 non-null  object
dtypes: int64(4), object(6)
memory usage: 125.7+ MB


In [6]:
# imputed_final.to_parquet('data/imputed_permits_final.parquet.gzip',
#               compression='gzip')
