# Building Permit Data

## Documentation

[United States Census Bureau Building Permits Survey](https://www.census.gov/construction/bps/)

[ASCII files by State, Metropolitan Statistical Area (MSA), County or Place](https://www2.census.gov/econ/bps/)

[MSA Folder](https://www2.census.gov/econ/bps/Metro/)

[ASCII MSA Documentation](https://www2.census.gov/econ/bps/Documentation/msaasc.pdf)

In [1]:
import numpy as np
import pandas as pd

import re

import os.path
from os import path

from datetime import datetime

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

from sklearn.preprocessing import MinMaxScaler, StandardScaler, PowerTransformer
from sklearn.cluster import KMeans

import wrangle as wr
import preprocessing as pr
import explore as ex
import model as mo

import warnings
warnings.filterwarnings("ignore")


Bad key "text.kerning_factor" on line 4 in
/usr/local/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


In [2]:
pd.set_option("display.max_columns", None)

## Acquire

In [3]:
df = wr.acquire_building_permits()
print(f"""Our DataFrame contains {df.shape[0]:,} observations and {df.shape[1]} features.""")
df

Our DataFrame contains 14,149 observations and 29 features.


Unnamed: 0,survey_date,csa_code,cbsa_code,moncov,cbsa_name,one_unit_bldgs_est,one_unit_units_est,one_unit_value_est,two_units_bldgs_est,two_units_units_est,two_units_value_est,three_to_four_units_bldgs_est,three_to_four_units_units_est,three_to_four_units_value_est,five_or_more_units_bldgs_est,five_or_more_units_units_est,five_or_more_units_value_est,one_unit_bldgs_rep,one_unit_units_rep,one_unit_value_rep,two_units_bldgs_rep,two_units_units_rep,two_units_value_rep,three_to_four_units_bldgs_rep,three_to_four_units_units_rep,three_to_four_units_value_rep,five_or_more_units_bldgs_rep,five_or_more_units_units_rep,five_or_more_units_value_rep
0,2019,104.0,10580.0,True,Albany-Schenectady-Troy NY,1120.0,1120.0,309397.0,20.0,40.0,7644.0,12.0,45.0,6074.0,48.0,665.0,60456.0,984.0,984.0,268946.0,18.0,36.0,6544.0,12.0,45.0,6074.0,34.0,580.0,56469.0
1,2019,430.0,48260.0,False,Weirton-Steubenville WV-OH,25.0,25.0,5782.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,25.0,5782.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2019,999.0,10180.0,False,Abilene TX,354.0,354.0,72824.0,8.0,16.0,2093.0,0.0,0.0,0.0,0.0,0.0,0.0,353.0,353.0,72596.0,8.0,16.0,2093.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2019,566.0,49660.0,False,Youngstown-Warren-Boardman OH-PA,323.0,323.0,73182.0,2.0,4.0,407.0,1.0,3.0,467.0,0.0,0.0,0.0,234.0,234.0,50054.0,2.0,4.0,407.0,1.0,3.0,467.0,0.0,0.0,0.0
4,2019,558.0,48700.0,False,Williamsport PA,66.0,66.0,16215.0,6.0,12.0,1610.0,0.0,0.0,0.0,0.0,0.0,0.0,49.0,49.0,12095.0,6.0,12.0,1610.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14144,1980,5745.0,9999.0,False,NORTHEAST PENNSYLVANIA SMSA,1146.0,1146.0,42642.0,3.0,6.0,91.0,6.0,23.0,440.0,5.0,627.0,15798.0,1055.0,1055.0,39843.0,3.0,6.0,91.0,6.0,23.0,440.0,5.0,627.0,15798.0
14145,1980,5720.0,9999.0,False,NORFOLK-VIRGINIA BEACH-,2806.0,2806.0,146250.0,110.0,220.0,6606.0,61.0,231.0,7896.0,201.0,1521.0,44621.0,2806.0,2806.0,146250.0,110.0,220.0,6606.0,61.0,231.0,7896.0,201.0,1521.0,44621.0
14146,1980,5680.0,9999.0,False,NEWPORT NEWS-HAMPTON SMSA,1435.0,1435.0,65952.0,2.0,4.0,30.0,0.0,0.0,0.0,25.0,192.0,3146.0,1435.0,1435.0,65952.0,2.0,4.0,30.0,0.0,0.0,0.0,25.0,192.0,3146.0
14147,1980,5640.0,9999.0,False,NEWARK SMSA,2156.0,2156.0,137423.0,59.0,118.0,3407.0,13.0,47.0,1927.0,34.0,1353.0,40279.0,2154.0,2154.0,137349.0,59.0,118.0,3407.0,13.0,47.0,1927.0,32.0,1343.0,40208.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14149 entries, 0 to 14148
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   survey_date                    14149 non-null  int64  
 1   csa_code                       14149 non-null  float64
 2   cbsa_code                      14149 non-null  float64
 3   moncov                         14149 non-null  bool   
 4   cbsa_name                      14149 non-null  object 
 5   one_unit_bldgs_est             14149 non-null  float64
 6   one_unit_units_est             14149 non-null  float64
 7   one_unit_value_est             14149 non-null  float64
 8   two_units_bldgs_est            14149 non-null  float64
 9   two_units_units_est            14149 non-null  float64
 10  two_units_value_est            14149 non-null  float64
 11  three_to_four_units_bldgs_est  14149 non-null  float64
 12  three_to_four_units_units_est  14149 non-null 

In [5]:
print(f"There are {len(df.cbsa_name.unique()):,} unique metropolitan areas in the DataFrame.")

There are 2,639 unique metropolitan areas in the DataFrame.


In [6]:
print(f"""This DataFrame contains survey data from {df.survey_date.min()} through {df.survey_date.max()}.""")

This DataFrame contains survey data from 1980 through 2019.


In [7]:
df.cbsa_name.head()

0          Albany-Schenectady-Troy  NY 
1          Weirton-Steubenville  WV-OH 
2                          Abilene  TX 
3    Youngstown-Warren-Boardman  OH-PA 
4                     Williamsport  PA 
Name: cbsa_name, dtype: object

In [8]:
df.cbsa_name.tail()

14144    NORTHEAST PENNSYLVANIA SMSA
14145        NORFOLK-VIRGINIA BEACH-
14146      NEWPORT NEWS-HAMPTON SMSA
14147                    NEWARK SMSA
14148                  ROCKFORD SMSA
Name: cbsa_name, dtype: object