In [9]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Explore public HUD REAC data

Author: Jack Vandeleuv

Data files:
* [Public Housing Physical Inspection Scores (2016-2021)](https://www.huduser.gov/portal/datasets/pis.html#2021_data-collapse)
* [Multifamily Physical Inspection Scores (2016-2021)](https://www.huduser.gov/portal/datasets/pis.html#2021_data-collapse)

This notebook will explore the publicly released scores from the U.S. Department of Housing and Urban Development's Real Estate Assessment Center (HUD REAC). These scores are derived from physical inspections on HUD properties.

The goal of this notebook is to aggregate REAC data on a city-by-city basis, so we'll focus on data relevant to that analysis.

# Loading and cleaning

In [10]:
seed = 538

Place the data files in the data directory and load into pandas.

In [11]:
WORKING_DIR = 'D:/Fire Project/data/'
MULTIFAMILY_FILES = [
    'multifamily_physical_inspection_scores_0321.xlsx',
    'multifamily_physical_inspection_scores_0620.xlsx',
    'multifamily-physical-inspection-scores-2016.xlsx',
    'multifamily-physical-inspection-scores-2018.xlsx',
    'multifamily-physical-inspection-scores-2019.xlsx'
]

PUBLIC_HOUSING_FILES = [
    'public_housing_physical_inspection_scores_0321.xlsx',
    'public_housing_physical_inspection_scores_0620.xlsx',
    'public-housing-physical-inspection-scores-2016.xlsx',
    'public-housing-physical-inspection-scores-2018.xlsx',
    'public-housing-physical-inspection-scores-2019.xlsx'
]

The zipcode columns is alternately named ZIP and ZIPCODE. We'll rename all columns labeled ZIPCODE as ZIP for consistency sake.

In [12]:
multi_dfs = []
public_dfs = []

for i, public_file in enumerate(PUBLIC_HOUSING_FILES):
    public_df = pd.read_excel(WORKING_DIR + public_file)
    multi_df = pd.read_excel(WORKING_DIR + MULTIFAMILY_FILES[i])

    public_df = public_df.rename(columns={'ZIPCODE': 'ZIP'}, )
    multi_df = multi_df.rename(columns={'ZIPCODE': 'ZIP'})

    public_dfs.append(public_df)
    multi_dfs.append(multi_df)

multi = pd.concat(multi_dfs, axis=0)
public = pd.concat(public_dfs, axis=0)

FileNotFoundError: [Errno 2] No such file or directory: 'data/public_housing_physical_inspection_scores_0321.xlsx'

Because the REAC dataset shows the latest inspection for each property, and we're merging data from multiple years, there will be a significant number of duplicates. We'll drop these. We'll define a duplicate as two inspections with the same INSPECTION_ID.

In [None]:
multi = multi[~multi.INSPECTION_ID.duplicated()]
public = public[~public.INSPECTION_ID.duplicated()]

In [None]:
multi.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 63807 entries, 0 to 27876
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   INSPECTION_ID     63807 non-null  int64  
 1   PROPERTY_ID       63807 non-null  int64  
 2   PROPERTY_NAME     63807 non-null  object 
 3   ADDRESS           62624 non-null  object 
 4   CITY              63807 non-null  object 
 5   CBSA_NAME         58603 non-null  object 
 6   CBSA_CODE         63777 non-null  float64
 7   COUNTY_NAME       63780 non-null  object 
 8   COUNTY_CODE       63781 non-null  float64
 9   STATE_NAME        52362 non-null  object 
 10  STATE_CODE        63807 non-null  object 
 11  ZIP               63787 non-null  float64
 12  LATITUDE          63781 non-null  float64
 13  LONGITUDE         63781 non-null  float64
 14  LOCATION_QUALITY  63781 non-null  object 
 15  INSPECTION_SCORE  63807 non-null  int64  
 16  INSPECTION_DATE   63807 non-null  object

In [None]:
public.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16962 entries, 0 to 6778
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   INSPECTION_ID     16962 non-null  int64  
 1   DEVELOPMENT_ID    16962 non-null  object 
 2   DEVELOPMENT_NAME  16962 non-null  object 
 3   ADDRESS           6639 non-null   object 
 4   CITY              16962 non-null  object 
 5   CBSA_NAME         14634 non-null  object 
 6   CBSA_CODE         16956 non-null  float64
 7   COUNTY_NAME       16956 non-null  object 
 8   COUNTY_CODE       16956 non-null  float64
 9   STATE_NAME        13420 non-null  object 
 10  STATE_CODE        16958 non-null  object 
 11  ZIP               16960 non-null  float64
 12  LATITUDE          16956 non-null  float64
 13  LONGITUDE         16956 non-null  float64
 14  LOCATION_QUALITY  16960 non-null  object 
 15  PHA_CODE          16962 non-null  object 
 16  PHA_NAME          16962 non-null  object 

Let's also look at STATE_NAME and STATE_CODE, which has a lot of null values.

In [None]:
multi.sample(n=5, random_state=seed).loc[:, ['STATE_NAME', 'STATE_CODE']]

Unnamed: 0,STATE_NAME,STATE_CODE
7548,WV,54
16258,GA,13
3573,VA,51
22460,,MA
4033,AL,1


In [None]:
public.sample(n=5, random_state=seed).loc[:, ['STATE_NAME', 'STATE_CODE']]

Unnamed: 0,STATE_NAME,STATE_CODE
4103,MD,24
3385,,IN
615,AL,1
3059,VA,51.0
6652,,VA


We also have two different state code formats. We'll combine these into a single column with only the two-letter state codes.

In [None]:
def combine_columns(row):
    if pd.isnull(row['STATE_NAME']):
        return row['STATE_CODE']
    else:
        return row['STATE_NAME']

public['STATE'] = public.apply(combine_columns, axis=1)
multi['STATE'] = multi.apply(combine_columns, axis=1)

public = public.drop(columns=['STATE_NAME', 'STATE_CODE'])
multi = multi.drop(columns=['STATE_NAME', 'STATE_CODE'])

Let's look again to see what nulls are left.

In [None]:
public.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16962 entries, 0 to 6778
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   INSPECTION_ID     16962 non-null  int64  
 1   DEVELOPMENT_ID    16962 non-null  object 
 2   DEVELOPMENT_NAME  16962 non-null  object 
 3   ADDRESS           6639 non-null   object 
 4   CITY              16962 non-null  object 
 5   CBSA_NAME         14634 non-null  object 
 6   CBSA_CODE         16956 non-null  float64
 7   COUNTY_NAME       16956 non-null  object 
 8   COUNTY_CODE       16956 non-null  float64
 9   ZIP               16960 non-null  float64
 10  LATITUDE          16956 non-null  float64
 11  LONGITUDE         16956 non-null  float64
 12  LOCATION_QUALITY  16960 non-null  object 
 13  PHA_CODE          16962 non-null  object 
 14  PHA_NAME          16962 non-null  object 
 15  INSPECTION_SCORE  16962 non-null  int64  
 16  INSPECTION_DATE   16962 non-null  object 

In [None]:
multi.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 63807 entries, 0 to 27876
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   INSPECTION_ID     63807 non-null  int64  
 1   PROPERTY_ID       63807 non-null  int64  
 2   PROPERTY_NAME     63807 non-null  object 
 3   ADDRESS           62624 non-null  object 
 4   CITY              63807 non-null  object 
 5   CBSA_NAME         58603 non-null  object 
 6   CBSA_CODE         63777 non-null  float64
 7   COUNTY_NAME       63780 non-null  object 
 8   COUNTY_CODE       63781 non-null  float64
 9   ZIP               63787 non-null  float64
 10  LATITUDE          63781 non-null  float64
 11  LONGITUDE         63781 non-null  float64
 12  LOCATION_QUALITY  63781 non-null  object 
 13  INSPECTION_SCORE  63807 non-null  int64  
 14  INSPECTION_DATE   63807 non-null  object 
 15  FIPS_STATE_CODE   11445 non-null  float64
 16  STATE             63807 non-null  object

We have no nulls in CITY and STATE. Let's look at the remaining nulls in ZIP.

In [None]:
print(len(public[public.ZIP.isna()]))
print('Unique states represented among null zipcodes:', public[public.ZIP.isna()].STATE.unique())

2
Unique states represented among null zipcodes: ['PR']


In [None]:
print(len(multi[multi.ZIP.isna()]))
print('Unique states represented among null zipcodes:', multi[multi.ZIP.isna()].STATE.unique())

20
Unique states represented among null zipcodes: ['MA' 'PR' 'NJ' 'VA' 'GA' 'WY' 'TN' 'OH' 'CA' 'DC' 'UT']


Standardize the city names to be upper-case.

In [None]:
multi.CITY = multi.CITY.str.upper()
public.CITY = public.CITY.str.upper()

We have a small number of inspection without zipcode data. We'll leave these for now, and see if we can perform the spatial join without the zipcode data.

We'll convert the datetimes to a standard format.

In [None]:
public['INSPECTION_DATE'] = pd.to_datetime(public['INSPECTION_DATE'], infer_datetime_format=True)
multi['INSPECTION_DATE'] = pd.to_datetime(multi['INSPECTION_DATE'], infer_datetime_format=True)

Let's check to see if we have any cities in the same state with the same name. We'll check this using latitude and longitude.

In [None]:
multi_citystate = multi.loc[:, ['CITY', 'STATE']]
multi_unique_locs = multi_citystate[~multi_citystate.duplicated()].values.tolist()

Let's look at this distribution of scores by date range.

In [None]:
multi.groupby(by=multi.INSPECTION_DATE.dt.year).size()

INSPECTION_DATE
2013     2949
2014     8542
2015    10762
2016     9104
2017     7471
2018    11856
2019    11286
2020     1835
2021        2
dtype: int64

In [None]:
public.groupby(by=public.INSPECTION_DATE.dt.year).size()

NameError: name 'public' is not defined

The bulk of inspections in the dataset occurred between 2013 and 2020, with a drop-off when the 2019 pandemic started. 

Now we'll export our cleaned data from before 2019. (We'll reserve the remainder as a validation set.)

In [7]:
multi

NameError: name 'multi' is not defined

In [49]:
multi.to_csv('data/clean_agg_multi.csv', 
             sep=',', 
             index=False)
public.to_csv('data/clean_agg_public.csv', 
              sep=',',
              index=False)

### Note: there are a few values that have mismatch between coordinates and the listed address, or that are an outlier in terms of min/max latitude/longitude. We can remove these, although there are likely not too many values.

In general, a spread-out city like KCMO might vary by about 1 latitude point from top to bottom.

In [50]:
# max_lats = []
# max_longs = []

# for city, state in multi_unique_locs:
#     city_mask = multi.CITY == city
#     state_mask = multi.STATE == state
#     multi_sub = multi[city_mask & state_mask]
#     max_lats.append((city, state, multi_sub.LATITUDE.max() - multi_sub.LATITUDE.min()))
#     max_longs.append((city, state, multi_sub.LONGITUDE.max() - multi_sub.LONGITUDE.min()))

In [51]:
# sorted(max_lats, key=lambda x: x[2], reverse=True)[:10]

The top value has an incorrect latitude and longitude, but the address is valid.

In [52]:
# city_mask = multi.CITY == 'WINCHESTER'
# state_mask = multi.STATE == 'VA'

# multi[city_mask & state_mask].sort_values(by='LATITUDE')

The top value is miscoded as being in Hamilton Township when it's not, which results in a high latitude rating

In [53]:
# city_mask = multi.CITY == 'HAMILTON TOWNSHIP'
# state_mask = multi.STATE == 'NJ'

# multi[city_mask & state_mask].sort_values(by='LATITUDE')

In [54]:
# sorted(max_longs, key=lambda x: x[2], reverse=True)[:5]