# Data Appendix 

The Census conducts the American Housing Survey, which is sponsored by the Departmentof Housing and Urban Development (HUD). The data files can be downloaded from theirwebsite at http://www.census.gov/programs-surveys/The survey has been conductedin every odd-numbered year since 1973.  The codebook for 1997-2013 can be found at http://www.census.gov/programs-surveys/ahs/data/2013We  use  all  of  the  odd-numberedyears since 1985.  Prior to 1985, the value of the home is a categorical variable, so we do notuse those data.Download the flat files and place them into folders to which you then need to point thecode supplementing this data appendix, found in the file named “[TBD]”.

We remove datapoints without an MSA identifier, as we will be building a panel by city.  Weremove units in housing projects, those with bars on the windows, and those that are rentstabilized.  We remove datapoints missing data (such as tenure status, i.e., owner occupiedor renter occupied).We further clean the observations before the hedonic regression as follows
- Delete if the ratio of household income to house value is greater than 2 (This identifiesdata errors in the house value field)
- Delete if the ratio of household income to annual rent is greater than 100 (This identifiesdata errors in the annual rent field)

Throughout the rest of the analysis, we restrict the sample to the top 30 cities by data-points (after cleaning) in 1985, the first year in the sample

# Python Code Implementation

Load the Necessary Modules for Analysis

In [27]:
import pandas as pd
import os
import numpy as np
from statsmodels import regression
#import matplotlib.pyplot as plt
import warnings
import net_yields_calculation as net_yield_calc
import non_parametric_median as npm
from openpyxl import load_workbook
import xlsxwriter
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")

computer_name = os.getenv('COMPUTERNAME')
print(computer_name)

FINC-360-JJA


Create a class for cleaning the data

In [1]:
class Cleaning():
    
    def __init__(self):
        pass

    def check_col_float2(self,header_list, df):
        for col in header_list:
            try:
                df.loc[:,col] = df.loc[:,col].str.replace("\'","").str.strip().astype(float)
                
            except AttributeError:
                pass
            except KeyError:
                try:
                    df[col.upper()] = df[col.lower()].str.replace("\'","").str.strip().astype(float)
                except AttributeError:
                    df[col.upper()] = df[col.lower()].copy()
        return df
    
    
    
    def change_airsys_data(self,data):
        
        data.loc[(data['AIRSYS'] == 2) & (data['year'] < 2015), 'AIRSYS'] = 0.0
        data.loc[(data['AIRSYS'] == 12) & (data['year'] >= 2015), 'AIRSYS'] = 0.0
        data.loc[(data['AIRSYS'] != 12) & (data['year'] >= 2015), 'AIRSYS'] = 1.0
        
        return data
    
    def change_bath_data(self,data):
        for i in data.year.unique():
            if i >=2015 and i<=2017:
                data.loc[data.year==i,'BATHS'] = data.loc[data.year==i,'BATHS'].replace({7: \
                            0, 8:0, 9:0, 10:0, 11:0, 12:0, 13:0, 3:2,4:3,5:3,6:4 })
            else:
                data.loc[data.year==i,'BATHS'] = data.loc[data.year==i,'BATHS'].replace({5:4, \
                            6:4, 7: 4, 8:4, 9:4, 10:4, 11:4, 12:4, 13:4})
        return data
    
    def change_unit_type_data(self,data):
        data['Unit Type'] = 'None' 
        data.loc[data['CONDO'] == 1, 'Unit Type'] = 'Condo'
    
        data.loc[(data['year'] < 2015) & (data['NUNIT2'] == 1 ),'Unit Type'] = 'Detached'
        data.loc[(data['year'] < 2015) & (data['NUNIT2'] == 2 ),'Unit Type'] = 'Attached'
        
        
        data.loc[(data['year'] >= 2015) & (data['TYPE'] == 2 ),'Unit Type'] = 'Detached'
        data.loc[(data['year'] >= 2015) & (data['TYPE'] == 3 ),'Unit Type'] = 'Attached'
        
        
        return data
    
    def get_city_names(self,data, smsa_dict):
        data['CITY'] = ''
        for i in smsa_dict.keys():
            data.loc[data['SMSA'] == float(i), 'CITY'] = smsa_dict[str(i)]
        return data
        
        
    
    def change_build_data(self,data):
        
        for i in data.year.unique():
            data.loc[data.year==i,'BUILT'] = data.loc[data.year==i,'BUILT'].replace({80: 1980,
                    81:1981,82:1982,83:1983,84:1984,85:1985,86:1986,87:1987,88:1988,89:1989,
                    90:1990,91:1991,92:1992,93:1993,94:1994,95:1995})
                 
        for i in data.year.unique():
            if i >=1985 and i<=1995:
                data.loc[data.year==i,'BUILT'] = data.loc[data.year==i,'BUILT'].replace({9: 1919,
                        8:1920,7:1930,6:1940,5:1950,4:1960,3:1970,2:1970,1:1970})
        
        for i in data.year.unique():
            data.loc[data.year==i,'BUILT']=data.loc[data.year==i,'BUILT'].replace({1919:1910,
                 1981:1980,1982:1980, 1983:1980,1984:1980,1986:1985,1987:1985,1988:1985,1989:1985,
                 1991:1990,1992:1990, 1993:1990,1994:1990,1996:1995,1997:1995,1998:1995,1999:1995,
                 2001:2000,2002:2000, 2003:2000,2004:2000,2006:2005,2007:2005,2008:2005,2009:2005,
                 2011:2010,2012:2010, 2013:2010,2014:2010})
        """
        for i in data.year.unique():
            data.loc[data.year==i,'BUILT']=(np.floor(data.loc[data.year==i,'BUILT']/10.0)*\
            10.0).astype(int)
        """
                    
            
        for i in data.year.unique():
            if i >=2015 and i<=2017:
                data.loc[data.year==i,'AIRSYS'] = data.loc[data.year==i,'AIRSYS'].replace({2: \
                                    1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1,12:2})
                 
        for i in data.year.unique():
            if i >=2015 and i<=2017:
                data.loc[data.year==i,'BATHS'] = data.loc[data.year==i,'BATHS'].replace({7: 0, \
                    8:0, 9:0, 10:0, 11:0, 12:0, 13:0, 3:2,4:3,5:3,6:4 })
            else:
                data.loc[data.year==i,'BATHS'] = data.loc[data.year==i,'BATHS'].replace({5:4, \
                                    6:4, 7: 4, 8:4, 9:4, 10:4, 11:4, 12:4, 13:4})
        ##modify Ebar data
        for i in data.year.unique():
            if i<=1995:
                data.loc[data.year==i,'EBAR']=data.loc[data.year==i,'EBAR'].replace({0:2})
                
        return data
    
    def append_df_to_excel(self,filename, df, sheet_name='Sheet1', startrow=None,\
                           truncate_sheet=False,**to_excel_kwargs):
        
            # ignore [engine] parameter if it was passed
            if 'engine' in to_excel_kwargs:
                to_excel_kwargs.pop('engine')
        
            writer = pd.ExcelWriter(filename, engine='openpyxl')
        
        
        
            try:
                # try to open an existing workbook
                writer.book = load_workbook(filename)
        
                # get the last row in the existing Excel sheet
                # if it was not specified explicitly
                if startrow is None and sheet_name in writer.book.sheetnames:
                    startrow = writer.book[sheet_name].max_row
        
                # truncate sheet
                if truncate_sheet and sheet_name in writer.book.sheetnames:
                    # index of [sheet_name] sheet
                    idx = writer.book.sheetnames.index(sheet_name)
                    # remove [sheet_name]
                    writer.book.remove(writer.book.worksheets[idx])
                    # create an empty sheet [sheet_name] using old index
                    writer.book.create_sheet(sheet_name, idx)
        
                # copy existing sheets
                writer.sheets = {ws.title:ws for ws in writer.book.worksheets}
            except FileNotFoundError:
                # file does not exist yet, we will create it
                pass
        
            if startrow is None:
                startrow = 0
        
            # write out the new sheet
            df.to_excel(writer, sheet_name, startrow=startrow, **to_excel_kwargs)
        
            # save the workbook
            writer.save()
            
            
    def get_hist(self, data, filter_code, bincount, name):
            fig, ax = plt.subplots(len(data.year.value_counts()),1, figsize=(20, 60))
            count = 0
            for i in data.year.unique():
                data.loc[data.year == i, filter_code].hist(bins = 5,ax=ax[count])
                ax[count].set_title(filter_code + ' ' + str(i))
                count = count + 1
            fig.savefig(name)        
        
        
         
    
    def get_distribution(self,data, filter_code, sheetname, rowloc, title, bins=None):
        dct = {}
        
        for i in data.year.unique():
                dct[i] = data.loc[data.year == i, filter_code].value_counts()
                
        dct = pd.DataFrame(dct)
        title_df = pd.DataFrame([title])
        try:
            self.append_df_to_excel(r'T:\PhD Students\breakdown2.xlsx',title_df,sheet_name=sheetname, startrow=rowloc,\
                                    truncate_sheet=False, header = False)

            self.append_df_to_excel(r'T:\PhD Students\breakdown2.xlsx',dct,sheet_name=sheetname, startrow=rowloc+1,\
                                    truncate_sheet=False)
        except PermissionError:
            pass
        
        return dct


    def change_rent_data(self, hedonic_model_data):
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']<=1995) & \
                                                      (hedonic_model_data['RENT'] == 1))]
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']<=1995) & \
                                                      (hedonic_model_data['RENT'] == 999))]
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']<=1995) & \
                                                      (hedonic_model_data['RENT'] == 9999))]
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']> 1995) & \
                                                      (hedonic_model_data['RENT'] == 9999))]
        hedonic_model_data.loc[:, 'RENT'] = hedonic_model_data.loc[:, 'RENT']*12.0
        return hedonic_model_data
    
    def clean_rooms(self, hedonic_model_data):
        #gets rid of the data with -6 and -9 response codes
        hedonic_model_data = hedonic_model_data.loc[hedonic_model_data['ROOMS'] >= 1]
        #gets rid of the top coded values
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>= 2015) & \
                    (hedonic_model_data['year']<= 2017) &(hedonic_model_data['ROOMS'] >= 21))]
        
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']> 1995) & \
                    (hedonic_model_data['year']< 2015) &(hedonic_model_data['ROOMS'] >= 21))]
        
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>= 1985) & \
                    (hedonic_model_data['year']<= 1995) &(hedonic_model_data['ROOMS'] == 99))] 
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>= 1985) & \
                    (hedonic_model_data['year']<= 1993) &(hedonic_model_data['ROOMS'] >= 21))] 
        
        return hedonic_model_data
    
    def clean_built(self,hedonic_model_data):
        hedonic_model_data = hedonic_model_data.loc[hedonic_model_data['BUILT'] >= 1]

        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']<=1995) & \
                                                      (hedonic_model_data['BUILT'] == 99))]
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>1995) & \
                                                      (hedonic_model_data['BUILT'] == B))]

        
    def clean_value(self, hedonic_model_data):
        hedonic_model_data = hedonic_model_data.loc[hedonic_model_data['VALUE'] >= 1]
        
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']<=1995) & \
                                                      (hedonic_model_data['VALUE'] == 999999))]
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']<=1995) & \
                                                      (hedonic_model_data['VALUE'] == 250001))]
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>1995) & \
                                                      (hedonic_model_data['VALUE'] > 250000))]
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>1995) & \
                                                      (hedonic_model_data['VALUE'] ==R))]
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>1995) & \
                                                      (hedonic_model_data['VALUE'] == D))]
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>1995) & \
                                                      (hedonic_model_data['VALUE'] == B))]
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>1995) & \
                                                      (hedonic_model_data['VALUE'] ==.))]
    def clean_EBAR(self,hedonic_model_data):
        hedonic_model_data = hedonic_model_data.loc[hedonic_model_data['VALUE'] >= 1]
        hedonic_model_data = hedonic_model_data.loc[~hedonic_model_data['VALUE'] == 9]
        hedonic_model_data = hedonic_model_data.loc[~hedonic_model_data['VALUE'] == R]
        hedonic_model_data = hedonic_model_data.loc[~hedonic_model_data['VALUE'] == D]
        hedonic_model_data = hedonic_model_data.loc[~hedonic_model_data['VALUE'] == B]
        hedonic_model_data = hedonic_model_data.loc[~hedonic_model_data['VALUE'] == M]
        hedonic_model_data = hedonic_model_data.loc[~hedonic_model_data['VALUE'] == N]
    
    
    def clean_bdrms(self, hedonic_model_data):

        #gets rid of the data with -6 and -9 response codes
        hedonic_model_data = hedonic_model_data.loc[hedonic_model_data['BEDRMS'] >= 0]
        #gets rid of the top coded values
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>= 2015) & \
                    (hedonic_model_data['year']<= 2017) &(hedonic_model_data['BEDRMS'] == 10))]
        
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']> 1995) & \
                    (hedonic_model_data['year']< 2015) &(hedonic_model_data['BEDRMS'] == 10))]
        
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>= 1985) & \
                    (hedonic_model_data['year']<= 1995) &(hedonic_model_data['BEDRMS'] == 10))] 
        
        hedonic_model_data = hedonic_model_data.loc[~((hedonic_model_data['year']>= 1985) & \
                    (hedonic_model_data['year']<= 1995) &(hedonic_model_data['BEDRMS'] == 99))] 
        
        return hedonic_model_data

IndentationError: expected an indented block (<ipython-input-1-a4281d1e3978>, line 102)

Create a class for calculations

In [29]:
class Calculations():
    
    def get_mean_var(self, data, varname):
        mean_yearly=pd.DataFrame(np.nan,index=data['year'].unique(), columns=range(1))
        for i in data['year'].unique():
            sample_year=data.loc[data['year']==i,:]
            mean_year=np.mean(sample_year[varname])
            mean_yearly.loc[i,:]=mean_year
        
        print(mean_yearly)        
        mean_yearly.plot(kind='bar')
        return mean_yearly
    

instantiate the class with objects so that we can do our analysis

In [30]:
    c = Cleaning()
    calc = Calculations()

# # Reading the Data

read the cleaned files in a local directory and concatenate the data into one data frame called hedonic_model_data for analysis. This data contains the relevant variables of interest 

In [31]:
    filepath = r'T:\PhD Students\ftp_data_census2'
    hedonic_model_data = pd.DataFrame()
    sample = {}
    for item in os.listdir(filepath):
        if item[0] == 'o':
            year = int(item.split(".")[0][1:])
            data = pd.read_hdf(os.path.join(filepath, item))
            
            if len(hedonic_model_data.index) == 0:
                hedonic_model_data = data.copy()
            else:
                hedonic_model_data = pd.concat([hedonic_model_data, data.copy()], axis = 0)
                

print out the data to make sure that it was read correctly

In [32]:
print(hedonic_model_data.head(10))
all_data = hedonic_model_data.copy()

    year  VACANCY  RENT  ZINC2  EBAR  RCNTRL   VALUE NUNIT2  CONDO  TENURE  \
7   1985     -9.0    -9  42000   0.0    -9.0   75000      1    3.0     1.0   
14  1985     -9.0    -9  15000   0.0     2.0      -9      3    3.0     3.0   
17  1985      7.0    -9     -9   0.0    -9.0      -9      1    3.0    -9.0   
26  1985     -9.0   425   4100   0.0     2.0      -9      3    3.0     2.0   
27  1985     -9.0   565  33000   0.0     2.0      -9      3    3.0     2.0   
35  1985     -9.0    -9  35500   0.0    -9.0  105000      1    1.0     1.0   
36  1985     -9.0    -9     -9  -9.0    -9.0      -9     -9   -9.0    -9.0   
37  1985     -9.0   325  15000   0.0     2.0      -9      2    3.0     2.0   
42  1985      2.0   750     -9   0.0     2.0      -9      2    1.0    -9.0   
43  1985     -9.0    -9  73000   0.0    -9.0  120000      1    3.0     1.0   

      SMSA  ROOMS  BATHS  AIRSYS  TYPE  BUILT  BEDRMS  
7   1680.0      5    1.0     2.0   1.0     84       3  
14   520.0      5    2.0     

## Data Selection

We will output our results to a local excel file as well with the following code. 

In [33]:
    if computer_name == 'FINC-360-JJA':

        if os.path.exists(r'T:\PhD Students\breakdown.xlsx'):
            os.remove(r'T:\PhD Students\breakdown2.xlsx')

        #we will write our results to a local excel file called breakdown
        workbook = xlsxwriter.Workbook(r'T:\PhD Students\breakdown2.xlsx')
        workbook.close()
    else:
        if os.path.exists(r'T:\PhD Students\breakdown.xlsx'):
            os.remove(r'T:\PhD Students\breakdown.xlsx')

        #we will write our results to a local excel file called breakdown
        workbook = xlsxwriter.Workbook(r'T:\PhD Students\breakdown.xlsx')
        workbook.close()    

<b>Clean the data for rooms <b>

In [34]:
    rooms = c.get_distribution(all_data, 'ROOMS','ROOMS',1, title = 'All Data' + 'ROOMS')
    print(rooms)

        1985     1987     1989     1991     1993     1995    1997    1999  \
-9    2261.0   1787.0   2689.0   2468.0   2908.0   4459.0     NaN     NaN   
-6       NaN      NaN      NaN      NaN      NaN      NaN  5383.0  6977.0   
 1     947.0    614.0    716.0    637.0    729.0    490.0   116.0   206.0   
 2    1612.0   1033.0   1148.0   1052.0   1034.0    833.0   347.0   500.0   
 3   10482.0   7617.0   7991.0   7780.0   7336.0   6220.0  3018.0  3493.0   
 4   14917.0  12123.0  12367.0  12568.0  11415.0  10350.0  4527.0  5446.0   
 5   14723.0  11482.0  12472.0  12819.0  11645.0  11232.0  4012.0  5467.0   
 6   12847.0  10021.0  11185.0  11208.0  10491.0  10721.0  3763.0  5190.0   
 7    8189.0   6348.0   7013.0   7289.0   6957.0   7132.0  2503.0  3043.0   
 8    4847.0   4066.0   4243.0   4767.0   4562.0   4460.0  1545.0  1857.0   
 9    2481.0   1917.0   2053.0   2151.0   2348.0   2134.0   789.0   885.0   
 10    983.0    756.0    793.0    877.0    979.0    923.0   316.0   341.0   

For the years 1985-1995 a response code of -9 represents not applicable. For the years 1997-2013 a response code of -6 represents not applicable. For this time frame it is top coded for 21 rooms or more. The years 2015 and 2017 do not have response codes for not applicable. For this time frame it is top coded for 27 rooms or more. For the years 1985-1993 it is top coded for 21.

 hedonic_model_data = c.clean_rooms(hedonic_model_data)
 rooms = c.get_distribution(hedonic_model_data, 'ROOMS','ROOMS',1, title = 'All Data' + 'ROOMS')
 print(rooms)

cleaned data

In [35]:
rooms = c.clean_rooms(all_data)
rooms = c.get_distribution(rooms, 'ROOMS','ROOMS',1, title = 'All Data' + 'ROOMS')
print(rooms)


       1985     1987     1989     1991     1993     1995    1997    1999  \
1     947.0    614.0    716.0    637.0    729.0    490.0   116.0   206.0   
2    1612.0   1033.0   1148.0   1052.0   1034.0    833.0   347.0   500.0   
3   10482.0   7617.0   7991.0   7780.0   7336.0   6220.0  3018.0  3493.0   
4   14917.0  12123.0  12367.0  12568.0  11415.0  10350.0  4527.0  5446.0   
5   14723.0  11482.0  12472.0  12819.0  11645.0  11232.0  4012.0  5467.0   
6   12847.0  10021.0  11185.0  11208.0  10491.0  10721.0  3763.0  5190.0   
7    8189.0   6348.0   7013.0   7289.0   6957.0   7132.0  2503.0  3043.0   
8    4847.0   4066.0   4243.0   4767.0   4562.0   4460.0  1545.0  1857.0   
9    2481.0   1917.0   2053.0   2151.0   2348.0   2134.0   789.0   885.0   
10    983.0    756.0    793.0    877.0    979.0    923.0   316.0   341.0   
11    412.0    313.0    308.0    328.0    418.0    337.0   121.0   133.0   
12    150.0    132.0    111.0    171.0    161.0    159.0    59.0    69.0   
13     80.0 

 <b>clean the data for bedrms<b>

In [36]:
bedrms = c.get_distribution(all_data, 'BEDRMS','BEDRMS',1, title = 'All Data' + 'BEDRMS')
print(bedrms)

        1985     1987     1989     1991     1993     1995    1997    1999  \
-9    2261.0   1789.0   2689.0   2468.0   2908.0   4459.0     NaN     NaN   
-6       NaN      NaN      NaN      NaN      NaN      NaN  5383.0  6977.0   
 0    2035.0   1214.0   1501.0   1268.0   1402.0    936.0   159.0   469.0   
 1   13997.0  10418.0  11341.0  11186.0  10227.0   8893.0  3937.0  4674.0   
 2   23276.0  18335.0  19421.0  19681.0  18403.0  16595.0  6574.0  7985.0   
 3   23677.0  18792.0  20106.0  20718.0  19133.0  20023.0  7187.0  9368.0   
 4    7983.0   6382.0   6771.0   7335.0   7445.0   7164.0  2679.0  3374.0   
 5    1388.0   1101.0   1109.0   1284.0   1299.0   1207.0   484.0   680.0   
 6     278.0    203.0    200.0    211.0    218.0    197.0   144.0   113.0   
 7      57.0     44.0     42.0     45.0     49.0     40.0    16.0    24.0   
 8      12.0     30.0     12.0      9.0     11.0     14.0    23.0    16.0   
 9       2.0      2.0      1.0      4.0      5.0      7.0     3.0     7.0   

For the years 1985-1999 a response code of -9 represents not applicable. For the years 1997-2013 a response code of -6 represents not applicable. The values are top coded at 10. Remove response code of 99 for 1985-19995.

cleaned data output

In [37]:
bedrms = c.clean_bdrms(all_data)
bedrms = c.get_distribution(bedrms, 'BEDRMS','BEDRMS',1, title = 'All Data' + 'BEDRMS')
print(bedrms)


    1985   1987   1989   1991   1993   1995  1997  1999  2001  2003  2005  \
0   2035   1214   1501   1268   1402    936   159   469   334   444   367   
1  13997  10418  11341  11186  10227   8893  3937  4674  3691  4716  3733   
2  23276  18335  19421  19681  18403  16595  6574  7985  6297  8011  6280   
3  23677  18792  20106  20718  19133  20023  7187  9368  7554  9559  7520   
4   7983   6382   6771   7335   7445   7164  2679  3374  2828  3742  3000   
5   1388   1101   1109   1284   1299   1207   484   680   577   723   644   
6    278    203    200    211    218    197   144   113    96   144   102   
7     57     44     42     45     49     40    16    24    25    26    24   
8     12     30     12      9     11     14    23    16     8    12    12   
9      2      2      1      4      5      7     3     7     4     2     1   

    2007   2009     2011     2013     2015     2017  
0    475    453   1394.0   1147.0    633.0    641.0  
1   5581   4986  15932.0  15360.0   8724.0  

Actually clean the data for hedonic model

In [38]:
hedonic_model_data = c.clean_bdrms(hedonic_model_data)
hedonic_model_data = c.clean_rooms(hedonic_model_data)
print(hedonic_model_data)


       year  VACANCY  RENT   ZINC2  EBAR  RCNTRL   VALUE NUNIT2  CONDO  \
7      1985     -9.0    -9   42000   0.0    -9.0   75000      1    3.0   
14     1985     -9.0    -9   15000   0.0     2.0      -9      3    3.0   
17     1985      7.0    -9      -9   0.0    -9.0      -9      1    3.0   
26     1985     -9.0   425    4100   0.0     2.0      -9      3    3.0   
27     1985     -9.0   565   33000   0.0     2.0      -9      3    3.0   
35     1985     -9.0    -9   35500   0.0    -9.0  105000      1    1.0   
37     1985     -9.0   325   15000   0.0     2.0      -9      2    3.0   
42     1985      2.0   750      -9   0.0     2.0      -9      2    1.0   
43     1985     -9.0    -9   73000   0.0    -9.0  120000      1    3.0   
44     1985     -9.0   650   25800   0.0    -9.0      -9      3    3.0   
47     1985     -9.0    -9   42200   0.0    -9.0   85000      2    1.0   
48     1985     -9.0    -9   65000   0.0    -9.0  160000      1    3.0   
52     1985     -9.0    -9   63000   0

##  Imputing rents with a hedonic model

## Aggregating rent-to-price ratios with nonparametric weights

We wish to find the median rent-to-price ofrentalhomes, but our dataset has rent-to-priceratios computed onownedhomes.  To account for sample selection in our dataset of rent-to-price  ratios,  we  weight  the  owned  homes  in  a  city  to  the  distribution  of  rental  homes.(We do not use the alternate methodology – to estimate the hedonic coefficients on ownedhomes, and then compute rent-to-price ratios on rental homes directly – because there arenot enough rental homes for a meaningful sample in some city-year bins.  Indeed, the verysame  cities  that  are  less  populated  are  the  ones  with  a  low  ratio  of  rental  homes.)   Thisprocedure is similar to the nonparametric approach used in Barsky et al (2002.

Sample selection is an issue because rent-to-price ratios are decreasing in house prices(and in house rents), as discussed in the section on house level data in the main text.

For  each  city  in  each  year,  we  re-weight  the  owner-occupied  houses  as  follows.   First,line  up  all  the  houses  by  predicted  rent.   Then  bin  by  percentile  of  predicted  rent.   Next determine the density of renter-occupied in the predicted rent space.  Finally, compute the median  rent-to-price  ratio  among  owner-occupied,  using  the  density  of  renter-occupied  totake a weighted median.

Note that relative to an unweighted median, this nonparametric procedure reduces theweight  on  expensive  homes,  which  are  the  same  homes  for  which  the  hedonic  model  hasthe largest errors (because it is estimated upon rental homes, which are likely to be smaller homes.

## Vacancy data

To compute net yields from gross yields, we need to know the percentage of rental homesthat sit vacant.  We can get this information from the AHS as well.  We use the same dataset(including removing units in housing projects, those with bars on the windows, those thatare rent stabilized, and those missing data).  We label a home as a vacant rental if the surveyidentifies it as for rent only, for rent or for sale, or rented but not yet occupied.  The vacancyrate is the ratio of this number to this number plus the number of renter-occupied homes.For those city-year cells without enough datapoints, we use a projection from the rest of thedataset.

## Tax Rates

We also need a panel of tax rates to compute net yields.  Our sources are Emrath (2002) for1990 and 2000 tax rates from Census data, and the National Association of Home Builders(NAHB) for 2005 to 2012 tax rates from ACS data.  The tax rate data are available by state.

##  Interpolating missing years

As  the  survey  is  biannual  and  the  tax  rates  are  from  Census,  we  linearly  interpolate  therent-to-price ratios, vacancy rates, and tax rates to even-numbered years and other missingyears (in the case of the tax data)

## Net Yields

Starting from gross yields, we compute net yields using the follow costs, some of which areexpressed as a percentage of rent and some of which are a percentage of home value.  Weuse expense ratios from Morgan Stanley,  “The New Age of Buy-To-Rent,”  July 31,  2013.Similar, but less comprehensive, assumptions appear in Bernanke (2012) “The US HousingMarket:  Current conditions and policy considerations.”  The assumptions underlying CoreLogic’s Rental Trends, discussed below, are also broadly consistent with ours, however someof  their  cost  estimates  rely  on  direct  proprietary  data  rather  than  ratios  of  rent  or  houseprice.

- Insurance:  0.375% of price
- Repairs:  0.6% of price
- Capex:  1.15% of price
- Property manager:  5.9% of rent
- Credit loss:  0.73% of ren
- Tax: on price
- Vacancy:  on rent