# Table of Contents

* [Background](#Data-sources-background-information)
* [Data Import](#Data-import)


## TODO list:

* organize descriptions of files and links
* work on API / functionalized approach
* decide which month / quarter / year to focus on


## Setup notes

* setup conda env with python 3.7: realty-market-app
* setup workspace in vscode
* created new container in docker for postgres from official postgres image pg-realty


## Data sources background information

| Data                | Source                              | Purpose    | Details    | Key | Data name
| :----------------   | :---                                | :---       | :---       | :--- | :---
| House prices and rent prices      | [Zillow](https://www.zillow.com/research/data/) | with rent prices, can determine ratios | TBA  | Zip Code | rent_prices
| New construction                  | [Census data](https://www.census.gov/construction/nrc/index.html) | While prices are lagging indicator, this should be a leading indicator | only high level data :( maybe this is better: [Census surveys](https://www.census.gov/construction/bps/msaannual.html)  | CBSA - crosswalk to Zip code | construction
| Rental Vacancy rates (Table 4)     | [more Census](https://www.census.gov/housing/hvs/data/rates.html) | could be a proxy for rent prices which is hard to find | this is 75 largest metro areas  | names of metro areas?? | vacancy
| sub-county population estimates    | [Census pop data](https://www.census.gov/data/tables/time-series/demo/popest/2010s-total-cities-and-towns.html#ds)   | population trends will indicate future rent demand | TBA  |  | population


## Rental vacancy 

from here https://www.census.gov/housing/hvs/methodology/index.html

The CPS/HVS is administered by the Census Bureau using a probability selected sample of about 72,000 housing units, both occupied and vacant. The fieldwork is conducted during the calendar week that includes the 19th of the month. The questions refer to activities during the prior week; that is, the week that includes the 12th of the month. Households from all 50 states and the District of Columbia are in the survey for 4 consecutive months, out for 8, and then return for another 4 months before leaving the sample permanently. This design ensures a high degree of continuity from one month to the next (as well as over the year). The 4-8-4 sampling scheme has the added benefit of allowing the constant replenishment of the sample without excessive burden to respondents.

The CPS/HVS questionnaire is a completely computerized document that is administered by Census Bureau field representatives across the country through both personal and telephone interviews.

CPS/HVS Interviewing Manual [PDF]

2019 CPS/HVS Design and Methodology Technical Paper [PDF]

2006 CPS/HVS Design and Methodology Technical Paper [PDF]

## Zillow rent prices

https://www.zillow.com/research/methodology-zori-repeat-rent-27092/

Zillow Observed Rent Index (ZORI): A smoothed measure of the typical observed market rate rent across a given region. ZORI is a repeat-rent index that is weighted to the rental housing stock to ensure representativeness across the entire market, not just those homes currently listed for-rent. The index is dollar-denominated by computing the mean of listed rents that fall into the 40th to 60th percentile range for all homes and apartments in a given region, which is once again weighted to reflect the rental housing stock. Details available in ZORI methodology.

What’s available to rent at any given time can change rapidly, and measures of median or average prices across time may not reflect actual market-based movements in rent prices, but instead simply reflect the fact that certain unit types are available at different times. ZORI solves this challenge by calculating price differences for the same rental unit over time, then aggregating those differences across all properties repeatedly listed for rent on Zillow.

Once the index is computed, it is smoothed using a three-month exponentially weighted moving average. Prior to publication, both the raw and smoothed indices are checked against a set of heuristics based on statistics of the time series to flag potential data quality issues so they can be investigated and fixed, or a determination can be made not to publish the series.

To make the index more interpretable, we attach a dollar value to the latest data point in the series and use the index’s month-to-month changes to chain the dollar value back in time. The dollar amount is calculated by taking the mean of the middle 20% (the 40-to-60 percentile) of the asking rent for observations from the most recent month. Using the mean of the middle quintile instead of a straight median better captures small changes in the market, while also reducing noise. To correct for bias in list rents, we use the same weights described above to make the dollar-denominated amount representative of the market of available homes. 

Zillow neigborhoods: https://data.opendatasoft.com/explore/dataset/zillow-neighborhoods@public/information/

## Unemployment data 
Starting from the LAUS home page, scroll down to "Get Detailed Statistics," and then click on "Flat files FTP Site," or go to http://download.bls.gov/pub/time.series/la/. To understand the data provided in these FTP files, scroll down to the "la.txt" document, which provides information on the following: Time series, series file, data file, and mapping file definitions and relationships; Series file format and field definitions; Data file format and field definitions; Mapping file formats and field definitions; and a Data Element Dictionary. Other explanatory documents include "la.period" and "la.area.type," which define periods and area types, respectively.

## Plan for today (Monday):

1. decide which Zillow data to use and whether zip codes?
2. what is my primary key to link everything going to be?

    * zip code? FIPS code? what is CBSA? Core-Based Statistical Area?

3. what hypothesis tests?

    * bin by regions and determine differences? doesn't seem very useful?
    * could do some testing of impact of COVID if I use 2020 data?
    * bayesian?
    * correlation between home prices and rent prices?
    
4. data into SQL
5. start analyzing
6. (Later) write code to use APIs and pull directly from websites instead

## FIPS, CBSA, Zip Codes, geocodes etc

CBSA = core based statistical areas

New metropolitan and micropolitan statistical area definitions were announced by OMB on June 6, 2003, based on application of the 2000 standards with Census 2000 data. Metropolitan and Micropolitan Statistical Areas are collectively referred to as Core-Based Statistical Areas.

Metropolitan statistical areas have at least one urbanized area of 50,000 or more population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties.
Micropolitan statistical areas are a new set of statistical areas that have at least one urban cluster of at least 10,000 but less than 50,000 population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties.
Metropolitan and micropolitan statistical areas are defined in terms of whole counties or county equivalents, including the six New England states. As of June 6, 2003, there are 362 metropolitan statistical areas and 560 micropolitan statistical areas in the United States.

https://www.huduser.gov/portal/datasets/usps_crosswalk.html

zip code tabulation areas? https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline
pd.options.display.float_format = '{:,.2f}'.format
plt.style.use('seaborn-white')

## Data import

### Zillow Home Price data

In [7]:
home_prices = pd.read_csv('../data/Zip_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_mon.csv')

In [9]:
home_prices.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,1996-01-31,...,2019-09-30,2019-10-31,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31,2020-04-30,2020-05-31,2020-06-30
0,61639,0,10025,Zip,NY,NY,New York,New York-Newark-Jersey City,New York County,,...,1397980.0,1389522.0,1383244.0,1380903.0,1385338.0,1386299.0,1385537.0,1377219.0,1366529.0,1351955.0
1,84654,1,60657,Zip,IL,IL,Chicago,Chicago-Naperville-Elgin,Cook County,362440.0,...,969701.0,968746.0,967546.0,966119.0,965833.0,966867.0,968066.0,967947.0,966726.0,964844.0
2,61637,2,10023,Zip,NY,NY,New York,New York-Newark-Jersey City,New York County,,...,1597185.0,1590668.0,1584963.0,1581334.0,1584746.0,1586066.0,1582635.0,1575709.0,1569296.0,1569607.0
3,91982,3,77494,Zip,TX,TX,Katy,Houston-The Woodlands-Sugar Land,Harris County,200594.0,...,336361.0,336399.0,336382.0,336475.0,336202.0,336398.0,336859.0,337928.0,338853.0,339429.0
4,84616,4,60614,Zip,IL,IL,Chicago,Chicago-Naperville-Elgin,Cook County,544060.0,...,1200264.0,1198154.0,1195523.0,1193421.0,1193009.0,1194815.0,1196093.0,1196435.0,1195008.0,1194721.0


### Zillow Rent Price data

In [8]:
rent_prices = pd.read_csv('../data/Zip_ZORI_AllHomesPlusMultifamily_Smoothed.csv')

In [10]:
rent_prices.head()

Unnamed: 0,RegionID,RegionName,SizeRank,2014-01,2014-02,2014-03,2014-04,2014-05,2014-06,2014-07,...,2019-09,2019-10,2019-11,2019-12,2020-01,2020-02,2020-03,2020-04,2020-05,2020-06
0,61639,10025,1,3134.0,3065.0,3082.0,3159.0,3119.0,3099.0,3134.0,...,3371.0,3348.0,3331.0,3327.0,3332.0,3395.0,3370.0,3306.0,3275.0,3213.0
1,84654,60657,2,1673.0,1663.0,1674.0,1711.0,1725.0,1770.0,1768.0,...,1947.0,1924.0,1912.0,1904.0,1936.0,1974.0,1974.0,1974.0,1970.0,1953.0
2,61637,10023,3,3087.0,3025.0,3085.0,3138.0,3157.0,3225.0,3202.0,...,3312.0,3314.0,3328.0,3263.0,3322.0,3315.0,3333.0,3305.0,3269.0,3246.0
3,91982,77494,4,1813.0,1877.0,1854.0,1885.0,1898.0,1925.0,1923.0,...,1864.0,1862.0,1855.0,1862.0,1865.0,1866.0,1869.0,1861.0,1855.0,1873.0
4,84616,60614,5,1870.0,1782.0,1840.0,1869.0,1901.0,1915.0,1938.0,...,2166.0,2147.0,2137.0,2129.0,2128.0,2146.0,2160.0,2182.0,2185.0,2174.0


### Construction data

In [25]:
# skip 1-5 and 7 row ids
construction = pd.read_excel('../data/msaannual_201999_building_construction.xls', header=5, skiprows=[range(5), 6])
construction.head()
#construction.shape

Unnamed: 0,CSA,CBSA,Name,Total,1 Unit,2 Units,3 and 4 Units,5 Units or More,Num of Structures With 5 Units or More
0,999,10180,"Abilene, TX ...",370,354,16,0,0,0
1,184,10420,"Akron, OH ...",945,856,2,32,55,1
2,999,10500,"Albany, GA ...",402,182,0,0,220,10
3,440,10540,"Albany-Lebanon, OR ...",716,466,0,0,250,26
4,104,10580,"Albany-Schenectady-Troy, NY ...",1870,1120,40,45,665,48


### Rental vacancy

In [49]:
a = list(range(509,557))
a

[509,
 510,
 511,
 512,
 513,
 514,
 515,
 516,
 517,
 518,
 519,
 520,
 521,
 522,
 523,
 524,
 525,
 526,
 527,
 528,
 529,
 530,
 531,
 532,
 533,
 534,
 535,
 536,
 537,
 538,
 539,
 540,
 541,
 542,
 543,
 544,
 545,
 546,
 547,
 548,
 549,
 550,
 551,
 552,
 553,
 554,
 555,
 556]

In [74]:
# this file has merged cells and other stuff
vacancy = pd.read_excel('../data/tab4_msa_15_20_rvr.xlsx', 
                        header=3, 
                        usecols='B:J', 
                        #skiprows=2, 
                        skipfooter=509)

vacancy = vacancy.iloc[4:,:]
vacancy.head()

Unnamed: 0,Metropolitan Statistical Area,First Quarter 2020,Margin of Error1,Second Quarter 2020,Margin of Error1.1,Third Quarter 2020,Margin of Error1.2,Fourth Quarter 2020,Margin of Error1.3
4,"Akron, OH .......................................",10.2,9.2,,,,,,
5,"Albany-Schenectady-Troy, NY .....................",7.9,6.7,,,,,,
6,"Albuquerque, NM..................................",4.3,2.8,,,,,,
7,"Allentown-Bethlehem-Easton, PA-NJ................",4.5,5.9,,,,,,
8,"Atlanta-Sandy Springs-Roswell, GA1...............",6.9,2.5,,,,,,


In [76]:
# drop rows without MSA name
vacancy = vacancy.dropna(subset=['Metropolitan Statistical Area'], axis=0)

In [None]:
# TODO: split the city and state into new col? or do it in SQL?

### population

In [33]:
# default utf-8 didn't work
population = pd.read_csv('../data/sub-est2019_all.csv', encoding = "ISO-8859-1")
population.head()
population.shape

(81434, 22)

### ZIP and CBSA crosswalk files

In [None]:
# 4th quarter ZIP and CBSA crosswalk files:
# some ZIPs are in multiple CBSAs?
# how many unique do we have?
# from wikipedia, says there is around 900:
# https://en.wikipedia.org/wiki/List_of_core-based_statistical_areas

# TODO: figure this out

### Import into SQL

In [None]:
# open connection

