[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lokdoesdata/zillow-forecast/blob/main/lokdoesdata-zillow-forecast.ipynb)

## Introduction

“Of all the ways the ultra-rich made their fortunes, real estate outpaced every other method 3 to 1”, wrote [Liz Brumer-Smith](https://www.fool.com/millionacres/real-estate-basics/when-to-invest-in-real-estate/) of Millionarces. Like any other investments, market knowledge is important for investing in real estates. Fortunately, with how widely available data is in today's society, there are plenty of data that can be used to help make real estate investment decisions. In this exercise, future pricing of single family homes were prediced using data from [Zillow](https://files.zillowstatic.com/research/public/Zip/Zip_Zhvi_SingleFamilyResidence.csv).

This exercise was originally done as part of a course assignment for Big Data Analytics (IST 718) at Syracuse University.

## Set Up

This notebook uses [`geopandas`](https://geopandas.org/) and it's designed to run on Google Colab.

This notebook also uses two custom helpers; one for handling the data pull, and another to handle the times series.

### Install GeoPandas

This uses pip to install `geopandas` on Google Colab.

In [None]:
%pip install --upgrade geopandas
%pip install --upgrade pyshp
%pip install --upgrade shapely
%pip install --upgrade descartes

### Import Packages

In [7]:
from helper import data
import pandas as pd
import geopandas as gpd

## Data

The main dataset used in this exercise is from [Zillow's Housing Price Index by ZIP Code](https://files.zillowstatic.com/research/public/Zip/Zip_Zhvi_SingleFamilyResidence.csv). This data set contains monthly housing price by ZIP Code. At the time of the analysis, there are data from January 1996 to March 2020. 

An initial data inspection does not shows that the Zillow data set has a lot of incorrect state assigned to the ZIP codes. For example, ZIP code 00601 is assigned to Mississippi when it belongs to Puerto Rico. This mainly affect ZIP codes outside of the 50 states. However, external dataset can be used to correct them.

Two geodatabases from Esri were used:
- [USA ZIP Code Areas](https://www.arcgis.com/home/item.html?id=8d2012a2016e484dafaac0451f9aea24) which contains ZIP Code boundaries from TomTom (December 2019) and 2018 total population estimates from Esri demographics team.
- [USA ZIP Code Points](https://www.arcgis.com/home/item.html?id=1eeaf4bb41314febb990e2e96f7178df) which contains ZIP Code points from TomTom December 2019 and 2018 total population estimates from Esri demographics team. This file is used for single site ZIP Codes.

One shapefile from the United States' Census:
- [Metropolitan and Micropolitan Statistical Area (MSA)](https://www2.census.gov/geo/tiger/GENZ2010/gz_2010_us_310_m1_500k.zip) from the 2010 Census (latest version at the time of analysis).



### Zillow Data

The Zillow dataset was downloaded directly from Zillow.

In [None]:
gdf_zillow = pd.read_csv(r'https://files.zillowstatic.com/research/public/Zip/Zip_Zhvi_SingleFamilyResidence.csv')

gdf_zillow.drop([
    'RegionID', 
    'SizeRank', 
    'RegionType', 
    'StateName', 
    'State', 
    'City', 
    'Metro', 
    'CountyName'
], axis=1, inplace=True)

gdf_zillow.rename({'RegionName': 'ZIP Code'}, axis=1, inplace=True)

gdf_zillow['ZIP Code'] = [str(z).zfill(5) for z in gdf_zillow['ZIP Code']]

### Geographical Data

The geographical information for each ZIP code is processed using a [helper module](https://colab.research.google.com/github/lokdoesdata/zillow-forecast/blob/main/helper/geom_data.ipynb).

In [2]:
gdf_zip_code = data.get_zip_code_gdf()

gdf_zip_code.rename({
    'ZIP_CODE': 'ZIP Code',
    'PO_NAME': 'PO Name',
    'STATE': 'State',
    'POPULATION': 'Pop',
    'SQMI': 'Sq Mi',
    'NAME': 'MSA'
}, axis=1, inplace=True)

gdf_zip_code = gdf_zip_code[['ZIP Code', 'PO Name', 'State', 'Pop', 'Sq Mi', 'MSA', 'geometry']]

Wall time: 2min 33s


In [25]:
gdf_zip_code = gdf_zip_code.merge(gdf_zillow, on='ZIP Code')

In [26]:
gdf_zip_code

Unnamed: 0,ZIP Code,PO Name,State,Pop,Sq Mi,MSA,geometry,1996-01-31,1996-02-29,1996-03-31,...,2019-06-30,2019-07-31,2019-08-31,2019-09-30,2019-10-31,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31
0,35004,Moody,AL,10166.0,16.67,Birmingham-Hoover,"POLYGON ((-86.53440 33.57611, -86.53439 33.579...",,,,...,171341.0,171985.0,172798.0,173525.0,174447.0,175412.0,176493.0,177517.0,178792.0,180209.0
1,35005,Adamsville,AL,7704.0,31.95,Birmingham-Hoover,"POLYGON ((-87.08134 33.61427, -87.08134 33.614...",,,,...,93842.0,94998.0,95880.0,96309.0,96827.0,97715.0,98858.0,100218.0,101420.0,103567.0
2,35006,Adger,AL,3226.0,110.74,Birmingham-Hoover,"POLYGON ((-87.37462 33.43646, -87.37457 33.436...",,,65352.0,...,89677.0,90329.0,90541.0,90755.0,91278.0,92405.0,93368.0,94468.0,95646.0,97727.0
3,35007,Alabaster,AL,28701.0,37.41,Birmingham-Hoover,"POLYGON ((-86.86542 33.19305, -86.86539 33.193...",,,,...,176897.0,177814.0,178767.0,179826.0,181315.0,182661.0,183993.0,185435.0,186884.0,188331.0
4,35010,Alexander City,AL,20230.0,247.58,Alexander City,"POLYGON ((-86.10826 32.83174, -86.10698 32.833...",,,,...,105853.0,106422.0,106928.0,107407.0,107833.0,108275.0,108636.0,109033.0,109330.0,109625.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30459,82934,Granger,WY,,,Rock Springs,POINT (-109.96484 41.58896),,,,...,68665.0,68642.0,68769.0,69167.0,69679.0,70262.0,70312.0,70867.0,71104.0,72128.0
30460,82939,Mountain View,WY,,,Evanston,POINT (-110.33740 41.27024),,,,...,218850.0,220318.0,221834.0,223074.0,223976.0,224595.0,225243.0,225789.0,226010.0,226011.0
30461,82944,Robertson,WY,,,Evanston,POINT (-110.40662 41.18529),,,,...,255875.0,258222.0,261013.0,262419.0,264220.0,264957.0,266349.0,268271.0,269414.0,269747.0
30462,82945,Superior,WY,,,Rock Springs,POINT (-108.96489 41.76420),,,,...,56587.0,57422.0,57578.0,57380.0,56872.0,56611.0,56404.0,56016.0,55944.0,55830.0
