[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lokdoesdata/zillow-forecast/blob/main/lokdoesdata-zillow-forecast.ipynb)

## Introduction

“Of all the ways the ultra-rich made their fortunes, real estate outpaced every other method 3 to 1”, wrote [Liz Brumer-Smith](https://www.fool.com/millionacres/real-estate-basics/when-to-invest-in-real-estate/) of Millionarces. Like any other investments, market knowledge is important for investing in real estates. Fortunately, with how widely available data is in today's society, there are plenty of data that can be used to help make real estate investment decisions. In this exercise, future pricing of single family homes were prediced using data from [Zillow](https://files.zillowstatic.com/research/public/Zip/Zip_Zhvi_SingleFamilyResidence.csv).

This exercise was originally done as part of a course assignment for Big Data Analytics (IST 718) at Syracuse University.

## Set Up

This notebook uses [`geopandas`](https://geopandas.org/) and it's designed to run on Google Colab.

This notebook also uses two custom helpers; one for handling the data pull, and another to handle the times series.

### Install GeoPandas

This uses pip to install `geopandas` on Google Colab.

In [1]:
%pip install --upgrade geopandas
%pip install --upgrade pyshp
%pip install --upgrade shapely
%pip install --upgrade descartes

### Install pmdarima

This uses pip to install `pmdarima` on Google Colab.

In [None]:
%pip install --upgrade pmdarima

### Cloning from Github

In [2]:
!git clone https://github.com/lokdoesdata/zillow-forecast.git
import sys, os
sys.path.append(r'/content/zillow-forecast')

### Import Packages

In [3]:
from helper import geom_data
from helper.time_series import TimeSeries
import pandas as pd
import geopandas as gpd

## Data

The main dataset used in this exercise is from [Zillow's Housing Price Index by ZIP Code](https://files.zillowstatic.com/research/public/Zip/Zip_Zhvi_SingleFamilyResidence.csv). This data set contains monthly housing price by ZIP Code. At the time of the analysis, there are data from January 1996 to March 2020. 

An initial data inspection does not shows that the Zillow data set has a lot of incorrect state assigned to the ZIP codes. For example, ZIP code 00601 is assigned to Mississippi when it belongs to Puerto Rico. This mainly affect ZIP codes outside of the 50 states. However, external dataset can be used to correct them.

Two geodatabases from Esri were used:
- [USA ZIP Code Areas](https://www.arcgis.com/home/item.html?id=8d2012a2016e484dafaac0451f9aea24) which contains ZIP Code boundaries from TomTom (December 2019) and 2018 total population estimates from Esri demographics team.
- [USA ZIP Code Points](https://www.arcgis.com/home/item.html?id=1eeaf4bb41314febb990e2e96f7178df) which contains ZIP Code points from TomTom December 2019 and 2018 total population estimates from Esri demographics team. This file is used for single site ZIP Codes.

One shapefile from the United States' Census:
- [Metropolitan and Micropolitan Statistical Area (MSA)](https://www2.census.gov/geo/tiger/GENZ2010/gz_2010_us_310_m1_500k.zip) from the 2010 Census (latest version at the time of analysis).



### Zillow Data

The Zillow dataset was downloaded directly from Zillow.

In [4]:
gdf_zillow = pd.read_csv(r'https://files.zillowstatic.com/research/public/Zip/Zip_Zhvi_SingleFamilyResidence.csv')

gdf_zillow.drop([
    'RegionID', 
    'SizeRank', 
    'RegionType', 
    'StateName', 
    'State', 
    'City', 
    'Metro', 
    'CountyName'
], axis=1, inplace=True)

gdf_zillow.rename({'RegionName': 'ZIP Code'}, axis=1, inplace=True)

gdf_zillow['ZIP Code'] = [str(z).zfill(5) for z in gdf_zillow['ZIP Code']]

### Geographical Data

The geographical information for each ZIP code is processed using a [helper module](https://github.com/lokdoesdata/zillow-forecast/blob/main/helper/geom_data.py).

This could take a while due to the volume of data.

In [5]:
gdf_zip_code = geom_data.get_zip_code_gdf()

gdf_zip_code.rename({
    'ZIP_CODE': 'ZIP Code',
    'PO_NAME': 'PO Name',
    'STATE': 'State',
    'POPULATION': 'Pop',
    'SQMI': 'Sq Mi',
    'NAME': 'MSA'
}, axis=1, inplace=True)

gdf_zip_code = gdf_zip_code[['ZIP Code', 'PO Name', 'State', 'Pop', 'Sq Mi', 'MSA', 'x', 'y', 'geometry']]

In [6]:
gdf_zip_code = gdf_zip_code.merge(gdf_zillow, on='ZIP Code')

In [7]:
# gdf_zip_code = gpd.read_file(geom_data.DATA_PATH.joinpath('test_input', 'zip_code.gpkg'))

In [8]:
time_series = TimeSeries(gdf_zip_code, forecast_start='01/31/2019')