# Analysis and Modelling of King County, WA Home Sale Prices

An analysis and model by Luluva Lakdawala, Amanda Potter and Leana Critchell

## Setting the Scene:

This project investigates the factors that determine housing prices in King County, Washington.  Our model and investigation is directed towards first time home buyers who are looking to live in a single family home.  Many factors contribute to real estate sales prices, and we aim to discover what some of these driving factors are. 

### Aims:

This project aims to:
- Investigate some of the features that appear to have a relationship with King County housing sale prices
- Develop a model that predicts housing prices in King County using the features that we identify
- Validate the following claims made by real estate professions:
    - Higher square footage increases home sale price
    - The presence of a nuisance (power lines, traffic noise, airport noise) decreases home sale price
    - Having a porch increases home sale price

### Definitions:

- Single family home:
    - A single family home is defined in this project as a single residence on one lot.  Condos are not included in our analysis/model but townhomes are.
- First-time home buyers:
    - Home buyers who are making their first ever home purchase.  We have made the assumption in our analysis that first-time home buyers will not typically buy homes greater than $2.5million 
- Model:
    - The term model referred to through this project is in reference to the linear regression model which we build to explain the variance in home sale prices
- Features:
    - Features refer to the independent variables we choose for our model to help predict sale prices
- Target:
    - Sale Price is our target variable which our model aims to predict

### Data:

The data used in this project is from the King County Department of Assessments website and can be found [here](https://info.kingcounty.gov/assessor/DataDownload/default.aspx).  From this link, you can find the files/tables that were used in this project:
- Real Property Sales
- Parcel
- Residential Building
- Lookup

Our analysis was only looking at the most recent data from 2019 so the data was filtered to this accordingly. 

Additional information about the table identifiers can be found [here](https://www5.kingcounty.gov/sdc/Metadata.aspx?Layer=parcel#AttributeInfo).

### Analysis Takeaways, Future Investigations and Recommendations:

- Our analysis finds square footage of total living area, porch and deck, bathroom count, building grade and township to be some of the more significant driving factors of home sale prices.
- We find that a higher square footage does increase home sale prices in King County
- We find that the presence of nuisances has little affect on home sale prices in King County
- We find that having a porch does increase the home sale price in King County
- Is there a large difference in pricing when looking at condominiums?
- What are the drivers of price in the upper bounds of the market?
- Think about buying smaller and adding extension later
- Consider homes without decks/porches and add later to increase value
- Bathrooms are expensive - look for homes that already have the number of baths you desire even at the expense of square footage



# Data Cleaning and Exploratory Data Analysis:

In [1]:
%load_ext autoreload
%autoreload 2

## Imports

In [10]:
# imports needed throughout analysis and modelling
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('dark')

from statsmodels.formula.api import ols
import statsmodels.api as sm
import scipy.stats as stats

import os
import sys

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))

if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.modeling import modeling_functions as mf
from src.data_cleaning import cleaning_functions as cf

## Get data:

Firstly, to get the data downloaded, run the cells in [this notebook](exploratory/get_data_script.ipynb).  This will create a `data` folder in the root directory containing all the csv files needed for this project as well as a `raw` folder within `data` containing the corresponding zip files.  

Next, create pandas dataframes of the tables we wish to join:

In [6]:
# real property sales df create:
rps = pd.read_csv("data/EXTR_RPSale.csv")

# residential building df create:
res_build = pd.read_csv("data/EXTR_ResBldg.csv")

# parcel df create:
parcel = pd.read_csv("data/EXTR_Parcel.csv", encoding='latin-1')

# lookup codes df create:
codes = pd.read_csv("data/EXTR_LookUp.csv")

## Inspect the data:

In [7]:
rps.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,2687551,138860,110,08/21/2014,245000,20140828001436,,,,,...,3,6,3,N,N,N,N,1,8,
1,1235111,664885,40,07/09/1991,0,199203161090,71.0,1.0,664885.0,C,...,3,0,26,N,N,N,N,18,3,11
2,2704079,423943,50,10/11/2014,0,20141205000558,,,,,...,3,6,15,N,N,N,N,18,8,18 31 51
3,2584094,403700,715,01/04/2013,0,20130110000910,,,,,...,3,6,15,N,N,N,N,11,8,18 31 38
4,3027422,213043,120,12/20/2019,560000,20191226000848,,,,,...,11,6,3,N,N,N,N,1,8,


In [8]:
res_build.head()

Unnamed: 0,Major,Minor,BldgNbr,NbrLivingUnits,Address,BuildingNumber,Fraction,DirectionPrefix,StreetName,StreetType,...,FpMultiStory,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost
0,46100,935,1,1,7349 10TH AVE NW 98117,7349,,,10TH,AVE,...,0,0,0,1927,0,0,0,0,3,0
1,46100,1165,1,1,7342 11TH AVE NW 98117,7342,,,11TH,AVE,...,0,0,0,2019,0,53,0,0,3,0
2,46100,1720,1,1,7330 13TH AVE NW 98117,7330,,,13TH,AVE,...,0,0,0,2019,0,0,0,0,3,0
3,46100,1745,1,1,7348 13TH AVE NW 98117,7348,,,13TH,AVE,...,0,0,0,1910,2005,0,0,0,3,0
4,46100,1930,1,1,7306 14TH AVE NW 98117,7306,,,14TH,AVE,...,1,0,1,1955,0,0,0,0,3,0


In [9]:
parcel.head()

Unnamed: 0,Major,Minor,PropName,PlatName,PlatLot,PlatBlock,Range,Township,Section,QuarterSection,...,SeismicHazard,LandslideHazard,SteepSlopeHazard,Stream,Wetland,SpeciesOfConcern,SensitiveAreaTract,WaterProblems,TranspConcurrency,OtherProblems
0,916110,346,,WARDALL PARK ADD,20-21-22,3.0,3,24,14,SW,...,N,N,N,N,N,N,N,N,N,N
1,132606,9228,,,,,6,26,13,SE,...,N,N,N,N,N,N,N,N,N,N
2,329870,12,,HIGHLAND PARK,2,1.0,4,24,31,SW,...,N,N,N,N,N,N,N,N,N,N
3,884530,50,,UPPERS H S LIBERTY HEIGHTS ADD,9,1.0,3,24,26,SW,...,N,N,N,N,N,N,N,N,N,N
4,261730,220,,FOUR LAKES,2,3.0,6,23,27,NE,...,N,N,N,N,N,N,N,N,N,N


We can see that all of these tables have `Major` and `Minor` columns.  We know from the reference link [here](https://www5.kingcounty.gov/sdc/Metadata.aspx?Layer=parcel#AttributeInfo) that we need to combine the major and minor columns and join the tables on this combined identifier.  Also note that this identifier needs to be a 10 digit number so we will pad with 0's before each major/minor number to get the number we need.

First though, we will make our tables more pythonic by making all columns lower case:

In [14]:
# list of tables to manipulate
tables = [rps, res_build, parcel]
for table in tables:
        colnames = list(map(lambda x: x.lower(), table.columns))
        table.columns = colnames

Now we create the `maj_min` column to join the tables on and drop the exising `major`, `minor` columns:

In [15]:
# custom function does all of this
for table in tables:
    cf.maj_min_col(table)

Preview new tables:

In [16]:
rps.head()

Unnamed: 0,excisetaxnbr,major_minor,documentdate,saleprice,recordingnbr,volume,page,platnbr,plattype,platlot,...,propertytype,principaluse,saleinstrument,afforestland,afcurrentuseland,afnonprofituse,afhistoricproperty,salereason,propertyclass,salewarning
0,2687551,1388600110,08/21/2014,245000,20140828001436,,,,,,...,3,6,3,N,N,N,N,1,8,
1,1235111,6648850040,07/09/1991,0,199203161090,71.0,1.0,664885.0,C,B102,...,3,0,26,N,N,N,N,18,3,11
2,2704079,4239430050,10/11/2014,0,20141205000558,,,,,,...,3,6,15,N,N,N,N,18,8,18 31 51
3,2584094,4037000715,01/04/2013,0,20130110000910,,,,,,...,3,6,15,N,N,N,N,11,8,18 31 38
4,3027422,2130430120,12/20/2019,560000,20191226000848,,,,,,...,11,6,3,N,N,N,N,1,8,


In [17]:
res_build.head()

Unnamed: 0,bldgnbr,major_minor,nbrlivingunits,address,buildingnumber,fraction,directionprefix,streetname,streettype,directionsuffix,...,fpmultistory,fpfreestanding,fpadditional,yrbuilt,yrrenovated,pcntcomplete,obsolescence,pcntnetcondition,condition,addnlcost
0,1,461000935,1,7349 10TH AVE NW 98117,7349,,,10TH,AVE,NW,...,0,0,0,1927,0,0,0,0,3,0
1,1,461001165,1,7342 11TH AVE NW 98117,7342,,,11TH,AVE,NW,...,0,0,0,2019,0,53,0,0,3,0
2,1,461001720,1,7330 13TH AVE NW 98117,7330,,,13TH,AVE,NW,...,0,0,0,2019,0,0,0,0,3,0
3,1,461001745,1,7348 13TH AVE NW 98117,7348,,,13TH,AVE,NW,...,0,0,0,1910,2005,0,0,0,3,0
4,1,461001930,1,7306 14TH AVE NW 98117,7306,,,14TH,AVE,NW,...,1,0,1,1955,0,0,0,0,3,0


In [18]:
parcel.head()

Unnamed: 0,propname,major_minor,platname,platlot,platblock,range,township,section,quartersection,proptype,...,seismichazard,landslidehazard,steepslopehazard,stream,wetland,speciesofconcern,sensitiveareatract,waterproblems,transpconcurrency,otherproblems
0,,9161100346,WARDALL PARK ADD,20-21-22,3.0,3,24,14,SW,R,...,N,N,N,N,N,N,N,N,N,N
1,,1326069228,,,,6,26,13,SE,R,...,N,N,N,N,N,N,N,N,N,N
2,,3298700012,HIGHLAND PARK,2,1.0,4,24,31,SW,R,...,N,N,N,N,N,N,N,N,N,N
3,,8845300050,UPPERS H S LIBERTY HEIGHTS ADD,9,1.0,3,24,26,SW,R,...,N,N,N,N,N,N,N,N,N,N
4,,2617300220,FOUR LAKES,2,3.0,6,23,27,NE,R,...,N,N,N,N,N,N,N,N,N,N


### Inspect info of these tables:

#### RPS table:

In [19]:
rps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2089099 entries, 0 to 2089098
Data columns (total 23 columns):
excisetaxnbr          int64
major_minor           object
documentdate          object
saleprice             int64
recordingnbr          object
volume                object
page                  object
platnbr               object
plattype              object
platlot               object
platblock             object
sellername            object
buyername             object
propertytype          int64
principaluse          int64
saleinstrument        int64
afforestland          object
afcurrentuseland      object
afnonprofituse        object
afhistoricproperty    object
salereason            int64
propertyclass         int64
dtypes: int64(7), object(16)
memory usage: 366.6+ MB


Currently, this table has over 2million entries.  This would be due to the fact that we have not yet filtered down to our desired date range of 2019.  Notice also that `documentdate` is an object type and not a datetime object.  We update this now so that we can filter entries by date:

In [20]:
# change document date type
rps['documentdate'] = pd.to_datetime(rps['documentdate'])

In [22]:
# create new df isolating 2019 data
rps19 = rps[rps['documentdate'].dt.year == 2019]

#### res_build table:

In [23]:
res_build.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515263 entries, 0 to 515262
Data columns (total 49 columns):
bldgnbr               515263 non-null int64
major_minor           515263 non-null object
nbrlivingunits        515263 non-null int64
address               515263 non-null object
buildingnumber        515263 non-null object
fraction              515263 non-null object
directionprefix       514721 non-null object
streetname            515263 non-null object
streettype            515263 non-null object
directionsuffix       514721 non-null object
zipcode               469778 non-null object
stories               515263 non-null float64
bldggrade             515263 non-null int64
bldggradevar          515263 non-null int64
sqft1stfloor          515263 non-null int64
sqfthalffloor         515263 non-null int64
sqft2ndfloor          515263 non-null int64
sqftupperfloor        515263 non-null int64
sqftunfinfull         515263 non-null int64
sqftunfinhalf         515263 non-null int6

In [4]:
# include viewing of correlations to choose feature for fsm

In [5]:
# start with fsm

In [6]:
# explain how future features were chosen

In [7]:
# work through the iterations

Example reference link to other notebooks:

see [here](exploratory/lmc_notebooks/model6_lc.ipynb)

# First Simple Model

In [8]:
# fsm based on square footage

# Model iterations

In [9]:
# iterate through the process

In [10]:
# include further feature investigations here

# Model interpretation

In [11]:
# go over final model details - r2, assumptions, coefficients, pvals etc

# Claim Validations

In [12]:
# go through each claim seperately and refer to other things throughout. 

# Conclusion:

This report set out to highlight areas for future investigation in order to gain a better understanding of the Opportunity Youth population. By diving deeper into the reasons for the levels of education that the Opportunity Youth attain, and the factors that motivate Opportunity Youth to seek employment, a better understanding of this population can be formed to drive future efforts of support.