# King County Project

## Business Problem

A client in King County, WA wants to advise homeowners on **home improvement projects** that will **add to the sale value of their homes**.

**This advice should be based on data from the most recent full calendar year, 2019**.

### Core questions:
Enclosing a porch will increase the sale price of a home.

Converting a garage to a bedroom is a good way to increase the sale price of a home.

Upgrading to a forced-air heating system will increase the sale price of a home.

### Core Goals:
Create model

Interpret results

Make recomendations

## Schedule

### Friday 2/19: 
#### Business Understanding & Preliminary EDA
* Repo Creation
* Data Importation
* Database Creation
* Created initial data frame.

### Saturday 2/20: 
#### Data Understanding & EDA
* Added ```.gitignore``` file in exploratory directory to exclude ```KingDB.db``` file.
* Created ```lookup()``` function.
* Created 2019 data frame.
* Created ```nz19``` data frame of 2019 documents with non-zero sale prices.

### Sunday 2/21: 
##### Data Prep

* Added a ```function.py``` module to contain the functions written while working through the project.
* Created a ```col_stripper()``` function, and appended it, ```fetch()```, and ```lookup()``` to the ```function``` module.
* Did some data cleaning on the lookup dataframe ```df_look```.
* Made a ```heat_df``` data frame with the ```'SalePrice'``` target and ```'HeatSystems'``` predictor.
* Converted  ```'HeatSystems'``` to a column called ```'HeatNames'```with more descriptive values.
* Perfomed one-hot encoding on ```'HeatNames'``` and created a ```model_df```.
* Created a correlation matrix and heatmap for ```model_df```.


# Initial EDA Work

#### Importing Libraries and Adjusting Settings

In [1]:
# import modules for eda and plotting
import pandas as pd
import numpy as np
import scipy.stats as stats

import sqlite3

from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt
import seaborn as sns

import functions as fn

# setting plots to inline
%matplotlib inline

# setting the max number of rows displayed
pd.options.display.max_rows = 200

## SQL Dataframe

### SQL Prelim Work

#### Created Database
Earlier, wrote up a [DB Creator](DB_Creator.ipynb) notebook and ran it to create an SQL database from the raw ```.csv``` files.

#### Creating DataFrame From the Database

Connecting to the database, and creating a cursor object. Joining the database tables into a second main data frame. Lastly, checking basic information about the data frame.

In [136]:
# creating database, connection, and cursor
conn = sqlite3.connect('KingDB.db')  
cur = conn.cursor()

In [137]:
# checking the table names
q = """SELECT name FROM sqlite_master 
WHERE type IN ('table','view') 
AND name NOT LIKE 'sqlite_%'
ORDER BY 1"""
fn.fetch(cur, q)

[('PARC',), ('RESB',), ('SALES',)]

#### Joining The Tables to Create a Data Frame

In [153]:
# joining tables to create dataframe and appending column names
q = """SELECT*FROM SALES AS SA
       JOIN PARC AS PA
       ON SA.Major = PA.Major
       AND SA.Minor = PA.Minor
       JOIN RESB AS RE
       ON PA.Major = RE.Major
       AND PA.Minor = RE.Minor
       """
df = pd.DataFrame(fn.fetch(cur, q))
df.columns = [i[0] for i in cur.description]

In [154]:
# checking info, shape
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251300 entries, 0 to 251299
Columns: 156 entries, ExciseTaxNbr to AddnlCost
dtypes: object(156)
memory usage: 299.1+ MB


### Checking & Dealing with Columns

In [155]:
# checking columns
list(df.columns)

['ExciseTaxNbr',
 'Major',
 'Minor',
 'DocumentDate',
 'SalePrice',
 'RecordingNbr',
 'Volume',
 'Page',
 'PlatNbr',
 'PlatType',
 'PlatLot',
 'PlatBlock',
 'SellerName',
 'BuyerName',
 'PropertyType',
 'PrincipalUse',
 'SaleInstrument',
 'AFForestLand',
 'AFCurrentUseLand',
 'AFNonProfitUse',
 'AFHistoricProperty',
 'SaleReason',
 'PropertyClass',
 'Unnamed: 0',
 'Major',
 'Minor',
 'PropName',
 'PlatName',
 'PlatLot',
 'PlatBlock',
 'Range',
 'Township',
 'Section',
 'QuarterSection',
 'PropType',
 'Area',
 'SubArea',
 'SpecArea',
 'SpecSubArea',
 'DistrictName',
 'LevyCode',
 'CurrentZoning',
 'HBUAsIfVacant',
 'HBUAsImproved',
 'PresentUse',
 'SqFtLot',
 'WaterSystem',
 'SewerSystem',
 'Access',
 'Topography',
 'StreetSurface',
 'RestrictiveSzShape',
 'InadequateParking',
 'PcntUnusable',
 'Unbuildable',
 'MtRainier',
 'Olympics',
 'Cascades',
 'Territorial',
 'SeattleSkyline',
 'PugetSound',
 'LakeWashington',
 'LakeSammamish',
 'SmallLakeRiverCreek',
 'OtherView',
 'WfntLocation'

In [156]:
# dropping unnamed column found above
df.drop('Unnamed: 0', axis=1, inplace=True)

In [157]:
df.head(1)

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,FpMultiStory,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost
0,2743355,638580,110,07/14/2015,190000,20150715002686,,,,,...,1,0,1,1963,0,0,0,0,3,0


In [158]:
# checking nulls and shape
print(df.isna().sum())
df.shape

ExciseTaxNbr                   0
Major                          0
Minor                          0
DocumentDate                   0
SalePrice                      0
RecordingNbr                   0
Volume                         0
Page                           0
PlatNbr                        0
PlatType                       0
PlatLot                        0
PlatBlock                      0
SellerName                     0
BuyerName                      0
PropertyType                   0
PrincipalUse                   0
SaleInstrument                 0
AFForestLand                   0
AFCurrentUseLand               0
AFNonProfitUse                 0
AFHistoricProperty             0
SaleReason                     0
PropertyClass                  0
Major                          0
Minor                          0
PropName                   11473
PlatName                   29223
PlatLot                        0
PlatBlock                      0
Range                          0
Township  

(251300, 155)

Checking ```'SpecArea', 'SpecSubArea'``` columns.

In [159]:
df[['SpecArea', 'SpecSubArea']].head(2)

Unnamed: 0,SpecArea,SpecSubArea
0,,
1,,


It looks like the ```'SpecArea'``` and ```'SpecSubArea'``` are extraneous columns for modeling purposes so they will be dropped.

In [160]:
# dropping columns and checking first row
df.drop(['SpecArea', 'SpecSubArea'], axis=1, inplace=True)
df.head(1)

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,FpMultiStory,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost
0,2743355,638580,110,07/14/2015,190000,20150715002686,,,,,...,1,0,1,1963,0,0,0,0,3,0


In [133]:
# checking problematic columns
df[['PropName', 'PlatName', 'DirectionPrefix', 'DirectionSuffix', 'ZipCode']].head()

Unnamed: 0,PropName,PlatName,DirectionPrefix,DirectionSuffix,ZipCode
0,,OLYMPIC TERRACE ADD,,S,98188.0
1,,VINTNER'S PLACE,NE,,
2,,LAKE UNION ADD,,N,98103.0
3,,MAPLE LEAF TO GREEN LAKE CIRCLE POR OF,NE,,98115.0
4,,BURROWS ADD,,SW,98106.0


In [161]:
# checking for duplicate columns
df.loc[:, df.columns.duplicated()].head(3)

Unnamed: 0,Major,Minor,PlatLot,PlatBlock,Major.1,Minor.1
0,638580,110,11,,638580,110
1,894677,240,24,,894677,240
2,408330,4150,9,42.0,408330,4150


In [162]:
# dropping duplicate column and checking shape
df = df.loc[:,~df.columns.duplicated()]
df.shape

(251300, 147)

In [163]:
list(df.columns)

['ExciseTaxNbr',
 'Major',
 'Minor',
 'DocumentDate',
 'SalePrice',
 'RecordingNbr',
 'Volume',
 'Page',
 'PlatNbr',
 'PlatType',
 'PlatLot',
 'PlatBlock',
 'SellerName',
 'BuyerName',
 'PropertyType',
 'PrincipalUse',
 'SaleInstrument',
 'AFForestLand',
 'AFCurrentUseLand',
 'AFNonProfitUse',
 'AFHistoricProperty',
 'SaleReason',
 'PropertyClass',
 'PropName',
 'PlatName',
 'Range',
 'Township',
 'Section',
 'QuarterSection',
 'PropType',
 'Area',
 'SubArea',
 'DistrictName',
 'LevyCode',
 'CurrentZoning',
 'HBUAsIfVacant',
 'HBUAsImproved',
 'PresentUse',
 'SqFtLot',
 'WaterSystem',
 'SewerSystem',
 'Access',
 'Topography',
 'StreetSurface',
 'RestrictiveSzShape',
 'InadequateParking',
 'PcntUnusable',
 'Unbuildable',
 'MtRainier',
 'Olympics',
 'Cascades',
 'Territorial',
 'SeattleSkyline',
 'PugetSound',
 'LakeWashington',
 'LakeSammamish',
 'SmallLakeRiverCreek',
 'OtherView',
 'WfntLocation',
 'WfntFootage',
 'WfntBank',
 'WfntPoorQuality',
 'WfntRestrictedAccess',
 'WfntAccessRi

In [164]:
# moving target variable to the front column of the df
columns = list(df.columns)
columns = [columns[4]] + columns[:4] + columns[5:]
df = df[columns]
df.head(3)

### Creating the Lookup Data Frame

In [15]:
# creating path to the file
files = ['EXTR_LookUp.csv']
paths = [f'../../data/raw/{file}' for file in files]

# creating list of data frames, importing data as strings
dfs = [pd.read_csv(path, dtype=str) for path in paths]

# isolating individual data frames
look = dfs[0]

#### Checking the Lookup Data Frame

Getting basic info, checking the first row, description strings, and cleaning the columns.

In [16]:
# getting info for lookup data frame
look.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1208 entries, 0 to 1207
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   LUType         1208 non-null   object
 1   LUItem         1208 non-null   object
 2   LUDescription  1208 non-null   object
dtypes: object(3)
memory usage: 28.4+ KB


In [17]:
# checking first row
look.head(1)

Unnamed: 0,LUType,LUItem,LUDescription
0,1,1,LAND ONLY ...


In [18]:
# checking strings
look.LUType.values[:2], look.LUItem.values[:2], look.LUDescription.values[:2]

(array(['1  ', '1  '], dtype=object),
 array(['1  ', '10 '], dtype=object),
 array(['LAND ONLY                                         ',
        'Land with new building                            '], dtype=object))

Cleaning strings and checking results.

In [19]:
# cleaning strings
look['LUType'] = fn.col_stripper(look, 'LUType')
look['LUItem'] = fn.col_stripper(look, 'LUItem')
look['LUDescription'] = fn.col_stripper(look, 'LUDescription')

In [20]:
# checking results
print(look.LUType.values)
print(look.LUItem.values)
look.LUDescription.values

['1' '1' '1' ... '99' '99' '99']
['1' '10' '11' ... '3' '4' '5']


array(['LAND ONLY', 'Land with new building',
       'Household, single family units', ..., 'AVERAGE', 'ABOVE AVERAGE',
       'EXCELLENT'], dtype=object)

## 2019 Data

Since we want to train our model on 2019 data, we are going to isolate the 2019 information into a new data frame and use it to define our target and predictors.

#### Light Data Cleaning

Changing datestrings to datetime objects,  and sale price strings to floats in the main pandas dataframe. Adding a 'DocumentYear' column to main the data frame.

In [21]:
# changing date strings to datetime objects
df.DocumentDate = pd.to_datetime(df.DocumentDate)

# adding a document year column
df['DocumentYear'] = df['DocumentDate'].apply(lambda x: x.year)

# converting SalePrice string to float
df['SalePrice'] = df['SalePrice'].astype('float')

#### Creating Initial 2019 DataFrame

Creating data frame and checking basic information

In [22]:
# isolating 2019 data
df19 = df[df['DocumentYear']==2019]

In [23]:
# checking data frame shape and looking for NaNs
print(df19.shape)
df19.isna().sum()

(43838, 150)


ExciseTaxNbr                  0
Major                         0
Minor                         0
DocumentDate                  0
SalePrice                     0
RecordingNbr                  0
Volume                        0
Page                          0
PlatNbr                       0
PlatType                      0
PlatLot                       0
PlatBlock                     0
SellerName                    0
BuyerName                     0
PropertyType                  0
PrincipalUse                  0
SaleInstrument                0
AFForestLand                  0
AFCurrentUseLand              0
AFNonProfitUse                0
AFHistoricProperty            0
SaleReason                    0
PropertyClass                 0
PropName                   2434
PlatName                   5151
Range                         0
Township                      0
Section                       0
QuarterSection                0
PropType                      0
Area                          0
SubArea 

In [24]:
# checking first few rows
df19.head(3)

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost,DocumentYear
5,3027422,213043,120,2019-12-20,560000.0,20191226000848,,,,,...,0,0,1989,0,0,0,0,3,0,2019
6,3002257,940652,630,2019-07-22,435000.0,20190730001339,,,,,...,0,0,1994,0,0,0,0,3,2500,2019
12,2993601,140281,20,2019-06-04,450000.0,20190614000489,,,,,...,0,0,1986,0,0,0,0,3,0,2019


#### Creating a 2019 Dataframe with Non-Zero Sale Prices 

Creating data frame and checking basic information

In [25]:
# creating a datafrme of 2019 data with non-zero sale prices and restting index
nz19 = df19[df19['SalePrice'] != 0].reset_index()

In [26]:
# checking shape and first few rows
print(nz19.shape)
nz19.head(3)

(29944, 151)


Unnamed: 0,index,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,...,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost,DocumentYear
0,5,3027422,213043,120,2019-12-20,560000.0,20191226000848,,,,...,0,0,1989,0,0,0,0,3,0,2019
1,6,3002257,940652,630,2019-07-22,435000.0,20190730001339,,,,...,0,0,1994,0,0,0,0,3,2500,2019
2,12,2993601,140281,20,2019-06-04,450000.0,20190614000489,,,,...,0,0,1986,0,0,0,0,3,0,2019


In [27]:
# checking for NaNs
df.isna().sum()

ExciseTaxNbr                   0
Major                          0
Minor                          0
DocumentDate                   0
SalePrice                      0
RecordingNbr                   0
Volume                         0
Page                           0
PlatNbr                        0
PlatType                       0
PlatLot                        0
PlatBlock                      0
SellerName                     0
BuyerName                      0
PropertyType                   0
PrincipalUse                   0
SaleInstrument                 0
AFForestLand                   0
AFCurrentUseLand               0
AFNonProfitUse                 0
AFHistoricProperty             0
SaleReason                     0
PropertyClass                  0
PropName                   11473
PlatName                   29223
Range                          0
Township                       0
Section                        0
QuarterSection                 0
PropType                       0
Area      

## Isolating Target and Initial Predictors

Since two of the questions we want to answer relate to the HeatSystem and SqFtEnclosedPorch features, we are going to isolate them, and our SalePrice target variable, into seperate dataframes. 

### HeatSystem Dataframe

Isolating ```'SalePrice'``` target and ```'HeatSystem'``` predictor in Checking first the few rows, shape, nulls, and unique values.

In [28]:
# Isolating SalePrice target and HeatSystem predictor into a data frame. 
heat_df = nz19[['SalePrice','HeatSystem']]
heat_df.head(3)

Unnamed: 0,SalePrice,HeatSystem
0,560000.0,5
1,435000.0,5
2,450000.0,1


In [29]:
# checking shape
heat_df.shape

(29944, 2)

In [30]:
# checking NaNs
heat_df.isna().sum()

SalePrice     0
HeatSystem    0
dtype: int64

In [31]:
# checking unique values
print(heat_df.HeatSystem.unique())
len(heat_df.HeatSystem.unique())

['5' '1' '4' '7' '6' '3' '0' '2' '8']


9

In [32]:
# looking up HeatSystem codes
lu_df = fn.lookup(look, 108)
lu_df

NameError: name 'df_look' is not defined

#### Checking ```'0'``` Values

Checking ```'0'```, since there is not a ```'0'``` lookup code. For now, will be assuming that there is no heating system information available for ```'0'``` enteries. May re-adjust assumption to 'no heating system in property' after researching further. 

In [None]:
# checking '0' values
zero_df = heat_df[heat_df['HeatSystem'] == '0']
print(zero_df.shape)
zero_df.head()

#### Prepping for Model Data Frame

Creating an array of the lookup code descriptions, and using it to make a list of values for use in a new column.  Will use this column in a one-hot-encoding procedure for more descriptive column names in the model dataframe.

In [None]:
column_names = lu_df.LUDescription.values
column_names

In [None]:
# putting descriptions into a list in preparation for adding a new column to the dataframe.
heat_names = []
for i in range(len(heat_df['HeatSystem'].values)):
    if heat_df['HeatSystem'].values[i] == '0':
        heat_names.append('NA')
    if heat_df['HeatSystem'].values[i] == '1':
        heat_names.append(column_names[0])
    if heat_df['HeatSystem'].values[i] == '2':
        heat_names.append(column_names[1])
    if heat_df['HeatSystem'].values[i] == '3':
        heat_names.append(column_names[2])
    if heat_df['HeatSystem'].values[i] == '4':
        heat_names.append(column_names[3])
    if heat_df['HeatSystem'].values[i] == '5':
        heat_names.append(column_names[4])
    if heat_df['HeatSystem'].values[i] == '6':
        heat_names.append(column_names[5])
    if heat_df['HeatSystem'].values[i] == '7':
        heat_names.append(column_names[6])
    if heat_df['HeatSystem'].values[i] == '8':
        heat_names.append(column_names[7])

In [None]:
# checking first few entries in list
heat_names[:5]

### Creating New Data Frame

Creating new data frame, dropping old ```'HeatSystems'``` column, and appending a new ```'HeatNames'``` column with more descriptive system names.

In [None]:
# creating new data frame and dropping old 'HeatSystems' column
heat_df2 = heat_df.copy().drop('HeatSystem', axis=1)

In [None]:
# appending new, more descriptive heat systems column, 'HeatNames'
heat_df2['HeatNames'] = heat_names

In [None]:
# checking first few rows
heat_df2.head(3)

#### One-Hot Encoding the HeatSystem Predictor

Instantiating the encoder, fitting the encoder to ```heat_df2[['HeatNames']]``` , and transforming the data. Creating a new ```heat_ohe``` data frame, dropping the ```'HeatNames'``` column from ``` heat_df2```, and concatenating it will ```heat_ohe``` to form a new ```model_df``` data frame.

In [None]:
# instantiating, fitting and transforming
ohcoder = OneHotEncoder(drop='first')

ohcoder.fit(heat_df2[['HeatNames']])

transformed = ohcoder.transform(heat_df2[['HeatNames']])

In [None]:
# creating heat_ohe data frame and checking the first few columns
heat_ohe = pd.DataFrame(transformed.todense(),\
                       columns = ohcoder.get_feature_names())
heat_ohe.head(3)

In [None]:
# dropping the 'HeatNames' column from heat_df2
heat2_dropped = heat_df2.drop('HeatNames', axis=1)

#### Creating ```model_df``` Data Frame

In [None]:
# creating the model dataframe and checking first few rows
model_df = pd.concat([heat2_dropped, heat_ohe], axis=1)
model_df.head(3)

#### ```model_df``` Correlation Matrix and Heatmap

In [None]:
model_df.corr()

In [None]:
sns.heatmap(model_df.corr());

#### EnclosedPoarch Dataframe

In [None]:
# isolating SalePrice target and SqFtEnclosedPorch predictor
porch_df = nz19[['SqFtEnclosedPorch']]
porch_df.head()