# King County Project

## Business Problem

A client in King County, WA wants to advise homeowners on **home improvement projects** that will **add to the sale value of their homes**.

**This advice should be based on data from the most recent full calendar year, 2019**.

### Core questions:
Enclosing a porch will increase the sale price of a home.

Converting a garage to a bedroom is a good way to increase the sale price of a home.

Upgrading to a forced-air heating system will increase the sale price of a home.

### Core Goals:
Create model

Interpret results

Make recomendations

## Schedule

### Friday 2/19: 
#### Business Understanding & Preliminary EDA
* Repo Creation
* Data Importation
* Database Creation
* Created initial data frame.

### Saturday 2/20: 
#### Data Understanding & EDA
* Added ```.gitignore``` file in exploratory directory to exclude ```KingDB.db``` file.
* Created ```lookup()``` function.
* Created 2019 data frame.
* Isolated target and each of two initial predictors into dataframes.

### Sunday 2/21: 
##### Data Prep & Base Model

# Initial EDA Work

#### Importing Libraries and Adjusting Settings

In [1]:
# import modules for eda and plotting
import pandas as pd
import numpy as np
import scipy.stats as stats

import matplotlib.pyplot as plt
import seaborn as sns

# setting plots to inline
%matplotlib inline

# setting the max number of rows displayed
pd.options.display.max_rows = 200

## Pandas Dataframe

#### Importing the Data & Creating Initial Data Frames

Importing data and creating various data frames. Checking ```.shape```, nulls, and looking for problematic columns I may need to deal with later.

In [2]:
# creating paths to the files
files = ['EXTR_RPSale.csv', 'EXTR_ResBldg.csv', 'EXTR_Parcel.csv', 'EXTR_LookUp.csv']
paths = [f'../../data/raw/{file}' for file in files]

# creating list of data frames, importing data as strings
dfs = [pd.read_csv(path, dtype=str) for path in paths]

# isolating individual data frames
df_sale = dfs[0]
df_resb = dfs[1]
df_parc = dfs[2]
df_look = dfs[3]

In [3]:
# checking shape of the dataframes
df_sale.shape, df_resb.shape, df_parc.shape, df_look.shape

((351067, 24), (181510, 50), (205199, 82), (1208, 3))

#### Checking NaNs

In [4]:
# checking columns and nulls
df_sale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351067 entries, 0 to 351066
Data columns (total 24 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   ExciseTaxNbr        351067 non-null  object
 1   Major               351067 non-null  object
 2   Minor               351067 non-null  object
 3   DocumentDate        351067 non-null  object
 4   SalePrice           351067 non-null  object
 5   RecordingNbr        351067 non-null  object
 6   Volume              351067 non-null  object
 7   Page                351067 non-null  object
 8   PlatNbr             351067 non-null  object
 9   PlatType            351067 non-null  object
 10  PlatLot             351067 non-null  object
 11  PlatBlock           351067 non-null  object
 12  SellerName          351067 non-null  object
 13  BuyerName           351067 non-null  object
 14  PropertyType        351067 non-null  object
 15  PrincipalUse        351067 non-null  object
 16  Sa

In [5]:
# checking columns and nulls
df_resb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181510 entries, 0 to 181509
Data columns (total 50 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Major               181510 non-null  object
 1   Minor               181510 non-null  object
 2   BldgNbr             181510 non-null  object
 3   NbrLivingUnits      181510 non-null  object
 4   Address             181510 non-null  object
 5   BuildingNumber      181510 non-null  object
 6   Fraction            181510 non-null  object
 7   DirectionPrefix     181146 non-null  object
 8   StreetName          181510 non-null  object
 9   StreetType          181510 non-null  object
 10  DirectionSuffix     181146 non-null  object
 11  ZipCode             154594 non-null  object
 12  Stories             181510 non-null  object
 13  BldgGrade           181510 non-null  object
 14  BldgGradeVar        181510 non-null  object
 15  SqFt1stFloor        181510 non-null  object
 16  Sq

In [6]:
# checking columns and nulls
df_parc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205199 entries, 0 to 205198
Data columns (total 82 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Unnamed: 0              205199 non-null  object
 1   Major                   205199 non-null  object
 2   Minor                   205199 non-null  object
 3   PropName                196088 non-null  object
 4   PlatName                176654 non-null  object
 5   PlatLot                 205199 non-null  object
 6   PlatBlock               205199 non-null  object
 7   Range                   205199 non-null  object
 8   Township                205199 non-null  object
 9   Section                 205199 non-null  object
 10  QuarterSection          205199 non-null  object
 11  PropType                205199 non-null  object
 12  Area                    205193 non-null  object
 13  SubArea                 205193 non-null  object
 14  SpecArea                4864 non-nul

#### Checking Problematic Columns

Checking columns and dropping an extraneous column form the data frame.

In [7]:
# checking problematic columns
df_prob = df_parc[['Unnamed: 0', 'PropName', 'PlatName', 'Area', 'SubArea', 'SpecArea', 'SpecSubArea' ]]
df_prob.head()

Unnamed: 0.1,Unnamed: 0,PropName,PlatName,Area,SubArea,SpecArea,SpecSubArea
0,0,,SUMMER RIDGE DIV NO. 02,35,2,,
1,2,,SANDER'S TO GILMAN PK & SALMON BAY,19,1,,
2,3,,VASHON GARDENS ADD,100,3,,
3,6,,,1,1,,
4,7,,ELDORADO NORTH,37,2,,


In [8]:
#  dropping extraneaous column
df_parc = df_parc.drop('Unnamed: 0', axis=1)

In [9]:
# checking dropped column
df_parc.head(1)

Unnamed: 0,Major,Minor,PropName,PlatName,PlatLot,PlatBlock,Range,Township,Section,QuarterSection,...,SeismicHazard,LandslideHazard,SteepSlopeHazard,Stream,Wetland,SpeciesOfConcern,SensitiveAreaTract,WaterProblems,TranspConcurrency,OtherProblems
0,807841,410,,SUMMER RIDGE DIV NO. 02,41,,6,25,22,SW,...,N,N,N,N,N,N,N,N,N,N


#### Checking the Lookup Data Frame

In [10]:
# getting info for lookup data frame
df_look.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1208 entries, 0 to 1207
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   LUType         1208 non-null   object
 1   LUItem         1208 non-null   object
 2   LUDescription  1208 non-null   object
dtypes: object(3)
memory usage: 28.4+ KB


In [11]:
# checking first row
df_look.head(1)

Unnamed: 0,LUType,LUItem,LUDescription
0,1,1,LAND ONLY ...


#### Merging the Data Frames into a Main Data Frame.

Merged three of the initial data frames into a main data frame and checked ```.info``` and ```.head()```.

In [12]:
# doing a chained merge of the three data frames on the 'Major' and 'Minor' columns
df = pd.merge(pd.merge(df_sale, df_parc, on=['Major', 'Minor']), df_resb, on=['Major', 'Minor'])

In [13]:
# checking info and first few columns
print(df.info())
df.head(2)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 251300 entries, 0 to 251299
Columns: 151 entries, ExciseTaxNbr to AddnlCost
dtypes: object(151)
memory usage: 291.4+ MB
None


Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,FpMultiStory,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost
0,2743355,638580,110,07/14/2015,190000,20150715002686,,,,,...,1,0,1,1963,0,0,0,0,3,0
1,2743356,638580,110,07/14/2015,0,20150715002687,,,,,...,1,0,1,1963,0,0,0,0,3,0


## SQL Dataframe

### SQL Prelim Work

#### Created Database
Earlier, wrote up a [DB Creator](DB_Creator.ipynb) notebook and ran it to create a SQL database from the raw ```.csv``` files for use later on in the project.

#### Creating DataFrame From the Database

Importing ```sqlite3```,  connecting to the database, and creating a cursor object. Also, creating a ```fetch()``` function, and joining the database tables into a secondary main data frame. Lastly, checking basic information about the data frame. 

In [14]:
# importing sqlite
import sqlite3

# creating database, connection, and cursor
conn = sqlite3.connect('KingDB.db')  
cur = conn.cursor()

# creating query fetch function
def fetch(q):
    """Returns an SQL query."""
    return cur.execute(q).fetchall()

In [16]:
# checking the table names
q = """SELECT name FROM sqlite_master 
WHERE type IN ('table','view') 
AND name NOT LIKE 'sqlite_%'
ORDER BY 1"""
fetch(q)

[('PARC',), ('RESB',), ('SALES',)]

#### Joining The Tables to Create a Data Frame

In [17]:
# joining tables to create dataframe and appending column names
q = """SELECT*FROM SALES AS SA
       JOIN PARC AS PA
       ON SA.Major = PA.Major
       AND SA.Minor = PA.Minor
       JOIN RESB AS RE
       ON PA.Major = RE.Major
       AND PA.Minor = RE.Minor
       """
df2 = pd.DataFrame(fetch(q))
df2.columns = [i[0] for i in cur.description]

In [18]:
# checking info, shape and first row
print(df2.info())
df2.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251300 entries, 0 to 251299
Columns: 156 entries, ExciseTaxNbr to AddnlCost
dtypes: object(156)
memory usage: 299.1+ MB
None


Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,FpMultiStory,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost
0,2743355,638580,110,07/14/2015,190000,20150715002686,,,,,...,1,0,1,1963,0,0,0,0,3,0


In [19]:
# dropping unnamed column found in initial pandas df above
df2 = df2.drop('Unnamed: 0', axis=1)

In [20]:
# checking shape and nulls
print(df2.shape)
df2.isna().sum()

(251300, 155)


ExciseTaxNbr                   0
Major                          0
Minor                          0
DocumentDate                   0
SalePrice                      0
RecordingNbr                   0
Volume                         0
Page                           0
PlatNbr                        0
PlatType                       0
PlatLot                        0
PlatBlock                      0
SellerName                     0
BuyerName                      0
PropertyType                   0
PrincipalUse                   0
SaleInstrument                 0
AFForestLand                   0
AFCurrentUseLand               0
AFNonProfitUse                 0
AFHistoricProperty             0
SaleReason                     0
PropertyClass                  0
Major                          0
Minor                          0
PropName                   11473
PlatName                   29223
PlatLot                        0
PlatBlock                      0
Range                          0
Township  

## 2019 Data

Since we want to train our model on 2019 data, we are going to isolate the 2019 information into a new data frame and use it to difine our target and predictors.

#### Light Data Cleaning

Creating a lookup function to make life a little easier for exploring features. Changing datestrings to datetime objects, saleprice strings to floats in the main pandas dataframe. Adding a 'DocumentYear' column to main the data frame. Also stripping blank spaces from the lookup types in the lookup dataframe.

In [21]:
# creating a lookup function
def lookup(code):
    """Returns a data frame with rows containing the specified lookup code."""
    return df_look[df_look['LUType']==f'{code}']

In [22]:
# changing date strings to datetime objects
df.DocumentDate = pd.to_datetime(df.DocumentDate)

# adding a document year column
df['DocumentYear'] = df['DocumentDate'].apply(lambda x: x.year)

# converting SalePrice string to float
df['SalePrice'] = df['SalePrice'].astype('float')

# stripping blank spaces from lookup strings
df_look['LUType'] = df_look['LUType'].apply(lambda x: x.strip())

#### Creating Initial 2019 DataFrame

Creating data frame and checking basic information

In [23]:
# isolating 2019 data
df19 = df[df['DocumentYear']==2019]

In [24]:
# checking data frame shape and looking for NaNs
print(df19.shape)
df19.isna().sum()

(43838, 152)


ExciseTaxNbr                  0
Major                         0
Minor                         0
DocumentDate                  0
SalePrice                     0
RecordingNbr                  0
Volume                        0
Page                          0
PlatNbr                       0
PlatType                      0
PlatLot_x                     0
PlatBlock_x                   0
SellerName                    0
BuyerName                     0
PropertyType                  0
PrincipalUse                  0
SaleInstrument                0
AFForestLand                  0
AFCurrentUseLand              0
AFNonProfitUse                0
AFHistoricProperty            0
SaleReason                    0
PropertyClass                 0
PropName                   2434
PlatName                   5151
PlatLot_y                     0
PlatBlock_y                   0
Range                         0
Township                      0
Section                       0
QuarterSection                0
PropType

In [25]:
# checking first few rows
df19.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost,DocumentYear
4,3024468,894677,240,2019-12-05,0.0,20191209000162,,,,,...,0,0,2016,0,0,0,0,3,0,2019
5,3024469,894677,240,2019-12-05,0.0,20191209000163,,,,,...,0,0,2016,0,0,0,0,3,0,2019
11,3027422,213043,120,2019-12-20,560000.0,20191226000848,,,,,...,0,0,1989,0,0,0,0,3,0,2019
12,3002257,940652,630,2019-07-22,435000.0,20190730001339,,,,,...,0,0,1994,0,0,0,0,3,2500,2019
22,2993601,140281,20,2019-06-04,450000.0,20190614000489,,,,,...,0,0,1986,0,0,0,0,3,0,2019


#### Creating a 2019 Dataframe with Non-Zero Sale Prices 

Creating data frame and checking basic information

In [26]:
# creating a datafrme of 2019 data with non-zero sale prices
nz19 = df19[df19['SalePrice'] != 0]

In [27]:
# checking shape and first few rows
print(nz19.shape)
nz19.head()

(29944, 152)


Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost,DocumentYear
11,3027422,213043,120,2019-12-20,560000.0,20191226000848,,,,,...,0,0,1989,0,0,0,0,3,0,2019
12,3002257,940652,630,2019-07-22,435000.0,20190730001339,,,,,...,0,0,1994,0,0,0,0,3,2500,2019
22,2993601,140281,20,2019-06-04,450000.0,20190614000489,,,,,...,0,0,1986,0,0,0,0,3,0,2019
29,3002772,175070,50,2019-07-26,812000.0,20190801000556,,,,,...,0,0,1947,0,0,0,0,5,0,2019
39,3014535,287860,1145,2019-09-23,749950.0,20191009001321,,,,,...,0,0,1906,0,0,0,0,4,0,2019


In [28]:
# checking for NaNs
df.isna().sum()

ExciseTaxNbr                   0
Major                          0
Minor                          0
DocumentDate                   0
SalePrice                      0
RecordingNbr                   0
Volume                         0
Page                           0
PlatNbr                        0
PlatType                       0
PlatLot_x                      0
PlatBlock_x                    0
SellerName                     0
BuyerName                      0
PropertyType                   0
PrincipalUse                   0
SaleInstrument                 0
AFForestLand                   0
AFCurrentUseLand               0
AFNonProfitUse                 0
AFHistoricProperty             0
SaleReason                     0
PropertyClass                  0
PropName                   11473
PlatName                   29223
PlatLot_y                      0
PlatBlock_y                    0
Range                          0
Township                       0
Section                        0
QuarterSec

### Isolating Target and Initial Predictors

Since two of the questions we want to answer relate to the HeatSystem and SqFtEnclosedPorch features, we are going to isolate them into seperate dataframes along with our SalePrice target variable. 

#### HeatSystem Dataframe

In [29]:
# isolating SalePrice target and HeatSystem predictor
p_dfh = nz19[['SalePrice', 'HeatSystem']].copy()
p_dfh.head()

Unnamed: 0,SalePrice,HeatSystem
11,560000.0,5
12,435000.0,5
22,450000.0,1
29,812000.0,5
39,749950.0,5


In [30]:
# looking up HeatSystem codes
lookup(108)

Unnamed: 0,LUType,LUItem,LUDescription
243,108,1,Floor-Wall ...
244,108,2,Gravity ...
245,108,3,Radiant ...
246,108,4,Elec BB ...
247,108,5,Forced Air ...
248,108,6,Hot Water ...
249,108,7,Heat Pump ...
250,108,8,Other ...


#### EnclosedPoarch Dataframe

In [31]:
# isolating SalePrice target and SqFtEnclosedPorch predictor
p_dfp = nz19[['SalePrice', 'SqFtEnclosedPorch']].copy()
p_dfp.head()

Unnamed: 0,SalePrice,SqFtEnclosedPorch
11,560000.0,0
12,435000.0,0
22,450000.0,0
29,812000.0,0
39,749950.0,0
