### Core questions:
Enclosing a porch will increase the sale price of a home.

Converting a garage to a bedroom is a good way to increase the sale price of a home.

Upgrading to a forced-air heating system will increase the sale price of a home.

### Core Goals:
Create model

Interpret results

Make recomendations

### Schedule:
Friday: Business Understanding and Data Importation

Saturday: Data Understanding. Add .gitignore file in exploratory directory.

Sunday: Data Prep

### Importing Libraries

In [1]:
# import modules for eda and plotting
import pandas as pd
import numpy as np
import scipy.stats as stats

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### SQL Prelim Work

In [36]:
# importing sqlite
import sqlite3

# creating database, connection, and cursor
conn = sqlite3.connect('KingDB.db')  
cur = conn.cursor()

# creating query fetch function
def fetcha(q):
    return cur.execute(q).fetchall()

### SQL Dataframe

In [37]:
# getting table names
q = """SELECT name FROM sqlite_master 
WHERE type IN ('table','view') 
AND name NOT LIKE 'sqlite_%'
ORDER BY 1"""
fetcha(q)

[('PARC',), ('RESB',), ('SALES',)]

In [38]:
q = """SELECT*FROM SALES AS SA
       JOIN PARC AS PA
       ON SA.Major = PA.Major
       AND SA.Minor = PA.Minor
       JOIN RESB AS RE
       ON PA.Major = RE.Major
       AND PA.Minor = RE.Minor
       """
df = pd.DataFrame(fetcha(q))
df.columns = [i[0] for i in cur.description]
df.columns

Index(['ExciseTaxNbr', 'Major', 'Minor', 'DocumentDate', 'SalePrice',
       'RecordingNbr', 'Volume', 'Page', 'PlatNbr', 'PlatType',
       ...
       'FpMultiStory', 'FpFreestanding', 'FpAdditional', 'YrBuilt',
       'YrRenovated', 'PcntComplete', 'Obsolescence', 'PcntNetCondition',
       'Condition', 'AddnlCost'],
      dtype='object', length=156)

In [39]:
df.head(2)

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,FpMultiStory,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost
0,2743355,638580,110,07/14/2015,190000,20150715002686,,,,,...,1,0,1,1963,0,0,0,0,3,0
1,2841697,894677,240,12/21/2016,818161,20161228000896,,,,,...,0,0,0,2016,0,0,0,0,3,0


In [40]:
df.shape

(251300, 156)

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251300 entries, 0 to 251299
Columns: 156 entries, ExciseTaxNbr to AddnlCost
dtypes: object(156)
memory usage: 299.1+ MB


In [42]:
# dropping unnamed column
df = df.drop('Unnamed: 0', axis=1)

In [43]:
df.shape

(251300, 155)

In [44]:
np.array(df.isnull().sum())

array([     0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,  11473,  29223,      0,      0,      0,      0,
            0,      0,      0,      0,      0, 251300, 251300,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
      

### Pandas Dataframe

### Importing the Data

In [11]:
# create paths to the files
files = ['EXTR_RPSale.csv', 'EXTR_ResBldg.csv', 'EXTR_Parcel.csv']
paths = [f'../../data/raw/{file}' for file in files]

# create list of data frames importing data as strings
dfs = [pd.read_csv(path, dtype=str) for path in paths]

# isolate individual data frames
df_sale = dfs[0]
df_resb = dfs[1]
df_parc = dfs[2]

In [12]:
# checking shape of the dataframes
df_sale.shape, df_resb.shape, df_parc.shape 

((351067, 24), (181510, 50), (205199, 82))

In [13]:
# checking columns and nulls
df_sale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351067 entries, 0 to 351066
Data columns (total 24 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   ExciseTaxNbr        351067 non-null  object
 1   Major               351067 non-null  object
 2   Minor               351067 non-null  object
 3   DocumentDate        351067 non-null  object
 4   SalePrice           351067 non-null  object
 5   RecordingNbr        351067 non-null  object
 6   Volume              351067 non-null  object
 7   Page                351067 non-null  object
 8   PlatNbr             351067 non-null  object
 9   PlatType            351067 non-null  object
 10  PlatLot             351067 non-null  object
 11  PlatBlock           351067 non-null  object
 12  SellerName          351067 non-null  object
 13  BuyerName           351067 non-null  object
 14  PropertyType        351067 non-null  object
 15  PrincipalUse        351067 non-null  object
 16  Sa

In [14]:
# checking columns and nulls
df_resb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181510 entries, 0 to 181509
Data columns (total 50 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Major               181510 non-null  object
 1   Minor               181510 non-null  object
 2   BldgNbr             181510 non-null  object
 3   NbrLivingUnits      181510 non-null  object
 4   Address             181510 non-null  object
 5   BuildingNumber      181510 non-null  object
 6   Fraction            181510 non-null  object
 7   DirectionPrefix     181146 non-null  object
 8   StreetName          181510 non-null  object
 9   StreetType          181510 non-null  object
 10  DirectionSuffix     181146 non-null  object
 11  ZipCode             154594 non-null  object
 12  Stories             181510 non-null  object
 13  BldgGrade           181510 non-null  object
 14  BldgGradeVar        181510 non-null  object
 15  SqFt1stFloor        181510 non-null  object
 16  Sq

In [15]:
# checking columns and nulls
df_parc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205199 entries, 0 to 205198
Data columns (total 82 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Unnamed: 0              205199 non-null  object
 1   Major                   205199 non-null  object
 2   Minor                   205199 non-null  object
 3   PropName                196088 non-null  object
 4   PlatName                176654 non-null  object
 5   PlatLot                 205199 non-null  object
 6   PlatBlock               205199 non-null  object
 7   Range                   205199 non-null  object
 8   Township                205199 non-null  object
 9   Section                 205199 non-null  object
 10  QuarterSection          205199 non-null  object
 11  PropType                205199 non-null  object
 12  Area                    205193 non-null  object
 13  SubArea                 205193 non-null  object
 14  SpecArea                4864 non-nul

In [16]:
# checking first couple of rows
df_parc.head(2)

Unnamed: 0.1,Unnamed: 0,Major,Minor,PropName,PlatName,PlatLot,PlatBlock,Range,Township,Section,...,SeismicHazard,LandslideHazard,SteepSlopeHazard,Stream,Wetland,SpeciesOfConcern,SensitiveAreaTract,WaterProblems,TranspConcurrency,OtherProblems
0,0,807841,410,,SUMMER RIDGE DIV NO. 02,41,,6,25,22,...,N,N,N,N,N,N,N,N,N,N
1,2,755080,15,,SANDER'S TO GILMAN PK & SALMON BAY,3,1.0,3,25,11,...,N,N,N,N,N,N,N,N,N,N


In [17]:
df_parc = df_parc.drop('Unnamed: 0', axis=1)

In [18]:
# checking dropped column
df_parc.head(2)

Unnamed: 0,Major,Minor,PropName,PlatName,PlatLot,PlatBlock,Range,Township,Section,QuarterSection,...,SeismicHazard,LandslideHazard,SteepSlopeHazard,Stream,Wetland,SpeciesOfConcern,SensitiveAreaTract,WaterProblems,TranspConcurrency,OtherProblems
0,807841,410,,SUMMER RIDGE DIV NO. 02,41,,6,25,22,SW,...,N,N,N,N,N,N,N,N,N,N
1,755080,15,,SANDER'S TO GILMAN PK & SALMON BAY,3,1.0,3,25,11,NW,...,N,N,N,N,N,N,N,N,N,N


In [19]:
_1st_merge = pd.merge(df_sale, df_parc, on=['Major', 'Minor'])

In [20]:
_1st_merge.head(2)

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,SeismicHazard,LandslideHazard,SteepSlopeHazard,Stream,Wetland,SpeciesOfConcern,SensitiveAreaTract,WaterProblems,TranspConcurrency,OtherProblems
0,2857854,198920,1430,03/28/2017,0,20170410000541,,,,,...,N,N,N,N,N,N,N,N,N,N
1,2843191,198920,1430,12/27/2016,0,20170105001317,,,,,...,N,N,N,N,N,N,N,N,N,N


In [21]:
_2nd_merge = pd.merge(_1st_merge, df_resb, on=['Major', 'Minor'])

In [22]:
_2nd_merge.head(2)

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,FpMultiStory,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost
0,2743355,638580,110,07/14/2015,190000,20150715002686,,,,,...,1,0,1,1963,0,0,0,0,3,0
1,2743356,638580,110,07/14/2015,0,20150715002687,,,,,...,1,0,1,1963,0,0,0,0,3,0


In [23]:
_2nd_merge.shape

(251300, 151)

In [24]:
_2nd_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 251300 entries, 0 to 251299
Columns: 151 entries, ExciseTaxNbr to AddnlCost
dtypes: object(151)
memory usage: 291.4+ MB


In [25]:
df2 = _2nd_merge

In [26]:
df2.shape

(251300, 151)