# Project 2: predicting sales prices with the Aimes Iowa Housing dataset

## Problem Statement

ABC Real Estate Agents needs a way to predict property sales prices so they may give potential customers a reliable estimate of the value of their property before they decide to engage them to find a buyer.

Currently, 50% of customers express dissatisfaction when the price potential buyers the company finds for their property to be much lower than the price estimates of the company's agents, sometimes resulting in wasted time for the company's agents when sellers change their minds about selling their properties. While such disappointments can be avoided by giving lower property price estimates, there is industry pressure to give a high valuation estimate as it gives sellers the impression that the agent is skilled.

Using data science to come up with a more reliable property price estimate, we can potentially increase customer satisfaction while quoting prices that are competitive in the industry.

In [1]:
from python_imports import *

In [43]:
# Load data.
import_path = r'..\datasets\clean_train.csv'
# Dataset contains NA strings that should not be considered null values.
data = pd.read_csv(import_path, keep_default_na=False, na_values=[''])

import_path = r'..\datasets\clean_test.csv'
# Dataset contains NA strings that should not be considered null values.
data2 = pd.read_csv(import_path, keep_default_na=False, na_values=[''])

In [44]:
# Separate the target variable from the dataset.
df_target = data['saleprice']

In [45]:
features = [x for x in data.columns if x != 'saleprice']
# Join train and test datasets in preparation for preprocessing.
df_features = data[features].append(data2)

In [46]:
# Confirm join is successful. Since join='outer', nulls will be present if the DataFrames were not joined correctly.
np.sum(df_features.isnull().sum())

0

In [47]:
df_features.shape

(2901, 77)

In [48]:
df_features.columns

Index(['ms_subclass', 'ms_zoning', 'lot_frontage', 'lot_area', 'street',
       'alley', 'lot_shape', 'land_contour', 'utilities', 'lot_config',
       'land_slope', 'neighborhood', 'condition_1', 'condition_2', 'bldg_type',
       'house_style', 'overall_qual', 'overall_cond', 'year_built',
       'year_remod/add', 'roof_style', 'roof_matl', 'exterior_1st',
       'exterior_2nd', 'mas_vnr_type', 'mas_vnr_area', 'exter_qual',
       'exter_cond', 'foundation', 'bsmt_qual', 'bsmt_cond', 'bsmt_exposure',
       'bsmtfin_type_1', 'bsmtfin_sf_1', 'bsmtfin_type_2', 'bsmtfin_sf_2',
       'bsmt_unf_sf', 'total_bsmt_sf', 'heating', 'heating_qc', 'central_air',
       'electrical', '1st_flr_sf', '2nd_flr_sf', 'low_qual_fin_sf',
       'gr_liv_area', 'bsmt_full_bath', 'bsmt_half_bath', 'full_bath',
       'half_bath', 'bedroom_abvgr', 'kitchen_abvgr', 'kitchen_qual',
       'totrms_abvgrd', 'functional', 'fireplaces', 'fireplace_qu',
       'garage_type', 'garage_finish', 'garage_cars', 'garage

In [49]:
# Convert ordinal values to numbers.
ordinal_vars = ['Lot Shape', 'Utilities', 'Land Slope', 'Exter Qual', 'Exter Cond', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating QC', 'Electrical', 'Kitchen Qual', 'Functional', 'Fireplace Qu', 'Garage Finish', 'Garage Qual', 'Garage Cond', 'Paved Drive', 'Pool QC', 'Fence']
ordinal_vars = [x.lower().replace(' ', '_') for x in ordinal_vars]

In [50]:
ordinal_values = [
    ['IR3', 'IR2', 'IR1', 'Reg'],
    ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'],
    ['Sev', 'Mod', 'Gtl'],
    ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    ['NA', 'No', 'Mn', 'Av', 'Gd'],
    ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'],
    ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'],
    ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    ['NA', 'Unf', 'RFn', 'Fin'],
    ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    ['N', 'P', 'Y'],
    ['NA', 'Fa', 'TA', 'Gd', 'Ex'],
    ['NA', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv']
]

In [51]:
ordinals = zip(ordinal_vars, ordinal_values)

In [52]:
ordinal_map = {k: v for k, v in ordinals}
ordinal_map

{'lot_shape': ['IR3', 'IR2', 'IR1', 'Reg'],
 'utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'],
 'land_slope': ['Sev', 'Mod', 'Gtl'],
 'exter_qual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'exter_cond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'bsmt_qual': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'bsmt_cond': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'bsmt_exposure': ['NA', 'No', 'Mn', 'Av', 'Gd'],
 'bsmtfin_type_1': ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
 'bsmtfin_type_2': ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
 'heating_qc': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'electrical': ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'],
 'kitchen_qual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'],
 'fireplace_qu': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'garage_finish': ['NA', 'Unf', 'RFn', 'Fin'],
 'garage_qual': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'garage_cond': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'paved_drive': ['N', 'P', 'Y'],
 'p

In [53]:
df_ord = ordinal_scale(df_features, ordinal_map)

In [54]:
nominal = ['MS SubClass', 'MS Zoning', 'Street', 'alley', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1', 'Exterior 2', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air', 'Garage Type', 'Misc Feature', 'Sale Type', 'Sale Condition']
nominal = [x.lower().replace(' ', '_') for x in nominal]

In [55]:
non_norminal = [x for x in df_features if x not in nominal]
df_non_nominal = df_features[non_norminal]

In [56]:
df_non_nominal.shape

(2901, 57)

In [59]:
# Eliminate features with a Variance Inflation Factor of more than 5.0, as these are highly correlated with each other.
dropped_features = vif_feature_select(df_non_nominal, drop_list=True)

returning list of dropped features.


In [60]:
len(dropped_features)

91

In [18]:
df_vif.columns

NameError: name 'df_vif' is not defined

In [None]:
df_features.shape

In [None]:
df_features.columns

In [None]:
df_dum = pd.get_dummies(data=df_features, columns=nominal, drop_first=True)

In [None]:
df_dum.shape

In [None]:
# Eliminate features with a Variance Inflation Factor of more than 5.0, as these are highly correlated with each other.
df_dum = vif_feature_select(df_dum)

In [None]:
df_dum.columns

In [None]:
# The distribution plot gives us a rough idea
# subplot_dist(df_ord)