# Predicting House Prices in Ames Iowa, A Data Science Project
### Gurpal Singh
### October 2019

## Introduction

This notebook explores the Ames housing data set which was collected in Ames Iowa. The training set contain 1460 instances and 81 features from an initial glance. This notebook will walk through the exploratory data analysis, cleaning the data, feature engineering, and model implementation. 

### The Problem at Hand

To fully grasp the problem we need to view it in the shoes of the client. In this case it would be anyone evaluating the price of home in Ames, Iowa. This could either be home buyer as well as a seller. When we look at features, we will need to determine which features are relevant to this outcome. The goal is to predict the saleprice of a home given the features in the Ames Dataset. 

## Loading Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import warnings # ?
#import xgboost as xgb
#import lightgbm as lgb
from scipy.stats import skew
from scipy import stats
from scipy.stats.stats import pearsonr
from scipy.stats import norm
from collections import Counter
from sklearn.linear_model import LinearRegression, LassoCV, Ridge, LassoLarsCV, ElasticNetCV
from sklearn.model_selection import GridSearchCV, cross_val_score, learning_curve
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler, Normalizer, RobustScaler
warnings.filterwarnings('ignore')
%matplotlib inline

## Loading Data

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
# Checking the number of instances and features 
shapetrain = train.shape
print('The size of training data is: ', shapetrain)
shapetest = test.shape
print('The size of testing data is: ', shapetest)

The size of training data is:  (1460, 81)
The size of testing data is:  (1459, 80)


In [4]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## Checking for Duplicated Data

In [5]:
# Function to check for Duplicate values in a column of dataframe
def check_duplicates(column):
    count = 0
    dups = 0
    for entry in column.duplicated():
        if entry is True:
            print('Duplicate Detected at index: ', count)
            dups += 1
        count +=1
    print('Duplicates detected: ',dups)

# Checking for duplicates
check_duplicates(train['Id'])
check_duplicates(test['Id'])

Duplicates detected:  0
Duplicates detected:  0


Since the Id column is not needed for prediction, we will drop it.

In [6]:
# Saving Id
train_ID = train['Id']
test_ID = test['Id']

# Dropping the 'Id' Column
train.drop('Id', axis = 1, inplace = True)
test.drop('Id', axis = 1, inplace = True)

# Checking Data Frame size after drop
# Checking the number of instances and features 
shapetrain = train.shape
print('The size of training data after dropping the ID column: ', shapetrain)
shapetest = test.shape
print('The size of testing data after dropping the ID column: ', shapetest)

The size of training data after dropping the ID column:  (1460, 80)
The size of testing data after dropping the ID column:  (1459, 79)


In [7]:
# Taking a look at the training data
train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


 ## Target Variable Analysis
 In this case, the target variable is 'SalePrice'. 

In [8]:
# Description of the data
train['SalePrice'].describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

## Exploratory Data Analysis