## House Price prediction 
#### Based on Historic Values of houses

Building the decision tree classifier to predict the house prices 

Later we will try Prediction using Random Forest Classifier

In [112]:
# Importing Libraries
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


# Path of the file to read
iowa_file_path = 'train.csv'

home_data = pd.read_csv(iowa_file_path)


## List of All Columns in the data

In [113]:
print(home_data.columns)

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

## Creating target object y

In [114]:
y = home_data.SalePrice
print (y.head())

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64


Selecting Data for Modeling
The dataset had too many variables .I'll start by picking a few variables using my intuition.

In [115]:

# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

X = home_data[features]
print (X.head())

   LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0     8450       2003       856       854         2             3   
1     9600       1976      1262         0         2             3   
2    11250       2001       920       866         2             3   
3     9550       1915       961       756         1             3   
4    14260       2000      1145      1053         2             4   

   TotRmsAbvGrd  
0             8  
1             6  
2             6  
3             7  
4             9  


## The steps to building and using a model are:

#### Define: I will be using a decision tree
#### Fit: Capturing patterns from provided data. Building Model using training data
#### Predict: Predicting the house price on unseen data ie Validaion Data
#### Evaluate: Determining the accuratacy the model. 

In [116]:

# Spliting data  into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specifying Model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fitting Model
iowa_model.fit(train_X, train_y)

# Making predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: ",val_mae)


Validation MAE:  29652.931506849316


# The Mean Absolute Error is Coming $29,653 in house price prediction which is very high. 
## Now I will try to reduce the Error by checking the MAE at various Depth of the Decision tree.

### Method to calculate the MAE for a given depth/Max Leave count. 

In [117]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

## Step 1: Compare Different Tree Sizes
Writing a loop that tries the  [5, 25, 50, 100, 250, 500] values for *max_leaf_nodes* .


In [118]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
min_val = 0
# Writing loop to find the ideal tree size from candidate_max_leaf_nodes
for max_leaf_nodes in candidate_max_leaf_nodes:
    mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, mae))
    if min_val == 0 or  mae < min_val:
        min_val = mae
        leaf = max_leaf_nodes
# Storing the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = leaf
print ( " The best tree size Comes out to be %d "%(best_tree_size))

Max leaf nodes: 5  		 Mean Absolute Error:  35044
Max leaf nodes: 25  		 Mean Absolute Error:  29016
Max leaf nodes: 50  		 Mean Absolute Error:  27405
Max leaf nodes: 100  		 Mean Absolute Error:  27282
Max leaf nodes: 250  		 Mean Absolute Error:  27893
Max leaf nodes: 500  		 Mean Absolute Error:  29454
 The best tree size Comes out to be 100 


## Step 2: Fit Model Using All Data
As we know the best tree size. Now deploing this model in practice.
We will make it even more accurate by using all of the data and keeping that tree size. 
That is, now we don't need to hold out the validation data.

In [119]:
# Filling in argument to make optimal size  
final_model = DecisionTreeRegressor(max_leaf_nodes = best_tree_size, random_state=1)

# fitting the final model 
final_model.fit(X, y)
mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)

print( " Mean Absolute Error for the Final Model is $%d "% (mae))

 Mean Absolute Error for the Final Model is $29454 


# Random Forest Regressor

Now I will apply another classifier called Random Forest.


In [120]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print("The Mean Average Error with Random Forest is %d" %(mean_absolute_error(val_y, melb_preds)))

The Mean Average Error with Random Forest is 22762
