# Question 2

### Consider the House Price Prediction dataset (Dataset Link). Suppose you need to predict the Sale Price of a house and for the task you want to use a neural network with 3 hidden layers. Write a report on how you would modify your above neural network for such task with proper reasoning.

<b>For the given data set we need to predict the 'SalePices' of a house which is a continous/real valued attribute, therefore it is a regression task of ML. For this purpose we need to perform following changes to the previous code:</b>
<br>

* Data Preprocessing: The given data contains numerical as well as catagorical data so we need to distingush between the two and preprocess the data. The preprocessing involves replacing the NAN/missing values with desired number in case of numerical attribute and encoding the catagorical data by one hot endcoding scheam. This way we can treat the catagorical attribute as numerical attributes. The reason behind this is to make the effect of different values of a catagorical attribute i.e. only the values which are applicable to a perticular data point will be considered rest all will have zero effect on weight matrix
<br>

* No need to encode the class label as it a regression problem.
<br>

* Change in activation function: In the case of classification problem we used sofmax function on an encoded class labels for the last layer now we need to use the same function in the last output layer as we used in the internal layers.
<br>

* As specified in the problem we need to have 3 layers which can we configured easily in the previous code
<br>

* Using Mean Square Error for score: Prediction will be real valued closer to the actual value and not exactly the same so we will use Mean Square Error for evaluation of our model.


# PREPROCESSING DATA

## Importing libraries

In [85]:
import numpy as np
import pandas as pd
import math as mt
import sys
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import itertools


## Utility Functions


In [86]:
def make_one_hot(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)
    
def make_na_median(df, name):
    df[name] = df[name].fillna(df[name].median())
    
def normalize(df, name):
    df[name] = (df[name] - df[name].mean()) / df[name].std()
    
def preprocessing(df):
    df.drop("Id", axis = 1, inplace = True)
    catColumns = df.select_dtypes(include = ['object'])
    numColumns = df.select_dtypes(exclude = ['object'])
    for col in catColumns:
        make_one_hot(df, col)
    for col in numColumns:
        make_na_median(df, col)
        normalize(df, col)
    if "SalePrice" in numColumns:
        classColData = df["SalePrice"]
        df.drop("SalePrice",axis = 1, inplace = True)
        df["SalePrice"] = classColData


### Path of different data sets

In [87]:
trainFilePath = "./../input_data/house-prices-advanced-regression-techniques/train.csv"
testFilePath = "./../input_data/house-prices-advanced-regression-techniques/test.csv"
y_testPath = "./../input_data/house-prices-advanced-regression-techniques/sample_submission.csv"

### Reading csv to pandas data frame

In [88]:
trainDataSet = pd.read_csv(trainFilePath)
testDataSet = pd.read_csv(testFilePath)
y_test = pd.read_csv(y_testPath)

In [89]:
trainDataSet.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [90]:
trainDataSet

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000


In [91]:
testDataSet

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal
5,1466,60,RL,75.0,10000,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,4,2010,WD,Normal
6,1467,20,RL,,7980,Pave,,IR1,Lvl,AllPub,...,0,0,,GdPrv,Shed,500,3,2010,WD,Normal
7,1468,60,RL,63.0,8402,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,5,2010,WD,Normal
8,1469,20,RL,85.0,10176,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2010,WD,Normal
9,1470,20,RL,70.0,8400,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,,0,4,2010,WD,Normal


In [92]:
preprocessing(trainDataSet)
preprocessing(testDataSet)
preprocessing(y_test)

In [93]:
trainDataSet

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType-New,SaleType-Oth,SaleType-WD,SaleCondition-Abnorml,SaleCondition-AdjLand,SaleCondition-Alloca,SaleCondition-Family,SaleCondition-Normal,SaleCondition-Partial,SalePrice
0,0.073350,-0.220799,-0.207071,0.651256,-0.517023,1.050634,0.878367,0.513928,0.575228,-0.288554,...,0,0,1,0,0,0,0,1,0,0.347154
1,-0.872264,0.460162,-0.091855,-0.071812,2.178881,0.156680,-0.429430,-0.570555,1.171591,-0.288554,...,0,0,1,0,0,0,0,1,0,0.007286
2,0.073350,-0.084607,0.073455,0.651256,-0.517023,0.984415,0.829930,0.325803,0.092875,-0.288554,...,0,0,1,0,0,0,0,1,0,0.535970
3,0.309753,-0.447787,-0.096864,0.651256,-0.517023,-1.862993,-0.720051,-0.570555,-0.499103,-0.288554,...,0,0,1,1,0,0,0,0,0,-0.515105
4,0.073350,0.641752,0.375020,1.374324,-0.517023,0.951306,0.733056,1.366021,0.463410,-0.288554,...,0,0,1,0,0,0,0,1,0,0.869545
5,-0.163054,0.687149,0.360493,-0.794879,-0.517023,0.719540,0.490872,-0.570555,0.632233,-0.288554,...,0,0,1,0,0,0,0,1,0,-0.477341
6,-0.872264,0.233175,-0.043364,1.374324,-0.517023,1.083743,0.975241,0.458597,2.028862,-0.288554,...,0,0,1,0,0,0,0,1,0,1.587045
7,0.073350,-0.039210,-0.013508,0.651256,0.381612,0.057352,-0.574741,0.757383,0.910682,-0.090190,...,0,0,1,0,0,0,0,1,0,0.240159
8,-0.163054,-0.856363,-0.440508,0.651256,-0.517023,-1.333243,-1.688790,-0.570555,-0.972685,-0.288554,...,0,0,1,1,0,0,0,0,0,-0.642241
9,3.146594,-0.901761,-0.310264,-0.794879,0.381612,-1.068368,-1.688790,-0.570555,0.893142,-0.288554,...,0,0,1,0,0,0,0,1,0,-0.792034


In [94]:
Class = "SalePrice"
columns = list(trainDataSet.columns)
X_train = trainDataSet[columns[:-1]].values
y_train = trainDataSet[[Class]].values
X_test = testDataSet.values