# <center>Surprise Housing - Property Price Prediction</center>

## Problem Statement
A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them at a higher price.<BR>
The company is looking at prospective properties to buy to enter the market.<BR><BR>

## Goal
<UL>
    <LI>Build Regression model using regularisation to predict price of property  </LI>
    <LI>Identify variables which are significant in price prediction</LI>
    <LI>How accurately we can predict the price based on above identified independent variable </LI>
</UL>

### Load Libraries & Data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
pd.set_option("display.max_column", 100)

In [None]:
housingDF =pd.read_csv("train.csv")
housingDF.head()

In [None]:
#housingDF.set_index('Id', inplace=True)
housingDF.shape

## Data Cleaning

In [None]:
housingDF.describe(percentiles=[0.05,0.25,0.5,0.75,0.95, 0.99])

There are few outlieres in dataset.

In [None]:
housingDF.drop(columns=housingDF.describe().columns).describe()

We have total **1460** observation in data set.<BR>
For **Utilities** column which have 2 different categorical values, **1459** observation have unique value **AllPub** out of **1460**, so this column will not help in prediction, hence will drop this column.<BR>
Similarly there are few other columns which have very les variance (or 95 To 100% data contain same categorical value) like **Street, Condition2, RoofMatl, Heating, etc. **, will analys these columns seprtaly after data cleanup.

In [None]:
housingDF.drop(labels=['Utilities','Id'], inplace=True, axis=1)

In [None]:
housingDF.info()

In [None]:
missingData = housingDF.isnull().sum() / housingDF.index.size * 100
missingData[missingData > 0]

In [None]:
#Will drop column with more than 40% missing data.
housingDF.drop(labels=['Alley','FireplaceQu','PoolQC','Fence','MiscFeature'], inplace=True, axis=1)

Aprt from **LotFrontage** column, other remaning columns have 0.5 - 6 % missing data, will drop missingdata rows instead of imputing some data and introduce baised values. 

In [None]:
housingDF.dropna(subset=missingData[(missingData > 0) & (missingData < 6)].index, inplace=True)

In [None]:
housingDF.head()

In [None]:
#Drop rows 
missingData = housingDF.isnull().sum() / housingDF.index.size * 100
print(missingData[missingData > 0])
housingDF[missingData[missingData > 0].index].describe(percentiles=[0.5,0.9,0.99])

In [None]:
plt.scatter(housingDF['SalePrice'], housingDF['LotFrontage'])

We cannot drop **LotFrontage** column, as it can be usefull in model prediction.<BR>
Also imputing **18%** data is not good idea, so will drop missing data rows for this columm also. 

In [None]:
housingDF.dropna(subset=['LotFrontage'], inplace=True)

In [None]:
missingData = housingDF.isnull().sum() / housingDF.index.size * 100
missingData[missingData > 0]

**No Missing Data in data set**

In [None]:
#Check for duplicate rows
housingDF[housingDF.duplicated()]

In [None]:
housingDF.shape

From total 1460 observation we left with 1094 observations, approx 26% data removed.

## Data Prepration

In [None]:
#Print Missing value count and Percent of data missing
#Create data frame which have rows for each category of that column.
#Calculate each value count for each category and Mean Sale Price for that Category.
def GetValueCount(colName):
    tempDF = housingDF[[colName,'SalePrice']].copy()
    print("****** " + colName + " *****")

    # Including NaN value count
    valCountSer = tempDF[colName].value_counts() 
    df = pd.DataFrame(data=valCountSer)
    
    #df = df.merge(tempDF.groupby(by=colName, observed=False ).sum()[['Converted']], left_index=True, right_index=True)

    df[colName + ' %'] = round(df[colName] / df[colName].sum() * 100, 2)
    df = df.merge(tempDF.groupby(by=colName, observed=False).mean()[['SalePrice']], left_index=True, right_index=True)
   
    print(df.sort_values(by=colName, ascending=False))
    print()

### Categorical Data

In [None]:
dummy = [GetValueCount(col) for col in housingDF.columns[housingDF.dtypes == 'object']]

Will reduce categories using following steps/rules:<BR>
- Combining less variance data
- Drop that categorical column if single categorical value explaining more than 95 - 100 % data.
- Convert categorical column to Ordinal scale.

Will apply Dummification Or Binary Encoding on remaning categorical column after data analysis.

Almost 93% data explained by top 2 MSZoning categories **RL, RM**.<BR>
Will create new category and assign this category to remaining 7% rows.

In [None]:
housingDF['MSZoning'] = housingDF['MSZoning'].apply (lambda v : v if v in (['RL', 'RM']) else 'FV_RH_C')
GetValueCount('MSZoning')

In [None]:
#99.63% properties have 'Pave' type of road access to street, very less variance explained by this feature, hence will drop this feature  
#Same issue with 'Condition2','RoofMatl'
housingDF.drop(labels=['Street','Condition2','RoofMatl'], inplace=True, axis=1)

In [None]:
housingDF['LotShape'] = housingDF['LotShape'].apply (lambda v : v if v in (['Reg', 'IR1']) else 'IR2_IR3')
GetValueCount('LotShape')

In [None]:
housingDF['LandContour'] = housingDF['LandContour'].apply (lambda v : v if v in (['Lvl']) else 'Bnk_HLS_Low')
GetValueCount('LandContour')

In [None]:
housingDF['LotConfig'] = housingDF['LotConfig'].apply (lambda v : v if v in (['Inside', 'Corner']) else 'CulDSac_FR2_FR3')
GetValueCount('LotConfig')

In [None]:
housingDF['Condition1'] = housingDF['Condition1'].apply (lambda v : v if v in (['Norm']) else 'Other_Condition1')
GetValueCount('Condition1')

In [None]:
housingDF['BldgType'] = housingDF['BldgType'].apply (lambda v : v if v in (['1Fam']) else 'Other_BldgType')
GetValueCount('BldgType')

In [None]:
housingDF['HouseStyle'] = housingDF['HouseStyle'].apply (lambda v : v if v in (['1Story','2Story','1.5Fin']) else 'Other_HouseStyle')
GetValueCount('HouseStyle')

In [None]:
housingDF['RoofStyle'] = housingDF['RoofStyle'].apply (lambda v : v if v in (['Gable','Hip']) else 'Other_RoofStyle')
GetValueCount('RoofStyle')

In [None]:
housingDF['Foundation'] = housingDF['Foundation'].apply (lambda v : v if v in (['PConc','CBlock','BrkTil']) else 'Stone_Wood')
GetValueCount('Foundation')

In [None]:
housingDF['Heating'] = housingDF['Heating'].apply (lambda v : v if v in (['GasA']) else 'Other_Heating')
GetValueCount('Heating')

In [None]:
housingDF['Electrical'] = housingDF['Electrical'].apply (lambda v : v if v in (['SBrkr']) else 'Other_Electrical')
GetValueCount('Electrical')

In [None]:
housingDF['GarageType'] = housingDF['GarageType'].apply (lambda v : v if v in (['Attchd', 'Detchd']) else 'Other_GarageType')
GetValueCount('GarageType')

In [None]:
housingDF['SaleType'] = housingDF['SaleType'].apply (lambda v : v if v in (['WD', 'New']) else 'Other_SaleType')
GetValueCount('SaleType')

In [None]:
housingDF['SaleCondition'] = housingDF['SaleCondition'].apply (lambda v : v if v in (['Normal', 'Partial']) else 'Other_SaleCondition')
GetValueCount('SaleCondition')

In [None]:
LandSlope = {'Gtl' : 1, 'Mod' : 2, 'Sev' : 3}
Qual = {'NA' : 0, 'Po' : 1, 'Fa' : 2, 'TA' : 3, 'Gd' : 4, 'Ex' : 5}
BsmtExposure = {'NA' : 0, 'No' : 1, 'Mn' : 2, 'Av' : 3, 'Gd' : 4}
BsmtFinType1 = {'NA' : 0, 'Unf' : 1, 'LwQ' : 2, 'Rec' : 3, 'BLQ' : 4, 'ALQ' : 5, 'GLQ' : 6}
Functional= {'Sal' : 1, 'Sev' : 2, 'Maj2' : 3, 'Maj1' : 4, 'Mod' : 5, 'Min2' : 6, 'Min1' : 7, 'Typ' : 8}
GarageFinish = {'NA' : 0, 'Unf' : 1, 'RFn' : 2, 'Fin' : 3}
PavedDrive = {'N' : 0, 'P' : 1, 'Y' : 2}

housingDF.replace({'LandSlope' : LandSlope, 
                   'ExterQual' : Qual,
                   'ExterCond' : Qual,
                   'BsmtQual' : Qual,
                   'BsmtCond' : Qual,
                   'BsmtExposure' : BsmtExposure,
                   'BsmtFinType1' : BsmtFinType1,
                   'BsmtFinType2' : BsmtFinType1,
                   'HeatingQC' : Qual,
                   'KitchenQual' : Qual,
                   'Functional' : Functional,
                   'GarageFinish' : GarageFinish,
                   'GarageQual' : Qual,
                   'GarageCond' : Qual,
                   'PavedDrive' : PavedDrive
                  }, inplace=True)

housingDF = housingDF.astype({'LandSlope' : int, 
                   'ExterQual' : int,
                   'ExterCond' : int,
                   'BsmtQual' : int,
                   'BsmtCond' : int,
                   'BsmtExposure' : int,
                   'BsmtFinType1' : int,
                   'BsmtFinType2' : int,
                   'HeatingQC' : int,
                   'KitchenQual' : int,
                   'Functional' : int,
                   'GarageFinish' : int,
                   'GarageQual' : int,
                   'GarageCond' : int,
                   'PavedDrive' : int
                  } )

In [None]:
housingDF.info()

### Numerical Data

In [None]:
housingDF[housingDF.columns[housingDF.dtypes != 'object']].head(10)

In [None]:
import datetime as dt

In [None]:
#Derived Variable Property Age
curDate = dt.date.today()
housingDF['PropertyAge'] = housingDF['YearBuilt'].apply(lambda x : curDate.year - x)
housingDF['PropertyRemodelAge'] = housingDF['YearRemodAdd'].apply(lambda x : curDate.year - x)
housingDF['GarageAge'] = housingDF['GarageYrBlt'].apply(lambda x : int(curDate.year - x))
housingDF['PropertySoldSince'] = housingDF.apply(lambda x : (curDate.month - x['MoSold']) / 12 + (curDate.year - x['YrSold']), axis=1)

In [None]:
housingDF[housingDF['PropertyAge'] < housingDF['PropertyRemodelAge']]

In [None]:
housingDF[housingDF['PropertyAge'] < housingDF['GarageAge']]

In [None]:
#Garage Age cannot be greater than property age.
#Will assign Property Age to Garage Age for such rows.
housingDF['GarageAge'] = housingDF.apply(lambda x : x['GarageAge'] if x['GarageAge'] <= x['PropertyAge'] else x['PropertyAge'] , axis=1)

In [None]:
housingDF[housingDF['BsmtFinSF1'] + housingDF['BsmtFinSF2'] + housingDF['BsmtUnfSF'] != housingDF['TotalBsmtSF']]

In [None]:
#We can remove TotalBsmtSF as Other 3 columns can able to explain 'TotalBsmtSF' column
#Also drop unwanted Year columns

housingDF.drop(labels=['TotalBsmtSF','YearBuilt','YearRemodAdd', 'GarageYrBlt','MoSold', 'YrSold'], axis=1, inplace=True)

## Visualising Data

In [None]:
qutVar = housingDF.columns[housingDF.dtypes != 'object']
catVar = housingDF.columns[housingDF.dtypes == 'object']

### Quantitative Variable Univariate & Bivariate Analysis

In [None]:
import math
a = housingDF.describe(percentiles=[0.01,0.05,0.25,0.5,0.75,0.95,0.99]).loc['99%']

temp = housingDF.copy()
temp = temp[temp < a]
count = 1
plt.figure(figsize=(20,50))
for n in qutVar:
    plt.subplot(math.ceil(qutVar.size / 4), 4, count)
    #sns.distplot(housingDF[housingDF[n] < np.percentile(housingDF[n], 99)] [n])
    sns.distplot(temp[n])
    count += 1

plt.show()

In [None]:
a = housingDF.describe(percentiles=[0.01,0.05,0.25,0.5,0.75,0.95,0.99]).loc['99%']
housingDF.describe(percentiles=[0.01,0.05,0.25,0.5,0.75,0.95,0.99]).loc['99%']

In [None]:
housingDF[housingDF < a].describe(percentiles=[0.01,0.05,0.25,0.5,0.75,0.95,0.99])

In [None]:
housingDF[housingDF < a]

In [None]:
count = 1
plt.figure(figsize=(20,50))
for n in qutVar:
    plt.subplot(math.ceil(qutVar.size / 4), 4, count)
    sns.boxplot(y=housingDF[housingDF[n] < np.percentile(housingDF[n], 99)] [n])
    count += 1

plt.show()

In [None]:
#Check correlation with heatmap
corr = housingDF.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(20,15))
sns.heatmap(corr, cmap='RdBu',  mask=mask, center=0, linewidths= 0.1)
plt.show()

In [None]:
#Check correlation with heatmap
#Display data with high corelation only 
corr = round(housingDF.corr().applymap(lambda x : 0 if x > -0.5 and x < 0.5 else x ), 1)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(20,15))
sns.heatmap(corr, cmap='RdBu', annot=True, mask=mask, center=0, linewidths= 0.1, )
plt.show()

### Qualitative Variable Univariate & Bivariate Analysis

In [None]:
corr = housingDF.corr()