# Assignment - Advance Regression

## Problem Statement

A US-based housing company named **Surprise Housing** has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. The company is looking at prospective properties to buy to enter the market.

## Objective

The company wants to know:
 - Which variables are significant in predicting the price of a house, and
 - How well those variables describe the price of a house.

## Business Goal

You are required to model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. They can accordingly manipulate the strategy of the firm and concentrate on areas that will yield high returns. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

## Data Sourcing

The company has collected and given a data set from the sale of houses in Australia. For this case study we need to focus only on provided dataset given by company. Data dfinition is also given in the link below to understand the data.

Link to the [Data definition](./Data_definition.pdf)

## Instructions

1. Build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.
2. Determine the optimal value of lambda for ridge and lasso regression

## Data Dictionary

Here's a brief version of what you'll find in the data description file. I have grouped the fields under meaningful name to understand it better.

#### Dwelling
 - **MSSubClass**: Identifies the type of dwelling involved in the sale.
 - **BldgType**: Type of dwelling
 - **HouseStyle**: Style of dwelling
 
#### Zone
 - **MSZoning**: Identifies the general zoning classification of the sale.
 
#### Size
 - **LotFrontage**: Linear feet of street connected to property
 - **LotArea**: Lot size in square feet
 - **1stFlrSF**: First Floor square feet
 - **2ndFlrSF**: Second floor square feet
 - **LowQualFinSF**: Low quality finished square feet (all floors)
 - **GrLivArea**: Above grade (ground) living area square feet
 - **WoodDeckSF**: Wood deck area in square feet

#### Street
 - **Street**: Type of road access to property
 - **Alley**: Type of alley access to property
 - **Utilities**: Type of utilities available
 
#### Shape
 - **LotShape**: General shape of property
 - **LandContour**: Flatness of the property
 - **LotConfig**: Lot configuration
 - **LandSlope**: Slope of property
 
#### Neighborhood 
 - **Neighborhood**: Physical locations within Ames city limits
 
#### Proximity
 - **Condition1**: Proximity to various conditions
 - **Condition2**: Proximity to various conditions (if more than one is present)

#### House Quality
 - **OverallQual**: Rates the overall material and finish of the house (1-10)
 - **OverallCond**: Rates the overall condition of the house
 
#### Construction Time
 - **YearBuilt**: Original construction date
 - **YearRemodAdd**: Remodel date (same as construction date if no remodeling or additions)

#### Roof
 - **RoofStyle**: Type of roof
 - **RoofMatl**: Roof material
 
#### Masnary
 - **MasVnrType**: Masonry veneer type
 - **MasVnrArea**: Masonry veneer area in square feet
 - **Foundation**: Type of foundation
 
#### Exterior
 - **Exterior1st**: Exterior covering on house
 - **Exterior2nd**: Exterior covering on house (if more than one material)
 - **ExterQual**: Evaluates the quality of the material on the exterior
 - **ExterCond**: Evaluates the present condition of the material on the exterior

#### Basement
 - **BsmtQual**: Evaluates the height of the basement
 - **BsmtCond**: Evaluates the general condition of the basement
 - **BsmtExposure**: Refers to walkout or garden level walls
 - **BsmtFinType1**: Rating of basement finished area
 - **BsmtFinSF1**: Type 1 finished square feet
 - **BsmtFinType2**: Rating of basement finished area (if multiple types)
 - **BsmtFinSF2**: Type 2 finished square feet
 - **BsmtUnfSF**: Unfinished square feet of basement area
 - **TotalBsmtSF**: Total square feet of basement area
 
#### Heat Management 
 - **Heating**: Type of heating
 - **HeatingQC**: Heating quality and condition
 - **CentralAir**: Central air conditioning
 
#### Electricity 
 - **Electrical**: Electrical system
 

#### Bathrooms
 - **BsmtFullBath**: Basement full bathrooms
 - **BsmtHalfBath**: Basement half bathrooms
 - **FullBath**: Full bathrooms above grade
 - **HalfBath**: Half baths above grade
 - **Bedroom**: Bedrooms above grade (does NOT include basement bedrooms)
 
#### Kitchen
 - **Kitchen**: Kitchens above grade
 - **KitchenQual**: Kitchen quality
 
#### Total Rooms 
 - **TotRmsAbvGrd**: Total rooms above grade (does not include bathrooms)
 
#### Fire Management
 - **Fireplaces**: Number of fireplaces
 - **FireplaceQu**: Fireplace quality

#### Garage
 - **GarageType**: Garage location
 - **GarageYrBlt**: Year garage was built
 - **GarageFinish**: Interior finish of the garage
 - **GarageCars**: Size of garage in car capacity
 - **GarageArea**: Size of garage in square feet
 - **GarageQual**: Garage quality
 - **GarageCond**: Garage condition
 - **PavedDrive**: Paved driveway

#### Porch
 - **OpenPorchSF**: Open porch area in square feet
 - **EnclosedPorch**: Enclosed porch area in square feet
 - **3SsnPorch**: Three season porch area in square feet
 - **ScreenPorch**: Screen porch area in square feet

#### Pool
 - **PoolArea**: Pool area in square feet
 - **PoolQC**: Pool quality
 
#### Fence
 - **Fence**: Fence quality
 
#### Misc
 - **MiscFeature**: Miscellaneous feature not covered in other categories
 - **MiscVal**: Value of miscellaneous feature
 - **Functional**: Home functionality (Assume typical unless deductions are warranted)
 
#### Sold Time
 - **MoSold**: Month Sold (MM)
 - **YrSold**: Year Sold (YYYY)
 
#### Sale
 - **SaleType**: Type of sale
 - **SaleCondition**: Condition of sale

## Import Libraries

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model, metrics
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

import os

# hide warnings
import warnings
warnings.filterwarnings('ignore')

# setting to display all rows & columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

## Read and Store Data

In [20]:
# hd: house data
hd = pd.read_csv('./train.csv');

In [21]:
# Shape of house data
print(hd.shape)

(1460, 81)


In [22]:
hd.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [23]:
print(hd.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC