### 1. Data Understanding and Exploration

Let's first have a look at the dataset and understand the size, attribute names etc.

In [1]:
import sys
import os
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split,GridSearchCV,KFold,cross_val_score
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score

warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
prices = pd.read_csv('train.csv')

In [3]:
prices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

There seems to be some columns with null values. So checking null percentage

In [4]:
print(round(prices.isnull().sum()/len(prices.index),2).sort_values(ascending=False).to_markdown())

|               |    0 |
|:--------------|-----:|
| PoolQC        | 1    |
| MiscFeature   | 0.96 |
| Alley         | 0.94 |
| Fence         | 0.81 |
| FireplaceQu   | 0.47 |
| LotFrontage   | 0.18 |
| GarageYrBlt   | 0.06 |
| GarageFinish  | 0.06 |
| GarageType    | 0.06 |
| GarageQual    | 0.06 |
| GarageCond    | 0.06 |
| BsmtExposure  | 0.03 |
| BsmtQual      | 0.03 |
| BsmtCond      | 0.03 |
| BsmtFinType2  | 0.03 |
| BsmtFinType1  | 0.03 |
| MasVnrType    | 0.01 |
| MasVnrArea    | 0.01 |
| Id            | 0    |
| Functional    | 0    |
| Fireplaces    | 0    |
| KitchenQual   | 0    |
| KitchenAbvGr  | 0    |
| BedroomAbvGr  | 0    |
| HalfBath      | 0    |
| FullBath      | 0    |
| BsmtHalfBath  | 0    |
| BsmtFullBath  | 0    |
| TotRmsAbvGrd  | 0    |
| GarageCars    | 0    |
| LowQualFinSF  | 0    |
| GarageArea    | 0    |
| PavedDrive    | 0    |
| WoodDeckSF    | 0    |
| OpenPorchSF   | 0    |
| EnclosedPorch | 0    |
| 3SsnPorch     | 0    |
| ScreenPorch   | 0    |


Removing values with 10% or more null values

In [5]:
limitPer = len(prices) * .90
prices = prices.dropna(thresh=limitPer, axis=1)

In [6]:
print(round(prices.isnull().sum()/len(prices.index),2).sort_values(ascending=False).to_markdown())

|               |    0 |
|:--------------|-----:|
| GarageType    | 0.06 |
| GarageYrBlt   | 0.06 |
| GarageFinish  | 0.06 |
| GarageQual    | 0.06 |
| GarageCond    | 0.06 |
| BsmtFinType1  | 0.03 |
| BsmtQual      | 0.03 |
| BsmtCond      | 0.03 |
| BsmtExposure  | 0.03 |
| BsmtFinType2  | 0.03 |
| MasVnrType    | 0.01 |
| MasVnrArea    | 0.01 |
| BedroomAbvGr  | 0    |
| HalfBath      | 0    |
| FullBath      | 0    |
| BsmtHalfBath  | 0    |
| BsmtFullBath  | 0    |
| KitchenAbvGr  | 0    |
| KitchenQual   | 0    |
| GrLivArea     | 0    |
| LowQualFinSF  | 0    |
| TotRmsAbvGrd  | 0    |
| Id            | 0    |
| Functional    | 0    |
| Fireplaces    | 0    |
| SaleCondition | 0    |
| SaleType      | 0    |
| YrSold        | 0    |
| MoSold        | 0    |
| MiscVal       | 0    |
| PoolArea      | 0    |
| ScreenPorch   | 0    |
| 3SsnPorch     | 0    |
| EnclosedPorch | 0    |
| OpenPorchSF   | 0    |
| WoodDeckSF    | 0    |
| PavedDrive    | 0    |
| GarageArea    | 0    |


There are still some columns which have null values, so we have to explore them individually 

In [7]:
haveNullValues = ['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','BsmtFinType1','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType2','MasVnrType','MasVnrArea']

In [8]:
len(haveNullValues)

12

In [10]:
prices[haveNullValues].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   GarageType    1379 non-null   object 
 1   GarageYrBlt   1379 non-null   float64
 2   GarageFinish  1379 non-null   object 
 3   GarageQual    1379 non-null   object 
 4   GarageCond    1379 non-null   object 
 5   BsmtFinType1  1423 non-null   object 
 6   BsmtQual      1423 non-null   object 
 7   BsmtCond      1423 non-null   object 
 8   BsmtExposure  1422 non-null   object 
 9   BsmtFinType2  1422 non-null   object 
 10  MasVnrType    1452 non-null   object 
 11  MasVnrArea    1452 non-null   float64
dtypes: float64(2), object(10)
memory usage: 137.0+ KB


In [23]:
for type in haveNullValues:
    print(prices[type].value_counts(), end='\n\n')

Attchd     870
Detchd     387
BuiltIn     88
Basment     19
CarPort      9
2Types       6
Name: GarageType, dtype: int64

2005.0    65
2006.0    59
2004.0    53
2003.0    50
2007.0    49
          ..
1927.0     1
1900.0     1
1906.0     1
1908.0     1
1933.0     1
Name: GarageYrBlt, Length: 97, dtype: int64

Unf    605
RFn    422
Fin    352
Name: GarageFinish, dtype: int64

TA    1311
Fa      48
Gd      14
Ex       3
Po       3
Name: GarageQual, dtype: int64

TA    1326
Fa      35
Gd       9
Po       7
Ex       2
Name: GarageCond, dtype: int64

Unf    430
GLQ    418
ALQ    220
BLQ    148
Rec    133
LwQ     74
Name: BsmtFinType1, dtype: int64

TA    649
Gd    618
Ex    121
Fa     35
Name: BsmtQual, dtype: int64

TA    1311
Gd      65
Fa      45
Po       2
Name: BsmtCond, dtype: int64

No    953
Av    221
Gd    134
Mn    114
Name: BsmtExposure, dtype: int64

Unf    1256
Rec      54
LwQ      46
BLQ      33
ALQ      19
GLQ      14
Name: BsmtFinType2, dtype: int64

None       864
BrkFace   

According to dictionary, NA in:
- GarageType means No Garage in home,  So we can replace NA with NoGarage
- GarageYrBlt is years garage old, and since NA in GarageType and GarageYrOld is equal, so GarageYrOld null will be there where there is no garage, so we replace NA with 0
- GarageFinish is Interior finish of the garage, so it is NA in homes where there is no garage, so we can place NoGarage
- GarageQual is garage quality, so it is NA where there is no garage, so we can replace NA with NoGarage
- GarageCond is garage condition, so it is NA where there is no garage, so we can replace NA with NoGarage
- BsmtFinType1 is Rating of basement finished area, so it is NA where there is no basement, so we can replace NA with NoBasement
- BsmtFinType2 is Rating of basement finished area, so it is NA where there is no basement, so we can replace NA with NoBasement
- BsmtQual is NA where there is no basement, so we can replace NA with NoBasement
- BsmtCond is NA where there is no basement, so we can replace NA with NoBasement
- BsmtExposure is NA where there is no basement, so we can replace NA with NoBasement
- MasVnrType is NA where there is no masonry veneer, so we can replace NA with NoMasVnr
- MasVnrArea is NA where there is no masonry veneer, so we can replace NA with 0

In [24]:
prices.GarageType.fillna('NoGarage',inplace=True)
prices.GarageYrBlt.fillna(0,inplace=True)
prices.GarageFinish.fillna('NoGarage',inplace=True)
prices.GarageQual.fillna('NoGarage',inplace=True)
prices.GarageCond.fillna('NoGarage',inplace=True)
prices.BsmtFinType1.fillna('NoBasement',inplace=True)
prices.BsmtFinType2.fillna('NoBasement',inplace=True)
prices.BsmtQual.fillna('NoBasement',inplace=True)
prices.BsmtCond.fillna('NoBasement',inplace=True)
prices.BsmtExposure.fillna('NoBasement',inplace=True)
prices.MasVnrType.fillna('NoMasVnr',inplace=True)
prices.MasVnrArea.fillna(0,inplace=True)

In [25]:
prices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 75 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   LotShape       1460 non-null   object 
 6   LandContour    1460 non-null   object 
 7   Utilities      1460 non-null   object 
 8   LotConfig      1460 non-null   object 
 9   LandSlope      1460 non-null   object 
 10  Neighborhood   1460 non-null   object 
 11  Condition1     1460 non-null   object 
 12  Condition2     1460 non-null   object 
 13  BldgType       1460 non-null   object 
 14  HouseStyle     1460 non-null   object 
 15  OverallQual    1460 non-null   int64  
 16  OverallCond    1460 non-null   int64  
 17  YearBuilt      1460 non-null   int64  
 18  YearRemo