# House Prices Prediction: Kaggle Dataset Analysis

This notebook performs exploratory data analysis (EDA) and model training to predict house prices using the Kaggle House Prices dataset. The goal is to build a robust linear regression models to predict the `SalePrice` of houses based on various features.

## Objectives
- Load and preprocess the dataset.
- Train and evaluate linear regression models with regularization (Ridge,Lasso and ElasticNet).
- Generate predictions for the test dataset.

## Dataset
The dataset consists of two CSV files:
- `train.csv`: Training data with 1460 rows and 81 columns, including the target variable `SalePrice`.
- `test.csv`: Test data with 1459 rows and 80 columns (no `SalePrice`).

In [346]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')

pd.options.display.max_columns = None

from sklearn.model_selection import train_test_split

import statsmodels
import statsmodels.api as sm
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.gofplots import qqplot

from scipy import stats
from scipy.stats import shapiro

from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score,mean_squared_error

## 1. Data Loading and Initial Exploration

In [364]:
train= pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train.shape,test.shape

((1460, 81), (1459, 80))

**Concatenating** both train and test data for easier null value treatment and encoding categorical variables

In [390]:
df=pd.concat([train,test])
df.reset_index(drop=True,inplace=True)
df.drop('SalePrice',axis=1,inplace=True) # dropping target variable

In [392]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706.0,Unf,0.0,150.0,856.0,GasA,Ex,Y,SBrkr,856,854,0,1710,1.0,0.0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2.0,548.0,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978.0,Unf,0.0,284.0,1262.0,GasA,Ex,Y,SBrkr,1262,0,0,1262,0.0,1.0,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2.0,460.0,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486.0,Unf,0.0,434.0,920.0,GasA,Ex,Y,SBrkr,920,866,0,1786,1.0,0.0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2.0,608.0,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216.0,Unf,0.0,540.0,756.0,GasA,Gd,Y,SBrkr,961,756,0,1717,1.0,0.0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3.0,642.0,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655.0,Unf,0.0,490.0,1145.0,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1.0,0.0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3.0,836.0,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal


In [394]:
df.shape

(2919, 80)

In [402]:
df.drop('Id',axis=1,inplace=True) # bcz Id has unique values

**Remove variables that have more than 40% null values**

In [404]:
null=df.isnull().sum()/len(df)*100
null_cols = null[null>40].index
null_cols

Index(['Alley', 'MasVnrType', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], dtype='object')

In [406]:
df.drop(null_cols,axis=1,inplace=True)

Changing the datatype of mislabeled variables

In [416]:
cols =['MSSubClass','OverallQual','OverallCond']

In [418]:
for i in cols:
    df[i] =df[i].astype('object')

In [410]:
df.shape
# removed ID and null columns

(2919, 73)

**Observation:** The dataset is now ready for pre-processing. There are 2919 rows and 73 columns in this dataset now. Before null value treatment we need to separate the numerical and categorical vairables.

## 2. Null value treatment

In [420]:
num = df.select_dtypes(include=np.number).columns.to_list()
cat = df.select_dtypes(exclude=np.number).columns.to_list()
len(num),len(cat)

(33, 40)

Using KNNImputer to impute the null values in the numerical variables

In [423]:
imputer = KNNImputer(n_neighbors=5)
df_num = pd.DataFrame(imputer.fit_transform(df[num]),columns=num)
df_num.head()

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,65.0,8450.0,2003.0,2003.0,196.0,706.0,0.0,150.0,856.0,856.0,854.0,0.0,1710.0,1.0,0.0,2.0,1.0,3.0,1.0,8.0,0.0,2003.0,2.0,548.0,0.0,61.0,0.0,0.0,0.0,0.0,0.0,2.0,2008.0
1,80.0,9600.0,1976.0,1976.0,0.0,978.0,0.0,284.0,1262.0,1262.0,0.0,0.0,1262.0,0.0,1.0,2.0,0.0,3.0,1.0,6.0,1.0,1976.0,2.0,460.0,298.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0
2,68.0,11250.0,2001.0,2002.0,162.0,486.0,0.0,434.0,920.0,920.0,866.0,0.0,1786.0,1.0,0.0,2.0,1.0,3.0,1.0,6.0,1.0,2001.0,2.0,608.0,0.0,42.0,0.0,0.0,0.0,0.0,0.0,9.0,2008.0
3,60.0,9550.0,1915.0,1970.0,0.0,216.0,0.0,540.0,756.0,961.0,756.0,0.0,1717.0,1.0,0.0,1.0,0.0,3.0,1.0,7.0,1.0,1998.0,3.0,642.0,0.0,35.0,272.0,0.0,0.0,0.0,0.0,2.0,2006.0
4,84.0,14260.0,2000.0,2000.0,350.0,655.0,0.0,490.0,1145.0,1145.0,1053.0,0.0,2198.0,1.0,0.0,2.0,1.0,4.0,1.0,9.0,1.0,2000.0,3.0,836.0,192.0,84.0,0.0,0.0,0.0,0.0,0.0,12.0,2008.0


Imputing the categorical variables with mode

In [426]:
cat_impute = SimpleImputer(missing_values=np.nan,strategy='most_frequent')
df_cat = pd.DataFrame(cat_impute.fit_transform(df[cat]),columns=cat)
df_cat

Unnamed: 0,MSSubClass,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,60,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,Gable,CompShg,VinylSd,VinylSd,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
1,20,RL,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,Gable,CompShg,MetalSd,MetalSd,TA,TA,CBlock,Gd,TA,Gd,ALQ,Unf,GasA,Ex,Y,SBrkr,TA,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
2,60,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,Gable,CompShg,VinylSd,VinylSd,Gd,TA,PConc,Gd,TA,Mn,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
3,70,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,Gable,CompShg,Wd Sdng,Wd Shng,TA,TA,BrkTil,TA,Gd,No,ALQ,Unf,GasA,Gd,Y,SBrkr,Gd,Typ,Detchd,Unf,TA,TA,Y,WD,Abnorml
4,60,RL,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,Gable,CompShg,VinylSd,VinylSd,Gd,TA,PConc,Gd,TA,Av,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal


In [436]:
df_fin = pd.concat([df_num,df_cat],axis=1)
df_fin.shape

(2919, 73)

**Inference:** All the null values are removed.

## 3. Encoding

In [440]:
for i in cat:
    l=LabelEncoder()
    df_fin[i]=l.fit_transform(df_fin[i])

In [448]:
df_train = df_fin.iloc[:train.shape[0],:]

In [450]:
df_test = df_fin.iloc[train.shape[0]:,:]

In [526]:
df_test.reset_index(drop=True,inplace=True)
df_test.head()

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,MSSubClass,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,80.0,11622.0,1961.0,1961.0,0.0,468.0,144.0,270.0,882.0,896.0,0.0,0.0,896.0,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,730.0,140.0,0.0,0.0,0.0,120.0,0.0,0.0,6.0,2010.0,0,2,1,3,3,0,4,0,12,1,2,0,2,4,5,1,1,12,13,3,4,1,3,3,3,4,3,1,4,1,4,3,6,1,2,4,4,2,8,4
1,81.0,14267.0,1958.0,1958.0,108.0,923.0,0.0,406.0,1329.0,1329.0,0.0,0.0,1329.0,0.0,0.0,1.0,1.0,3.0,1.0,6.0,0.0,1958.0,1.0,312.0,393.0,36.0,0.0,0.0,0.0,0.0,12500.0,6.0,2010.0,0,3,1,0,3,0,0,0,12,2,2,0,2,5,5,3,1,13,14,3,4,1,3,3,3,0,5,1,4,1,4,2,6,1,2,4,4,2,8,4
2,74.0,13830.0,1997.0,1998.0,0.0,791.0,0.0,137.0,928.0,928.0,701.0,0.0,1629.0,0.0,0.0,2.0,1.0,3.0,1.0,6.0,1.0,1997.0,2.0,482.0,212.0,34.0,0.0,0.0,0.0,0.0,0.0,3.0,2010.0,5,3,1,0,3,0,4,0,8,2,2,0,5,4,4,1,1,12,13,3,4,2,2,3,3,2,5,1,2,1,4,3,6,1,0,4,4,2,8,4
3,78.0,9978.0,1998.0,1998.0,20.0,602.0,0.0,324.0,926.0,926.0,678.0,0.0,1604.0,0.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,1998.0,2.0,470.0,360.0,36.0,0.0,0.0,0.0,0.0,0.0,6.0,2010.0,5,3,1,0,3,0,4,0,8,2,2,0,5,5,5,1,1,12,13,3,4,2,3,3,3,2,5,1,0,1,4,2,6,1,0,4,4,2,8,4
4,43.0,5005.0,1992.0,1992.0,0.0,263.0,0.0,1017.0,1280.0,1280.0,0.0,0.0,1280.0,0.0,0.0,2.0,0.0,2.0,1.0,5.0,0.0,1992.0,2.0,506.0,0.0,82.0,0.0,0.0,144.0,0.0,0.0,1.0,2010.0,11,3,1,0,1,0,4,0,22,2,2,4,2,7,4,1,1,6,6,2,4,2,2,3,3,0,5,1,0,1,4,2,6,1,1,4,4,2,8,4


**Inference:** Encoding is done and both the `train` and `test` data are separated for model training

## 4. Model training

We train a basic Linear regression model. The target variable `SalePrice` is log-transformed to handle skewness.

In [452]:
x=df_train
y=train['SalePrice']

In [458]:
y=np.log(y)

In [460]:
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.30,random_state=42)

In [466]:
lr=LinearRegression()
model=lr.fit(xtrain,ytrain)
pred_train = model.predict(xtrain)
pred_test = model.predict(xtest)

In [468]:
r2_train = r2_score(ytrain,pred_train)
r2_test = r2_score(ytest,pred_test)

print('R2_train:',r2_train)
print('R2_test:',r2_test)

R2_train: 0.8850791701057361
R2_test: 0.8695462417398837


In [470]:
print('RMSE Train: ',np.sqrt(mean_squared_error(ytrain,pred_train)))
print('RMSE Test: ',np.sqrt(mean_squared_error(ytest,pred_test)))

RMSE Train:  0.13346676596548612
RMSE Test:  0.14876484731904138


**Observation**: The basic model achieves an RMSE of approximately 0.133 on the training set and 0.148 on the test set, indicating reasonable performance. Now we use this model to predict on the given test dataset and store the result as per the requested format.

In [476]:
ypred = model.predict(df_test)

In [480]:
ypred = np.exp(ypred)

In [482]:
ypred

array([115149.22683492, 132781.3511098 , 165545.22166169, ...,
       152455.4072098 , 123772.16840872, 256188.79523305])

In [484]:
model_prediction = pd.DataFrame({'Id':test.Id,'SalePrice':ypred})
model_prediction.to_csv('Linear.csv',index=False)

## 5. Regularization Techniques

In [487]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV

**Ridge Linear Regression** adds an L2 penalty (squared magnitude of coefficients) to the loss function, which helps prevent overfitting by shrinking the model weights. We will be using GridSearch to find the best penalty factor

In [538]:
ridge = Ridge()
params=({'alpha':[0.1,0.5,0.6,0.9,0.01,0.02,0.001,0.002,0.003,0.3,1,2,3,4,5,10,15,20]})

grid_ridge = GridSearchCV(estimator=ridge,
                         param_grid=params,
                         scoring='r2',
                         cv=5)
grid_ridge.fit(xtrain,ytrain)

In [540]:
pred_train = grid_ridge.predict(xtrain)
pred_test = grid_ridge.predict(xtest)
rmse_train = np.sqrt(mean_squared_error(ytrain,pred_train))
rmse_test = np.sqrt(mean_squared_error(ytest,pred_test))
print('RMSE train: ',rmse_train)
print('RMSE test: ',rmse_test)

RMSE train:  0.1337057559190548
RMSE test:  0.1480623910946379


**Observation**: The Ridge model achieves an RMSE of approximately 0.133 on the training set and 0.148 on the test set, which is similar to the base model

In [495]:
ypred_ridge=grid_ridge.predict(df_test)
ypred_ridge=np.exp(ypred_ridge)

In [497]:
model_ridge= pd.DataFrame({'Id':test.Id,'SalePrice':ypred_ridge})
model_ridge

Unnamed: 0,Id,SalePrice
0,1461,116606.452704
1,1462,126236.542254
2,1463,166664.017637
3,1464,188894.740895
4,1465,182310.167556
...,...,...
1454,2915,93767.753663
1455,2916,87474.384798
1456,2917,153166.628515
1457,2918,122909.122309


**Lasso Regression** adds an L1 penalty (absolute value of coefficients) to the loss function, which not only prevents overfitting but also performs feature selection by shrinking some coefficients exactly to zero.

In [544]:
lasso=Lasso()
params=({'alpha':[0.01, 0.02, 0.05, 0.001,0.005,0.009,0.002,0.003,0.004,0.005,0.007,0.008,1,2]})
grid_lasso = GridSearchCV(estimator=lasso,
                         param_grid=params,
                         scoring='r2',
                         cv=5)
grid_lasso.fit(xtrain,ytrain)

In [546]:
pred_train = grid_lasso.predict(xtrain)
pred_test = grid_lasso.predict(xtest)
rmse_train = np.sqrt(mean_squared_error(ytrain,pred_train))
rmse_test = np.sqrt(mean_squared_error(ytest,pred_test))
print('RMSE train: ',rmse_train)
print('RMSE test: ',rmse_test)

RMSE train:  0.1345011423920012
RMSE test:  0.1467781699805046


**Observation**: The Lasso model achieves an RMSE of approximately 0.134 on the training set and 0.146 on the test set, indicating a slightly better performance on the test data

**ElasticNet Regression** combines both L1 (Lasso) and L2 (Ridge) penalties, balancing between feature selection and coefficient shrinkage.

In [552]:
enet = ElasticNet()
params=({'l1_ratio':[0.1,0.01,0.001,0.2,0.25,0.3,0.5],
         'alpha'   :[0.1,0.2,0.9,1,2,3,4,5]
        })
grid_enet = GridSearchCV(estimator=enet,
                         param_grid=params,
                         scoring='r2',
                         cv=5)
grid_enet.fit(xtrain,ytrain)

In [554]:
pred_train = grid_enet.predict(xtrain)
pred_test = grid_enet.predict(xtest)
rmse_train = np.sqrt(mean_squared_error(ytrain,pred_train))
rmse_test = np.sqrt(mean_squared_error(ytest,pred_test))
print('RMSE train: ',rmse_train)
print('RMSE test: ',rmse_test)

RMSE train:  0.1365871889014257
RMSE test:  0.14509574910991233


**Observation**: The Elasticnet model achieves an RMSE of approximately 0.136 on the training set and 0.145 on the test set, indicating a slightly better performance on the test data as well.

## 6. Conclusion
This notebook demonstrates the process of loading, preprocessing, and modeling the Kaggle House Prices dataset using basic and regularized regression models. Key steps include:
- Combining train and test datasets for consistent preprocessing.
- Training regularized linear models with hyperparameter tuning.
- Evaluating model performance using RMSE.
- Generating and saving predictions for submission.