# About Dataset

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.

MSSubClass: The building class

MSZoning: The general zoning classification

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access

Alley: Type of alley access

LotShape: General shape of property

LandContour: Flatness of the property

Utilities: Type of utilities available

LotConfig: Lot configuration

LandSlope: Slope of property

Neighborhood: Physical locations within Ames city limits

Condition1: Proximity to main road or railroad

Condition2: Proximity to main road or railroad (if a second is present)

BldgType: Type of dwelling

HouseStyle: Style of dwelling

OverallQual: Overall material and finish quality

OverallCond: Overall condition rating

YearBuilt: Original construction date

YearRemodAdd: Remodel date

RoofStyle: Type of roof

RoofMatl: Roof material

Exterior1st: Exterior covering on house

Exterior2nd: Exterior covering on house (if more than one material)

MasVnrType: Masonry veneer type

MasVnrArea: Masonry veneer area in square feet

ExterQual: Exterior material quality

ExterCond: Present condition of the material on the exterior

Foundation: Type of foundation

BsmtQual: Height of the basement

BsmtCond: General condition of the basement

BsmtExposure: Walkout or garden level basement walls

BsmtFinType1: Quality of basement finished area

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Quality of second finished area (if present)

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

HeatingQC: Heating quality and condition

CentralAir: Central air conditioning

Electrical: Electrical system

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)

KitchenAbvGr: Kitchens above grade

KitchenQual: Kitchen quality

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality rating

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

GarageType: Garage location

GarageYrBlt: Year garage was built

GarageFinish: Interior finish of the garage

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

GarageCond: Garage condition

PavedDrive: Paved driveway

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

PoolQC: Pool quality

Fence: Fence quality

MiscFeature: Miscellaneous feature not covered in other categories

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold

YrSold: Year Sold

SaleType: Type of sale

SaleCondition: Condition of sale

# Import Libraries

In [18]:
import pandas as pd
from xgboost import plot_importance
import plotly.express as px
import sklearn
import graphviz
import numpy as np
import seaborn as sns
import lightgbm as lgb
import xgboost 
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso, Ridge, ElasticNet
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV

# EDA + Data preprocessing

In [2]:
df = pd.read_csv("./train.csv")

In [3]:
df.shape

(1460, 81)

In [4]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [6]:
df = df.drop(["MiscFeature","PoolQC","Fence","FireplaceQu","Alley"],axis=1)


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [8]:
for col in df.columns:
    if df[col].dtype in ["int64","float64"]:
        df[col] = df[col].fillna(df[col].mean()) 
df = df.fillna("Unknown")

In [9]:
def encode_categorical_features(train_df, categorical_features):

    train_df_encoded = train_df.copy()
    
    for feature in categorical_features:
        le = LabelEncoder()
        le.fit(train_df[feature])
        train_df_encoded[feature] = le.transform(train_df_encoded[feature])
        train_df_encoded[feature] = train_df_encoded[feature].astype('int64')
    
    return train_df_encoded

cat_features = df.select_dtypes(include=["object"])
df = encode_categorical_features(df,cat_features)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   int64  
 3   LotFrontage    1460 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   int64  
 6   LotShape       1460 non-null   int64  
 7   LandContour    1460 non-null   int64  
 8   Utilities      1460 non-null   int64  
 9   LotConfig      1460 non-null   int64  
 10  LandSlope      1460 non-null   int64  
 11  Neighborhood   1460 non-null   int64  
 12  Condition1     1460 non-null   int64  
 13  Condition2     1460 non-null   int64  
 14  BldgType       1460 non-null   int64  
 15  HouseStyle     1460 non-null   int64  
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [11]:
X = df.drop("SalePrice",axis=1)
y = df["SalePrice"]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

# Preparation of the model and its evaluation

In [15]:
def train_evaluate(X_train,X_test):
    lr = LinearRegression()
    lasso = Lasso(random_state=42)
    ridge = Ridge(random_state=42)
    eln = ElasticNet(random_state=42)
    xgb = XGBRegressor(random_state=42)
    rf = RandomForestRegressor(random_state=42)
    lgb = LGBMRegressor(random_state=42)
    dt = DecisionTreeRegressor(random_state=42)
    
    models = [lr, lasso, ridge, eln, xgb, rf, lgb, dt]
    model_names = ['LinearRegression', 'Lasso', 'Ridge', 'ElasticNet', 'XGBRegressor', 
                   'RandomForestRegressor', 'LGBMRegressor', 'DecisionTreeRegressor']
    
    results = pd.DataFrame(columns=['Model', 'MSE', 'MAE'])

    for model, name in zip(models, model_names):
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)
        mse = mean_squared_error(y_test, predictions)
        mae = mean_absolute_error(y_test, predictions)
        temp_df = pd.DataFrame([[name, mse, mae]], columns=['Model', 'MSE', 'MAE'])
        results = pd.concat([results, temp_df], ignore_index=True)

    results.set_index('Model', inplace=True)
    results = results.round(2)
    
    sorted_results = results.sort_values(by='MSE')
    print("RESULTS BY MSE")
    print(sorted_results)
    
    sorted_results = results.sort_values(by='MAE')
    print("RESULTS BY MAE")
    print(sorted_results)
    
    sorted_results = results.sort_values(by=['MSE', 'MAE'])
    print("RESULTS BY MSE AND MAE")
    print(sorted_results)
    
    
    

# PCA selection

In [16]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

train_evaluate(X_train_pca,X_test_pca)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000554 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 14535
[LightGBM] [Info] Number of data points in the train set: 1168, number of used features: 57
[LightGBM] [Info] Start training from score 181441.541952
RESULTS BY MSE
                                MSE       MAE
Model                                        
XGBRegressor           8.149835e+08  18434.42
RandomForestRegressor  1.064697e+09  19172.69
LGBMRegressor          1.065716e+09  18837.64
Ridge                  1.163595e+09  21461.49
Lasso                  1.163665e+09  21465.31
LinearRegression       1.163679e+09  21466.77
ElasticNet             1.236966e+09  20657.62
DecisionTreeRegressor  2.478101e+09  27937.77
RESULTS BY MAE
                                MSE       MAE
Model                                        
XGBRegressor           8.149835e+08  18434.42
LGBMRegressor          1.

# Feature importance of xgboost selection

In [17]:
model =  XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
                         max_depth = 5, alpha = 10, n_estimators = 10)
model.fit(X_train, y_train)

feature_importances = model.feature_importances_

threshold = feature_importances.mean()
selected_features = [feature for feature, importance in enumerate(feature_importances) if importance > threshold]

X_train_selected = X_train.iloc[:, selected_features]
X_test_selected = X_test.iloc[:, selected_features]

train_evaluate(X_train_selected,X_test_selected)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000134 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1211
[LightGBM] [Info] Number of data points in the train set: 1168, number of used features: 13
[LightGBM] [Info] Start training from score 181441.541952
RESULTS BY MSE
                                MSE       MAE
Model                                        
RandomForestRegressor  8.366395e+08  18091.69
XGBRegressor           8.948992e+08  19378.47
LGBMRegressor          9.442541e+08  18727.48
Ridge                  1.342761e+09  23543.17
Lasso                  1.342885e+09  23549.85
LinearRegression       1.342908e+09  23550.76
ElasticNet             1.391284e+09  22757.29
DecisionTreeRegressor  2.226789e+09  27896.96
RESULTS BY MAE
                                MSE       MAE
Model                                        
RandomForestRegressor  8.366395e+08  18091.69
LGBMRegressor          9.4

# Random Forest Regressor tuning

In [20]:
rf = RandomForestRegressor(random_state=42)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, 
                           cv=3, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error')

grid_search.fit(X_train_selected, y_train)

print(f"Лучшие параметры: {grid_search.best_params_}")

best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test_selected)
mse = mean_squared_error(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)
print(f"Mean Squared Error of the best model: {mse}")
print(f"Mean Absolute Error of the best model: {mae}")

Fitting 3 folds for each of 216 candidates, totalling 648 fits
Лучшие параметры: {'bootstrap': True, 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Mean Squared Error of the best model: 849790848.9406691
Mean Absolute Error of the best model: 18211.858082711366
