# Modeling of Housing Prices

This notebook focuses on building, training, and evaluating regression models to predict the target variable `SalePrice`. The steps include model setup, training multiple regression models, performing cross-validation, and generating predictions for submission.


In [1]:
!pip install pycaret
!pip install catboost

Collecting catboost
  Using cached catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [3]:
train_x = pd.read_csv('train_x.csv')
test_x = pd.read_csv('test_x.csv')
log_train_y = pd.read_csv('log_train_y.csv')
submission_data = pd.read_csv('sample_submission.csv')

## Modeling with PyCaret

- Combined the training features (`train_x`) and log-transformed target (`log_train_y`) into a single dataset (`train_data_with_target`).
- Set up PyCaret for regression using the `setup()` function.
- Used `compare_models()` to evaluate and compare various regression models on the dataset.


In [4]:
from pycaret.regression import *

In [5]:
combined_data = pd.concat([train_x, log_train_y], axis=1)

In [6]:
train_data_with_target = combined_data
# Setup PyCaret for regression
regression_setup = setup(
    data=train_data_with_target,
    target='SalePrice',  # Replace 'SalePrice' with your actual target column name
    normalize=True,      # Scales all features
    session_id=42,       # Ensures reproducibility
)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,SalePrice
2,Target type,Regression
3,Original data shape,"(1460, 69)"
4,Transformed data shape,"(1460, 69)"
5,Transformed train set shape,"(1021, 69)"
6,Transformed test set shape,"(439, 69)"
7,Numeric features,68
8,Preprocess,True
9,Imputation type,simple


In [7]:
compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
catboost,CatBoost Regressor,0.0814,0.0156,0.1229,0.8993,0.0095,0.0068,5.784
gbr,Gradient Boosting Regressor,0.0865,0.0166,0.127,0.8919,0.0098,0.0072,0.456
lightgbm,Light Gradient Boosting Machine,0.0909,0.0179,0.132,0.8835,0.0102,0.0076,0.736
et,Extra Trees Regressor,0.0934,0.0196,0.1382,0.873,0.0107,0.0078,1.12
br,Bayesian Ridge,0.0935,0.0199,0.1388,0.8696,0.0107,0.0078,0.07
huber,Huber Regressor,0.0904,0.0198,0.1379,0.8694,0.0106,0.0076,0.127
xgboost,Extreme Gradient Boosting,0.0985,0.0202,0.1405,0.869,0.0108,0.0082,0.43
ridge,Ridge Regression,0.0942,0.0203,0.1404,0.8663,0.0108,0.0079,0.039
rf,Random Forest Regressor,0.0975,0.0211,0.1432,0.8636,0.0111,0.0082,1.61
lar,Least Angle Regression,0.1026,0.0233,0.1499,0.846,0.0115,0.0086,0.065


Processing:   0%|          | 0/85 [00:00<?, ?it/s]

<catboost.core.CatBoostRegressor at 0x7ab0b5c1f370>

## Model Training and Cross-validation

- Defined hyperparameters for multiple regression models: `GradientBoostingRegressor`, `LGBMRegressor`, `XGBRegressor`, and `CatBoostRegressor`.
- Trained each model on the training data (`train_x` and `log_train_y`).
- Applied 10-fold cross-validation to evaluate model performance using RMSE (Root Mean Squared Error).
- Printed the mean and standard deviation of RMSE for each model to assess their performance.


In [8]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

In [9]:
gbr_params = {
    "n_estimators": 100,
    "learning_rate": 0.1,
    "max_depth": 3,
    "random_state": 42
}

lgbm_params = {
    "n_estimators": 100,
    "learning_rate": 0.1,
    "max_depth": -1,
    "random_state": 42
}
catboost_params = {
    'iterations': 5000,
    'learning_rate': 0.02,
    'depth': 4,
    'eval_metric':'RMSE',
    'early_stopping_rounds': 20
}

xgboost_params = {
    "n_estimators": 100,
    "max_depth": 6,
    "learning_rate": 0.1,

}

In [10]:
models = {
    "catboost": CatBoostRegressor(**catboost_params, verbose=0),
    "xgb": XGBRegressor(**xgboost_params, verbose=0),
    "gbr": GradientBoostingRegressor(**gbr_params),
    "lgbm": LGBMRegressor(**lgbm_params)
}

# Train each model and display performance
for name, model in models.items():
    model.fit(train_x, log_train_y)
    print(f"{name} trained.")

catboost trained.
xgb trained.
gbr trained.
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000734 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2951
[LightGBM] [Info] Number of data points in the train set: 1460, number of used features: 64
[LightGBM] [Info] Start training from score 12.024057
lgbm trained.


In [16]:
# Cross-validation to evaluate models
results = {}
kf = KFold(n_splits=10, shuffle=True, random_state=42)

for name, model in models.items():
    result = np.sqrt(-cross_val_score(model, train_x, log_train_y, scoring='neg_mean_squared_error', cv=kf))
    results[name] = result

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000633 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2899
[LightGBM] [Info] Number of data points in the train set: 1314, number of used features: 64
[LightGBM] [Info] Start training from score 12.025324
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000608 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2893
[LightGBM] [Info] Number of data points in the train set: 1314, number of used features: 63
[LightGBM] [Info] Start training from score 12.028659
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000671 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enoug

In [17]:
for name, result in results.items():
    print(f"----------\n{name}")
    print(f"Mean RMSE: {np.mean(result)}")
    print(f"Std Dev: {np.std(result)}")

----------
catboost
Mean RMSE: 0.11838747317996068
Std Dev: 0.019828071229297844
----------
xgb
Mean RMSE: 0.1282795459586145
Std Dev: 0.017620735851197975
----------
gbr
Mean RMSE: 0.12102072633305387
Std Dev: 0.016863005722192716
----------
lgbm
Mean RMSE: 0.12916089828222518
Std Dev: 0.019744410382742664


### Model Prediction and Submission

- Combined predictions from `CatBoostRegressor` and `GradientBoostingRegressor` using a weighted average (90% from CatBoost and 10% from GBR).
- Exponentiated the predictions to revert the log transformation.
- Created a submission DataFrame with `Id` from the `submission_data` and the predicted `SalePrice`.

In [13]:
y_prd = (
    0.9 * np.exp(models['catboost'].predict(test_x)) +
    0.1 * np.exp(models['gbr'].predict(test_x))
)

In [14]:
sub = pd.DataFrame({
    'Id': submission_data['Id'],
    'SalePrice': y_prd
})

In [19]:
sub.head(10)

Unnamed: 0,Id,SalePrice
0,1461,127800.976651
1,1462,165956.047121
2,1463,187773.484223
3,1464,201194.471624
4,1465,176284.410953
5,1466,170740.782255
6,1467,178397.115893
7,1468,165502.832131
8,1469,180903.774986
9,1470,127624.718667


In [15]:
#sub.to_csv('submission.csv', index=False)