**Comparing XGBoost and Random Forest Models for House Price Prediction Using Python**

In this project, we will walk through the steps to build house price prediction models using two popular algorithms: XGBoost and Random Forest. We will use the publicly available Boston Housing dataset.

*Step 1: Import Necessary Libraries*

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif

*Step 2: Load the Dataset*

The Boston Housing dataset can be downloaded from the following URL:

In [2]:
# Load dataset (e.g. Boston Housing or Kaggle's House Prices)
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

In [3]:
# Create a Pandas DataFrame
df = pd.DataFrame(data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'])
df['PRICE'] = target

*Step 3: Split Data into Training and Testing Sets*

In [4]:
# Split data into training and testing sets
X = df.drop('PRICE', axis=1)
y = df['PRICE']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

*Step 4: Create Pipelines for XGBoost and Random Forest Models*

In [5]:
# Create pipelines for XGBoost and Random Forest
xgb_model = xgb.XGBRegressor(random_state=42)
rf_model = RandomForestRegressor(random_state=42)

xgb_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('xgb', xgb_model)
])

rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', rf_model)
])

*Step 5: Train the Models*

In [6]:
# Train models
xgb_pipeline.fit(X_train, y_train)
rf_pipeline.fit(X_train, y_train)

*Step 6: Make Predictions and Evaluate Models*

In [7]:
# Make predictions
y_pred_xgb = xgb_pipeline.predict(X_test)
y_pred_rf = rf_pipeline.predict(X_test)

In [8]:
# Evaluate models
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print(f'XGBoost MSE: {mse_xgb:.2f}')
print(f'Random Forest MSE: {mse_rf:.2f}')

XGBoost MSE: 6.91
Random Forest MSE: 7.91


*Step 7: Hyperparameter Tuning with GridSearchCV*

To enhance model performance, we can perform hyperparameter tuning using GridSearchCV.

In [9]:
# Hyperparameter tuning using GridSearchCV for both models
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'xgb__max_depth': [3, 5, 7],
    'xgb__learning_rate': [0.1, 0.5, 1],
    'xgb__n_estimators': [50, 100, 200]
}

param_grid_rf = {
    'rf__n_estimators': [50, 100, 200],
    'rf__max_depth': [3, 5, 7],
    'rf__min_samples_split': [2, 5, 10]
}

grid_search_xgb = GridSearchCV(xgb_pipeline, param_grid_xgb, cv=5, scoring='neg_mean_squared_error')
grid_search_rf = GridSearchCV(rf_pipeline, param_grid_rf, cv=5, scoring='neg_mean_squared_error')

grid_search_xgb.fit(X_train, y_train)
grid_search_rf.fit(X_train, y_train)

print(f'Best XGBoost parameters: {grid_search_xgb.best_params_}')
print(f'Best Random Forest parameters: {grid_search_rf.best_params_}')

Best XGBoost parameters: {'xgb__learning_rate': 0.1, 'xgb__max_depth': 3, 'xgb__n_estimators': 200}
Best Random Forest parameters: {'rf__max_depth': 7, 'rf__min_samples_split': 5, 'rf__n_estimators': 100}


In this project, we discussed how to build and evaluate house price prediction models using XGBoost and Random Forest with the Boston Housing dataset. Additionally, we performed hyperparameter tuning to improve model performance. By using pipelines, we can seamlessly integrate preprocessing steps and model building, making the analysis process more efficient.

*Conclusion*