# House Appraiser, (XGBoost, Regression)


`#extreme-gradient-boosting` `#decision-trees` `#ensemble-learning` `#regression` `#cross-validation` `#hyperparameter-tuning `


> Objectives
>
> - Implement a new version of the House Appraiser model with support for multiple features using XGBoost.
> - Use SciKit-Learn's Pipeline functionality for data preprocessing and model evaluation.
> - Perform Cross-Validation, a technique for gauging model generalizability and comparing model.
> - Use Hyperparameter Tuning to optimize training


## Standard Deep Atlas Exercise Set Up


- [ ] Ensure you are using the coursework Pipenv environment and kernel ([instructions](../SETUP.md))
- [ ] Apply the standard Deep Atlas environment setup process by running this cell:


In [None]:
import sys, os
sys.path.insert(0, os.path.join('..', 'includes'))

import deep_atlas
from deep_atlas import FILL_THIS_IN
deep_atlas.initialize_environment()
if deep_atlas.environment == 'COLAB':
    %pip install -q python-dotenv==1.0.0

### 🚦 Checkpoint: Start

- [ ] Run this cell to record your start time:


In [None]:
deep_atlas.log_start_time()

---


## Context


XGBoost (XGB) refers to the technique "eXtreme Gradient Boosting".

It is a library that builds on SciKit-Learn's shallow-modeling capabilities. We will still need to use SciKit-Learn's APIs _alongside_ XGB for tasks outside of model training itself (data processing etc.)


### What does XGBoost do?


XGBoost (XGB) implements decision trees, a supervised learning model:

A decision tree has nodes (features), branches (decisions), and leaves (predictions). Data is recursively split based on a loss function, and predictions are made by traversing the tree.

> Contrast with _random forests_: ensembles of decision trees that run in parallel on different subsets of data and features. They aggregate votes from individual trees to reduce overfitting and underfitting. Gradient boosting is distinct _ensemble_ strategy.

Gradient Boosting:

- Predicts by summing outputs of many models, added sequentially to correct errors.
- Models are added until no further improvements can be made.

Pros of XGBoost:

- Efficient API for gradient boosting.
- Supports regression, classification, ranking.
- Handles missing values, categorical data, and regularization.
- Includes cross-validation, feature importance, and can integrate with scikit-learn utilities like GridSearchCV.

Cons:

- Not suited for deep learning tasks (e.g., transformers, GANs, reinforcement learning). Use PyTorch for these.


## Exercise Goals:


- In this exercise we will be using XGBoost's Regression model to predict a house's price, given other features about the house.
  - This model will be able to perform multiple regression — using multiple features to perform prediction — like you did with SciKit-Learn in previous exercises.


## Dependencies


In [None]:
if deep_atlas.environment == 'VIRTUAL': 
    !pipenv install xgboost==2.0.3 ipykernel==6.28.0 pandas==2.1.4 scikit-learn==1.3.2 
if deep_atlas.environment == 'COLAB':
    %pip install xgboost==2.0.3 ipykernel==6.28.0 pandas==2.1.4 scikit-learn==1.3.2 matplotlib==3.8.2


## Imports


In [None]:
import pandas as pd  # interface for data loading
import matplotlib.pyplot as plt  # visualization
from sklearn.model_selection import train_test_split  # splitting data
from sklearn.metrics import mean_absolute_error  # evaluation metric
from sklearn.compose import ColumnTransformer  # processing columns
from sklearn.preprocessing import OneHotEncoder  # processing string data
from sklearn.impute import SimpleImputer  # processing missing data
from sklearn.pipeline import Pipeline  # pipeline constructor
from sklearn.model_selection import cross_val_score  # cross-validation
from sklearn.model_selection import GridSearchCV  # hyperparameter tuning
from sklearn.ensemble import RandomForestRegressor  # random forest
from xgboost import XGBRegressor  # gradient boosted trees
import xgboost as xgb  # model interpretation

## Loading data


Let's begin by loading the data

- [ ] Explore the data in `housing.csv` and note the available features.
- [ ] Specify the columns to use while training
  - [ ] We will not use `ocean_proximity` yet; some preprocessing is required before XGBoost can consume the values in that field without breaking.
- [ ] Split the dataset into subsets using SciKit-Learn's `train_test_split`


In [None]:
data = pd.read_csv("./assets/housing.csv")

In [None]:
cols_to_use = [
    "longitude",
    "latitude",
    "housing_median_age",
    "total_rooms",
    "total_bedrooms",
    "population",
    "households",
    "median_income",
]  # without the feature "ocean_proximity"

X = data[cols_to_use]

y = data["median_house_value"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Train the XGBRegressor model


The building block of the XGBRegressor model is a decision tree — a function that produces a probable output based on a particular feature.

XGBRegressor is a class provided by XGBoost that mimics the interface of SciKit-Learns RandomForestRegressor but uses gradient boosting instead (more models are added in sequence until no improvements can be made).

- [ ] Train the model:


In [None]:
model = XGBRegressor(
    n_estimators=500,  # Rough number of trees to use.
    early_stopping_rounds=5,  # Rounds of no improvement before stopping.
    learning_rate=0.01,  # How much each tree should adjust the answer.
    n_jobs=4,  # Parallel processing if available to the computer.
)

model.fit(
    X_train,
    y_train,
    eval_set=[(X_test, y_test)],
    verbose=False,
)

- [ ] Use the model to predict values in the test data and note the mean error in terms of dollars-away-from-actual


In [None]:
predictions = model.predict(X_test)

print(
    "Mean Absolute Error: " + str(mean_absolute_error(predictions, y_test))
)
plt.scatter(predictions, y_test, alpha=0.1)
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()

NOTE: A great feature of XGBoost's model is interpretability: its ability to report the relative importance of each feature (the number of times a feature was used to split the data in the branching decision tree). This is referred to as the F-score of a feature.

- [ ] Plot the importance of the features:


In [None]:
xgb.plot_importance(model)
plt.show()

### 🚦 Checkpoint: Stop

- [ ] Uncomment this code
- [ ] Complete the feedback form
- [ ] Run the cell to log your responses and record your stop time:


In [None]:
# deep_atlas.log_feedback(
#     {
#         # How long were you actively focused on this section? (HH:MM)
#         "active_time": FILL_THIS_IN,
#         # Did you feel finished with this section (Yes/No):
#         "finished": FILL_THIS_IN,
#         # How much did you enjoy this section? (1–5)
#         "enjoyment": FILL_THIS_IN,
#         # How useful was this section? (1–5)
#         "usefulness": FILL_THIS_IN,
#         # Did you skip any steps?
#         "skipped_steps": [FILL_THIS_IN],
#         # Any obvious opportunities for improvement?
#         "suggestions": [FILL_THIS_IN],
#     }
# )
# deep_atlas.log_stop_time()

## You did it!
