# Predicting California Housing Prices

## Using XGBoost in SageMaker (Batch Transform)

_DSCI 502 | Deployment_

---

As an introduction to using SageMaker's High-Level Python API, we will explore a straightforward regression problem: predicting the **median value of homes in California districts** using the **California Housing Dataset**. This dataset includes features such as average income, housing density, and proximity to the ocean.

We’ll use this problem to get hands-on experience with SageMaker’s tools for managing machine learning workflows, specifically using **XGBoost** and **Batch Transform** for model training and prediction.

> _Note:_ The California Housing dataset is a modern and ethically appropriate alternative to older datasets and is commonly used for teaching regression and evaluation techniques.

The documentation for SageMaker's high-level API can be found on the [SageMaker ReadTheDocs page](http://sagemaker.readthedocs.io/en/latest/).

## General Outline

In a typical SageMaker workflow, you’ll move through the following steps:

1. Download or retrieve the data.
2. Process and prepare the data.
3. Upload the data to S3.
4. Train the model.
5. Test the trained model using a batch transform job.
6. Deploy the model.
7. Use the deployed model for predictions.

In this notebook, we will cover steps **1 through 5** to become familiar with the workflow. We will explore model deployment in a later lesson.


In [1]:
# Make sure that we use SageMaker 1.x
!pip install sagemaker>=2.x

## Step 0: Setting up the notebook

We begin by setting up all of the necessary bits required to run our notebook. To start that means loading all of the Python modules we will need.

In [2]:
# Please make sure to read the note I posted on Blackboard explaining why we no longer use the Boston Housing Dataset.
# It's an important reflection on ethics in data science and will help you understand the broader responsibility we carry when working with data.

#-----------------------------------------------------------
#-----------------------------------------------------------
#-----------------------------------------------------------

# Just something to reflect on — a reminder of how far we've come in data science, and how much progress is still needed.
# The old codes:

# %matplotlib inline

# import os

# import numpy as np
# import pandas as pd

# import matplotlib.pyplot as plt

# from sklearn.datasets import load_boston
# import sklearn.model_selection

#-----------------------------------------------------------
#-----------------------------------------------------------
#-----------------------------------------------------------


%matplotlib inline

import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import fetch_california_housing
import sklearn.model_selection

# Load the California Housing dataset
california = fetch_california_housing(as_frame=True)
df = california.frame

# Display the first few rows
print(df.head())


Matplotlib is building the font cache; this may take a moment.


   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  MedHouseVal  
0    -122.23        4.526  
1    -122.22        3.585  
2    -122.24        3.521  
3    -122.25        3.413  
4    -122.25        3.422  


In addition to the modules above, we need to import the various bits of SageMaker that we will be using. 

In [3]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker import image_uris
# from sagemaker.predictor import csv_serializer  # optional, used if you deploy and predict
from sagemaker.serializers import CSVSerializer


# Create a SageMaker session object (includes region info, default bucket, etc.)
session = sagemaker.Session()

# Get the IAM role used for running the training job
role = get_execution_role()




sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Step 1: Downloading the data

Fortunately, this dataset can be retrieved using sklearn and so this step is relatively straightforward.

In [4]:
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()

## Step 2: Preparing and splitting the data

Given that this is clean tabular data, we don't need to do any processing. However, we do need to split the rows in the dataset up into train, test and validation sets.

In [5]:
# First we package up the input data and the target variable (the median house value) as pandas DataFrames.
# This will make saving the data to a file a little easier later on.

X_cal_pd = pd.DataFrame(california.data, columns=california.feature_names)
Y_cal_pd = pd.DataFrame(california.target, columns=['MedHouseVal'])

# We split the dataset into 2/3 training and 1/3 testing sets.
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_cal_pd, Y_cal_pd, test_size=0.33)

# Then we split the training set further into 2/3 training and 1/3 validation sets.
X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.33)


## Step 3: Uploading the data files to S3

When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3. We can use the SageMaker API to do this and hide some of the details.

### Save the data locally

First we need to create the test, train and validation csv files which we will then upload to S3.

In [7]:
# This is our local data directory. We need to make sure that it exists.
data_dir = '../data/california'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)


In [8]:
# We use pandas to save our test, train, and validation data to CSV files. 
# Note that we exclude headers and indices, as required by some ML platforms like Amazon SageMaker.
# For training and validation data, the target variable must be the first column.

X_test.to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)


In [9]:
Y_test.to_csv(os.path.join(data_dir, 'testLabels.csv'), header=False, index=False)

### Upload to S3

Since we are currently running inside of a SageMaker session, we can use the object which represents this session to upload our data to the 'default' S3 bucket. Note that it is good practice to provide a custom prefix (essentially an S3 folder) to make sure that you don't accidentally interfere with data uploaded from some other notebook or project.

In [8]:
prefix = 'california-xgboost-HL'

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)


## Step 4: Train the XGBoost model

Now that we have the training and validation data uploaded to S3, we can construct our XGBoost model and train it. We will be making use of the high level SageMaker API to do this which will make the resulting code a little easier to read at the cost of some flexibility.

To construct an estimator, the object which we wish to train, we need to provide the location of a container which contains the training code. Since we are using a built in algorithm this container is provided by Amazon. However, the full name of the container is a bit lengthy and depends on the region that we are operating in. Fortunately, SageMaker provides a useful utility method called `get_image_uri` that constructs the image name for us.

To use the `get_image_uri` method we need to provide it with our current region, which can be obtained from the session object, and the name of the algorithm we wish to use. In this notebook we will be using XGBoost however you could try another algorithm if you wish. The list of built in algorithms can be found in the list of [Common Parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

In [9]:
from sagemaker import image_uris

# Get the image URI for the XGBoost algorithm
# container = image_uris.retrieve('xgboost', region=session.boto_region_name)

# Get the image URI for a supported XGBoost version
container = image_uris.retrieve(
    framework='xgboost',
    region=session.boto_region_name,
    version='1.5-1'  # You can also use '1.7-1' or 'latest'
)

# Construct the estimator object
xgb = sagemaker.estimator.Estimator(
    image_uri=container,                    # The image name of the training container
    role=role,                              # The IAM role to use
    instance_count=1,                       # Number of training instances
    instance_type='ml.m4.xlarge',          # Type of training instance
    output_path=f's3://{session.default_bucket()}/{prefix}/output',  # S3 path for model artifacts
    sagemaker_session=session               # The current SageMaker session
)


Before asking SageMaker to begin the training job, we should probably set any model specific hyperparameters. There are quite a few that can be set when using the XGBoost algorithm, below are just a few of them. If you would like to change the hyperparameters below or modify additional ones you can find additional information on the [XGBoost hyperparameter page](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)

In [10]:
xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    objective='reg:squarederror',  # updated from 'reg:linear'
    early_stopping_rounds=10,
    num_round=200
)


Now that we have our estimator object completely set up, it is time to train it. To do this we make sure that SageMaker knows our input data is in csv format and then execute the `fit` method.

In [11]:
from sagemaker.inputs import TrainingInput

# Specify the S3 locations of our training and validation data, and let SageMaker know it's CSV format
s3_input_train = TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = TrainingInput(s3_data=val_location, content_type='csv')

# Fit the model
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})


2025-03-27 23:54:53 Starting - Starting the training job...
..25-03-27 23:55:06 Starting - Preparing the instances for training.
..25-03-27 23:55:30 Downloading - Downloading input data.
..25-03-27 23:56:01 Downloading - Downloading the training image.
..25-03-27 23:56:56 Training - Training image download completed. Training in progress..
  from pandas import MultiIndex, Int64Index[0m
[34m[2025-03-27 23:57:14.984 ip-10-0-221-148.us-east-2.compute.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2025-03-27 23:57:15.007 ip-10-0-221-148.us-east-2.compute.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2025-03-27:23:57:15:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2025-03-27:23:57:15:INFO] Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[34mReturning the value itself[0m
[34m[2025-03-27:23:57:15:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-03-

## Step 5: Test the model

Now that we have fit our model to the training data, using the validation data to avoid overfitting, we can test our model. To do this we will make use of SageMaker's Batch Transform functionality. To start with, we need to build a transformer object from our fit model.

In [13]:
print(xgb.model_data)

s3://sagemaker-us-east-2-584711701556/california-xgboost-HL/output/sagemaker-xgboost-2025-03-27-23-54-51-294/output/model.tar.gz


In [27]:
xgb_transformer = xgb.transformer(
    instance_count=1,
    instance_type='ml.m4.xlarge'
)


Next we ask SageMaker to begin a batch transform job using our trained model and applying it to the test data we previously stored in S3. We need to make sure to provide SageMaker with the type of data that we are providing to our model, in our case `text/csv`, so that it knows how to serialize our data. In addition, we need to make sure to let SageMaker know how to split our data up into chunks if the entire data set happens to be too large to send to our model all at once.

Note that when we ask SageMaker to do this it will execute the batch transform job in the background. Since we need to wait for the results of this job before we can continue, we use the `wait()` method. An added benefit of this is that we get some output from our batch transform job which lets us know if anything went wrong.

In [28]:
xgb_transformer.transform(
    data=test_location,
    content_type='text/csv',
    split_type='Line'
)

In [None]:
xgb_transformer.wait()

Now that the batch transform job has finished, the resulting output is stored on S3. Since we wish to analyze the output inside of our notebook we can use a bit of notebook magic to copy the output file from its S3 location and save it locally.

In [None]:
# Download the batch transform output from S3 to the local data directory
!aws s3 cp --recursive {xgb_transformer.output_path} {data_dir}


To see how well our model works we can create a simple scatter plot between the predicted and actual values. If the model was completely accurate the resulting scatter plot would look like the line $x=y$. As we can see, our model seems to have done okay but there is room for improvement.

In [None]:
Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)

#A little cleaner maybe :D
#predictions_path = os.path.join(data_dir, 'test.csv.out')
#Y_pred = pd.read_csv(predictions_path, header=None)


In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(Y_test, Y_pred, alpha=0.6)
plt.xlabel("Actual Median Price")
plt.ylabel("Predicted Median Price")
plt.title("Actual vs Predicted Median House Prices (California Housing)")
plt.grid(True)
plt.show()

## Optional: Clean up

The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook.

In [None]:
# First we will remove all of the files contained in the data_dir directory
!rm -f {data_dir}/*

# And then we delete the directory itself
!rmdir {data_dir}


#-f ensures it doesn't complain if files are already gone.
#{data_dir} pulls in the Python variable properly in a notebook context.