# SageMaker Models for Regression

## Loading the data

In this module, we practice with the obesity data again. Each row in the data represents a subject. The data has 16 attributes on different measurements of the subjects such as gender, age, height, other test measurements, and the obesity level. The target is obesity level of each subject which is as follows

- ObesityLevel = 1 : Insufficient_Weight
- ObesityLevel = 2 : Normal_Weight
- ObesityLevel = 3 : Overweight_Level_I
- ObesityLevel = 4 : Overweight_Level_II
- ObesityLevel = 5 : Obesity_Type_I
- ObesityLevel = 6 : Obesity_Type_II
- ObesityLevel = 7 : Obesity_Type_III

Data preprocessing is just a basic pipeline to impute and standardize numeric data, and encode categorical data. Please refer to the classification notebook first since description in this notebook is minimal.

In [1]:
import pandas as pd
import numpy as np

data_name = 'obesity.csv'
data_location = 'your bucket'

data = pd.read_csv(data_location + data_name)
data.head(3)

severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



Unnamed: 0,Gender,Age,Height,Weight,FamilyHistory,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,ObesityLevel
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,2
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,2
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,2


We will also transform the data at the end to process and upload to S3

In [2]:
num_cols = ['Age','Height','Weight','FCVC','NCP','CH2O','FAF','TUE']
cat_cols = ['Gender','FamilyHistory','CAEC','SMOKE','SCC','CALC','MTRANS']
target = 'ObesityLevel'

from sklearn.model_selection import train_test_split

traindata, testdata = train_test_split(data, test_size=0.2)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

#pipeline for numeric columns
num_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('standardize', StandardScaler())
])

#pipeline for class columns
cat_pipeline = Pipeline([
    ('encode', OneHotEncoder(max_categories=5, handle_unknown='infrequent_if_exist'))
])

from sklearn.compose import ColumnTransformer

#combining
full_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

train_features = full_pipeline.fit_transform(traindata)
test_features = full_pipeline.transform(testdata)

### Process Data for SageMaker

Similar to the classification notebook

In [3]:
#further split training data to training and validation
trainX, validX, trainY, validY = train_test_split(train_features, traindata[target], test_size=0.2)

#reorganize the data in format <label - features>
#we add .values to get the numpy data instead of the pandas dataframes
import numpy as np
traindata = np.concatenate([trainY.values.reshape(-1,1), trainX],axis=1)
validdata = np.concatenate([validY.values.reshape(-1,1), validX],axis=1)

#generate csv files to upload
pd.DataFrame(traindata).to_csv('train.csv', index=False, header=False)
pd.DataFrame(validdata).to_csv('validation.csv', index=False, header=False)

Then upload to the designated bucket on S3. You can change prefix to send the data to a different folder.

In [4]:
import sagemaker
import boto3
import os

sess = sagemaker.Session()                        # get our current SageMaker session
bucket = 'lle13-it7143'                           # this should be the name of the bucket we created in module 9
prefix = 'obesity'                                # the folder to store your data in the S3 instance

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## AWS XGBoost

### Setting Hyperparameter Ranges

In [5]:
from sagemaker_models import *

from sagemaker.parameter import CategoricalParameter, ContinuousParameter, IntegerParameter
from sagemaker.tuner import HyperparameterTuner

hyperparameter_ranges = {
    'eta': ContinuousParameter(0, 1), 
    'min_child_weight': ContinuousParameter(1, 10),
    'alpha': ContinuousParameter(0, 2), 
    'max_depth': IntegerParameter(1, 10)
}

### Build and Train XGBoost Model for Regression

You should monitor the training job through SageMaker console to make sure everything runs properly.

In [7]:
#path to training and validation data
s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.TrainingInput(s3_data='s3://{}/{}/validation'.format(bucket, prefix), content_type='csv')

#download model image
region = boto3.Session().region_name
from sagemaker.amazon.amazon_estimator import get_image_uri
xgb_cont = get_image_uri(region, 'xgboost', repo_version='1.0-1')

#create model
role = sagemaker.get_execution_role()
xgb = sagemaker.estimator.Estimator(xgb_cont, role, train_instance_count=1, train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix), sagemaker_session=sess)

#regression parameters
xgb.set_hyperparameters(eval_metric='rmse',
                        objective='reg:squarederror',
                        num_round=100,
                        rate_drop=0.1,
                        tweedie_variance_power=1.4)

#create model tuner
xgb_tuner = HyperparameterTuner(xgb, 
                                objective_metric_name='validation:rmse', 
                                objective_type='Minimize',
                                hyperparameter_ranges=hyperparameter_ranges, 
                                max_jobs=15, max_parallel_jobs=3)

xgb_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


.........................................................................................!


### Deploy the Model

Deploy the model to an endpoint

In [8]:
best_xgboost = xgb_tuner.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', model_name='xgboost-reg')


2024-01-04 01:03:17 Starting - Found matching resource for reuse
2024-01-04 01:03:17 Downloading - Downloading the training image
2024-01-04 01:03:17 Training - Training image download completed. Training in progress.
2024-01-04 01:03:17 Uploading - Uploading generated training model
2024-01-04 01:03:17 Completed - Resource reused by training job: sagemaker-xgboost-240104-0057-012-169a5fb5
-----!

### Inference

Like in classification, we need to serialize the test data before sending to the endpoint for prediction. The result also needs some processing.

In [12]:
from sagemaker.base_serializers import CSVSerializer

best_xgboost.serializer = CSVSerializer()
predictions = best_xgboost.predict(test_features).decode('utf-8') # predict!
testY_pred_xgb = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array

In [13]:
from sklearn.metrics import mean_squared_error, r2_score

print('Testing MSE:',mean_squared_error(testdata[target], testY_pred_xgb))
print('Testing R2:',r2_score(testdata[target], testY_pred_xgb))

Testing MSE: 0.09298963123856822
Testing R2: 0.9772470609653264


### Removing Endpoint before Moving on to Linear Learner

Remember to remove the endpoint due to the restriction on free-tier account.

In [14]:
best_xgboost.delete_endpoint(delete_endpoint_config=True)

## AWS Linear Learner

Now let's move on to Linear Learner model. Everything is pretty much the same as before

### Setting Hyperparameters

In [15]:
from sagemaker_models import *

from sagemaker.parameter import CategoricalParameter, ContinuousParameter, IntegerParameter
from sagemaker.tuner import HyperparameterTuner

hyperparameter_ranges = {
    "l1": ContinuousParameter(1e-7, 1, scaling_type="Auto"),
    "wd": ContinuousParameter(1e-7, 1, scaling_type="Auto"),
    "learning_rate": ContinuousParameter(1e-5, 1, scaling_type="Auto"),
    "mini_batch_size": IntegerParameter(100, 500, scaling_type="Auto"),
}

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.0-1.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


................................................................................................!


### Build and Train Model

In [None]:
#path to training and validation data
s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='text/csv')
s3_input_validation = sagemaker.TrainingInput(s3_data='s3://{}/{}/validation'.format(bucket, prefix), content_type='text/csv')

#download model image
from sagemaker.amazon.amazon_estimator import get_image_uri
region = boto3.Session().region_name
ll_cont = get_image_uri(region, 'linear-learner', repo_version='1.0-1')

#create model
role = sagemaker.get_execution_role()
ll = sagemaker.estimator.Estimator(ll_cont, role, train_instance_count=1, train_instance_type='ml.m4.xlarge',
                                output_path='s3://{}/{}/output'.format(bucket, prefix), sagemaker_session=sess)

#regression parameter
ll.set_hyperparameters(predictor_type='regressor')

#create model tuner
ll_tuner = HyperparameterTuner(ll, 
                        objective_metric_name='validation:objective_loss', 
                        objective_type='Minimize',
                        hyperparameter_ranges=hyperparameter_ranges, 
                        max_jobs=15, max_parallel_jobs=3)

#send model tuner to SageMaker
ll_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

### Deploy the Model

In [16]:
best_ll = ll_tuner.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', model_name='ll-reg')


2024-01-04 01:42:22 Starting - Found matching resource for reuse
2024-01-04 01:42:22 Downloading - Downloading the training image
2024-01-04 01:42:22 Training - Training image download completed. Training in progress.
2024-01-04 01:42:22 Uploading - Uploading generated training model
2024-01-04 01:42:22 Completed - Resource reused by training job: linear-learner-240104-0135-012-3b69bb75
-------!

### Inference

Similar to Linear Learner for classification, we set a serializer for test data, and process the returned results as JSON

In [19]:
import json
best_ll.serializer = CSVSerializer()
predictions = best_ll.predict(test_features).decode('utf-8') # predict!
testY_pred_ll = json.loads(predictions)
testY_pred_ll = np.array([yh['score'] for yh in testY_pred_ll['predictions']])

In [20]:
from sklearn.metrics import mean_squared_error, r2_score

print('Testing MSE:',mean_squared_error(testdata[target], testY_pred_ll))
print('Testing R2:',r2_score(testdata[target], testY_pred_ll))

Testing MSE: 0.19381244170347484
Testing R2: 0.9525774797522651


<h3>Final Clean up</h3>

In [21]:
best_ll.delete_endpoint(delete_endpoint_config=True)

sagemaker_client = boto3.client('sagemaker')

# delete model
sagemaker_client.delete_model(ModelName='xgboost-reg')
sagemaker_client.delete_model(ModelName='ll-reg')

{'ResponseMetadata': {'RequestId': '41dd8570-08cc-4197-b360-5f68b73fbba9',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '41dd8570-08cc-4197-b360-5f68b73fbba9',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Thu, 04 Jan 2024 01:49:33 GMT'},
  'RetryAttempts': 2}}