## Task 2: Train a Model

In the previous task, we used the built-in SageMaker XGBoost algorithm. This XGBoost built-in algorithm mode does not incorporate your own XGBoost training script and runs directly on the input datasets. 

In this task, we will use XGBoost as a framework. By running XGBoost as a framework inside SageMaker, you have more flexibility and access to more advanced scenarios, such as k-fold cross-validation, because you can customize your own training scripts. 

In addition to running the training job with a training script, we will also optimize the training by running a hyperparameter tuning job. A hyperparameter is a high-level parameter that influences the learning process during model training. To get the best model predictions, you can optimize a hyperparameter configuration or set hyperparameter ranges. The process of finding an optimal configuration is called hyperparameter tuning. 


### Task 2.1: Set up the environment

Before you start training your model, install any necessary dependencies.


In [1]:
import pandas as pd
import boto3
import sagemaker
import json
import joblib
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.tuner import (
    IntegerParameter,
    ContinuousParameter,
    HyperparameterTuner
)
from sagemaker.inputs import TrainingInput
from sagemaker.image_uris import retrieve
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

# Setting SageMaker variables
sess = sagemaker.Session()
region = sess.boto_region_name
s3_client = boto3.client("s3", region_name=region)
sagemaker_role = sagemaker.get_execution_role()
sagemaker_client = boto3.client("sagemaker")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


 Next, as before, we configure the training and validation paths that your training job uses as its input.

In [2]:
# define the bucket and prefix
#read_bucket = "<LAB_BUCKET>"
s3 = boto3.resource('s3')
for buckets in s3.buckets.all():
    if 'labdatabucket' in buckets.name:
        read_bucket = buckets.name
#read_bucket = "labdatabucket-us-west-2-110972467"
read_prefix = "scripts/data" 

# Setting S3 location for read and write operations
train_data_key = f"{read_prefix}/train/adult_data_processed_train_wheaders.csv"
test_data_key = f"{read_prefix}/test/adult_data_processed_test_wheaders.csv"
validation_data_key = f"{read_prefix}/validation/adult_data_processed_validation_wheaders.csv"

write_bucket = read_bucket
write_prefix = "script-mode/data"

model_key = f"{write_prefix}/model"
output_key = f"{write_prefix}/output"

train_data_uri = f"s3://{read_bucket}/{train_data_key}"
validation_data_uri = f"s3://{read_bucket}/{validation_data_key}"
test_data_uri = f"s3://{read_bucket}/{test_data_key}"
model_uri = f"s3://{write_bucket}/{model_key}"
output_uri = f"s3://{write_bucket}/{output_key}"
estimator_output_uri = f"s3://{write_bucket}/{write_prefix}/training_jobs"
bias_report_output_uri = f"s3://{write_bucket}/{write_prefix}/clarify-output/bias"
explainability_report_output_uri = f"s3://{write_bucket}/{write_prefix}/clarify-output/explainability"

### Task 2.2: Configure an estimator object

Tuning job configuration 

In [3]:
training_job_name_prefix = "xgbtrain"
tuning_job_name_prefix = "xgbtune" 
xgb_model_name = "xgb-script-mode-model"
train_instance_count = 1
train_instance_type = "ml.m5.xlarge"

Configure Static Hyperparameters, Hyperparameter Ranges and the Estimator Object. 

Note that unlike task 1, in this script-mode task, we define a custom script for the training job. Notice that we use that customized Python script as the entry point when defining your SageMaker estimator below. Take a minute and open the custom xgboost_train.py file.

Inside the custom xgboost_train.py file, you will see a section where we are performing Cross Validation technique. This is not possible with the out-of-the-box SageMaker XGBoost training algorithm.

Also, notice that the Hyperparameter ranges that we defined in the next cell are the same static Hyperparameters that we used in the XGBoost built-in training in the first task, but this time we define ranges for the tuner to try different values of these hyperparameters to find the combination with the best Final Objective metric. 


In [4]:
# Set static hyperparameters that will not be tuned
static_hyperparams = {  
                        "eval_metric" : "auc",
                        "objective": "binary:logistic",
                        "num_round": "5"
                      }

# hyperparameter ranges that will be tuned 
hyperparameter_ranges = {
    "max_depth": IntegerParameter(6, 9),
    "eta": ContinuousParameter(0.01, 0.03),
    "gamma": ContinuousParameter(0.5, 0.9),
    "min_child_weight": ContinuousParameter(0.5, 0.9),
    "subsample": ContinuousParameter(0.2, 0.5)
}

# XGBoost Estimator 
xgb_estimator = XGBoost(
                        entry_point="xgboost_train.py",
                        output_path=estimator_output_uri,
                        code_location=estimator_output_uri,
                        hyperparameters=static_hyperparams,
                        role=sagemaker_role,
                        instance_count=train_instance_count,
                        instance_type=train_instance_type,
                        framework_version="1.7-1",
                        base_job_name=training_job_name_prefix
                    )

Now create a tuner object that will use the XGBoost estimatror as well as the Hyperparameter ranges that we defined. 

In [5]:
objective_metric_name = "validation:auc"

# Setting up tuner object
tuner_config_dict = {
                     "estimator" : xgb_estimator,
                     "max_jobs" : 6,
                     "max_parallel_jobs" : 3,    
                     "objective_metric_name" : objective_metric_name,
                     "hyperparameter_ranges" : hyperparameter_ranges,
                     "base_tuning_job_name" : tuning_job_name_prefix,
                     "strategy" : "Random"
                    }
tuner = HyperparameterTuner(**tuner_config_dict)

Setting the input channels for tuning job and run the tuner job 

In [6]:
s3_input_train = TrainingInput(s3_data="s3://{}/{}".format(read_bucket, train_data_key), content_type="csv", s3_data_type="S3Prefix")
s3_input_validation = (TrainingInput(s3_data="s3://{}/{}".format(read_bucket, validation_data_key), content_type="csv", s3_data_type="S3Prefix"))

tuner.fit(inputs={"train": s3_input_train, "validation": s3_input_validation}, include_cls_metadata=False)
tuner.wait()

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


......................................!
!


<i aria-hidden="true" class="fas fa-clipboard-check" style="color:#18ab4b"></i> **Expected output:** If the estimator and hyperparameter configuration are correct and the hyperparameter tuning job is completed correctly, you should see the following output:

```plain
************************
**** EXAMPLE OUTPUT ****
************************

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
...................................!
!
```

<i aria-hidden="true" class="fas fa-sticky-note" style="color:#ff6633"></i> **Note:** The training takes approximately 3–4 minutes to run.


Run a summary of tuning results ordered in descending order of performance, where the highest Objective Value is at the top.  

In [7]:
df_tuner = sagemaker.HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.job_name).dataframe()
df_tuner = df_tuner[df_tuner["FinalObjectiveValue"]>-float('inf')].sort_values("FinalObjectiveValue", ascending=False)
df_tuner

Unnamed: 0,eta,gamma,max_depth,min_child_weight,subsample,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
1,0.020812,0.818419,8.0,0.635337,0.251334,xgbtune-241022-1225-005-a6f1a2b7,Completed,0.9,2024-10-22 12:28:10+00:00,2024-10-22 12:28:39+00:00,29.0
4,0.021226,0.737693,8.0,0.50277,0.393787,xgbtune-241022-1225-002-21bd0a91,Completed,0.9,2024-10-22 12:26:11+00:00,2024-10-22 12:27:48+00:00,97.0
0,0.027456,0.859884,7.0,0.684443,0.442178,xgbtune-241022-1225-006-b3d2c6b9,Completed,0.89,2024-10-22 12:28:11+00:00,2024-10-22 12:28:40+00:00,29.0
2,0.017665,0.687616,8.0,0.710612,0.372418,xgbtune-241022-1225-004-a18f9b14,Completed,0.89,2024-10-22 12:28:08+00:00,2024-10-22 12:28:37+00:00,29.0
3,0.022269,0.588709,6.0,0.840711,0.439536,xgbtune-241022-1225-003-3a854dbb,Completed,0.89,2024-10-22 12:26:13+00:00,2024-10-22 12:27:47+00:00,94.0
5,0.016508,0.878021,7.0,0.820659,0.492082,xgbtune-241022-1225-001-0aa2be98,Completed,0.89,2024-10-22 12:26:12+00:00,2024-10-22 12:27:47+00:00,95.0


### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and conclude the lab.