## Task 2: Train a Model

The process of creating a machine learning (ML) model starts with data processing. After the data processing is complete, you choose an ML algorithm to train your model. The goal of model training is to create a model that you can use to make predictions with future data. Your processed data must contain a target, but your future data does not contain a target (it is unlabeled). The algorithm finds patterns in the training data that map the input data attributes to the target. The algorithm then outputs an ML model that captures these patterns. When you have a model, you can make predictions on new data that does not contain the target value.

For example, if you want to train an ML model to predict if an email is spam or not spam, you would provide your model with training data that contains emails where you know the target (in this case, a label that tells whether an email is spam or not). Using this data, the algorithm creates a model that predicts if an email is spam or not spam. You can use this model to predict future email labels.

In this task, you are predicting if someone has less than 50,000 USD or not. Your model is training to optimize itself so that it can predict if someone has less than 50,000 USD as accurately as possible. Model training requires some configuration, including which kind of algorithm you want to use to train. In this task, you use the XGBoost (eXtreme Gradient Boosting) algorithm. When you train a model, you also need to configure your hyperparameters. Hyperparameters are parameters that control the training job process. They can be adjusted to change various steps in the training job. Selecting the right set of hyperparameters is important in terms of model performance and accuracy. After you train the model, you evaluate the model and view the model artifacts.

### Task 2.1: Set up the environment

Before you start training your model, install any necessary dependencies.

In [2]:
# Install required dependencies
%pip install matplotlib
%pip uninstall bokeh -y
%pip install bokeh==2.4.2
%pip install boto3
%pip install seaborn

# Install additional dependencies
import boto3  # Ensure boto3 is imported
import io
import json
import math
import matplotlib.pyplot as plt
import os
import pandas as pd
import re
import sagemaker
import sys
import time
import zipfile

from sagemaker.debugger import Rule, rule_configs
from IPython.display import FileLink, FileLinks
from sagemaker import image_uris
from IPython.display import display
from IPython.display import Image
from sagemaker.analytics import ExperimentAnalytics
from sagemaker.inputs import TrainingInput
from sagemaker.session import Session
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.xgboost.estimator import XGBoost
from time import gmtime, strftime

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()

# Get execution role for SageMaker (IAM Role)
role = sagemaker.get_execution_role()  # Automatically retrieves the role assigned to your SageMaker environment

# Initialize boto3 session and get region information
region = boto3.Session().region_name  # This retrieves the region name where the resources are located

# Initialize the SageMaker client using boto3
sess = boto3.Session()
sm = sess.client('sagemaker')  # SageMaker client to interact with SageMaker services

# Now you can proceed with your training job or model deployment


Note: you may need to restart the kernel to use updated packages.
Found existing installation: bokeh 2.4.2
Uninstalling bokeh-2.4.2:
  Successfully uninstalled bokeh-2.4.2
Note: you may need to restart the kernel to use updated packages.
Collecting bokeh==2.4.2
  Using cached bokeh-2.4.2-py3-none-any.whl.metadata (14 kB)
Using cached bokeh-2.4.2-py3-none-any.whl (18.5 MB)
Installing collected packages: bokeh
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
panel 1.7.2 requires bokeh<3.8.0,>=3.5.0, but you have bokeh 2.4.2 which is incompatible.[0m[31m
[0mSuccessfully installed bokeh-2.4.2
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xd

Next, import the dataset. In the previous lab, you exported the data files from Amazon SageMaker Data Wrangler to an Amazon Simple Storage Service (Amazon S3) bucket. You split the dataset into training (70 percent), validation (20 percent), and test (10 percent) datasets. The training and validation datasets are used during training. The test dataset is used in model evaluation after deployment.

The built-in Amazon SageMaker XGBoost algorithm supports several data formats like text/libsvm, text/csv, application/x-parquet and application/x-recordio-protobuf. This lab uses the CSV format for training. 

To view the dataset files that you created in the previous lab, follow these steps below:

<!-- 1. Navigate to the AWS Management Console.

1. At the top of the AWS Management Console, in the search bar, search for and choose `S3`.

1. In the list of buckets, choose the Amazon S3 bucket that contains **labdatabucket** in its name.

1. Choose the **scripts** folder, choose the **data** folder, choose the **train** folder

1. Select the **adult_data_processed_train.csv** file and choose **Download** to view its contents.

1. In the top of the page, choose **data** from the <i aria-hidden="true" class="fas fa-folder" style="color:white"></i> **/ ... /data/train/** breadcrumbs link.

1. Choose the **validation** folder.

1. Select the **adult_data_processed_validation.csv** file and choose **Download** to view its contents.

1. Return to the **lab_2.ipynb** notebook. -->

1. Choose the bucket icon from the left menu bar.

1. In the list of buckets, choose the Amazon S3 bucket that contains **labdatabucket** in its name.

Opening the .csv files opens new tabs in SageMaker Studio. To follow these directions, use one of the following options:
- **Option 1:** View the tabs side by side. To create a split screen view from the main SageMaker Studio window, either drag the **lab_2.ipynb** tab to the side or choose the **lab_2.ipynb** tab, and then from the toolbar, select **File** and **New View for Notebook**. You can now have the directions displayed as you explore the .csv files.
- **Option 2:** Switch between the SageMaker Studio tabs to follow these instructions. When you are finished exploring the .csv files, return to the notebook by choosing the **lab_2.ipynb** tab.

1. Choose (double-click) the **scripts** folder, choose (double-click) the **data** folder, choose (double-click) the **train** folder, and then choose (double-click) the **adult_data_processed_train.csv** file to view its contents.

1. In the left pane, choose **data** from the <i aria-hidden="true" class="fas fa-folder" style="color:white"></i> **/ ... /data/train/** breadcrumbs link.

1. Choose (double-click) the **validation** folder, and then choose (double-click) the **adult_data_processed_validation.csv** file to view its contents.

You have viewed the dataset files. Now, configure the training and validation paths that your training job uses as its input.

In [3]:
# Import the datasets
s3 = boto3.resource('s3')
for buckets in s3.buckets.all():
    if 'modeldevelopmentk21' in buckets.name:
        bucket = buckets.name
print("Bucket: ", bucket)
prefix = 'scripts/data'
output_path = 's3://{}/{}/output'.format(bucket, prefix)

# Configure the training paths
train_path = f"s3://{bucket}/{prefix}/train/adult_data_processed_train.csv"
validation_path = f"s3://{bucket}/{prefix}/validation/adult_data_processed_validation.csv"

# Set up the TrainingInput objects
train_input = TrainingInput(train_path, content_type='text/csv')
validation_input = TrainingInput(validation_path, content_type='text/csv')

# Print the training and validation paths
print(f'Training path: {train_path}')
print(f'Validation path: {validation_path}')

# Set the container, name, and tags
create_date = strftime("%m%d%H%M")
container = image_uris.retrieve(framework='xgboost',region=boto3.Session().region_name,version='1.5-1')
run_name = 'lab-2-run-{}'.format(create_date)

Bucket:  modeldevelopmentk21
Training path: s3://modeldevelopmentk21/scripts/data/train/adult_data_processed_train.csv
Validation path: s3://modeldevelopmentk21/scripts/data/validation/adult_data_processed_validation.csv


### Task 2.2: Configure an estimator object

An estimator is a high level interface for SageMaker training. You create an estimator object by supplying the required parameters, such as AWS Identity and Access Management (IAM) role, compute instance count and type, and the Amazon S3 output path. This lab uses the XGBoost built-in algorithm for the SageMaker generic estimator. XGBoost is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm performs well in handling a variety of data types, relationships, distributions, and the variety of hyperparameters that you can fine-tune. You can use XGBoost for regression, classification (binary and multiclass), and ranking problems. In this case, you are using XGBoost to solve a classification problem (whether someone is making less than 50,000 USD or not).

In this lab you create an XGBoost estimator by using the *sagemaker.estimator.Estimator* class. In the following example code, the XGBoost estimator is named *xgb_model*. To construct the SageMaker estimator, specify the following parameters:

- **image_uri**: The training container image URI. In this example, the SageMaker XGBoost training container URI is specified using *image_uris.retrieve*.
- **role**: The IAM role that SageMaker uses to perform tasks on your behalf (for example, reading training results, calling model artifacts from Amazon S3, and writing training results to Amazon S3). 
- **instance_count and instance_type**: The type and number of Amazon EC2 ML compute instances to use for model training. For this lab, you use a single ml.m5.xlarge instance, which has 4 CPUs, 16 GB of memory, an Amazon Elastic Block Store (Amazon EBS) storage, and a high network performance.
- **output_path**: The path to the S3 bucket where SageMaker stores the model artifact and training results.
- **sagemaker_session**: The session object that manages interactions with SageMaker API operations and other AWS service that the training job uses.
- **rules**: A list of Amazon SageMaker Debugger built-in rules. In this example, the create_xgboost_report() rule creates an XGBoost report that provides insights into the training progress and results.

In [4]:
xgb_model = sagemaker.estimator.Estimator(
    image_uri = container,
    role = role, 
    instance_count = 1, 
    instance_type ='ml.m5.xlarge',
    output_path = output_path,
    sagemaker_session = sagemaker_session,
    rules=[
        Rule.sagemaker(rule_configs.create_xgboost_report())
    ]
)

### Task 2.3: Configure hyperparameters

Hyperparameters directly control model structure, function, and performance. Hyperparameter tuning allows data scientists to tweak model performance for optimal results. This process is an essential part of machine learning, and choosing appropriate hyperparameter values is crucial for success.

You can set hyperparameters for the XGBoost algorithm by calling the *set_hyperparameters* method of the estimator.

Refer to [XGBoost Hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) for more information about XGBoost hyperparameters.

In [5]:
xgb_model.set_hyperparameters(
    max_depth = 5,
    eta = 0.2,
    gamma = 4,
    min_child_weight = 6,
    subsample = 0.7,
    verbosity = 0,
    objective = 'binary:logistic',
    num_round = 800
)

### Task 2.4: Run a SageMaker AI training job

Now that you have configured your estimator object and hyperparameters, you are ready to start training the model. The fit() method starts the training script. To start model training, call the estimator's fit() method with the training and validation datasets. If you set `wait=True`, the fit() method displays progress logs and waits until training is complete.

<i aria-hidden="true" class="fas fa-sticky-note" style="color:#563377"></i> **Note:** The training takes approximately 3–4 minutes to run.

In [6]:
xgb_model.fit(
    {
        "train": train_input,
        "validation": validation_input
    },
    wait=True
)

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2025-07-17-10-43-01-758


2025-07-17 10:43:02 Starting - Starting the training job...
2025-07-17 10:43:28 Starting - Preparing the instances for trainingCreateXgboostReport: InProgress
...
2025-07-17 10:44:00 Downloading - Downloading input data...
2025-07-17 10:44:20 Downloading - Downloading the training image...
  from pandas import MultiIndex, Int64Index[0m
[34m[2025-07-17 10:45:05.090 ip-10-0-99-218.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2025-07-17 10:45:05.112 ip-10-0-99-218.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2025-07-17:10:45:05:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2025-07-17:10:45:05:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2025-07-17:10:45:05:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-07-17:10:45:05:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[34m[2025-07-17

<i aria-hidden="true" class="fas fa-sticky-note" style="color:#563377"></i> **Note:** While the above cell runs, follow the below steps to monitor the progress of the training job: 

1. Navigate to the AWS console and on the top-left search bar, search for Amazon SageMaker AI

2. In the SageMaker AI console, on the left pane, select **Training** and then select **Training jobs**.  

3. Choose the link for the training job that starts with **sagemaker-xgboost** job to monitor the job creation progress.

4. Wait until the job status changes from **InProgress** to **Completed**. This indicates that the job creation is complete. The processing may take up to 5 minutes.

5. If the job status shows as **Failed**, re-run the above code cell and wait until the job status changes from **InProgress** to **Completed**.

6. Once the processing job status changes to **Completed**, return to the notebook to proceed with the next tasks.

<i aria-hidden="true" class="fas fa-exclamation-circle" style="color:#7C5AED"></i> **Caution:** Do not run the next code cell until the processing job completes.

<i aria-hidden="true" class="fas fa-clipboard-check" style="color:#18ab4b"></i> **Expected output:** If the estimator and hyperparameter configuration are correct and the training job is started correctly, you should see the following output:

```plain
************************
**** EXAMPLE OUTPUT ****
************************

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-08-09-20-09-56-628
2023-08-09 20:09:56 Starting - Starting the training job...
2023-08-09 20:10:19 Starting - Preparing the instances for trainingCreateXgboostReport: InProgress
......
2023-08-09 20:11:21 Downloading - Downloading input data...
2023-08-09 20:11:55 Training - Downloading the training image...
2023-08-09 20:12:20 Training - Training image download completed. Training in progress....
2023-08-09 20:12:56 Uploading - Uploading generated training model...
2023-08-09 20:13:20 Completed - Training job completed
..Training seconds: 107
Billable seconds: 107
```

To define the S3 location where the XGBoost report notebook is hosted, a path construction process is included.

In [7]:
bucket, project_prefix = xgb_model.output_path[5:].split('/',1)
rule_output_prefix = project_prefix + "/" + xgb_model.latest_training_job.job_name + "/rule-output/CreateXgboostReport/xgboost_report.ipynb"

To ensure timely access to the XGBoost report generated by the SageMaker Debugger, a waiter function is included.

In [8]:

print("Waiting for the report to become available")

waiter = boto3.client('s3').get_waiter('object_exists')

waiter.wait(
    Bucket=bucket,
    Key=rule_output_prefix,
    WaiterConfig={
        'Delay': 15,
        'MaxAttempts': 60
    }
)

print('The report is now available!')

Waiting for the report to become available
The report is now available!


### Task 2.5: Evaluate a model

After the training job has completed, you can download an XGBoost training report generated by SageMaker Debugger. The XGBoost training report offers you insights into the training progress and results, such as the loss function with respect to iteration, feature importance, confusion matrix, accuracy curves, and other statistical results of training. 

For SageMaker XGBoost training jobs, use the Debugger `CreateXgboostReport` rule to receive a comprehensive training report of the training progress and results.

In [9]:
%%capture
rule_output_path = xgb_model.output_path + "/" + xgb_model.latest_training_job.job_name + "/rule-output"
! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive
! aws s3 cp {'s3://{}/{}'.format(bucket, rule_output_prefix)} ./

The link in the output of the next cell opens a new tab in SageMaker Studio. To follow these directions, use one of the following options:
- **Option 1:** View the tabs side by side. To create a split screen view from the main SageMaker Studio window, either drag the **lab_2.ipynb** tab to the side or choose the **lab_2.ipynb** tab, and then from the toolbar, select **File** and **New View for Notebook**. You can now have the directions displayed as you explore the XGBoost report.
- **Option 2:** Switch between the SageMaker Studio tabs to follow these instructions. When you are finished exploring the XGBoost report, return to the notebook by choosing the **lab_2.ipynb** tab.

In [10]:
display("Click link below to view the XGBoost Training notebook", FileLink("CreateXgboostReport/xgboost_report.ipynb"))

'Click link below to view the XGBoost Training notebook'

<i aria-hidden="true" class="fas fa-sticky-note" style="color:#563377"></i> **Note:** After you run this code, you should see the following output: **'Click link below to view the XGBoost Training notebook' <span style="ssb_sm_blue">CreateXgboostReport/xgboost_report.ipynb</span>**

To open the notebook in a new tab, choose the link. 

<!-- When the notebook opens, in the **Set up notebook environment** window, configure the following:

- For **Image**, choose **Data Science 3.0**.
- For **Kernel**, choose **Python 3**.
- Choose **Select**. -->

At the top of the **xgboost_report.ipynb** tab, choose the <i aria-hidden="true" class="fas fa-forward"></i> **Restart the kernel and run all cells** button. When prompted with **Restart Kernel?**, choose **Restart**.

<i aria-hidden="true" class="fas fa-sticky-note" style="color:#563377"></i> **Note:** It takes approximately 2–3 minutes to run all of the cells.

When all cells have finished running, scroll down until you make it to the **Confusion Matrix**. The confusion matrix illustrates in a table the number of correct and incorrect predictions for each class by comparing an observation's predicted class and its true class. When you go to the diagram you see **true positive (TP)**, **true negative (TN)**, **false positive (FP)**, and **false negative (FN)** values.

- **True positive:** If the actual classification is positive and the predicted classification is positive (1,1), this is called a **true positive (TP)** result because the positive sample was correctly identified by the classifier. 
- **False negative:** If the actual classification is positive and the predicted classification is negative (1,0), this is called a **false negative (FN)** result because the positive sample is incorrectly identified by the classifier as being negative. 
- **False positive:** If the actual classification is negative and the predicted classification is positive (0,1), this is called a **false positive (FP)** result because the negative sample is incorrectly identified by the classifier as being positive. 
- **True negative**: If the actual classification is negative and the predicted classification is negative (0,0), this is called a **true negative (TN)** result because the negative sample gets correctly identified by the classifier.

Next, scroll down to **Evaluation of the Confusion Matrix** and take a closer look at the **Classification report** to understand the summary of the precision, recall, and F1-score for each class.

- **Precision**: Measures the fraction of actual positives that were predicted as positives out of all of those predicted as positive. The range is 0 to 1, and a larger value indicates better accuracy. Precision expresses the proportion of the data points that your model says was relevant and that were actually relevant. Precision is a good measure to consider, especially when the costs of FP are high.
- **Recall/Sensitivity/True Positive Rate (TPR)**: Measures the fraction of actual positives that were predicted as positives. The range is also 0 to 1, and a larger value indicates a better predictive accuracy. This is also known as Recall/Sensitivity. This measure expresses the ability to find all the relevant instances in a dataset.
- **F1-Score**: Demonstrates your target metric, which is the harmonic mean of precision and recall. F1 takes both FP and FN into account to give the same weight to precision and recall.

You are trying to predict if people make less than 50,000 USD so you can promote government assistance services to qualified citizens. In this case, the F1-Score is a good measure to use because it takes FP (people who make over 50,000 USD who were labeled as making less than 50,000 USD) and FN (people who make under 50,000 USD who were labeled as making more than 50,000 USD) into account. You want to make sure that your precision and recall are both high, and the F1-Score takes both measures into account. In the next lab, you optimize the model by tuning the hyperparameters to see if you can get a higher F1-Score.

What are the **Precision**, **Recall**, **F1-Score**, and **Overall Accuracy** for this model?

<i aria-hidden="true" class="far fa-comment" style="color:#008296"></i> **Consider:** Take a moment to review the other graphs that are included in the notebook. What kind of information do you see? What might be helpful to you when training your own models?

### Task 2.6: View the model artifacts

SageMaker AI stores the model artifact in your S3 bucket. To find the location of the model artifact, follow these steps:

<!-- 1. Navigate to the AWS Management Console.

1. At the top of the AWS Management Console, in the search bar, search for and choose `S3`.

1. In the list of buckets, choose the Amazon S3 bucket that contains **labdatabucket** in its name.

1. Navigate to the **scripts/data/output/sagemaker-xgboost-.../output** subfolder.  -->

1. Choose the bucket icon from the left menu bar.

1. In the list of buckets, open the Amazon S3 bucket that contains **labdatabucket** in its name.

1. Navigate to the **scripts/data/output/ sagemaker-xgboost-.../output** subfolder. 

You see the model artifact **model.tar.gz** in the subfolder. This is the model that you created with your SageMaker Estimator by calling the fit() method.

You viewed the model artifacts, including the model.tar.gz file. 

### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with the **Conclusion**.