# Targeting Direct Marketing with Features Store and Amazon SageMaker XGBoost
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

---

---

## Contents

1. [Background](#Background)
1. [Prepration](#Preparation)
1. [Data](#Data)
    1. [Exploration](#Exploration)
    1. [Transformation](#Transformation)
1. [Training](#Training)
1. [Hosting](#Hosting)
1. [Evaluation](#Evaluation)
1. [Exentsions](#Extensions)

---

## Background
Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.

This notebook presents an example problem to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls.  The steps include:

* Preparing your Amazon SageMaker notebook
* Downloading data from the internet into Amazon SageMaker
* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms
* Estimating a model using the Gradient Boosting algorithm
* Evaluating the effectiveness of the model
* Setting the model up to make on-going predictions

---

## Preparation

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [52]:
# cell 01
import sagemaker
bucket=sagemaker.Session().default_bucket()
prefix = 'sagemaker/DEMO-xgboost-dm'
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Now let's bring in the Python libraries that we'll use throughout the analysis

In [53]:
# cell 02
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 
import zipfile     # Amazon SageMaker's Python SDK provides many helper functions

**Get the data file from S3**

In [71]:
#cell 03

#!aws s3 cp s3://$bucket/$prefix . --recursive
import subprocess
# output_ 으로 시작하는 폴더 목록 가져오기
s3 = session.client('s3')
response = s3.list_objects_v2(Bucket=bucket, Prefix='output_', Delimiter='/')

# 폴더가 있는 경우 처리
if 'CommonPrefixes' in response:
    for prefix in response['CommonPrefixes']:
        folder_name = prefix['Prefix']
        print(f"다운로드할 폴더: {folder_name}")
        
        # 해당 폴더와 내용 다운로드
        subprocess.run(f"aws s3 cp s3://{bucket}/{folder_name} ./{folder_name} --recursive", shell=True)
        print(f"{folder_name} 폴더 다운로드 완료")
else:
    print("output_ 으로 시작하는 폴더를 찾을 수 없습니다.")

다운로드할 폴더: output_e4eb4705-8fff-4500-9a32-2c2fe8c20573/
download: s3://sagemaker-us-west-2-538179828789/output_e4eb4705-8fff-4500-9a32-2c2fe8c20573/part-00000-9f04b153-a1bb-4a6c-ad48-3fb5e005316a-c000.csv to output_e4eb4705-8fff-4500-9a32-2c2fe8c20573/part-00000-9f04b153-a1bb-4a6c-ad48-3fb5e005316a-c000.csv
output_e4eb4705-8fff-4500-9a32-2c2fe8c20573/ 폴더 다운로드 완료


Once we have downloaded the data from Amazon S3, we can load it to a Pandas Dataframe

In [67]:
# cell 04
from glob import glob
model_data = pd.concat([pd.read_csv(f) for f in glob("output*/*.csv")])
model_data = pd.get_dummies(model_data, dtype=int)

In [68]:
model_data

Unnamed: 0,age,duration,campaign,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,...,month_nov,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,y_no,y_yes
0,56,261,1,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,57,149,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
2,37,226,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
3,40,151,1,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
4,56,307,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,334,1,0,0,0,0,0,1,0,...,1,0,0,1,0,0,0,0,0,1
41184,46,383,1,0,1,0,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0
41185,56,189,2,0,0,0,0,0,1,0,...,1,0,0,1,0,0,0,0,1,0
41186,44,442,1,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,1


In [23]:
# cell 05
# Randomly sort the data then split out first 70%, second 20%, and last 10%
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])

  return bound(*args, **kwds)


Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format.  For this example, we'll stick to CSV.  Note that the first column must be the target variable and the CSV should not include headers.  Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [25]:
# cell 06
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)

Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup.

In [26]:
# cell 07
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

---

## End of Lab 1


---

## Training
Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make gradient boosted trees a good candidate algorithm.

There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [27]:
# cell 11
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [28]:
# cell 12
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

First we'll need to specify training parameters to the estimator.  This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

And then a `.fit()` function which specifies:
1. S3 location for output data.  In this case we have both a training and validation set which are passed in.

In [29]:
# cell 13
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m5.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

INFO:sagemaker:Creating training-job with name: xgboost-2025-07-20-08-24-09-102


2025-07-20 08:24:10 Starting - Starting the training job...
2025-07-20 08:24:41 Downloading - Downloading input data...
2025-07-20 08:25:06 Downloading - Downloading the training image......
2025-07-20 08:26:08 Training - Training image download completed. Training in progress.
2025-07-20 08:26:08 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[2025-07-20:08:26:01:INFO] Running standalone xgboost training.[0m
[34m[2025-07-20:08:26:01:INFO] File size need to be processed in the node: 3.92mb. Available memory size in the node: 8063.5mb[0m
[34m[2025-07-20:08:26:01:INFO] Determined delimiter of CSV input is ','[0m
[34m[08:26:01] S3DistributionType set as FullyReplicated[0m
[34m[08:26:01] 28831x53 matrix with 1528043 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2025-07-20:08:26:01:INFO] Determined delimiter of CSV input is ','[0m
[34m[08:26:01] S3DistributionType set as FullyReplicated[0m
[34m[08:26:01

---

## Hosting
Now that we've trained the `xgboost` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

In [30]:
# cell 14
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m5.xlarge')

INFO:sagemaker:Creating model with name: xgboost-2025-07-20-08-26-57-229
INFO:sagemaker:Creating endpoint-config with name xgboost-2025-07-20-08-26-57-229
INFO:sagemaker:Creating endpoint with name xgboost-2025-07-20-08-26-57-229


-----!

---

## Evaluation
There are many ways to compare the performance of a machine learning model, but let's start by simply comparing actual to predicted values.  In this case, we're simply predicting whether the customer subscribed to a term deposit (`1`) or not (`0`), which produces a simple confusion matrix.

First we'll need to determine how we pass data into and receive data from our endpoint.  Our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

In [31]:
# cell 15
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [32]:
# cell 16
def predict(data, predictor, rows=500 ):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(), xgb_predictor)

Now we'll check our confusion matrix to see how well we predicted versus actuals.

In [33]:
# cell 17
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3523,113
1,278,205


So, of the ~6938 potential customers, we predicted 3095 would subscribe and 2986 of them actually did.  We also had 414 subscribers who subscribed that we did not predict would. The model can (and should) be tuned to improve this result.  Most importantly, note that with minimal effort, our model produced accuracies similar to those published [here](http://media.salford-systems.com/video/tutorial/2015/targeted_marketing.pdf).

_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above._

### Clean Up
Delete any resources you created in this notebook that you no longer wish to use.

In [34]:
# cell 28
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting endpoint configuration with name: xgboost-2025-07-20-08-26-57-229
INFO:sagemaker:Deleting endpoint with name: xgboost-2025-07-20-08-26-57-229


--
## Serverless Deployment (Optional)

SageMaker also supports serverless configuration, with an endpoint that scales down to zero when there are no requests. To enable this feature, you need to set up a Serverless Configuration for your endpoint. The current max concurrent invocations for a single endpoint, known as MaxConcurrency, can be any value from 1 to 200, and MemorySize can be any of the following: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB.

In [35]:
from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig

xgb_serverless_predictor = xgb.deploy(
    serverless_inference_config=ServerlessInferenceConfig(
        memory_size_in_mb=2048,
        max_concurrency=1
    )
)

INFO:sagemaker:Creating model with name: xgboost-2025-07-20-08-35-10-892
INFO:sagemaker:Creating endpoint-config with name xgboost-2025-07-20-08-35-10-892
INFO:sagemaker:Creating endpoint with name xgboost-2025-07-20-08-35-10-892


---!

### Endpoint Invocation
Invoke the endpoint by sending a request to it. The following is a sample data point grabbed from the Direct Marketing dataset.

In [36]:
sample = validation_data.sample(n=5).drop(["y_yes", "y_no"], axis=1).values
len(sample[0])

53

In [37]:
xgb_serverless_predictor.serializer = sagemaker.serializers.CSVSerializer()

xgb_serverless_predictor.predict(sample)

b'0.015533538535237312,0.10435222089290619,0.008285315707325935,0.6210773587226868,0.4474489390850067'

### Clean Up
Delete any resources you created in this notebook that you no longer wish to use.

In [38]:
xgb_serverless_predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: xgboost-2025-07-20-08-35-10-892
INFO:sagemaker:Deleting endpoint with name: xgboost-2025-07-20-08-35-10-892


## Automatic model Tuning (optional)
Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.
For example, suppose that you want to solve a binary classification problem on this marketing dataset. Your goal is to maximize the area under the curve (auc) metric of the algorithm by training an XGBoost Algorithm model. You don't know which values of the eta, alpha, min_child_weight, and max_depth hyperparameters to use to train the best model. To find the best values for these hyperparameters, you can specify ranges of values that Amazon SageMaker hyperparameter tuning searches to find the combination of values that results in the training job that performs the best as measured by the objective metric that you chose. Hyperparameter tuning launches training jobs that use hyperparameter values in the ranges that you specified, and returns the training job with highest auc.


In [39]:
# cell 18
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                            'min_child_weight': ContinuousParameter(1, 10),
                            'alpha': ContinuousParameter(0, 2),
                            'max_depth': IntegerParameter(1, 10)}


In [40]:
# cell 19
objective_metric_name = 'validation:auc'

In [41]:
# cell 20
tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=20,
                            max_parallel_jobs=3)


In [42]:
# cell 21
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

INFO:sagemaker:Creating hyperparameter tuning job with name: xgboost-250720-0841


.....................................................................................!


In [43]:
# cell 22
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

'Completed'

In [44]:
# cell 23
# return the best training job name
tuner.best_training_job()

'xgboost-250720-0841-019-481127b9'

In [45]:
# cell 24
#  Deploy the best trained or user specified model to an Amazon SageMaker endpoint
tuner_predictor = tuner.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')


2025-07-20 08:48:31 Starting - Found matching resource for reuse
2025-07-20 08:48:31 Downloading - Downloading the training image
2025-07-20 08:48:31 Training - Training image download completed. Training in progress.
2025-07-20 08:48:31 Uploading - Uploading generated training model
2025-07-20 08:48:31 Completed - Resource retained for reuse

INFO:sagemaker:Creating model with name: xgboost-2025-07-20-08-48-58-905





INFO:sagemaker:Creating endpoint-config with name xgboost-250720-0841-019-481127b9
INFO:sagemaker:Creating endpoint with name xgboost-250720-0841-019-481127b9


-------!

In [46]:
# cell 25
# Create a serializer
tuner_predictor.serializer = sagemaker.serializers.CSVSerializer()

In [47]:
# cell 26
# Predict
predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(),tuner_predictor)

In [49]:
# cell 27
# Collect predictions and convert from the CSV output our model provides into a NumPy array
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3521,115
1,278,205


---

## Extensions

This example analyzed a relatively small dataset, but utilized Amazon SageMaker features such as distributed, managed training and real-time model hosting, which could easily be applied to much larger problems.  In order to improve predictive accuracy further, we could tweak value we threshold our predictions at to alter the mix of false-positives and false-negatives, or we could explore techniques like hyperparameter tuning.  In a real-world scenario, we would also spend more time engineering features by hand and would likely look for additional datasets to include which contain customer information not available in our initial dataset.

### Clean-up

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [50]:
# cell 29
tuner_predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting endpoint configuration with name: xgboost-250720-0841-019-481127b9
INFO:sagemaker:Deleting endpoint with name: xgboost-250720-0841-019-481127b9
