## SageMaker Processing (prebuilt sklearn container)

For the data scientists and ML engineers who are more experienced with running preprocessing, postprocessing and model evaluation workloads with Python scripts or custom containers, Amazon SageMaker Processing introduces a new Python SDK that lets you do exactly this. Processing jobs accept data from Amazon S3 as input and store data into Amazon S3 as output.

With SageMaker Processing you can process terabytes of data in a SageMaker-managed cluster separate from the instance running your notebook server. In a typical SageMaker workflow, notebooks are only used for prototyping and can be run on relatively inexpensive and less powerful instances, while processing, training and model hosting tasks are run on separate, more powerful SageMaker-managed instances.

Amazon SageMaker Processing allows you to run steps for data pre- or post-processing, feature engineering, data validation, or model evaluation workloads on Amazon SageMaker. Processing jobs accept data from Amazon S3 as input and store data into Amazon S3 as output.

![processing](https://sagemaker.readthedocs.io/en/stable/_images/amazon_sagemaker_processing_image1.png)

<b>To use SageMaker Processing, supply a Python data preprocessing script.</b> For this example, we're using a SageMaker prebuilt Scikit-learn container, which includes many common functions and libraries for processing data. There are few limitations on what kinds of code and operations you can run, and only a minimal contract: input and output <b>data must be placed in specified directories</b>:

/opt/ml/processing/input/ for input data

/opt/ml/processing/output/ for output data

If this is done, SageMaker Processing automatically loads the input data from S3 and uploads transformed data back to S3 when the job is complete.


### Overview
This notebook presents an example problem to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls.  The steps include:

* Preparing your Amazon SageMaker notebook
* Downloading data from the internet into Amazon SageMaker
* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms
* Estimating a model using the Gradient Boosting algorithm
* Evaluating the effectiveness of the model
* Setting the model up to make on-going predictions

---

### Preparation

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [12]:
# cell 01: Create default S3 bucket (region name + account ID) and start SageMaker session
import sagemaker                              # SageMaker python SDK
bucket=sagemaker.Session().default_bucket()   # creating default bucket
print('Default bucket:', bucket)
prefix = 'sagemaker/DEMO-xgboost-dm'          # will be used later to define the inputs source name
 
# Define IAM role
import boto3     # general AWS python SDK
import re        # RegEx
from sagemaker import get_execution_role

role = get_execution_role()    # Retrieving assigned IAM role

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Default bucket: sagemaker-eu-central-1-365644463685
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Now let's bring in the Python libraries that we'll use throughout the analysis

In [13]:
# cell 02: Loading python libraries required for the analysis
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 
import zipfile     # Amazon SageMaker's Python SDK provides many helper functions

---

## Data

Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.

Let's start by downloading the [direct marketing dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the sample data s3 bucket. 

\[Moro et al., 2014\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014


In [14]:
# cell 03: Fetching the data
!wget https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip

with zipfile.ZipFile('bank-additional.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

--2023-11-16 08:35:48--  https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
Resolving sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)... 3.5.86.159, 52.218.176.185, 52.92.163.122, ...
Connecting to sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)|3.5.86.159|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 432828 (423K) [application/zip]
Saving to: ‘bank-additional.zip.2’


2023-11-16 08:35:49 (437 KB/s) - ‘bank-additional.zip.2’ saved [432828/432828]



Now lets read this into a Pandas data frame and take a look.

In [15]:
# cell 04: Reading and visualizing the data
data = pd.read_csv('./bank-additional/bank-additional-full.csv')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,334,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,383,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,189,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,442,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


In [5]:
# # Downsample original data
# print(data.shape)
# data = data.sample(n=3000)
# print(data.shape)

We will store this natively in S3 to then process it with SageMaker Processing.

In [16]:
# cell 05: Start SageMaker session
from sagemaker import Session

sess = Session()   # The session object that manages interactions with SageMaker API operations and
                   # other AWS service that the training job uses.
input_source = sess.upload_data('./bank-additional/bank-additional-full.csv', bucket=bucket, key_prefix=f'{prefix}/input_data')
print('Input data location on S3:\n' + input_source)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
Input data location on S3:
s3://sagemaker-eu-central-1-365644463685/sagemaker/DEMO-xgboost-dm/input_data/bank-additional-full.csv


### Feature Engineering with Amazon SageMaker Processing

Here, we'll import the dataset and transform it with SageMaker Processing, which includes off-the-shelf support for Scikit-learn, as well as a Bring Your Own Container option.    


In [17]:
%%writefile preprocessing.py
# cell 06: Create locally the processing script that will be used as source by SageMaker Processing

# Processing script:

import pandas as pd
import numpy as np
import argparse
import os
from sklearn.preprocessing import OrdinalEncoder

def _parse_args():

    parser = argparse.ArgumentParser()

    # Data, model, and output directories
    # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
    parser.add_argument('--filepath', type=str, default='/opt/ml/processing/input/')
    parser.add_argument('--filename', type=str, default='bank-additional-full.csv')
    parser.add_argument('--outputpath', type=str, default='/opt/ml/processing/output/')
    parser.add_argument('--categorical_features', type=str, default='y, job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome')

    return parser.parse_known_args()

if __name__=="__main__":
    # Process arguments
    args, _ = _parse_args()
    # Load data
    df = pd.read_csv(os.path.join(args.filepath, args.filename))
    # Change the value . into _
    df = df.replace(regex=r'\.', value='_')
    df = df.replace(regex=r'\_$', value='')
    # Add two new indicators
    df["no_previous_contact"] = (df["pdays"] == 999).astype(int)
    df["not_working"] = df["job"].isin(["student", "retired", "unemployed"]).astype(int)
    df = df.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)
    # Encode the categorical features
    df = pd.get_dummies(df)
    # Downsample data frame: (for the demo to run faster)
    # df = df.sample(n=3000)
    # Train, test, validation split
    train_data, validation_data, test_data = np.split(df.sample(frac=1, random_state=42), [int(0.7 * len(df)), int(0.9 * len(df))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%
    # Local store
    pd.concat([train_data['y_yes'], train_data.drop(['y_yes','y_no'], axis=1)], axis=1).to_csv(os.path.join(args.outputpath, 'train/train.csv'), index=False, header=False)
    pd.concat([validation_data['y_yes'], validation_data.drop(['y_yes','y_no'], axis=1)], axis=1).to_csv(os.path.join(args.outputpath, 'validation/validation.csv'), index=False, header=False)
    test_data['y_yes'].to_csv(os.path.join(args.outputpath, 'test/test_y.csv'), index=False, header=False)
    test_data.drop(['y_yes','y_no'], axis=1).to_csv(os.path.join(args.outputpath, 'test/test_x.csv'), index=False, header=False)
    print("## Processing complete. Exiting.")

Overwriting preprocessing.py


NB: The ArgumentParser makes it possible to run the processing code as a standalone script (whether it’s in the AWS cloud or locally on your machine). Nothing in this script is specific to SageMaker, except from the default values of --filepath and --outputpath. By setting the default values of a few of the arguments in the ArgumentParser, we ensure that the paths are by default compliant with SageMaker expected paths. If you want to use dynamic values, you will be able to provide them when creating the Processor.

In [18]:
# Creating s3 paths for training, validation, and test.
train_path = f"s3://{bucket}/{prefix}/train"
validation_path = f"s3://{bucket}/{prefix}/validation"
test_path = f"s3://{bucket}/{prefix}/test"
print('Train path:\n' + train_path)
print('Validation path:\n' + validation_path)
print('Test path:\n' + test_path)

Train path:
s3://sagemaker-eu-central-1-365644463685/sagemaker/DEMO-xgboost-dm/train
Validation path:
s3://sagemaker-eu-central-1-365644463685/sagemaker/DEMO-xgboost-dm/validation
Test path:
s3://sagemaker-eu-central-1-365644463685/sagemaker/DEMO-xgboost-dm/test


Before starting the <b>SageMaker Processing job</b>, we instantiate a <b>SKLearnProcessor object</b>. This object allows you to specify the instance type to use in the job, as well as how many instances. (If you wanted to use a PySpark cluster, you can use the PySparkProcessor , or even a custom ScriptProcessor if you want to provide a script to your own custom container.)

In [19]:
# Creating the SKLearnProcessor object 
# (It lets you specify the instance number and type for the processing job.)

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

sklearn_processor = SKLearnProcessor(  
    framework_version="0.23-1",
    role=get_execution_role(),
    instance_type="ml.c5.2xlarge",   #default: ml.m5.large
    instance_count=1, #default: 1 
    base_job_name='sm-immday-skprocessing'
)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


You can now run this processing job, by <b>specifying the code to use, inputs, outputs, and the arguments</b>, if any. (By specifying s3_data_distribution_type="ShardedByS3Key", we assure that, if we are using multiple instances for processing, data is distributed across them, to ensure parallelization of the processing job.)

In [20]:
# Run processing job:

sklearn_processor.run(
    code='preprocessing.py',
    # arguments = ['arg1', 'arg2'],
    inputs=[
        ProcessingInput(
            source=input_source, 
            destination="/opt/ml/processing/input",
            s3_input_mode="File",
            s3_data_distribution_type="ShardedByS3Key"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="train_data", 
            source="/opt/ml/processing/output/train",
            destination=train_path,
        ),
        ProcessingOutput(output_name="validation_data", source="/opt/ml/processing/output/validation", destination=validation_path),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/output/test", destination=test_path),
    ]
)

INFO:sagemaker:Creating processing-job with name sm-immday-skprocessing-2023-11-16-08-46-05-864


......................[34m## Processing complete. Exiting.[0m



This cell can take a few minutes to run (from 5 to 7). Once it’s done, you can check that the Processing job has generated its outputs in the train_path S3 location by running:

In [21]:
# Check the output generated by the processing job:
!aws s3 ls $train_path/

2023-11-16 08:49:49    3545009 train.csv


You can <b>see the Processing job running on the AWS console</b> under SageMaker -> Processing -> Processing jobs.

Now your data is ready! You can proceed with Lab 2 to train your XGBoost model.