# Build, train, and deploy a machine learning model
## with Amazon SageMaker

Source: https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/?trk=el_a134p000003yWILAA2&trkCampaign=DS_SageMaker_Tutorial&sc_channel=el&sc_campaign=Data_Scientist_Hands-on_Tutorial&sc_outcome=Product_Marketing&sc_geo=mult&p=gsrc&c=lp_ds

1. Imports the required libraries and defines the environment variables you need to prepare the data, train the ML model, and deploy the ML model.

In [1]:
# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
from IPython.display import display
from time import gmtime, strftime
from sagemaker.predictor import csv_serializer

# Define IAM role
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'
my_region = boto3.session.Session().region_name # set the region of the instance

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", my_region, "latest")

print("Success - the MySageMakerInstance is in the " + my_region + " region. You will use the " + xgboost_container + " container for your SageMaker endpoint.")

Success - the MySageMakerInstance is in the eu-central-1 region. You will use the 813361260812.dkr.ecr.eu-central-1.amazonaws.com/xgboost:latest container for your SageMaker endpoint.


2. Create the S3 bucket to store your data, name should be changed.

In [2]:
bucket_name = 'your-s3-bucket-moanesga' # <--- CHANGE THIS VARIABLE TO A UNIQUE NAME FOR YOUR BUCKET
s3 = boto3.resource('s3')
try:
    if  my_region == 'us-east-1':
      s3.create_bucket(Bucket=bucket_name)
    else: 
      s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={ 'LocationConstraint': my_region })
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

S3 bucket created successfully


3. Download the data to your SageMaker instance and load the data into a dataframe

In [3]:
try:
  urllib.request.urlretrieve ("https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv", "bank_clean.csv")
  print('Success: downloaded bank_clean.csv.')
except Exception as e:
  print('Data load error: ',e)

try:
  model_data = pd.read_csv('./bank_clean.csv',index_col=0)
  print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: downloaded bank_clean.csv.
Success: Data loaded into dataframe.


4. Shuffle and split the data into training data and test data. 

The training data (70% of customers) is used during the model training loop. You use gradient-based optimization to iteratively refine the model parameters. Gradient-based optimization is a way to find model parameter values that minimize the model error, using the gradient of the model loss function.

The test data (remaining 30% of customers) is used to evaluate the performance of the model and measure how well the trained model generalizes to unseen data.

In [4]:
train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data))])
print(train_data.shape, test_data.shape)

(28831, 61) (12357, 61)


5. Train the ML model This code reformats the header and first column of the training data and then loads the data from the S3 bucket. This step is required to use the Amazon SageMaker pre-built XGBoost algorithm.

In [5]:
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

6. Set up the Amazon SageMaker session, create an instance of the XGBoost model (an estimator), and define the model’s hyperparameters. 

In [6]:
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(xgboost_container,role, instance_count=1, instance_type='ml.m4.xlarge',output_path='s3://{}/{}/output'.format(bucket_name, prefix),sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,objective='binary:logistic',num_round=100)

7. Start the training job. This code trains the model using gradient optimization on a ml.m4.xlarge instance. After a few minutes, you should see the training logs being generated in your Jupyter notebook.

In [7]:
xgb.fit({'train': s3_input_train})

2021-06-30 07:52:36 Starting - Starting the training job...
2021-06-30 07:52:59 Starting - Launching requested ML instancesProfilerReport-1625039555: InProgress
......
2021-06-30 07:54:00 Starting - Preparing the instances for training......
2021-06-30 07:55:00 Downloading - Downloading input data...
2021-06-30 07:55:31 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2021-06-30:07:55:32:INFO] Running standalone xgboost training.[0m
[34m[2021-06-30:07:55:32:INFO] Path /opt/ml/input/data/validation does not exist![0m
[34m[2021-06-30:07:55:32:INFO] File size need to be processed in the node: 3.38mb. Available memory size in the node: 8392.67mb[0m
[34m[2021-06-30:07:55:32:INFO] Determined delimiter of CSV input is ','[0m
[34m[07:55:32] S3DistributionType set as FullyReplicated[0m
[34m[07:55:33] 28831x59 matrix with 1701029 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[07:55:33] src/

8. Deploy the model.This code deploys the model on a server and creates a SageMaker endpoint that you can access. This step may take a few minutes to complete

In [8]:
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

-------------!

9. To predict whether customers in the test data enrolled for the bank product or not, 

In [9]:
from sagemaker.serializers import CSVSerializer

test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values #load the data into an array
xgb_predictor.serializer = CSVSerializer() # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)

(12357,)


10. Evaluate model performance. Evaluate the performance and accuracy of the machine learning model.This code compares the actual vs. predicted values in a table called a confusion matrix.
Based on the prediction, we can conclude that you predicted a customer will enroll for a certificate of deposit accurately for 90% of customers in the test data, with a precision of 65% (278/429) for enrolled and 90% (10,785/11,928) for didn’t enroll.

In [10]:
cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1]; p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:<15}{1:<15}{2:>8}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))


Overall Classification Rate: 89.5%

Predicted      No Purchase    Purchase
Observed
No Purchase    90% (10769)    37% (167)
Purchase        10% (1133)     63% (288) 



11. Clean up. In this step, you terminate the resources you used in this lab.

Important: Terminating resources that are not actively being used reduces costs and is a best practice. Not terminating your resources will result in charges to your account. Delete your endpoint:

In [11]:
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

12. Delete your training artifacts and S3 bucket

In [12]:
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'SMYFWNK5ADAWCNJQ',
   'HostId': 'bNQ7AaOaa1L9/g2PtxijN0Iv4+/UfyWwd/KNTapewSSemxcSa+EleB3fIckbNt8RTtqnBKGIRyk=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'bNQ7AaOaa1L9/g2PtxijN0Iv4+/UfyWwd/KNTapewSSemxcSa+EleB3fIckbNt8RTtqnBKGIRyk=',
    'x-amz-request-id': 'SMYFWNK5ADAWCNJQ',
    'date': 'Wed, 30 Jun 2021 08:13:45 GMT',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3',
    'connection': 'close'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker/DEMO-xgboost-dm/output/xgboost-2021-06-30-07-52-35-954/rule-output/ProfilerReport-1625039555/profiler-output/profiler-reports/MaxInitializationTime.json'},
   {'Key': 'sagemaker/DEMO-xgboost-dm/output/xgboost-2021-06-30-07-52-35-954/output/model.tar.gz'},
   {'Key': 'sagemaker/DEMO-xgboost-dm/output/xgboost-2021-06-30-07-52-35-954/profiler-output/system/training_job_end.ts'},
   {'Key': 'sagemaker/DEMO-xgboost-dm/output/xgboost-2021-

13. Delete your SageMaker Notebook: Stop and delete your SageMaker Notebook.