# Build, Train and Deploy a Machine Learning Model using Amazon SageMaker

In this lab, you will learn how to build, train, and deploy a machine learning model using Amazon SageMaker. You will use the XGBoost algorithm to predict whether a customer will enroll for a term deposit based on their demographic and behavioral data. This notebook demonstrates the complete ML workflow from data preparation to model deployment and evaluation.

The dataset used is from a direct marketing campaign of a banking institution, and the goal is to predict whether a client will subscribe to a term deposit (binary classification problem).

#### Scenario
You are a Machine Learning Engineer at a financial institution, and your task is to build a predictive model that can help the marketing team identify customers who are likely to subscribe to a term deposit. By automating this prediction, the bank can optimize its marketing campaigns and improve customer targeting.

## 1. Install and Import Required Libraries

In this task, you will install the SageMaker SDK and import all necessary libraries for data manipulation, AWS services, and machine learning operations.

In [None]:
# Install the SageMaker Python SDK (specific version for compatibility)
# The '<3.0.0' ensures we install a version below 3.0.0 to maintain compatibility
!pip install 'sagemaker<3.0.0'

**What we did in the above cell:**
- We installed the `sagemaker` library which provides Python APIs to interact with Amazon SageMaker services
- The version constraint `<3.0.0` ensures compatibility with the code in this notebook

In [None]:
# Importing necessary libraries for data manipulation and AWS services
import numpy as np  # For numerical operations and array handling
import pandas as pd  # For data manipulation using DataFrames
import matplotlib.pyplot as plt  # For creating visualizations
from IPython.display import display  # For displaying outputs in Jupyter notebooks
from time import gmtime, strftime  # For working with timestamps
import sys  # For system-specific parameters and functions
import math  # For mathematical operations
import json  # For working with JSON data
import os  # For interacting with the operating system

# Importing Boto3 (AWS SDK for Python) to interact with AWS services
import boto3,urllib.request

# Importing SageMaker-specific modules
import sagemaker  # Main SageMaker SDK
from sagemaker.inputs import TrainingInput  # For specifying training data location
from sagemaker.serializers import CSVSerializer  # For serializing data to CSV format for predictions

**What we did in the above cell:**
- **numpy** - we are importing this library for numerical computations and array operations
- **pandas** - we are importing this library to work with tabular data in DataFrame format
- **matplotlib** - we are importing this library for data visualization
- **boto3** - we are importing the AWS SDK to interact with AWS services like S3
- **sagemaker** - we are importing the SageMaker SDK to train and deploy ML models
- **TrainingInput** - we are importing this class to specify the S3 location of training data
- **CSVSerializer** - we are importing this serializer to format prediction requests in CSV format

## 2. Setup Environment and AWS Configuration

In this task, you will configure the AWS environment by setting up the SageMaker execution role, S3 bucket, and container image for the XGBoost algorithm.

In [None]:
# Get the IAM role for SageMaker
# This role provides permissions for SageMaker to access AWS resources like S3
role = sagemaker.get_execution_role()

# Get the AWS region where this notebook is running
region = boto3.Session().region_name

# Get the default SageMaker S3 bucket for this session
# This bucket will store training data, model artifacts, and outputs
bucket_name = sagemaker.Session().default_bucket()

# Define a prefix (folder path) in S3 to organize our files
prefix = 'xgboost-as-a-built-in-algo'

# Print the bucket name for verification
print(f"Using S3 bucket: {bucket_name}")
print(f"IAM Role: {role}")
print(f"Region: {region}")

**What we did in the above cell:**
- **get_execution_role()** - we are retrieving the IAM role that SageMaker will use to access S3 and other AWS services
- **region_name** - we are identifying the AWS region where our resources are located
- **default_bucket()** - we are getting the default S3 bucket created by SageMaker for storing data
- **prefix** - we are defining a folder structure in S3 to keep our project files organized

In [None]:
# Retrieve the container image URI for the XGBoost algorithm
# SageMaker provides pre-built containers for popular algorithms like XGBoost
container = sagemaker.image_uris.retrieve(
    region=region,  # AWS region
    framework='xgboost',  # The ML framework we want to use
    version='1.0-1'  # Specific version of XGBoost
)

# Print the container URI
print(f"XGBoost container image: {container}")

**What we did in the above cell:**
- **image_uris.retrieve()** - we are fetching the Docker container image URI for the XGBoost algorithm
- This container has the XGBoost algorithm pre-installed and optimized for SageMaker
- The version '1.0-1' specifies which XGBoost version we want to use

## 3. Load and Explore the Dataset

In this task, you will load the banking dataset from a CSV file and examine its structure to understand the features and target variable.

In [None]:
try:
  urllib.request.urlretrieve ("https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv", "bank_clean.csv")
  print('Success: downloaded bank_clean.csv.')
except Exception as e:
  print('Data load error: ',e)

In [None]:
# Try to load the dataset from a CSV file
try:
    # Read the CSV file into a pandas DataFrame
    # index_col=0 means the first column will be used as the row index
    model_data = pd.read_csv('./bank_clean.csv', index_col=0)

    # Print success message if data loads successfully
    print('Success: Data loaded into dataframe.')

except Exception as e:
    # If there's an error (file not found, corrupted file, etc.), print the error message
    print('Data load error:', e)

**What we did in the above cell:**
- **pd.read_csv()** - we are reading the CSV file and loading it into a pandas DataFrame
- **index_col=0** - we are setting the first column as the index (row labels)
- **try-except block** - we are using error handling to catch any issues during file loading
- If the file loads successfully, we print a success message; otherwise, we print the error

## 4. Split Data into Training and Testing Sets

In this task, you will split the dataset into training (70%) and testing (30%) sets. The training set is used to train the model, while the testing set is used to evaluate its performance.

In [None]:
# Shuffle the data randomly to ensure random distribution
# frac=1 means we shuffle 100% of the data
# random_state=1729 ensures reproducibility (same shuffle every time)
shuffled = model_data.sample(frac=1, random_state=1729)

# Calculate the split index for 70% training data
# int() converts to integer, 0.7 means 70% of total data
train_size = int(0.7 * len(shuffled))

# Create training dataset (first 70% of shuffled data)
# iloc[:train_size] selects rows from index 0 to train_size
# copy() creates a separate copy to avoid modifying original data
train_data = shuffled.iloc[:train_size].copy()

# Create testing dataset (remaining 30% of shuffled data)
# iloc[train_size:] selects rows from train_size to the end
test_data = shuffled.iloc[train_size:].copy()

# Print the shapes (rows, columns) of both datasets to verify the split
print(train_data.shape, test_data.shape)

# Display the first 5 rows of training data to verify columns and structure
print(train_data.head())

**What we did in the above cell:**
- **sample(frac=1)** - we are shuffling the entire dataset randomly to avoid any ordering bias
- **random_state=1729** - we are setting a seed value to ensure the same shuffle occurs every time (reproducibility)
- **train_size** - we are calculating 70% of the total data size for training
- **iloc[:train_size]** - we are selecting the first 70% rows for training data
- **iloc[train_size:]** - we are selecting the remaining 30% rows for testing data
- **copy()** - we are creating independent copies to prevent unintended modifications
- The split ensures the model is trained on 70% data and tested on unseen 30% data

## 5. Prepare and Upload Training Data to S3

In this task, you will prepare the training data in the format required by XGBoost (target column first, then features, no headers) and upload it to Amazon S3.

In [None]:
# Reformat the training data: target column first, then all features (XGBoost requirement)
# pd.concat() is used to combine columns
pd.concat(
    [
        train_data['y_yes'],  # Target column (1 if customer subscribed, 0 if not)
        train_data.drop(['y_no', 'y_yes'], axis=1)  # All feature columns (drop both target columns)
    ],
    axis=1  # Concatenate along columns (horizontally)
).to_csv('train.csv', index=False, header=False)  # Save as CSV without index and headers

# Upload the training CSV file to S3
# boto3.Session().resource('s3') creates an S3 resource client
# .Bucket(bucket_name) selects the specific S3 bucket
# .Object() specifies the file path in S3
# .upload_file() uploads the local file to S3
boto3.Session().resource('s3').Bucket(bucket_name).Object(
    os.path.join(prefix, 'train/train.csv')  # S3 path: xgboost-as-a-built-in-algo/train/train.csv
).upload_file('train.csv')  # Local file to upload

# Create a TrainingInput object that points to the S3 location of training data
# This object will be used by SageMaker to access the training data
s3_input_train = TrainingInput(
    s3_data=f's3://{bucket_name}/{prefix}/train',  # S3 URI where training data is stored
    content_type='text/csv'  # Specify that the data is in CSV format
)

# Print confirmation message
print("Training data uploaded and TrainingInput created successfully.")

**What we did in the above cell:**
- **pd.concat()** - we are combining the target column ('y_yes') and feature columns into a single DataFrame
- **drop(['y_no', 'y_yes'])** - we are removing both target columns from features (XGBoost only needs one target column)
- **to_csv(index=False, header=False)** - we are saving the data without row indices and column headers (XGBoost format requirement)
- **boto3 S3 upload** - we are uploading the CSV file to the specified S3 bucket and path
- **TrainingInput()** - we are creating a SageMaker object that references the S3 location of training data
- The data is now stored in S3 and ready to be used by the SageMaker training job

## 6. Configure and Train the XGBoost Model

In this task, you will create a SageMaker Estimator, configure the XGBoost hyperparameters, and start the training job. The training process will take approximately 5-8 minutes.

In [None]:
# Create a SageMaker session to manage interactions with SageMaker services
sess = sagemaker.Session()

# Create an Estimator object for the XGBoost algorithm
# An Estimator encapsulates the training configuration
xgb = sagemaker.estimator.Estimator(
    image_uri=container,  # Docker image URI for XGBoost (retrieved earlier)
    role=role,  # IAM role that grants SageMaker permissions to access AWS resources
    instance_count=1,  # Number of training instances (1 for single-instance training)
    instance_type='ml.m4.xlarge',  # EC2 instance type for training (4 vCPUs, 16 GB RAM)
    output_path=f's3://{bucket_name}/{prefix}/output',  # S3 path where trained model will be saved
    sagemaker_session=sess  # SageMaker session object
)

# Set hyperparameters for the XGBoost algorithm
# Hyperparameters control how the model learns from data
xgb.set_hyperparameters(
    max_depth=5,  # Maximum depth of each tree (controls model complexity)
    eta=0.2,  # Learning rate (step size shrinkage to prevent overfitting)
    gamma=4,  # Minimum loss reduction required to make a split (regularization)
    min_child_weight=6,  # Minimum sum of instance weight needed in a child (regularization)
    subsample=0.8,  # Fraction of samples used for training each tree (prevents overfitting)
    objective='binary:logistic',  # Loss function for binary classification
    num_round=100  # Number of boosting rounds (trees to build)
)

# Start the training job
# This will launch an EC2 instance, load the data from S3, train the model, and save it back to S3
# The training will take approximately 5-8 minutes
xgb.fit({'train': s3_input_train})  # Pass the training data location

print("Training job completed!")

**What we did in the above cell:**
- **sagemaker.Session()** - we are creating a session to interact with SageMaker services
- **sagemaker.estimator.Estimator()** - we are defining the training job configuration (container, role, instance type, etc.)
- **instance_count=1** - we are using a single training instance (can be increased for distributed training)
- **instance_type='ml.m4.xlarge'** - we are specifying the EC2 instance type with 4 vCPUs and 16 GB RAM
- **output_path** - we are specifying where the trained model artifacts will be saved in S3
- **set_hyperparameters()** - we are configuring the XGBoost algorithm parameters:
  - **max_depth=5** - limits tree depth to prevent overfitting
  - **eta=0.2** - controls learning rate (smaller values = slower but more robust learning)
  - **gamma=4** - minimum loss reduction to create a new split (higher = more conservative)
  - **min_child_weight=6** - prevents overly specific splits
  - **subsample=0.8** - uses 80% of data for each tree (random sampling)
  - **objective='binary:logistic'** - optimizes for binary classification
  - **num_round=100** - builds 100 decision trees
- **fit()** - we are starting the training job by passing the training data location

## 7. Deploy the Trained Model to an Endpoint

In this task, you will deploy the trained model to a SageMaker endpoint, which allows you to make real-time predictions.

In [None]:
# Deploy the trained model to a SageMaker endpoint
# This creates a real-time inference endpoint that can accept prediction requests
xgb_predictor = xgb.deploy(
    initial_instance_count=1,  #Number of instances for the endpoint (1 for single instance)
    instance_type='ml.m4.xlarge'  #EC2 instance type for hosting the model
)

# Print confirmation message
print("Model deployed successfully!")

**What we did in the above cell:**
- **deploy()** - we are deploying the trained model to a real-time endpoint
- **initial_instance_count=1** - we are using 1 instance to host the model (can be scaled for high traffic)
- **instance_type='ml.m4.xlarge'** - we are specifying the instance type for the endpoint
- **xgb_predictor** - we are creating a predictor object that we'll use to make predictions
- The endpoint is now live and ready to accept prediction requests in real-time

## 8. Make Predictions on Test Data

In this task, you will use the deployed model endpoint to make predictions on the test dataset.

In [None]:
# Prepare the test data for prediction
# Remove the target columns ('y_no', 'y_yes') and convert to numpy array
# The model only needs feature columns, not the target
test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values

# Set the serializer for the predictor to CSV format
# This tells the endpoint to expect data in CSV format
xgb_predictor.serializer = CSVSerializer()

# Make predictions using the deployed endpoint
# The endpoint returns predictions as a byte string
predictions = xgb_predictor.predict(test_data_array).decode('utf-8')

# Convert the newline-separated prediction scores to a numpy array
# fromstring() parses the string and creates an array
# sep='\n' indicates that values are separated by newlines
predictions_array = np.fromstring(predictions, sep='\n')

# Print the shape of predictions array to verify
print(predictions_array.shape)

**What we did in the above cell:**
- **drop(['y_no', 'y_yes'])** - we are removing the target columns from the test data
- **.values** - we are converting the DataFrame to a numpy array (required for prediction)
- **CSVSerializer()** - we are setting the data format for sending requests to the endpoint
- **predict()** - we are sending the test data to the endpoint and receiving predictions
- **.decode('utf-8')** - we are converting the byte response to a readable string
- **np.fromstring()** - we are parsing the newline-separated string into a numpy array of prediction scores
- Each value in predictions_array represents the probability that a customer will subscribe (0 to 1)

## 9. Evaluate Model Performance with Confusion Matrix

In this task, you will create a confusion matrix to evaluate the model's performance by comparing actual vs predicted values.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# -------------------------------
# Confusion Matrix (SAFE VERSION)
# -------------------------------

# Assuming test_data['y_yes'] are the true labels and predictions_array contains the predicted labels
cm = pd.crosstab(
    index=test_data['y_yes'],                 # Actual values (0 = No Purchase, 1 = Purchase)
    columns=np.round(predictions_array),      # Predicted values (rounded to 0 or 1)
    rownames=['Observed'],
    colnames=['Predicted']
)

# Force matrix to be 2x2 (prevents IndexError)
cm = cm.reindex(index=[0, 1], columns=[0, 1], fill_value=0)

# -------------------------------
# Extract Values (Correctly)
# -------------------------------

tn = cm.loc[0, 0]   # True Negatives
fp = cm.loc[0, 1]   # False Positives
fn = cm.loc[1, 0]   # False Negatives
tp = cm.loc[1, 1]   # True Positives

# -------------------------------
# Accuracy
# -------------------------------

accuracy = (tp + tn) / (tp + tn + fp + fn) * 100

# -------------------------------
# Formatted Output
# -------------------------------

print("Confusion Matrix:")
print(cm)
print()

print(f"Overall Classification Rate: {accuracy:.2f}%\n")

# Display confusion matrix in a more readable format
print("{:<15}{:>10}{:>10}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")

print("{:<15}{:>10}{:>10}".format(
    "No Purchase",
    f"{tn} ({(tn/(tn+fp)*100 if (tn+fp)>0 else 0):.1f}%)",
    f"{fp}"
))

print("{:<15}{:>10}{:>10}".format(
    "Purchase",
    f"{fn}",
    f"{tp} ({(tp/(tp+fp)*100 if (tp+fp)>0 else 0):.1f}%)"
))

# -------------------------------
# Visual Representation
# -------------------------------

# Create a heatmap for better visualization of confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Purchase', 'Purchase'], yticklabels=['No Purchase', 'Purchase'])
plt.title('Confusion Matrix Visualization')
plt.xlabel('Predicted')
plt.ylabel('Observed')
plt.show()


**What we did in the above cell:**
- **pd.crosstab()** - we are creating a confusion matrix that compares actual vs predicted values
- **np.round(predictions_array)** - we are rounding prediction probabilities to 0 or 1 for classification
- **tn (True Negatives)** - we are counting correctly predicted 'No Purchase' cases
- **fp (False Positives)** - we are counting incorrectly predicted 'Purchase' cases
- **fn (False Negatives)** - we are counting missed 'Purchase' cases (predicted as 'No Purchase')
- **tp (True Positives)** - we are counting correctly predicted 'Purchase' cases
- **Accuracy** - we are calculating the percentage of correct predictions: (TP + TN) / Total
- The confusion matrix helps us understand where the model is making errors and its overall performance

## 10. Clean Up Resources

In this task, you will delete the deployed endpoint, model, and S3 objects to avoid incurring unnecessary costs. **Important:** Always clean up resources after completing your work to prevent ongoing charges.

In [None]:
import boto3
from botocore.exceptions import ClientError

sm_client = boto3.client("sagemaker")

endpoint_name = xgb_predictor.endpoint_name

print("Starting SageMaker cleanup...\n")

# -------------------------------
# FIND ENDPOINT CONFIG NAME
# -------------------------------
try:
    endpoint_desc = sm_client.describe_endpoint(
        EndpointName=endpoint_name
    )
    endpoint_config_name = endpoint_desc["EndpointConfigName"]
except ClientError as e:
    print(f"‚ö†Ô∏è Endpoint not found: {endpoint_name}")
    endpoint_config_name = None

# -------------------------------
# FIND MODEL NAME
# -------------------------------
model_name = None
if endpoint_config_name:
    try:
        endpoint_config = sm_client.describe_endpoint_config(
            EndpointConfigName=endpoint_config_name
        )
        model_name = endpoint_config["ProductionVariants"][0]["ModelName"]
    except ClientError:
        pass

# -------------------------------
# DELETE ENDPOINT
# -------------------------------
try:
    sm_client.delete_endpoint(EndpointName=endpoint_name)
    print(f"‚úÖ Endpoint deleted: {endpoint_name}")
except ClientError as e:
    print(f"‚ö†Ô∏è Endpoint already deleted: {endpoint_name}")

# -------------------------------
# DELETE ENDPOINT CONFIG
# -------------------------------
if endpoint_config_name:
    try:
        sm_client.delete_endpoint_config(
            EndpointConfigName=endpoint_config_name
        )
        print(f"‚úÖ Endpoint config deleted: {endpoint_config_name}")
    except ClientError:
        print(f"‚ö†Ô∏è Endpoint config already deleted: {endpoint_config_name}")

# -------------------------------
# DELETE MODEL
# -------------------------------
if model_name:
    try:
        sm_client.delete_model(ModelName=model_name)
        print(f"‚úÖ Model deleted: {model_name}")
    except ClientError:
        print(f"‚ö†Ô∏è Model already deleted: {model_name}")

print("\nüéØ SageMaker cleanup completed successfully.")


**What we did in the above cell:**
- **delete_model()** - we are deleting the SageMaker model resource
- This removes the model configuration from SageMaker (but not the trained model file in S3)
- The model artifacts (.tar.gz file) remain in S3 for future use
- This is a cleanup step to remove unused resources from your SageMaker account

In [None]:
# Delete all files from the S3 bucket prefix (optional cleanup)
# This removes training data and model artifacts to free up S3 storage
# Create an S3 resource client
s3 = boto3.resource('s3')

# Get the bucket object
bucket = s3.Bucket(bucket_name)

# Delete all objects with our prefix (xgboost-as-a-built-in-algo/)
# This removes train.csv, model artifacts, and output files
bucket.objects.filter(Prefix=prefix).delete()

# Print confirmation message
print(f"All objects with prefix '{prefix}' deleted from S3 bucket '{bucket_name}'!")

**What we did in the above cell:**
- **boto3.resource('s3')** - we are creating an S3 resource client to interact with S3 buckets
- **s3.Bucket(bucket_name)** - we are accessing the specific S3 bucket that stored our data
- **bucket.objects.filter(Prefix=prefix)** - we are selecting all objects that start with our prefix 'xgboost-as-a-built-in-algo/'
- **.delete()** - we are deleting all those objects (training data, model artifacts, output files)
- This cleanup step removes all project files from S3 to avoid storage costs
- **Note:** This is optional - you may want to keep model artifacts for future redeployment