# AWS Demonstration 


The code defines the way in which any model can be deployed over the AWS for production purpose. Each and every instruction has been proided in the sheet. 

Please follow the following steps to create the instance in AWS:
1. Create AWS Account (No money required)
2. Login into AWS Management console
3. Search for AWS Sagemaker
4. Left hand side panel navigation --> Notebook > Notebook Instances > Create Notebook Instances
5. Create IAM Role while creating teh notebook instance (This is essential part of the process because it helps manage the access of the notebook)
6. The instance creation will take some time. Once done, please open it and start writing the below code

The demonstration is for the XGBoost model

The process goes as follows:
1. Set up environment
2. Download and split dataset
3. Model
4. Deployment 
5. Predictions
6. Delete endpoint

## Environment & path set up

There are some libraries which are required to be installed before we begin with the entire process. 
1. sagemaker -- inbult engine to perfome modeling and deployment
2. boto3 -- help in connecting server with this machine instance

Specific functions import:
1. get_image_uri -- As AWS inbuilt model will be used, it is required to be fetched as a container through this fuction
2. csv_serializer -- This will be used for prediction purpose, as the input will be supplied in the form of the csv (serialisation of the input)

In [31]:
import urllib
import os
import pandas as pd
import numpy as np
import sagemaker 
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri 
from sagemaker.session import s3_input, Session
from sagemaker.predictor import csv_serializer 

This step is to automatically create the S3 bucket in the AWS
1. specify the bucket name as per availablility
2. fetch the region-name (Note: S3 buckets are free of region specification)

(Note: S3 (Simple storage service) bucket is a data storage platform which provides scalability as well.
Region name: location of the operation performed (Eg: N. Vergnia))

In [2]:
bucket_name = 'ba-data-112233'
#get region name
my_region = boto3.session.Session().region_name
print(my_region)

us-east-1


The below code is to connect to S3 (using boto3) and create bucket for the project 

In [3]:
#get access of the s3 bucket
s3 = boto3.resource('s3')
#create bucket
try:
    if  my_region == 'us-east-1':
        s3.create_bucket(Bucket=bucket_name)
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

S3 bucket created successfully


The bucket will be used for storing everthing produced by the entire model. It involves:
1. Original data file
2. train and test data (created by spliting function)
3. model file
4. predictions

In [4]:
# set an output path where the trained model will be saved
prefix = 'xgboost_model'
output_path ='s3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)

s3://ba-data-112233/xgboost_model/output


## Data Download and spliting

The data is downlaoded from the github page through urllib library. and then finally converted to the dataframe

In [5]:
# Download data in s3 bucket
try:
    urllib.request.urlretrieve('https://raw.githubusercontent.com/jyotiyadav99111/AWS_Bank_Applkication-/main/bank_data.csv', 'bank_data.csv')
    print('Data Downloaded successfully')
except Exception as e:
    print('Downloading error: ',e)

Data Downloaded successfully


In [6]:
# load dataset in pandas dataframe
try:
    df = pd.read_csv('./bank_data.csv')   # provide relative path in s3 bucket
    print('Datafame created successfully')
except Exception as e:
    print('Dataframe creation error: ',e)

Datafame created successfully


In [7]:
# Train and test data split
train_data, test_data = np.split(df.sample(frac=1, random_state=1729), [int(0.8 * len(df))])
print(train_data.shape, test_data.shape)

(32950, 62) (8238, 62)


In [8]:
# as per some documentations in AWS the target variable should be the first column
# training data saved as csv
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
# upload data to s3 bucket under the 'train' folder
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
# for general upload of the we will reuire path of the data next time 
s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

In [9]:
# Repeat the same for test data

pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('test.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
s3_input_test = sagemaker.TrainingInput(s3_data='s3://{}/{}/test'.format(bucket_name, prefix), content_type='csv')

## XGBoost Model 

The AWS inbuilt models are built in form of containers or images. These are required to be pulled off and loaded in the instance for the use. 

In [25]:
# any algo can be called using this method (not necessarily the xgboost)
xgboost_container = get_image_uri(boto3.Session().region_name,'xgboost', repo_version='latest')

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [26]:
# These hyperparameter have been tuned already on local machine as on AWS it will be alittle slow and costly
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"binary:logistic",
        "num_round":50
        }

In [28]:
# This is a gernal method can be used for any ML algorithms, you just need to specify it in the container itslef
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          train_instance_count=1, 
                                          train_instance_type='ml.m5.2xlarge', 
                                          train_volume_size=5, # 5 GB 
                                          output_path=output_path,
                                          train_use_spot_instances=True,
                                          train_max_run=300,
                                          train_max_wait=600)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_max_run has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_use_spot_instances has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_max_wait has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [29]:
# final training of the model given the chosen paramters
estimator.fit({'train': s3_input_train,'validation': s3_input_test})

2021-07-24 12:07:55 Starting - Starting the training job...
2021-07-24 12:08:20 Starting - Launching requested ML instancesProfilerReport-1627128475: InProgress
......
2021-07-24 12:09:20 Starting - Preparing the instances for training......
2021-07-24 12:10:20 Downloading - Downloading input data
2021-07-24 12:10:20 Training - Downloading the training image...
2021-07-24 12:10:53 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[2021-07-24:12:10:48:INFO] Running standalone xgboost training.[0m
[34m[2021-07-24:12:10:48:INFO] File size need to be processed in the node: 5.05mb. Available memory size in the node: 23804.71mb[0m
[34m[2021-07-24:12:10:48:INFO] Determined delimiter of CSV input is ','[0m
[34m[12:10:48] S3DistributionType set as FullyReplicated[0m
[34m[12:10:48] 32950x60 matrix with 1977000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-07-24:12:10:48:INFO] Determined delimiter of CSV input 

## Deploy the model as endpoint

In [30]:
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

---------------!

## Predictions

In [33]:
test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values #load the data into an array
xgb_predictor.serializer = csv_serializer # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


(8238,)


The data is clearly unbalanced but the mail focus is on the AWS demonstration. Below code has been taken from the AWS documentation

In [34]:
cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1]; p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:<15}{1:<15}{2:>8}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))


Overall Classification Rate: 89.8%

Predicted      No Purchase    Purchase
Observed
No Purchase    91% (7132)    34% (138)
Purchase        9% (706)     66% (262) 



## Delete the endpoint and all the corresponding data

Once the model has been created and everything has been acomplished it is a good idea to delete the data and other files. 

In [35]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


[{'ResponseMetadata': {'RequestId': 'H3EKGCJE9Q5DAZXC',
   'HostId': '996z3I7Gk0t8+SoWa2hb0mjB/xSODxA4dbY9F9MJRGWgOlWq90aKtUfXtKc3+mgnLfb44KHw9xk=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': '996z3I7Gk0t8+SoWa2hb0mjB/xSODxA4dbY9F9MJRGWgOlWq90aKtUfXtKc3+mgnLfb44KHw9xk=',
    'x-amz-request-id': 'H3EKGCJE9Q5DAZXC',
    'date': 'Sat, 24 Jul 2021 12:38:48 GMT',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3',
    'connection': 'close'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'xgboost_model/output/xgboost-2021-07-24-12-07-55-586/profiler-output/framework/training_job_end.ts'},
   {'Key': 'xgboost_model/output/xgboost-2021-07-24-12-07-55-586/profiler-output/system/incremental/2021072412/1627128600.algo-1.json'},
   {'Key': 'xgboost_model/test/test.csv'},
   {'Key': 'xgboost_model/train/train.csv'},
   {'Key': 'xgboost_model/output/xgboost-2021-07-24-12-07-55-586/output/model.tar.gz'},
   {'Key': 'xgboost_model/output