# Deploying a Machine Learning Model

TEST 1 This lab will train and deploy a model using the Linear Learner algorithm.  You will test that model here within this notebook.

You will then train and deploy a model using XGBoost as the algorithm (with the same data set).

You following that deployment, you will create a new endpoint configuration which will add the new XGBoost model as a production variant without sending any traffic to it - then, once ready to perform your "blue green" deployment - you will update the weights of the two models witin the endpoint configuration to "swap" from the linear learner model to the XGBoost model.



The first few steps of this lab have to do with a fucntion called "Feature Engineering".  While we do some very basic adjustments to our core dataset in this lab, please know that there are several other steps we could and should take to make the model more accurate if this were a real-world development.  Those extra steps have been left out of this lab for the sake of simplicity.



## Step 1 - Load your data

We are using a set of housing data from King Country Seattle for our training.  This data is in a CSV file and we need to read it into a Pandas dataframe which we will call "df".

Once the data is loaded, we call a head method to get a look at the first few lines.

We will start by loading several libraries and utilities we will use throughout the lab.

In [1]:
# Load Libraries and utilities needed

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sagemaker
import boto3
import os
import re
from sklearn.model_selection import train_test_split
import time
import io 
import sagemaker.amazon.common as smac
import os
import math
from sagemaker.amazon.amazon_estimator import image_uris
from sklearn import metrics 
from io import StringIO 
from sagemaker.inputs import TrainingInput
from sagemaker.amazon.amazon_estimator import image_uris
from sagemaker.serializers import CSVSerializer


  from pandas.core.computation.check import NUMEXPR_INSTALLED


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
# Load the dataset into initial dataframe df
df = pd.read_csv("kc_house_data_2.csv")

# Review the first 5 rows of the data
df.head()
df.describe()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,N,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,N,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,N,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,N,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,N,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


As you can see this dataset has 21 columns:
* `id` - Unique id number
* `date` - Date of the house sale
* `price` - Price the house sold for
* `bedrooms` - Number of bedrooms
* `bathrooms` - Number of bathrooms
* `sqft_living` - Number of square feet of the living space
* `sqft_lot` - Number of square feet of the lot
* `floors` - Number of floors in the house
* `waterfront` - Whether the home is on the waterfront
* `view` - Number of lot sides with a view
* `condition` - Condition of the house
* `grade` - Classification by construction quality 
* `sqft_above` - Number of square feet above ground
* `sqft_basement` - Number of square feet below ground
* `yr_built` - Year built
* `yr_renovated` - Year renovated
* `zipcode` - ZIP code
* `lat` - Latitude
* `long` - Longitude
* `sqft_living15` - Number of square feet of living space in 2015 (can differ from `sqft_living` in the case of recent renovations)
* `sqft_lot15` - Number of square feet of lot space in 2015 (can differ from `sqft_lot` in the case of recent renovations)

This is an excellent dataset to work with and we could do a tremendous amount of feature engineering on it, however, this lab is focused on deploying our model.  But before we can do that, there are some basic feature engineering steps we must take.  Let's start by loading some data from our dataframe called df - to a new one called data.  Doing this will allow us to retain our original dataframe in case we would like to go back to it.

We start by adding the square feet of living space to our new dataframe.

## Step 2 - create our working dataframe

We start by adding the square feet of living space to our new dataframe.

In [4]:
# Add sqft living to our new dataframe called 'data'
data = df[['sqft_living']].copy()
data.head()

Unnamed: 0,sqft_living
0,1180
1,2570
2,770
3,1960
4,1680


Next, we add other features which do not need to be converted or "engineered".

In [5]:
data['bedrooms'] = df['bedrooms']
data['bathrooms'] = df['bathrooms']
data['sqft_lot'] = df['sqft_lot']
data['floors'] = df['floors']
data.head()

Unnamed: 0,sqft_living,bedrooms,bathrooms,sqft_lot,floors
0,1180,3,1.0,5650,1.0
1,2570,3,2.25,7242,2.0
2,770,2,1.0,10000,1.0
3,1960,4,3.0,5000,1.0
4,1680,3,2.0,8080,1.0


## Step 3 - Basic Feature Engineering

### Categorical variables

Let's start by including some categorical features, beginning with simple binary variables.

The dataset has the `waterfront` feature, which is a binary variable. We should change the encoding from `'Y'` and `'N'` to `1` and `0`. 

This can be done using the `map` function ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html)) provided by Pandas.  It expects either a function to apply to that column or a dictionary to look up the correct transformation.

In [7]:
# use the map function to add waterfront information as a binary categorical variable.

data['waterfront'] = df['waterfront'].map({'Y':1, 'N':0})


In [31]:
type(df['condition'])
df['condition'].index.unique()

RangeIndex(start=0, stop=21613, step=1)

You can also encode many class categorical variables. Look at column `condition`, which gives a score of the quality of the house. Looking into the [data source](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r#b) shows that the condition can be thought of as an ordinal categorical variable, so it makes sense to encode it with the order.

> Using the same map method as above, we will encode the ordinal categorical variable `condition` into the numerical range of 1 through 5.

In [34]:
# use the map fuction to add condition as an ordinal categorical variable.

data['condition'] = df['condition'].map({'Poor':1, 'Fair':2, 'Average':3, 'Good':4, 'Very Good':5})
data.head()

Unnamed: 0,sqft_living,bedrooms,bathrooms,sqft_lot,floors,waterfront,condition
0,1180,3,1.0,5650,1.0,0,3
1,2570,3,2.25,7242,2.0,0,3
2,770,2,1.0,10000,1.0,0,3
3,1960,4,3.0,5000,1.0,0,5
4,1680,3,2.0,8080,1.0,0,3


We will now use one hot encoding to convert some of our nominal categorical features to binary as well.

In [40]:
# one hot encoding
data = pd.concat([data, pd.get_dummies(df['zipcode'])], axis=1)

# Scaling of larger features
sqft_min = data['sqft_living'].min()
sqft_max = data['sqft_living'].max()
data['sqft_living'] = data['sqft_living'].map(lambda x : (x-sqft_min)/(sqft_max - sqft_min))

sqft_min2 = data['sqft_lot'].min()
sqft_max2 = data['sqft_lot'].max()
data['sqft_lot'] = data['sqft_lot'].map(lambda x : (x-sqft_min2)/(sqft_max2 - sqft_min2))

cond_min = data['condition'].min()
cond_max = data['condition'].max()
data['condition'] = data['condition'].map(lambda x : (x-cond_min)/(cond_max - cond_min))

In [41]:
data.head()

Unnamed: 0,sqft_living,bedrooms,bathrooms,sqft_lot,floors,waterfront,condition,98001,98002,98003,...,98146,98148,98155,98166,98168,98177,98178,98188,98198,98199
0,0.06717,3,1.0,0.003108,1.0,0,0.5,False,False,False,...,False,False,False,False,False,False,True,False,False,False
1,0.172075,3,2.25,0.004072,2.0,0,0.5,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,0.036226,2,1.0,0.005743,1.0,0,0.5,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,0.126038,4,3.0,0.002714,1.0,0,1.0,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,0.104906,3,2.0,0.004579,1.0,0,0.5,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Another method we can use to look at our data is the describe method.  This will give us some basic statistical information about our features.

In [39]:
data.describe()

Unnamed: 0,sqft_living,bedrooms,bathrooms,sqft_lot,floors,waterfront,condition
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,0.135087,3.370842,2.114757,0.008836,1.494309,0.007542,0.602357
std,0.069316,0.930062,0.770163,0.025091,0.539989,0.086517,0.162686
min,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,0.085811,3.0,1.75,0.002738,1.0,0.0,0.5
50%,0.122264,3.0,2.25,0.0043,1.5,0.0,0.5
75%,0.170566,4.0,2.5,0.006159,2.0,0.0,0.75
max,1.0,33.0,8.0,1.0,3.5,1.0,1.0


Now we will setup our training data, split it and convert it to RecordIP format for use in training our Linear Learner model.

In [42]:
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

# S3 bucket for saving code and model artifacts.
bucket = sagemaker.Session().default_bucket()
prefix = "linear-learner" #prefix is a sub-folder/key within the S3 bucket
output_location = 's3://{}/{}/output'.format(bucket, prefix)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [44]:
# Split training, validation, and test
ys = np.array(df['price']).astype("float32")
xs = np.array(data).astype("float32")

np.random.seed(8675309)
train_features, test_features, train_labels, test_labels = train_test_split(xs, ys, test_size=0.2)
val_features, test_features, val_labels, test_labels = train_test_split(test_features, test_labels, test_size=0.5)

In [45]:
#Create a SageMaker session
sagemaker_session = sagemaker.Session()

#Need to convert dataset to RecordIO format for Linear Learner to understand

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, train_features, train_labels)
buf.seek(0) 

###Uploading training data
#Filename for training data we are uploading to S3 
key = 'linear-train-data'
#Upload training data to S3
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

###Uploading test data
buf = io.BytesIO() # create an in-memory byte array (buf is a buffer I will be writing to)
smac.write_numpy_to_dense_tensor(buf, train_features, train_labels)
buf.seek(0)

#Sub-folder for test data
key = 'linear-test-data'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test', key)).upload_fileobj(buf)
s3_test_data = 's3://{}/{}/test/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_test_data))

###Model Artifacts
output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
uploaded training data location: s3://sagemaker-us-east-1-128847997663/linear-learner/train/linear-train-data
uploaded training data location: s3://sagemaker-us-east-1-128847997663/linear-learner/test/linear-test-data
Training artifacts will be uploaded to: s3://sagemaker-us-east-1-128847997663/linear-learner/output


While there are many more features we could use and feature engineering we could do, we will stop here and move forward with training our model.

## Step 4 - Train Model using the Linear Learner algorithm


In [46]:
# Set the container that holds the linear learner algorithm
container1 = image_uris.retrieve('linear-learner', boto3.Session().region_name)

# Create the esitmator
linear_model = sagemaker.estimator.Estimator(container1,
                                       role, 
                                       instance_count = 1, 
                                       instance_type = 'ml.m4.xlarge',
                                       output_path = output_location,
                                       sagemaker_session = sagemaker_session)

# And set the hyperparameters
linear_model.set_hyperparameters(feature_dim = 77,
                               predictor_type = 'regressor',
                               mini_batch_size = 20,
                               epochs = 5,
                               num_models = 10,
                               loss = 'absolute_loss')


In [None]:
# Now we can pass in S3 training_data path variable we declared earlier and train our first model
linear_model.fit({'train': s3_train_data})

INFO:sagemaker:Creating training-job with name: linear-learner-2023-12-01-14-50-57-995


2023-12-01 14:50:58 Starting - Starting the training job...
2023-12-01 14:51:24 Starting - Preparing the instances for training............
2023-12-01 14:53:02 Downloading - Downloading input data...
2023-12-01 14:53:37 Training - Downloading the training image......
2023-12-01 14:54:43 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[12/01/2023 14:55:04 INFO 140563914995520] Reading default configuration from /opt/amazon/lib/python3.8/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bias': '0.0', '

## Step 5 - Deploy Linear Learner the Model

We will now deploy our initial model built using Linear Learner.  

First we will set our endpoint name and model name variables, then we can deploy directly using the .deploy() method.  This will create a new endpoint configuration and endpoint that are hosted by Sagemaker.

We will use the same process when we initially deploy the XGBoost model later in the lab.

In [None]:
# Set variable names
endpoint_name = 'home-price-regressor-endpoint'
linear_model_name = 'linear-regressor-model'

#Deploy our initial model

home_price_regressor = linear_model.deploy(initial_instance_count = 1,
                                           instance_type = 'ml.m4.xlarge',
                                           endpoint_name= endpoint_name,
                                           model_name= linear_model_name
                                          )



In [None]:
# We need to make sure the data is in correct format for deployed model
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
home_price_regressor.serializer = CSVSerializer()
home_price_regressor.deserializer = JSONDeserializer()

## Step 6 - Interact with the Linear Learner model

In a real-world setting, this is where we would be when we first deploy our model.  Since we have a functioning model deployed, we will now push our test data to it and then chart how it does via a scatter plot.


In [None]:
result1 = home_price_regressor.predict(test_features)
result1 #should be a JSON

#Iterate the result JSON to get an NP array of all the predictions so we can compare to Y test
predictions = np.array([res['score'] for res in result1['predictions']])
#predictions #should now be an numpy array

#Visualize how accurate predictions are relative to y_test
plt.scatter(test_labels, predictions)


As you can see - it does ok.  If this were a real-world setting, we would go back and perform further feature engineering.  However, for the purpose of this lab, let's assume that this meets our business needs and thus we begin using the model.

### The scenario:  

After we have run our model for a time, we decide that a different algorithm may do a better job.  Our first step in replacing our current model is by training a new one.  This could also happen if we were retraining due to concept drift or other events where we needed an updated model but did not want to interupt our production environment.

## Step 7 - Train a new model

Now we will train using the same information but using the XGBoost algorithm.  XGBoost requires the input to be in CSV format vs pulling direct from a dataframe as we did above.  Therefore, we will need to setup the train, test and validation data then export it to an S3 bucket in a CSV format.

In [None]:
# Create out train, test and validation sets
XX_train = pd.concat((pd.DataFrame(train_labels),pd.DataFrame(train_features)),axis = 1)
XX_valid = pd.concat((pd.DataFrame(val_labels),pd.DataFrame(val_features)),axis = 1)
XX_test = pd.concat((pd.DataFrame(test_labels),pd.DataFrame(test_features)),axis = 1)

In [None]:
# Reformat csv file and then create our input sets
csv_buffer = StringIO()

bucket = sagemaker.Session().default_bucket()
prefix_out = 'output'
output_path = 's3://{}/{}/{}'.format(bucket, prefix_out, 'houseing-xgb')

XX_train.to_csv(csv_buffer,header=False, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'train.csv').put(Body=csv_buffer.getvalue())

XX_valid.to_csv(csv_buffer,header=False, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'valid.csv').put(Body=csv_buffer.getvalue())

XX_test.to_csv(csv_buffer,header=False, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'test.csv').put(Body=csv_buffer.getvalue())

s3_input_train = TrainingInput(s3_data='https://{}.s3.amazonaws.com/train.csv'.format(bucket), content_type='csv')
s3_input_validation = TrainingInput(s3_data='https://{}.s3.amazonaws.com/valid.csv'.format(bucket),content_type='csv')
s3_input_test = TrainingInput(s3_data='https://{}.s3.amazonaws.com/test.csv'.format(bucket),content_type='csv')


In [None]:
# Now setup the container, estimator and hyperparameters just as we did with the first model
container = sagemaker.image_uris.retrieve('xgboost',boto3.Session().region_name,'latest')

data_channels = {'train': s3_input_train, 'validation': s3_input_validation, 'test': s3_input_test}

sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role=sagemaker.get_execution_role(), 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path=output_path,
                                    sagemaker_session=sess)

xgb.set_hyperparameters(predictor_type='regressor',
                        max_depth=200,
                        num_round=100)

# and finally train our new model
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation, 'test': s3_input_test})


## Step 8 - Deploy the new model

We will now deploy our new model with its own endpoint and endpoint configuration.  It will start out as stand alone thus allows us to test it separately if we want.  We will not do this in this lab at this time.

In [None]:
# Set our variables, then again, use the .deploy() method to deploy our model
xgb_endpoint_name = 'xgb-regressor-endpoint'
xgb_model_name = 'xgb-regressor-model'

xgb_price_regressor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge',
                           endpoint_name = xgb_endpoint_name,
                           model_name = xgb_model_name,
                          )

xgb_price_regressor.serializer = CSVSerializer()
xgb_price_regressor.deserializer = JSONDeserializer()


At this point - we now have our original production endpoint and endpoing configuration which is hosting our linear learner model.  We also have a separate endpoint and endpoint configuration for our XGBoost model.  However, we don't want to switch over all our production traffic to this separate endpoint - so let's update the first one so we can do a Blue/Green deployment.

# Step 9 - Create a new endpoint configuration
This new endpoint configuration will include both the original model and the new model as two seperate production varients.  We will also set the initial weight to only send traffic to the orginal - thus ensureing there is no change to our production traffic.

In [None]:
# Get the current endpoint configuration
sage_client = sess.sagemaker_client
endpoint = sage_client.describe_endpoint(EndpointName=home_price_regressor.endpoint_name)
endpoint_config = sage_client.describe_endpoint_config(
    EndpointConfigName=endpoint['EndpointConfigName'])

# Set the current deployment weight to 1 to ensure all traffic continues to the same model as before
current_model_config = endpoint_config['ProductionVariants'][0]
current_model_config['InitialVariantWeight'] = 1
current_model_config['VariantName'] = 'linear-learner'

# Now setup a new variant and configuration for that variant
Variant = 'xgboost'

xgb_model_config = {'ModelName': xgb_model_name,
                      'InitialInstanceCount': 1,
                      'InstanceType': 'ml.m4.xlarge',
                      'VariantName': Variant,
                      'InitialVariantWeight': 0}

# And now create the new endpoint configuration using the two production variants from above.
sage_client.create_endpoint_config(
    EndpointConfigName='AB-Config',
    ProductionVariants=[current_model_config,
                        xgb_model_config])


In [None]:
# Now we update the original endpoint with the new endpoint configuration - this may take a few minutes
sage_client.update_endpoint(
    EndpointName=endpoint['EndpointConfigName'],
    EndpointConfigName='AB-Config')

result = sess.wait_for_endpoint(endpoint['EndpointConfigName'])

At this point - everything is processing as before - with the only difference being that we have two production varients in our endpoint configuration.

When it comes time to do the cutover - we perform:

# Step 10 - The Blue/Green Deployment
Simply by changing the weights within the endpoint configuration, we switch all traffic from the linear learner model to the XGBoost model

In [None]:
sage_client.update_endpoint_weights_and_capacities(
    EndpointName=endpoint['EndpointConfigName'],
    DesiredWeightsAndCapacities=[
        {
            'VariantName': 'linear-learner',
            'DesiredWeight': 0
        },
        {
            'VariantName': 'xgboost',
            'DesiredWeight': 1
        }
    ]
)
response = sess.wait_for_endpoint(endpoint['EndpointConfigName'])

## Conclusion

You have now trained and deployed two separate models, and then performed a Blue/Green deployment to switch from one to the other.

## Clean-up

If you're done with this notebook, please run the cell below to remove the hosted endpoint and avoid any charges from any stray instances being left on.

In [None]:
sagemaker.Session().delete_endpoint(xgb_price_regressor.endpoint_name)
sagemaker.Session().delete_endpoint(home_price_regressor.endpoint_name)