# Sagemaker demo

This notebook constitutes a basic demo of Amazon Sagemaker capabilities: 
- The notebook itself is part of a Notebook Instance
- The notebook will illustrate the concept of Estimator and Train Jobs, for model training at scale. 
- The notebook will illustrate the concept of deploying a model and doing real-time inference queries

To that end, we will use a prototypical task of predicting house prices. 

## Environment preparation

This section prepares the environment needed for the modeling. 

### Necessary imports

In [3]:
import boto3
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri

import numpy as np
import io
import pandas as pd
from sklearn.model_selection import train_test_split

### Load data

Read the data **from a S3 bucket to a CSV**. 

Note we had not seen this syntax before, but it is very useful: some functions accept file names of S3 objects in this format: 

    s3://[bucket_name]/[full_path_to_object]
    
*In pandas, remembed that the `use_cols` argument selects specific columns from the CSV. For this simple example, we just use a target attribute (`price`), and 2 input attributes(`sqft_living` and `waterfront`). 

In [4]:
data = pd.read_csv(
    's3://test-bucket-for-jose/sagemaker-data/kc_house_data.csv',
    usecols=['price', 'sqft_living', 'waterfront']
)

In [5]:
data.head()

Unnamed: 0,price,sqft_living,waterfront
0,221900.0,1180,0
1,538000.0,2570,0
2,180000.0,770,0
3,604000.0,1960,0
4,510000.0,1680,0


### Train / Val / Test split

We split the data into train (80%), validation (10%) and test (10%) sets. 

In [7]:
train, testval = train_test_split(data, train_size=0.8, random_state=1200)
val, test = train_test_split(testval, train_size=0.5, random_state=1200)

In [8]:
train.shape, val.shape, test.shape

((17290, 3), (2161, 3), (2162, 3))

To work with Sagemaker, we need to upload our data to S3. This snippet achieves that. Note what the snippet does: 

- It uses `.to_csv` as if it was writing to a file, but instead of giving a file it writes to an object of the class `StringIO`. This acts as an intermediate string placeholder. 
- It writes to S3, like in the example we saw in Lambda functions, but now giving as body the contents of the `StringIO`. 

In [14]:
s3 = boto3.resource('s3')

def upload_to_s3(df, bucket, filename):
    
    placeholder = io.StringIO()
    df.to_csv(placeholder, header=False, index=False)
    object = s3.Object(bucket, filename)
    object.put(Body=placeholder.getvalue())
    

After defining this, we proceed to the upload of the train and validation split. 

In [15]:
upload_to_s3(train, 'test-bucket-for-jose', 'sagemaker-data/kc/train.csv')

In [16]:
upload_to_s3(val, 'test-bucket-for-jose', 'sagemaker-data/kc/val.csv')

(Go to the S3 console and verify these files are there)

## Setting up the model

Now that's the part where we set up our model. 

We use the class `Estimator` from the `sagemaker.estimator` module. That will create the **environment** to run  training jobs for a model.

We specify: 

- A container name (Sagemaker works with containers. This code is pointing to a pre-existing container that holds everything that is needed to run xgboost. 
- A role name (the training job needs a role to have sufficient permissions, similarly to what we saw in Lambda functions). Remember that we created this role when starting the notebook server. 
- The number of instances for training (we use 1 but could use more in large jobs, to scale). 
- The type of instance (we select one that's included in the Sagemaker Free Tier). 
- The output path, where the model and other info will be written
- The hyperparameters of the algorithm (number of training rounds and loss function)
- The current session (it needs that for internal purposes)

(Remember to check the [pricing info](https://aws.amazon.com/sagemaker/pricing/#:~:text=amount%20of%20usage.-,Amazon%20SageMaker%20Free%20Tier,-Amazon%20SageMaker%20is) for more details on the Sagemaker Free Tier)

In [20]:
example = sagemaker.image_uris.retrieve('xgboost', region_name, version='0.90-1')

In [22]:
role = sagemaker.get_execution_role()
region_name = boto3.Session().region_name
#container = get_image_uri(region_name, 'xgboost', '0.90-1')  # Old version. Works anyway but warns.  
container = sagemaker.image_uris.retrieve('xgboost', region_name, version='0.90-1')
output_location = 's3://test-bucket-for-jose/sagemaker-output/'

#For a list of possible parameters of xgboost, see
# https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters
hyperparams = {
    'num_round': '20',
    'objective': 'reg:squarederror'
}

estimator = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=output_location,
    hyperparameters=hyperparams,
    sagemaker_session=sagemaker.Session()
)

Now we have to crete what sagemaker calls "channels". We need to specify where is the data and in which format in a specific dictionary:  

In [24]:
train_channel = sagemaker.session.s3_input(
    's3://test-bucket-for-jose/sagemaker-data/kc/train.csv',
    content_type='text/csv'
)
val_channel = sagemaker.session.s3_input(
    's3://test-bucket-for-jose/sagemaker-data/kc/val.csv',
    content_type='text/csv'
)


channels_for_training = {
    'train': train_channel,
    'validation': val_channel
}

The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


We are ready to train. 

When we execute the cell below, a training job will be launched. The training job is a "managed service" independent from this notebook. So if you go to the Sagemaker console and click on "Training jobs", you will find it there. 

BTW: once you got to this point, you could also launch a training job from the console (it will request you to enter all the information above manually-- just for you to know it is possible). 


In [25]:
estimator.fit(inputs=channels_for_training, logs=False)

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-02-21-14-42-17-809



2023-02-21 14:42:18 Starting - Starting the training job.......
2023-02-21 14:42:54 Starting - Preparing the instances for training............
2023-02-21 14:44:03 Downloading - Downloading input data....
2023-02-21 14:44:28 Training - Downloading the training image......
2023-02-21 14:45:04 Training - Training image download completed. Training in progress....
2023-02-21 14:45:24 Uploading - Uploading generated training model...
2023-02-21 14:45:40 Completed - Training job completed


We can print the job name -- this is the name that appears in the console. 

In [26]:
estimator._current_job_name

'sagemaker-xgboost-2023-02-21-14-42-17-809'

Finally, we can also get some metrics of the training job here. 

In [27]:
metrics = sagemaker.analytics.TrainingJobAnalytics(
    estimator._current_job_name,
    metric_names=['train:rmse', 'validation:rmse']
)

In [28]:
metrics.dataframe()

Unnamed: 0,timestamp,metric_name,value
0,0.0,train:rmse,221587.0
1,0.0,validation:rmse,230754.0


## Deploying the model

Now that the model is ready, we can "deploy it". This will create an instance that "serves" the model continuosuly. This server will accept queries with input values in real time and will return the model prediction. 

In [29]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', serializer=sagemaker.serializers.CSVSerializer())


INFO:sagemaker:Creating model with name: sagemaker-xgboost-2023-02-21-14-48-24-687
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2023-02-21-14-48-24-687
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2023-02-21-14-48-24-687


----------!

Keep the name of the model as we'll need it later!

While this notebook is open, you can perform inferences directly from the predictor object in this notebook, for testing purposes. 

In [30]:
predictor.predict("1000,0")

b'322433.96875'

(The format may surprise you and will be explained in the Inference.ipynb notebook)

When you close this notebook, the endpoint is still reachable as will be demonstrated in the Inference.ipynb notebook. 

## Additional info

- More on [deploying models](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deployment.html)