## CatBoost Scikit Learn Script Mode Local Training and Serving 

This is a sample Python program that trains a simple CatBoost model using SageMaker scikit-learn Docker image, and then performs inference. This implementation will work on your *local computer* or in the *AWS Cloud*.

#### Prerequisites:
1. Install required Python packages:
   `pip install -r requirements.txt`
2. Docker Desktop installed and running on your computer:
   `docker ps`
3. You should have AWS credentials configured on your local machine in order to be able to pull the docker image from ECR.

In [1]:
import os
import sagemaker
import pandas as pd
from sagemaker.predictor import csv_serializer
from sagemaker.xgboost import XGBoost
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

In [2]:
sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()

    
prefix = "xgboost_catboost"

## Downloading Data
Download training and eval data

In [3]:
local_train = './data/train/boston_train.csv'
local_validation = './data/validation/boston_validation.csv'
local_test = './data/test/boston_test.csv'

In [4]:
if os.path.isfile('./data/train/boston_train.csv') and \
        os.path.isfile('./data/validation/boston_validation.csv') and \
        os.path.isfile('./data/test/boston_test.csv'):
    print('Training dataset exist. Skipping Download')
else:
    print('Downloading training dataset')

    os.makedirs("./data", exist_ok=True)
    os.makedirs("./data/train", exist_ok=True)
    os.makedirs("./data/validation", exist_ok=True)
    os.makedirs("./data/test", exist_ok=True)

    data = load_boston()

    X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=45)
    X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=45)

    trainX = pd.DataFrame(X_train, columns=data.feature_names)
    trainX['target'] = y_train

    valX = pd.DataFrame(X_test, columns=data.feature_names)
    valX['target'] = y_test

    testX = pd.DataFrame(X_test, columns=data.feature_names)

    trainX.to_csv(local_train, header=None, index=False)
    valX.to_csv(local_validation, header=None, index=False)
    testX.to_csv(local_test, header=None, index=False)

    print('Downloading completed')

Training dataset exist. Skipping Download


## Model Training
Starting model training using **local mode**. Note: if launching for the first time in local mode, container image download might take a few minutes to complete.

In [44]:


training_instance_type = "ml.m5.xlarge"
train_location = sess.upload_data(
    local_train, key_prefix="{}/data/{}".format(prefix, "train")
)
validation_location = sess.upload_data(
    local_validation, key_prefix="{}/data/{}".format(prefix, "validation")
)
        

In [45]:
hyperparameters = {"num_round": 6}

estimator_parameters = {
    "entry_point": "multi_model_deploy.py",
    "source_dir": "code",
    "dependencies": ["my_custom_library"],
    "instance_type": training_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "xgboost-model",
    "framework_version": "1.0-1",
    "py_version": "py3",
}    
    

estimator = XGBoost(**estimator_parameters)

estimator.fit({'train': train_location, 'validation': validation_location})
print('Completed model training')


2022-04-13 04:31:49 Starting - Starting the training job...
2022-04-13 04:32:06 Starting - Preparing the instances for trainingProfilerReport-1649824309: InProgress
......
2022-04-13 04:33:17 Downloading - Downloading input data...
2022-04-13 04:33:42 Training - Downloading the training image.....[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Invoking user training script.[0m
[34mINFO:sagemaker-containers:Module multi_model_deploy does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34mINFO:sagemaker-containers:Generating setup.cfg[0m
[34mINFO:sagemaker-containers:Generating MANIFEST.in[0m
[34mINFO:sagemaker-containers:Installing module with the following command:[0m
[34m/miniconda3/bin/python3 -m pip install . -r requirements.txt[0m
[34mProcessing /opt/ml/code[0m
[34mCollecting catboost==0.2

In [47]:
model_data = estimator.model_data
model_data

's3://sagemaker-us-east-1-631450739534/xgboost-model-2022-04-13-04-31-49-243/output/model.tar.gz'

## Deploying trained model 
We can also deploy the trained model and perform invocation 

In [37]:
# endpoint_name = "xgboost-catboost-endpoint"
# predictor = estimator.deploy(
#         initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=endpoint_name
#     )


In [50]:
from sagemaker.xgboost.model import XGBoostModel

inference_model = XGBoostModel(
    model_data=model_data,
    role=role,
    entry_point="multi_model_deploy.py",
    framework_version="1.0-1",
    dependencies=["my_custom_library"],
    source_dir="code",
)

In [None]:
predictor = inference_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge",
)

------

In [None]:

from sagemaker.serializers import NumpySerializer, JSONSerializer, CSVSerializer
from sagemaker.deserializers import NumpyDeserializer, JSONDeserializer
predictor.serializer = CSVSerializer()
predictor.deserializer = JSONDeserializer()


In [None]:
with open(local_test, 'r') as f:
    payload = f.read().strip()

predictions = predictor.predict(payload)
print('predictions: {}'.format(predictions))

## Clear up resources
Delete the endpoint deployed in local

In [32]:
predictor.delete_endpoint(predictor.endpoint)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Gracefully stopping... (press Ctrl+C again to force)
