# Treinamento e Implantação de Modelo no Sagemaker Studio

Notebook dedicado a utilizar um exemplo de treinamento de modelo e implantação do mesmo via inferência baseado em serverless.
O Sagemaker permite apenas este tipo de inferência utilizando este método.
O Canvas, recém lançado por exemplo, somente permite a implantação de modelos via instâncias dedicadas de inferência.

## Instalação das bibliotecas necessárias

Vamos utilizar o AWS SDK e ferramentas do Sagemaker.

In [1]:
! pip install sagemaker botocore boto3 awscli --upgrade

Collecting sagemaker
  Using cached sagemaker-2.221.1-py3-none-any.whl.metadata (14 kB)
Collecting botocore
  Using cached botocore-1.34.117-py3-none-any.whl.metadata (5.7 kB)
Collecting boto3
  Using cached boto3-1.34.117-py3-none-any.whl.metadata (6.6 kB)
Collecting awscli
  Using cached awscli-1.32.117-py3-none-any.whl.metadata (11 kB)
Collecting docutils<0.17,>=0.10 (from awscli)
  Using cached docutils-0.16-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting rsa<4.8,>=3.1.2 (from awscli)
  Using cached rsa-4.7.2-py3-none-any.whl.metadata (3.6 kB)
Using cached sagemaker-2.221.1-py3-none-any.whl (1.5 MB)
Using cached botocore-1.34.117-py3-none-any.whl (12.3 MB)
Using cached boto3-1.34.117-py3-none-any.whl (139 kB)
Using cached awscli-1.32.117-py3-none-any.whl (4.5 MB)
Using cached docutils-0.16-py2.py3-none-any.whl (548 kB)
Using cached rsa-4.7.2-py3-none-any.whl (34 kB)
Installing collected packages: rsa, docutils, botocore, boto3, awscli, sagemaker
  Attempting uninstall: rsa
    Fo

Importação das bibliotecas e criando os clientes do Sagemaker para o treinamento e outro para inferência.

In [None]:
import boto3
import sagemaker
from sagemaker.estimator import Estimator
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sagemaker.inputs import TrainingInput
from time import gmtime, strftime

In [2]:
client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")

Configurando prefixos e definindo instância de treinamento.
Obtendo a região padrão de onde o domínio do Sagemaker foi criado.

In [6]:
boto_session = boto3.session.Session()
region = boto_session.region_name
print(region)

sagemaker_session = sagemaker.Session()
base_job_prefix = "xgboost-no-show"
role = sagemaker.get_execution_role()
print(role)

default_bucket = sagemaker_session.default_bucket()
s3_prefix = base_job_prefix

training_instance_type = "ml.m5.large"

us-east-1
arn:aws:iam::989944764342:role/service-role/AmazonSageMaker-ExecutionRole-20240601T211333


## Obtendo dataset

Este dataset é composto uma lista de consultas médicas das quais foram confirmadas a presentaça ou não. Detalhes como bairro, envio de SMS prévio, dia da marcação e dia da consulta, por exemplo, fornecem indícios sobre como podemos predizer a presença das pessoas.

In [None]:
!git clone https://github.com/michelpf/fiap-ds-cloud-cognitive-environments

In [7]:
df = pd.read_csv("fiap-ds-cloud-cognitive-environments/aula-1-introducao/dataset/brazil_noshowappointments_cleaned_may2016.csv")

## Feature engineering

Transformando e preparando os dados para amplificar o poder de predição.
A estratégia é tornar os dados mais generalistas, e com isso poder convergir melhor. 

* Conversão datas com horas e minutos para dia da semana, dia, hora e quarter.
* Remoção de atributos identificadores
* Definição de intervalos para ser aplicado em idade.

In [8]:
df.drop(columns=['patientid'], inplace=True)

df['scheduledday'] = pd.to_datetime(df['scheduledday'])
df['scheduled_month'] = df['scheduledday'].dt.month
df['scheduled_day'] = df['scheduledday'].dt.day
df['scheduled_hour'] = df['scheduledday'].dt.hour
df['scheduled_day_of_week'] = df['scheduledday'].dt.weekday  # 0=Monday, 6=Sunday
df['scheduled_quarter'] = df['scheduledday'].dt.quarter
df['scheduled_week_of_year'] = df['scheduledday'].dt.isocalendar().week

df['appointmentday'] = pd.to_datetime(df['appointmentday'])
df['appointment_month'] = df['appointmentday'].dt.month
df['appointment_day'] = df['appointmentday'].dt.day
df['appointment_hour'] = df['appointmentday'].dt.hour
df['appointment_day_of_week'] = df['appointmentday'].dt.weekday  # 0=Monday, 6=Sunday
df['appointment_quarter'] = df['appointmentday'].dt.quarter
df['appointment_week_of_year'] = df['appointmentday'].dt.isocalendar().week

df.drop(columns=['appointmentday'], inplace=True)
df.drop(columns=['scheduledday'], inplace=True)

In [9]:
# Definir os intervalos de idade
bins = [0, 12, 17, 35, 50, 65, float('inf')]

# Definir os rótulos correspondentes a cada faixa etária
labels = ['Criança', 'Adolescente', 'Jovem Adulto', 'Adulto', 'Meia-idade', 'Idoso']

# Usar pd.cut para criar uma nova coluna com as faixas etárias
df['age_type'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

df.drop(columns=['age'], inplace=True)

Utilização de codificação de valor string para número, por meio da utilização de um label encoder.

In [10]:
label_encoder = LabelEncoder()

df['gender'] = label_encoder.fit_transform(df['gender'])
df['neighbourhood'] = label_encoder.fit_transform(df['neighbourhood'])
df['scholarship'] = label_encoder.fit_transform(df['scholarship'])
df['sms_received'] = label_encoder.fit_transform(df['sms_received'])
df['appointment'] = label_encoder.fit_transform(df['appointment'])
df['ailment'] = label_encoder.fit_transform(df['ailment'])
df['age_type'] = label_encoder.fit_transform(df['age_type'])


In [11]:
df

Unnamed: 0,gender,neighbourhood,scholarship,sms_received,appointment,ailment,scheduled_month,scheduled_day,scheduled_hour,scheduled_day_of_week,scheduled_quarter,scheduled_week_of_year,appointment_month,appointment_day,appointment_hour,appointment_day_of_week,appointment_quarter,appointment_week_of_year,age_type
0,0,39,0,0,1,7,4,29,18,4,2,17,4,29,0,4,2,17,5
1,1,39,0,0,1,15,4,29,16,4,2,17,4,29,0,4,2,17,5
2,0,45,0,0,1,15,4,29,16,4,2,17,4,29,0,4,2,17,5
3,0,54,0,0,1,15,4,29,17,4,2,17,4,29,0,4,2,17,2
4,0,39,0,0,1,10,4,29,16,4,2,17,4,29,0,4,2,17,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109904,0,43,0,1,1,15,5,3,9,1,2,18,6,7,0,1,2,23,5
109905,0,43,0,1,1,15,5,3,7,1,2,18,6,7,0,1,2,23,5
109906,0,43,0,1,1,15,4,27,16,2,2,17,6,7,0,1,2,23,4
109907,0,43,0,1,1,15,4,27,15,2,2,17,6,7,0,1,2,23,1


In [13]:
df = df.rename(columns={'appointment': 'Target'})

In [14]:
other_columns = [col for col in df.columns if col != 'Target']

# Renomear as outras colunas para 'Feature_0', 'Feature_1', etc.
new_column_names = ['Feature_' + str(i) for i in range(len(other_columns))]
rename_dict = dict(zip(other_columns, new_column_names))

df = df.rename(columns=rename_dict)

# Reorganizar para que a coluna 'Target' seja a primeira
df = df[['Target'] + new_column_names]


In [15]:
df

Unnamed: 0,Target,Feature_0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Feature_10,Feature_11,Feature_12,Feature_13,Feature_14,Feature_15,Feature_16,Feature_17
0,1,0,39,0,0,7,4,29,18,4,2,17,4,29,0,4,2,17,5
1,1,1,39,0,0,15,4,29,16,4,2,17,4,29,0,4,2,17,5
2,1,0,45,0,0,15,4,29,16,4,2,17,4,29,0,4,2,17,5
3,1,0,54,0,0,15,4,29,17,4,2,17,4,29,0,4,2,17,2
4,1,0,39,0,0,10,4,29,16,4,2,17,4,29,0,4,2,17,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109904,1,0,43,0,1,15,5,3,9,1,2,18,6,7,0,1,2,23,5
109905,1,0,43,0,1,15,5,3,7,1,2,18,6,7,0,1,2,23,5
109906,1,0,43,0,1,15,4,27,16,2,2,17,6,7,0,1,2,23,4
109907,1,0,43,0,1,15,4,27,15,2,2,17,6,7,0,1,2,23,1


In [38]:
df['Target'].unique()

array([1, 0])

## Separação de dados

Separando dados de treinamento e teste.

In [45]:
train_data = df.sample(frac=0.8, random_state=200)
test_data = df.drop(train_data.index)

Exportando para o formato que o Sagemaker necessita para o treinamento.

In [46]:
# Exportando o DataFrame para um arquivo CSV
train_data.to_csv('train.csv', index=False)

Enviando o arquivo de treinamento para o S3. Local de onde os modelos a serem treinados podem acessá-lo.

In [47]:
# upload data to S3
!aws s3 cp train.csv s3://{default_bucket}/xgboost-classification/train.csv

upload: ./train.csv to s3://sagemaker-us-east-1-989944764342/xgboost-classification/train.csv


## Treinamento do modelo

Temos diversos modelos disponíveis para utilizar no treinamento, desde os internos da Amazon quanto externos.
O Amazon JumpStart oferece uma forma amigável para acessar diferentes modelos para diferentes propósitos. Ainda é possível obter conteiners de diferentes outros modelos disponíveis no Sagemaker.
O Sagemaker possui alguns modelos com biblioteca dedicada, sendo este outra forma de acessar os modelos padrão.

Para mais detalhes sobre os modelos disponíveis, acesse esta [referência](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html).

Neste projeto, vamos utilizar o XGBoost como modelo a ser treinado.

In [48]:
training_path = f"s3://{default_bucket}/xgboost-classification/train.csv"
train_input = TrainingInput(training_path, content_type="text/csv")
model_path = f"s3://{default_bucket}/{s3_prefix}/xgb_model"

Configuramos o XGBoost para classificação logística binária. Ou seja, será responsável por prever entre 0 e 1, sendo que o valor de corte, comumente utilizado como 0.5 pode ser ajustado conforme a necessidade do modelo.
Por exemplo, para predições mais precisas de "presença" podemos optar o corte acima de 0.7. E assim por diante.

In [49]:
# Obter a imagem do XGBoost
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type=training_instance_type,
)

# Configurar Estimador
xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=1,
    output_path=model_path,
    sagemaker_session=sagemaker_session,
    role=role,
)

# Configurar Hiperparâmetros
xgb_train.set_hyperparameters(
    objective="binary:logistic",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    silent=0,
)

In [50]:
# Fit model
xgb_train.fit({"train": train_input})

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2024-06-02-19-05-24-202


2024-06-02 19:05:24 Starting - Starting the training job...
2024-06-02 19:05:42 Starting - Preparing the instances for training...
2024-06-02 19:06:09 Downloading - Downloading input data...
2024-06-02 19:06:49 Downloading - Downloading the training image......
2024-06-02 19:07:50 Training - Training image download completed. Training in progress.
2024-06-02 19:07:50 Uploading - Uploading generated training model[34m[2024-06-02 19:07:39.781 ip-10-0-85-128.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','

## Criando inferência serverless

Nesta etapa, após o modelo ter sido treinado, vamos criar um endpoint para inferências baseado em serverless, ou seja, não utiliza uma instância dedicada.

In [51]:
# Retrieve model data from training job
model_artifacts = xgb_train.model_data
model_artifacts

's3://sagemaker-us-east-1-989944764342/xgboost-no-show/xgb_model/sagemaker-xgboost-2024-06-02-19-05-24-202/output/model.tar.gz'

In [52]:
model_name = "xgboost-serverless" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)

# dummy environment variables
byo_container_env_vars = {"SAGEMAKER_CONTAINER_LOG_LEVEL": "20", "SOME_ENV_VAR": "myEnvVar"}

create_model_response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            "Image": image_uri,
            "Mode": "SingleModel",
            "ModelDataUrl": model_artifacts,
            "Environment": byo_container_env_vars,
        }
    ],
    ExecutionRoleArn=role,
)

print("Model Arn: " + create_model_response["ModelArn"])

Model name: xgboost-serverless2024-06-02-19-08-40
Model Arn: arn:aws:sagemaker:us-east-1:989944764342:model/xgboost-serverless2024-06-02-19-08-40


In [53]:
xgboost_epc_name = "xgboost-serverless-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name,
    ProductionVariants=[
        {
            "VariantName": "byoVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 1024,
                "MaxConcurrency": 1,
            },
        },
    ],
)

print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

Endpoint Configuration Arn: arn:aws:sagemaker:us-east-1:989944764342:endpoint-config/xgboost-serverless-epc2024-06-02-19-08-43


In [55]:
endpoint_name = "xgboost-serverless-noshow" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=xgboost_epc_name,
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-east-1:989944764342:endpoint/xgboost-serverless-noshow2024-06-02-19-09-03


In [56]:
# wait for endpoint to reach a terminal state (InService) using describe endpoint
import time

describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)

while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])
    time.sleep(15)

describe_endpoint_response

Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
InService


{'EndpointName': 'xgboost-serverless-noshow2024-06-02-19-09-03',
 'EndpointArn': 'arn:aws:sagemaker:us-east-1:989944764342:endpoint/xgboost-serverless-noshow2024-06-02-19-09-03',
 'EndpointConfigName': 'xgboost-serverless-epc2024-06-02-19-08-43',
 'ProductionVariants': [{'VariantName': 'byoVariant',
   'DeployedImages': [{'SpecifiedImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3',
     'ResolvedImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost@sha256:b517ba0196d2579c5393750335ed104da74b11ad2a6f0ae5da6907a7221ff2e0',
     'ResolutionTime': datetime.datetime(2024, 6, 2, 19, 9, 4, 361000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 0,
   'CurrentServerlessConfig': {'MemorySizeInMB': 1024, 'MaxConcurrency': 1}}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2024, 6, 2, 19, 9, 3, 603000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2024, 6, 2, 

In [58]:
linha_0 = test_data.iloc[0]
linha_0

Target         1
Feature_0      0
Feature_1     54
Feature_2      0
Feature_3      0
Feature_4     15
Feature_5      4
Feature_6     29
Feature_7     17
Feature_8      4
Feature_9      2
Feature_10    17
Feature_11     4
Feature_12    29
Feature_13     0
Feature_14     4
Feature_15     2
Feature_16    17
Feature_17     2
Name: 3, dtype: Int64

In [61]:
linha_bytes = ','.join(map(str, df.iloc[0, 1:].values)).encode()
linha_bytes

b'0,39,0,0,7,4,29,18,4,2,17,4,29,0,4,2,17,5'

In [69]:
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=linha_bytes,
    ContentType="text/csv",
)

print(response["Body"].read())

b'0.9499961733818054'


In [117]:
label_test = test_data['Target'].tolist()
prediction_test = []

In [87]:
len(test_data)

21982

In [95]:
indice_divisao = len(test_data) // 2

# Dividindo o DataFrame em duas partes
df_parte1 = df.iloc[:indice_divisao]
df_parte2 = df.iloc[indice_divisao:]

In [102]:
import io
from io import StringIO
csv_file = io.StringIO()
# by default sagemaker expects comma separated
df_sem_primeira_coluna = test_data.iloc[:, 1:]

df_sem_primeira_coluna.to_csv(csv_file, sep=",", header=False, index=False)
my_payload_as_csv = csv_file.getvalue()

In [112]:
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=my_payload_as_csv,
    ContentType="text/csv")

In [113]:
response = response["Body"].read().decode('utf-8')

In [114]:
predictions = response.split(",")

In [129]:
prediction_test = []
for item in predictions:
    prediction_float = float(item)
    #prediction_rounded = round(prediction_float)
    if prediction_float > 0.5:
        prediction_rounded = 1
    else:
        prediction_rounded = 0
    prediction_test.append(prediction_rounded)

In [130]:
len(label_test), len(prediction_test)

(21982, 21982)

In [131]:
from sklearn.metrics import accuracy_score, precision_score, f1_score
from math import sqrt

accuracy = accuracy_score(label_test, prediction_test)
precision = precision_score(label_test, prediction_test)
f1 = f1_score(label_test, prediction_test)

print("Accuracy: {0}\nPrecision: {1}\nF1: {2}".format(accuracy, precision, f1))

Accuracy: 0.7996997543444636
Precision: 0.801994301994302
F1: 0.8879900277290188


In [132]:
client.delete_model(ModelName=model_name)
client.delete_endpoint_config(EndpointConfigName=xgboost_epc_name)
client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'e15647c4-cb82-4741-b7d3-cfe93598a299',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'e15647c4-cb82-4741-b7d3-cfe93598a299',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 02 Jun 2024 19:47:04 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}