# Amazon SageMaker Studio Demo
_**Usar Gradient Boosted Trees para predecir gastos familiares**_

---

En este demo se va a ver algunas carcteristicas de Amazon SageMaker Studio. 

* [Amazon SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html)
  * Administrar multiples experimentos
  * Experimentar on hiperparametros y visualizaciones
* [Model hosting](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html)
  * Creación de endpoints paa obtener predicciones
* [SageMaker Model Monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html)
  * Monitorear la calidad del modelo
  * Alertas cuando la calidad del modelo decrece

---

## Contents

1. [Background](#Background) - Predecir gastos familiares con XGBoost
1. [Data](#Data) - Preparar el dataset y subirlo a s3
1. [Train](#Train) - Enetranar el modelo
  - [Amazon SageMaker Experiments](#Amazon-SageMaker-Experiments)
  - [Amazon SageMaker Debugger](#Amazon-SageMaker-Debugger)
1. [Host](#Host)
1. [Model Monitor](#SageMaker-Model-Monitor)

---

## Background

Este notebook es una adaptación de [amazon-sagemaker-examples/aws_sagemaker_studio/getting_started/] (https://github.com/awslabs/amazon-sagemaker-examples.git). 

Se adapta el modelo implementado de gastos familiares para este demo

In [20]:
import sys
!{sys.executable} -m pip install sagemaker -U
!{sys.executable} -m pip install sagemaker-experiments

Requirement already up-to-date: sagemaker in /opt/conda/lib/python3.7/site-packages (1.72.0)


In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import boto3
import re

from sklearn.model_selection import train_test_split

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig
from sagemaker.model_monitor import DataCaptureConfig, DatasetFormat, DefaultModelMonitor
from sagemaker.s3 import S3Uploader, S3Downloader

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

In [22]:
sess = boto3.Session()
sm = sess.client('sagemaker')
role = sagemaker.get_execution_role()

---
## Data

Data con la cual se entreno el modelo de gastos familiares: processed_data.csv

In [23]:
local_data_path = './data/processed_data.csv'
data = pd.read_csv(local_data_path)
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 10)         # Keep the output on one page
data

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,col_0,DIRECTORIO,SECUENCIA_ENCUESTA_x,SECUENCIA_P_x,ORDEN_x,P6050,parentesco,P6020,genero,edad,sq_edad,rango_edad,P6210,P6240,P6390S1,P6510S1,P6510S2,P6585S1A1,P6585S1A2,P6585S2A1,P6585S2A2,P1653S1A1,P1653S1A2,P1653S2A1,P1653S2A2,P1653S3A1,P1653S3A2,P1653S4A1,P1653S4A2,P6750,P6760,P6779S1,P6920,P7510S2A1,P7513S6,P7513S7,P7513S6A1,P7513S7A1,P7513S7A2,VIVIENDA,SECUENCIA_ENCUESTA_y,SECUENCIA_P_y,ORDEN_y,REGION,DOMINIO,P3,P5747,P8520S1,estrato,P5090,dummy_vivienda,P5100S1,P5100S2,P5100S3,P5100S4,P5102,P5103,P5105,P1644M1,P1644M2,P1644M3,P1644M4,P1644M5,P1644M6,P1644M7,P1644S1,P1644S3,P5240,P5230,P6008,PERIODO,IT,ICGU,ICMUG,ICMDUG,GTUG,GCUG,GCMUG,gdu_alimentos_servicios,salario,sq_salario,ingreso_lulo,porcentaje_ingreso,gastos_básicos,salario_hogar,valor_arriendo,servicios_mercados,servicios_vivienda,porcentaje_gasto,valor_gasto,capc_mercado,gdp_alimentos_servicios,Desayuno_DFC,Almuerzo_DFC,Cena_DFC,Merienda_DFC,Desayuno_PFC,Almuerzo_PFC,Cena_PFC,Merienda_PFC,limpieza_cuidado,cuidado_jabones,cuidado_electrodom,alquiler_servicios,servicios_generales,ropa_accesorios,gastos_educacion,matriculas_educacion,medicamentos,consultas_médicas,servicios_salud,cocina_artículos,hogar_artículos,comunicaciones_transporte,otros_comunicacion,vehículo_productos,vehículo_servicios,cultura_recreación,recreación_fiesta,recreación_artículos,gastos_finanzas,misc_mensual,misc_anual,otra_vivienda,renovación_vivienda,gastos_viajes,personas_hogar,bin_porcentaje_ingreso,Desayuno_FC,Almuerzo_FC,Cena_FC,Merienda_FC,gastos_totales,gb_gastos_totales
0,4,118279,1,1,1,1,Jefe,1,Masculino,22,484,20 : 30,5,1,2395.0,,,,,70000.0,2.0,,,,,,,,,,,,1,,2,2,,,,76892|1413|06|0016|0003,1,1,1,PACÍFICA,YUMBO,1,2,1,1.0,3,0,,,,,,,,,,,,,,,,,2,2,3,201607,1.945000e+06,1.945000e+06,1.945000e+06,1.945000e+06,1.237871e+06,1.237871e+06,123787133333333,10038.0,1800000,3240000000000,Medio,1.000000,240038,1800000,230000,10038.0,0.0,0.133354,2.400380e+05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.150000e+04,174500.0,0.0,0.0,0.0,330000.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,1200000.0,0.0,350000.0,0.0,0.000000e+00,0.0,350000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,1,90 - 100,0.0,0.0,0.0,0.0,2.686038e+06,0.089365
1,11,118295,1,1,1,1,Jefe,1,Masculino,55,3025,50 : 60,6,1,8553.0,,,,,,,,,,,,,,,,,,1,,2,2,,,,76892|1257|23|0008|0001,1,1,1,PACÍFICA,YUMBO,1,1,1,1.0,1,1,,,,,,,,1.0,,,,,,,,,2,2,2,201607,2.100000e+06,2.100000e+06,1.800000e+06,1.680000e+06,6.844633e+05,6.844633e+05,384463333333333,9630.0,1800000,3240000000000,Medio,1.000000,23630,1800000,0,23630.0,0.0,0.013128,2.363000e+04,14000.0,0.0,0.0,0.0,0.0,0.0,1000.0,0.0,0.0,0.0,0.000000e+00,37500.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,10000.0,0.0,0.0,0.0,0.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,1,90 - 100,1000.0,0.0,0.0,0.0,7.213000e+04,0.327603
2,30,118353,1,1,1,1,Jefe,1,Masculino,35,1225,30 : 40,5,1,8423.0,,,,,,,,,,,,,,,,,,1,,1,2,25000000.0,,,05001|1054|07|0035|0001,1,1,1,CENTRAL,MEDELLÍN Y A.M.,1,2,1,2.0,3,0,,,,,,,,,,,,,,,,,1,1,4,201607,2.525000e+06,2.525000e+06,2.275000e+06,2.275000e+06,9.405617e+05,9.405617e+05,940561666666667,101706.0,2200000,4840000000000,Alto,1.000000,384126,2200000,250000,134126.0,0.0,0.174603,3.841260e+05,12500.0,19920.0,0.0,0.0,0.0,0.0,13000.0,38520.0,16500.0,9760.0,6.500000e+03,3000.0,0.0,0.0,0.0,102000.000000,0.0,5833.333333,2000.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25000.000000,1,90 - 100,13000.0,38520.0,16500.0,9760.0,6.062393e+05,0.633621
3,31,118354,1,1,1,1,Jefe,1,Masculino,38,1444,30 : 40,5,1,8422.0,,,,,,,,,,,,,,,,,,1,,2,2,,,,05001|1054|07|0037|0006,1,1,1,CENTRAL,MEDELLÍN Y A.M.,1,1,1,2.0,2,1,98.0,98.0,98.0,98.0,1.0,80000000.0,1.0,1.0,,,,,,,,,2,2,6,201607,2.892500e+06,2.892500e+06,2.267500e+06,2.067500e+06,8.834550e+06,2.137883e+06,1617050,510176.0,1500000,2250000000000,Medio,0.681818,881254,2200000,0,700996.0,180258.0,0.400570,6.008550e+05,84700.0,106120.0,0.0,0.0,0.0,0.0,38734.0,48150.0,25970.0,9844.0,4.990000e+04,55500.0,0.0,0.0,0.0,0.000000,0.0,10000.000000,2333.333333,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,3.000000e+04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,6,60 - 70,38734.0,48150.0,25970.0,9844.0,1.151685e+06,0.765186
4,38,118372,1,1,1,1,Jefe,1,Masculino,31,961,30 : 40,6,1,8412.0,,,,,,,,,,,,,,,,,,1,,2,2,,,,05001|1106|02|0016|0001,1,1,1,CENTRAL,MEDELLÍN Y A.M.,1,1,1,3.0,1,1,,,,,,,,1.0,,,,,,,,,1,2,1,201607,5.450000e+06,5.450000e+06,4.700000e+06,4.300000e+06,2.287731e+06,2.249731e+06,149973133333333,379867.0,4700000,22090000000000,Muy Alto,1.000000,379867,4700000,0,379867.0,0.0,0.080823,3.798670e+05,0.0,0.0,55150.0,199598.0,77420.0,0.0,0.0,0.0,0.0,0.0,1.225000e+04,13000.0,0.0,0.0,0.0,132000.000000,0.0,0.000000,3500.000000,0.0,0.000000,55000.0,120000.0,0.0,0.0,0.0,3.800000e+04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64166.666667,1,90 - 100,55150.0,199598.0,77420.0,0.0,1.149952e+06,0.330333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12312,56587,734490,1,1,1,1,Jefe,1,Masculino,53,2809,50 : 60,6,1,8610.0,,,,,,,,,,,,,,,,,,1,,2,2,,,,76001|1104|08|0017|0002,1,1,1,PACÍFICA,CALI,1,2,1,3.0,1,1,,,,,,,,1.0,,,,,,,,,2,2,1,201706,5.991667e+06,5.991667e+06,5.341667e+06,5.021667e+06,5.806343e+06,5.614676e+06,4964676,187036.0,4500000,20250000000000,Muy Alto,1.000000,187036,4500000,0,187036.0,0.0,0.041564,1.870360e+05,0.0,0.0,67060.0,329880.0,53500.0,0.0,0.0,0.0,0.0,0.0,1.336667e+04,38500.0,0.0,0.0,62500.0,480000.000000,0.0,0.000000,0.000000,0.0,10833.333333,0.0,0.0,0.0,0.0,0.0,1.291667e+05,0.0,0.0,0.0,800000.0,0.0,0.0,0.0,0.0,0.000000,1,90 - 100,67060.0,329880.0,53500.0,0.0,2.171843e+06,0.086119
12313,56590,734599,1,1,1,1,Jefe,1,Masculino,68,4624,60 : 70,6,1,8544.0,,,,,,,,,,,,,,,,,,3,,2,2,,,,76520|2001|03|0002|0001,1,1,1,PACÍFICA,OTRAS CABECERAS,1,1,1,5.0,3,0,,,,,,,,,,,,,,,,,1,2,3,201706,2.575000e+07,2.575000e+07,2.575000e+07,2.327000e+07,9.895049e+06,9.420049e+06,942004933333333,289866.0,4000000,16000000000000,Muy Alto,0.444444,1892186,9000000,1200000,692186.0,0.0,0.210243,8.409716e+05,0.0,402320.0,0.0,0.0,9600.0,0.0,39590.0,0.0,449400.0,7000.0,1.139400e+06,792599.0,255000.0,0.0,0.0,166666.666667,0.0,0.000000,43333.333333,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,1.091667e+06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,2,40 - 50,39590.0,0.0,459000.0,7000.0,5.886442e+06,0.321448
12314,56591,734599,2,1,2,2,Conyuge,2,Femenino,50,2500,50 : 60,6,1,9411.0,,,,,,,,,,,,,,,,,,1,,2,2,,,,76520|2001|03|0002|0001,1,1,1,PACÍFICA,OTRAS CABECERAS,1,1,1,5.0,3,0,,,,,,,,,,,,,,,,,1,2,3,201706,2.575000e+07,2.575000e+07,2.575000e+07,2.327000e+07,9.895049e+06,9.420049e+06,942004933333333,289866.0,5000000,25000000000000,Muy Alto,0.555556,1892186,9000000,1200000,692186.0,0.0,0.210243,1.051214e+06,0.0,402320.0,0.0,0.0,9600.0,0.0,39590.0,0.0,449400.0,7000.0,1.139400e+06,792599.0,255000.0,0.0,0.0,166666.666667,0.0,0.000000,43333.333333,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,1.091667e+06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,2,50 - 60,39590.0,0.0,459000.0,7000.0,5.886442e+06,0.321448
12315,56599,742592,1,1,1,1,Jefe,2,Femenino,36,1296,30 : 40,6,6,6820.0,,,,,,,,,,,,,,,,,,1,,2,2,,,,11001|1026|42|0005|0020,1,1,1,BOGOTÁ,BOGOTÁ,1,2,1,5.0,1,1,,,,,,,,1.0,,,,,,,,,2,2,3,201706,9.850000e+06,9.850000e+06,6.350000e+06,6.030000e+06,1.000295e+07,9.527950e+06,6022950,439290.0,5000000,25000000000000,Muy Alto,1.000000,439290,5000000,0,439290.0,0.0,0.087858,4.392900e+05,0.0,0.0,0.0,308160.0,0.0,0.0,0.0,0.0,0.0,0.0,1.070000e+06,50000.0,0.0,623000.0,375000.0,0.000000,1300000.0,166666.666667,0.000000,0.0,104166.666667,50000.0,0.0,0.0,0.0,0.0,2.166667e+05,0.0,400000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,1,90 - 100,0.0,308160.0,0.0,0.0,5.102950e+06,0.086085


In [24]:
#S3://lbk-analytics-dev/sagemaker/data
account_id = sess.client('sts', region_name=sess.region_name).get_caller_identity()["Account"]
bucket = 'lbk-analytics-dev'
prefix = 'sagemaker/xgboost'

df_model = data[['valor_gasto','edad','salario']]
df_model = df_model[~df_model['valor_gasto'].isnull()]
df_model['edad_salario'] = df_model['edad'] * df_model['salario']

train, validate = train_test_split(df_model, test_size=0.2, random_state=42)
train, test = train_test_split(train, test_size=0.2, random_state=42)

train.to_csv("data/train.csv", sep=',', decimal='.', index=False, header=False)
validate.to_csv("data/validate.csv", sep=',', decimal='.', index=False, header=False)
test.to_csv("data/test.csv", sep=',', decimal='.', index=False, header=False)

s3url = S3Uploader.upload('data/train.csv', 's3://{}/{}/{}'.format(bucket, prefix,'data'))
print(s3url)

s3url = S3Uploader.upload('data/validate.csv', 's3://{}/{}/{}'.format(bucket, prefix,'data'))
print(s3url)

s3url = S3Uploader.upload('data/test.csv', 's3://{}/{}/{}'.format(bucket, prefix,'data'))
print(s3url)

s3://lbk-analytics-dev/sagemaker/xgboost/data/train.csv
s3://lbk-analytics-dev/sagemaker/xgboost/data/validate.csv
s3://lbk-analytics-dev/sagemaker/xgboost/data/test.csv


---
## Train

Se va a usar la libreria XGBoost (gradient boosted decision trees) con la información de esta en el bucket

Se debe especificar la ubicacion de el contenedor con el algoritmo XGBoost

In [25]:
from sagemaker.amazon.amazon_estimator import get_image_uri
docker_image_name = get_image_uri(boto3.Session().region_name, 'xgboost', repo_version='1.0-1')



Como se va a hacer el entrenamiento con archivos csv, se vana acrear s3_inputs que pueda utilizar la función de enetrenamiento

In [26]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/data/train.csv'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/data/validate.csv'.format(bucket, prefix), content_type='csv')



### Amazon SageMaker Experiments

Amazon SageMaker Experiments permite monitorear el entrenamiento del modelo, organizar modelos relacionados, configuracion de logs de los modelos, parametros, revision de modelos anteriores, y comparación de modelos

Cada bloque de entrenamiento del modelo se llama "experiment trial", los cuales pueden ser comparados

In [27]:
sess = sagemaker.session.Session()

create_date = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
# family_expensives_experiment = Experiment.create(experiment_name="family-expensives-prediction-xgboost-{}".format(create_date), 
#                                               description="Using xgboost to predict family expensives", 
#                                               sagemaker_boto_client=boto3.client('sagemaker'))

family_expensives_experiment = Experiment.create(experiment_name="Experiment-family-expensives-prediction-xgboost", 
                                              description="Using xgboost to predict family expensives", 
                                              sagemaker_boto_client=boto3.client('sagemaker'))

In [28]:
# hyperparams = {"max_depth":5,
#                "subsample":0.8,
#                "num_round":600,
#                "eta":0.2,
#                "gamma":4,
#                "min_child_weight":6,
#                "silent":0,
#                "objective":'binary:logistic'}

# (                    
# n_estimators=10,
# missing=None,
# )

hyperparams = {"max_depth":20,
               "subsample":0.5,               
               "eta":0.2,
               "gamma":0.0,
               "min_child_weight":0.0,
               "silent":0,
               "max_delta_step":0.0,
               "colsample_bytree":1.0,
               "colsample_bylevel":1.0,
               "reg_alpha":0.0,
               "nthread":4,
               "scale_pos_weight":1.0,
               "base_score":0.5,
               "seed":1337,                              
               "num_round":10,
               "objective":'reg:squarederror'
}

#### Trial 1 - XGBoost

Se va a utilizar el algoritmo XGBoost para entrenar y desplegar el modelo.

Se crea un estimador con los parametros basicos, como el tipo instancias para entrenar y la cantidad de las mismas, ademas de la ruta de los artefactos generados de la ejecución del modelo.

Se crea un objeto `Trial` para asociar el experimento con el entrenamiento ejecutado.  

In [29]:
trial = Trial.create(trial_name="Trial-family-expensives-prediction-xgboost-{}".format(strftime("%Y-%m-%d-%H-%M-%S", gmtime())), 
                     experiment_name=family_expensives_experiment.experiment_name,
                     sagemaker_boto_client=boto3.client('sagemaker'))

xgb = sagemaker.estimator.Estimator(image_name=docker_image_name,
                                    role=role,
                                    hyperparameters=hyperparams,
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    base_job_name="demo-xgboost-customer-churn",
                                    sagemaker_session=sess)


xgb.fit({'train': s3_input_train,
         'validation': s3_input_validation}, 
        experiment_config={
            "ExperimentName": family_expensives_experiment.experiment_name, 
            "TrialName": trial.trial_name,
            "TrialComponentDisplayName": "Training",
        }
       )

INFO:sagemaker:Creating training-job with name: demo-xgboost-customer-churn-2020-07-30-22-42-57-074


2020-07-30 22:42:57 Starting - Starting the training job...
2020-07-30 22:43:00 Starting - Launching requested ML instances......
2020-07-30 22:44:15 Starting - Preparing the instances for training......
2020-07-30 22:45:07 Downloading - Downloading input data...
2020-07-30 22:45:30 Training - Downloading the training image..[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[22:46:04] 7882x3 matrix with 23646 entries loaded from /opt/ml/input/data/t

#### Resultados Trial

Se puede revisar las metricas, logs y graficas relacionadas a cada entrenamiento el el tab **Experiments** en Amazon SageMaker Studio. 

Para visualizar los Entrenamientos de los experimientos, se da doble click sobre el experimento. Adicional se pueden ver los entrenamientos de multimples experimentos seleccionadolos con Crtl, y con el menu contextual "Open in trial component list".

Los componentes son ordenados de tal forma que el mejor modelo queda al inicio

#### Información del trial y descarga de modelos

Seleccionado el trial, se pueden ver la información del mismo, graficas, metricas y la uibicación en s3 deñ modelo.

#### Modificando Hiperparametros

Se va a variar el parametro `min_child_weight`. FPara cada valor or each value, we'll create a separate trial so that we can compare the results in Amazon SageMaker Studio later.

In [33]:
min_child_weights = [1, 2, 4, 8, 10]

for weight in min_child_weights:
    hyperparams["min_child_weight"] = weight
    trial = Trial.create(trial_name="Trial-family-expensives-prediction-xgboost-weight-{}-{}".format(weight,strftime("%Y-%m-%d-%H-%M-%S", gmtime())), 
                         experiment_name=family_expensives_experiment.experiment_name,
                         sagemaker_boto_client=boto3.client('sagemaker'))

    t_xgb = sagemaker.estimator.Estimator(image_name=docker_image_name,
                                          role=role,
                                          hyperparameters=hyperparams,
                                          train_instance_count=1, 
                                          train_instance_type='ml.m4.xlarge',
                                          output_path='s3://{}/{}/output'.format(bucket, prefix),
                                          base_job_name="demo-xgboost-customer-churn",
                                          sagemaker_session=sess)

    t_xgb.fit({'train': s3_input_train,
               'validation': s3_input_validation},
                wait=False,
                experiment_config={
                    "ExperimentName": family_expensives_experiment.experiment_name, 
                    "TrialName": trial.trial_name,
                    "TrialComponentDisplayName": "Training",
                }
               )

INFO:sagemaker:Creating training-job with name: demo-xgboost-customer-churn-2020-07-30-23-45-39-536
INFO:sagemaker:Creating training-job with name: demo-xgboost-customer-churn-2020-07-30-23-45-39-897
INFO:sagemaker:Creating training-job with name: demo-xgboost-customer-churn-2020-07-30-23-45-43-845
INFO:sagemaker:Creating training-job with name: demo-xgboost-customer-churn-2020-07-30-23-45-46-265
INFO:sagemaker:Creating training-job with name: demo-xgboost-customer-churn-2020-07-30-23-45-46-804


#### Create charts

To create a chart, multi-select the components. Because this is a sample training run and the data is sparse, there's not much to chart in a time series. However, we can create a scatter plot for the parameter sweep. The following image is an example.

![scatter plot example](./images/scatter_plot_example.png)

##### How to create a scatter plot

Multi-select components, then choose **Add chart**. In the **Chart Properties** panel, choose **Summary Statistics** as the **Data type**. For **Chart type**, choose **Scatter plot**. Choose the hyperparameter `min_child_weight` as the X-axis (because this is the hyperparameter that you are iterating on in this notebook). For Y-axis metrics, choose either `validation:error_last` or `validation:error_avg`. Color them by choosing `trialComponentName`.

![create a scatter plot](./images/create_a_scatter_plot.gif)

You can also adjust the chart at any time by changing the components that are selected. And you can zoom in and out. Each item on the graph displays contextual information.

![adjust a scatter plot](./images/adjust_a_scatter_plot.gif)

---
## Host the model

Now that we've trained the model, let's deploy it to a hosted endpoint. To monitor the model after it's hosted and serving requests, we'll also add configurations to capture data that is being sent to the endpoint.

In [None]:
data_capture_prefix = '{}/datacapture'.format(prefix)

endpoint_name = "Endpoint-family-expensives-xgboost-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName = {}".format(endpoint_name))

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1, 
                           instance_type='ml.m4.xlarge',
                           endpoint_name=endpoint_name,
                           data_capture_config=DataCaptureConfig(enable_capture=True,
                                                                 sampling_percentage=100,
                                                                 destination_s3_uri='s3://{}/{}'.format(bucket, data_capture_prefix)
                                                                )
                           )

### Invoke the deployed model

Now that we have a hosted endpoint running, we can make real-time predictions from our model by making an http POST request.  But first, we need to set up serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

In [None]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

In [None]:
print("Sending test traffic to the endpoint {}. \nPlease wait for a minute...".format(endpoint_name))

with open('data/test_sample.csv', 'r') as f:
    for row in f:
        payload = row.rstrip('\n')
        response = xgb_predictor.predict(data=payload)
        time.sleep(0.5)

### Verify that data is captured in Amazon S3

When we made some real-time predictions by sending data to our endpoint, we should have also captured that data for monitoring purposes. 

Let's list the data capture files stored in Amazon S3. Expect to see different files from different time periods organized based on the hour in which the invocation occurred. The format of the Amazon S3 path is:

`s3://{destination-bucket-prefix}/{endpoint-name}/{variant-name}/yyyy/mm/dd/hh/filename.jsonl`

In [None]:
from time import sleep

current_endpoint_capture_prefix = '{}/{}'.format(data_capture_prefix, endpoint_name)
for _ in range(12): # wait up to a minute to see captures in S3
    capture_files = S3Downloader.list("s3://{}/{}".format(bucket, current_endpoint_capture_prefix))
    if capture_files:
        break
    sleep(5)

print("Found Data Capture Files:")
print(capture_files)

All the data captured is stored in a SageMaker specific json-line formatted file. Next, Let's take a quick peek at the contents of a single line in a pretty formatted json so that we can observe the format a little better.

In [None]:
capture_file = S3Downloader.read_file(capture_files[-1])

print("=====Single Data Capture====")
print(json.dumps(json.loads(capture_file.split('\n')[0]), indent=2)[:2000])

As you can see, each inference request is captured in one line in the jsonl file. The line contains both the input and output merged together. In our example, we provided the ContentType as `text/csv` which is reflected in the `observedContentType` value. Also, we expose the enconding that we used to encode the input and output payloads in the capture format with the `encoding` value.

To recap, we have observed how you can enable capturing the input and/or output payloads to an Endpoint with a new parameter. We have also observed how the captured format looks like in S3. Let's continue to explore how SageMaker helps with monitoring the data collected in S3.

---
## Amazon SageMaker Model Monitor

Amazon SageMaker Model Monitor lets you monitor and evaluate the data observed by endpoints. It works like this:
1. We need to create a baseline that we can use to compare real-time traffic against. 
1. When a baseline is ready, we can set up a schedule to continously evaluate and compare against the baseline.
1. We can send synthetic traffic to trigger alarms.

**Important**: It takes an hour or more to complete this section because the shortest monitoring polling time is one hour. The following graphic shows how the monitoring results look after running for a few hours and some of the errors triggered by synthetic traffic.

![model monitor example](./images/view_model_monitor_output.gif)

### Baselining and continous monitoring

#### 1. Constraint suggestion with the baseline (training) dataset

The training dataset that you use to train a model is usually a good baseline dataset. Note that the training dataset data schema and the inference dataset schema must match exactly (for example, they should have the same number and type of features).

Using our training dataset, let's ask Amazon SageMaker Model Monitor to suggest a set of baseline `constraints` and generate descriptive `statistics` so we can explore the data. For this example, let's upload the training dataset, which we used to train model. We'll use the dataset file with column headers so we have descriptive feature names.

In [None]:
baseline_prefix = prefix + '/baselining'
baseline_data_prefix = baseline_prefix + '/data'
baseline_results_prefix = baseline_prefix + '/results'

baseline_data_uri = 's3://{}/{}'.format(bucket,baseline_data_prefix)
baseline_results_uri = 's3://{}/{}'.format(bucket, baseline_results_prefix)
print('Baseline data uri: {}'.format(baseline_data_uri))
print('Baseline results uri: {}'.format(baseline_results_uri))
baseline_data_path = S3Uploader.upload("data/training-dataset-with-header.csv", baseline_data_uri)

##### Create a baselining job with the training dataset

Now that we have the training data ready in S3, let's start a job to `suggest` constraints. To generate the constraints, the convenient helper starts a `ProcessingJob` using a ProcessingJob container provided by Amazon SageMaker.

In [None]:
my_default_monitor = DefaultModelMonitor(role=role,
                                         instance_count=1,
                                         instance_type='ml.m5.xlarge',
                                         volume_size_in_gb=20,
                                         max_runtime_in_seconds=3600,
                                        )

baseline_job = my_default_monitor.suggest_baseline(baseline_dataset=baseline_data_path,
                                                   dataset_format=DatasetFormat.csv(header=True),
                                                   output_s3_uri=baseline_results_uri,
                                                   wait=True
)

Once the job succeeds, we can explore the `baseline_results_uri` location in s3 to see what files where stored there.

In [None]:
print("Found Files:")
S3Downloader.list("s3://{}/{}".format(bucket, baseline_results_prefix))

We have a`constraints.json` file that has information about suggested constraints. We also have a `statistics.json` which contains statistical information about the data in the baseline.

In [None]:
baseline_job = my_default_monitor.latest_baselining_job
schema_df = pd.io.json.json_normalize(baseline_job.baseline_statistics().body_dict["features"])
schema_df.head(10)

In [None]:
constraints_df = pd.io.json.json_normalize(baseline_job.suggested_constraints().body_dict["features"])
constraints_df.head(10)

#### 2. Analyzing subsequent captures for data quality issues

Now that we've generated a baseline dataset and processed it to get baseline statistics and constraints, let's monitor and analyze the data being sent to the endpoint with monitoring schedules.

##### Create a schedule
First, let's create a monitoring schedule for the endpoint. The schedule specifies the cadence at which we want to run a new processing job so that we can compare recent data captures to the baseline.

In [None]:
# First, copy over some test scripts to the S3 bucket so that they can be used for pre and post processing
code_prefix = '{}/code'.format(prefix)
pre_processor_script = S3Uploader.upload('preprocessor.py', 's3://{}/{}'.format(bucket,code_prefix))
s3_code_postprocessor_uri = S3Uploader.upload('postprocessor.py', 's3://{}/{}'.format(bucket,code_prefix))

We are ready to create a model monitoring schedule for the Endpoint created before and also the baseline resources (constraints and statistics) which were generated above.

In [None]:
from sagemaker.model_monitor import CronExpressionGenerator
from time import gmtime, strftime

reports_prefix = '{}/reports'.format(prefix)
s3_report_path = 's3://{}/{}'.format(bucket,reports_prefix)

mon_schedule_name = 'demo-xgboost-customer-churn-model-schedule-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
my_default_monitor.create_monitoring_schedule(monitor_schedule_name=mon_schedule_name,
                                              endpoint_input=xgb_predictor.endpoint,
                                              #record_preprocessor_script=pre_processor_script,
                                              post_analytics_processor_script=s3_code_postprocessor_uri,
                                              output_s3_uri=s3_report_path,
                                              statistics=my_default_monitor.baseline_statistics(),
                                              constraints=my_default_monitor.suggested_constraints(),
                                              schedule_cron_expression=CronExpressionGenerator.hourly(),
                                              enable_cloudwatch_metrics=True,
                                             )

#### 3. Start generating some artificial traffic
The following block starts a thread to send some traffic to the endpoint. This allows us to continue to send traffic to the endpoint so that we'll have data continually captured for analysis. If there is no traffic, the monitoring jobs will start to fail later.

To terminate this thread, you need to stop the kernel.

In [None]:
from threading import Thread

runtime_client = boto3.client('runtime.sagemaker')

# (just repeating code from above for convenience/ able to run this section independently)
def invoke_endpoint(ep_name, file_name, runtime_client):
    with open(file_name, 'r') as f:
        for row in f:
            payload = row.rstrip('\n')
            response = runtime_client.invoke_endpoint(EndpointName=ep_name,
                                          ContentType='text/csv', 
                                          Body=payload)
            response['Body'].read()
            sleep(1)
            
def invoke_endpoint_forever():
    while True:
        invoke_endpoint(endpoint_name, 'data/test-dataset-input-cols.csv', runtime_client)
        
thread = Thread(target = invoke_endpoint_forever)
thread.start()

# Note that you need to stop the kernel to stop the invocations

##### List executions
Once the schedule is set up, jobs start at the specified intervals. The following code lists the last five executions. If you run this code soon after creating the hourly schedule, you might not see any executions listed. To see executions, you might have to wait until you cross the hour boundary (in UTC). The code includes the logic for waiting.

In [None]:
mon_executions = my_default_monitor.list_executions()
if len(mon_executions) == 0:
    print("We created a hourly schedule above and it will kick off executions ON the hour.\nWe will have to wait till we hit the hour...")

while len(mon_executions) == 0:
    print("Waiting for the 1st execution to happen...")
    time.sleep(60)
    mon_executions = my_default_monitor.list_executions()  

In [None]:
#Evaluate the latest execution and list the generated reports
latest_execution = mon_executions[-1]
latest_execution.wait()

In [None]:
print("Latest execution result: {}".format(latest_execution.describe()['ExitMessage']))
report_uri = latest_execution.output.destination

print("Found Report Files:")
S3Downloader.list(report_uri)

In [None]:
#If there are any violations compared to the baseline, they will be generated here. Let's list the violations.
violations = my_default_monitor.latest_monitoring_constraint_violations()
pd.set_option('display.max_colwidth', -1)
constraints_df = pd.io.json.json_normalize(violations.body_dict["violations"])
constraints_df.head(10)

In [None]:
#You can plug in the processing job arn for a single execution of the monitoring into this notebook to see more detailed visualizations of the violations and distribution statistics of the data captue that was processed in that execution
latest_execution.describe()['ProcessingJobArn']