# [Module 4.1] XGBoost 모델 생성


아 과정은 위의 차원이 축소된 데이타를 가지고 실제 SageMaker Built-in Algorithm XGBoost를 이용하여 학습을 수행하여 XGBoost 모델을 생성 합니다.<br>
실제의 학습 과정은 SageMaker Cloud Instance에서 실제 학습니 됩니다.

이 노트북은 아래와 같은 과정을 수행 합니다.
- Built-in XGBoost 알고리즘 Docker Image 가져오기
- 차원 축소된 입력 Train, Validation 데이터 준비하기
- 학습하여 XGBoost 모델 생성하기
- XGBoost 모델 아티팩트 및 XGboost docker image 경로 저장


sagemaker sdk 를 업그레이드 합니다

In [1]:
!pip install --upgrade sagemaker

Requirement already up-to-date: sagemaker in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (2.5.0)


In [2]:
import sagemaker
import pandas as pd
import numpy as np
import os
import time
import json
from time import strftime, gmtime

role = sagemaker.get_execution_role()

In [3]:
%store -r

## Built-in XGBoost 알고리즘 Docker Image 가져오기


In [4]:
from sagemaker import image_uris, session
xgb_image = image_uris.retrieve("xgboost", session.Session().boto_region_name, version="latest")
print("xgb_image: ", xgb_image)

xgb_image:  306986355934.dkr.ecr.ap-northeast-2.amazonaws.com/xgboost:latest


## 차원 축소된 입력 Train, Validation 데이터 준비하기

이전 노트북에서 차원이 축소된 Train, Validaion 데이타의 경로 및 파일 포맷등을 지정하는 오브젝트를 생성 합니다.

In [5]:
s3_input_train_processed = sagemaker.inputs.TrainingInput(
    preprocessed_pca_train_path, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')
print("S3 Train input: \n")
print(s3_input_train_processed.config)
s3_input_validation_processed = sagemaker.inputs.TrainingInput(
    preprocessed_pca_validation_path, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')
print("\nS3 Validation input: \n")
print(s3_input_validation_processed.config)

S3 Train input: 

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-ap-northeast-2-057716757052/Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-26-07-40-44-612', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}

S3 Validation input: 

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-ap-northeast-2-057716757052/Scikit-pca-custom/transformtrain-pca-validation-output/pca-2020-08-26-10-50-14-889', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}


## 학습하여 XGBoost 모델 생성하기

#### 아래는 약 5분 정도가 소요 됩니다. 아래 셀의 [*] 의 표시가 [숫자] (에: [13])로 바뀔 때까지 기다려 주세요

In [6]:
sess = sagemaker.Session()
instance_type = 'ml.m4.2xlarge'


xgb = sagemaker.estimator.Estimator(xgb_image, # Built-in XGBoost Container
                                    role, 
                                    instance_count=1, 
                                    instance_type= instance_type,
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess
                                   )
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100,
                       )


xgb.fit({'train': s3_input_train_processed, 'validation': s3_input_validation_processed}) 

2020-08-26 12:23:04 Starting - Starting the training job...
2020-08-26 12:23:25 Starting - Launching requested ML instances......
2020-08-26 12:24:31 Starting - Preparing the instances for training......
2020-08-26 12:25:20 Downloading - Downloading input data
2020-08-26 12:25:20 Training - Downloading the training image..[34mArguments: train[0m
[34m[2020-08-26:12:25:40:INFO] Running standalone xgboost training.[0m
[34m[2020-08-26:12:25:40:INFO] File size need to be processed in the node: 1.46mb. Available memory size in the node: 24453.21mb[0m
[34m[2020-08-26:12:25:40:INFO] Determined delimiter of CSV input is ','[0m
[34m[12:25:40] S3DistributionType set as FullyReplicated[0m
[34m[12:25:40] 2333x25 matrix with 58325 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-08-26:12:25:40:INFO] Determined delimiter of CSV input is ','[0m
[34m[12:25:40] S3DistributionType set as FullyReplicated[0m
[34m[12:25:40] 666x25 matrix with 1

## XGBoost 모델 아티팩트 및 XGboost docker image 경로 저장

추후에 inference pipeline을 만들기 위해 저장

In [8]:
xgb_model_data = xgb.model_data
xgb_image_uri = xgb.image_uri
print("xgb_model_data: \n", xgb_model_data)
print("xgb_image_uri: \n", xgb_image_uri)

%store xgb_model_data
%store xgb_image_uri

xgb_model_data: 
 s3://sagemaker-ap-northeast-2-057716757052/sagemaker/customer-churn/output/xgboost-2020-08-26-12-23-04-528/output/model.tar.gz
xgb_image_uri: 
 306986355934.dkr.ecr.ap-northeast-2.amazonaws.com/xgboost:latest
Stored 'xgb_model_data' (str)
Stored 'xgb_image_uri' (str)
