# [Module 3.0] 전처리 데이타를 가지고 학습 



Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
from sagemaker.predictor import csv_serializer

In [8]:
from sagemaker import get_execution_role

role = get_execution_role()


In [9]:
%store -r

---
## Train

Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [10]:
import boto3

from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

	get_image_uri(region, 'xgboost', '0.90-2').


Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [11]:
s3_input_train_processed = sagemaker.session.s3_input(
    preprocessed_train_path, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')
print(s3_input_train_processed.config)
s3_input_validation_processed = sagemaker.session.s3_input(
    preprocessed_validation_path, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')
print(s3_input_validation_processed.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-07-16-00-11-2020-07-16-00-11-11-560', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-07-16-00-11-2020-07-16-00-11-19-808', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}


Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [12]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train_processed, 'validation': s3_input_validation_processed}) 

2020-07-16 00:12:32 Starting - Starting the training job......
2020-07-16 00:13:05 Starting - Launching requested ML instances......
2020-07-16 00:14:12 Starting - Preparing the instances for training......
2020-07-16 00:15:31 Downloading - Downloading input data
2020-07-16 00:15:31 Training - Downloading the training image...
2020-07-16 00:16:04 Uploading - Uploading generated training model
2020-07-16 00:16:04 Completed - Training job completed
[34mArguments: train[0m
[34m[2020-07-16:00:15:52:INFO] Running standalone xgboost training.[0m
[34m[2020-07-16:00:15:52:INFO] File size need to be processed in the node: 1.29mb. Available memory size in the node: 8497.36mb[0m
[34m[2020-07-16:00:15:52:INFO] Determined delimiter of CSV input is ','[0m
[34m[00:15:52] S3DistributionType set as FullyReplicated[0m
[34m[00:15:52] 2333x69 matrix with 160977 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-07-16:00:15:52:INFO] Determined delim

## Post-processing

In [58]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transform_postprocessor_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-postprocessing-output')
scikit_learn_post_process_model = sklearn_preprocessor.create_model(env={'TRANSFORM_MODE': 'inverse-label-transform'})
transformer_post_processing = scikit_learn_post_process_model.transformer(
    instance_count=1, 
    instance_type='local',
    assemble_with = 'Line',
    output_path = transform_postprocessor_path,
    accept = 'text/csv')

In [59]:
sklearn_preprocessor.model_data

's3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-15-14-39-38-201/model.tar.gz'

In [60]:
scikit_learn_post_process_model.model_data

's3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-15-14-39-38-201/model.tar.gz'

## Inference Pipeline <a class="anchor" id="pipeline_setup"></a>
Setting up a Machine Learning pipeline can be done with the create_model(). In this example, we configure our pipeline model with the fitted Scikit-learn inference model, the fitted Xgboost model and the psotprocessing model.

In [61]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_name = 'churn-inference-pipeline-' + timestamp_prefix
client = boto3.client('sagemaker')
response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            'Image': sklearn_preprocessor.image_name,
            'ModelDataUrl': sklearn_preprocessor.model_data,
            'Environment': {
                    "SAGEMAKER_ENABLE_CLOUDWATCH_METRICS": str(sklearn_preprocessor.enable_cloudwatch_metrics),
                    "SAGEMAKER_SUBMIT_DIRECTORY": sklearn_preprocessor.uploaded_code.s3_prefix,
                    "TRANSFORM_MODE": "feature-transform",
                    "SAGEMAKER_CONTAINER_LOG_LEVEL": str(sklearn_preprocessor.container_log_level),
                    "SAGEMAKER_REGION": sklearn_preprocessor.sagemaker_session.boto_region_name,
                    "SAGEMAKER_PROGRAM": sklearn_preprocessor.uploaded_code.script_name
                }
        },
        {
            'Image': xgb.image_name,
            'ModelDataUrl': xgb.model_data,
            "Environment": {}
        },
        {
            'Image': scikit_learn_post_process_model.image,
            'ModelDataUrl': scikit_learn_post_process_model.model_data,
            'Environment': {
                    "SAGEMAKER_ENABLE_CLOUDWATCH_METRICS": str(sklearn_preprocessor.enable_cloudwatch_metrics),
                    "SAGEMAKER_SUBMIT_DIRECTORY": sklearn_preprocessor.uploaded_code.s3_prefix,
                    "TRANSFORM_MODE": "inverse-label-transform",
                    "SAGEMAKER_CONTAINER_LOG_LEVEL": str(sklearn_preprocessor.container_log_level),
                    "SAGEMAKER_REGION": sklearn_preprocessor.sagemaker_session.boto_region_name,
                    "SAGEMAKER_PROGRAM": sklearn_preprocessor.uploaded_code.script_name
                }
        },
    ],
    ExecutionRoleArn = role,
)
model_name

'churn-inference-pipeline-2020-07-15-14-43-39'

In [62]:
%store model_name

Stored 'model_name' (str)
