# [Module 2.0] 데이타 전처리 하여 Feature 생성



Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
from sagemaker.predictor import csv_serializer

# Preprocessing data <a class="anchor" id="Pre-processing"></a>

We need to do typical preprocessing tasks, including cleaning, feature transformation, feature selection on input data before train the prediction model. For example:  
- `Phone` takes on too many unique values to be of any practical use. It's possible parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it.
- `Area Code` showing up as a feature we should convert to non-numeric.
- If we dig into features and run correlaiton analysis, we see several features that essentially have high correlation with one another. Including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias. We should remove one feature from each of the highly correlated pairs.

We will use Amazon SageMaker built-in Scikit-learn library for preprocessing (and also postprocessing), and then use the Amazon SageMaker built-in XGboost algorithm for predictions. We’ll deploy both the library and the algorithm on the same endpoint using the Amazon SageMaker Inference Pipelines feature so you can pass raw input data directly to Amazon SageMaker. We’ll also reuse the preprocessing code between training and inference to reduce development overhead and errors.

To run Scikit-learn on Sagemaker `SKLearn` Estimator with a script as an entry point. The training script is very similar to a training script you might run outside of SageMaker. Also, as this data set is pretty small in term of size, we use the 'local' mode for preprocessing and upload the transformer and transformed data into S3.

In [14]:
%store -r

## Preprocessing Model 생성 Local Test

In [20]:
from sagemaker.sklearn.estimator import SKLearn
sagemaker_session = sagemaker.Session()
from sagemaker import get_execution_role

role = get_execution_role()

script_path = 'preprocessing.py'

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    train_instance_type="local")
sklearn_preprocessor.fit({'train': s3_input_train})

Creating tmpi91nxvib_algo-1-epjoa_1 ... 
[1BAttaching to tmpi91nxvib_algo-1-epjoa_12mdone[0m
[36malgo-1-epjoa_1  |[0m 2020-07-15 06:33:46,793 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
[36malgo-1-epjoa_1  |[0m 2020-07-15 06:33:46,796 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-epjoa_1  |[0m 2020-07-15 06:33:46,804 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36malgo-1-epjoa_1  |[0m 2020-07-15 06:33:46,906 sagemaker-containers INFO     Module preprocessing does not provide a setup.py. 
[36malgo-1-epjoa_1  |[0m Generating setup.py
[36malgo-1-epjoa_1  |[0m 2020-07-15 06:33:46,906 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-epjoa_1  |[0m 2020-07-15 06:33:46,906 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-epjoa_1  |[0m 2020-07-15 06:33:46,906 sagemaker-containers INFO     Installing module with the following command:


### Preparing the training and validation dataset <a class="anchor" id="preprocess_train_data"></a>
Now that our proprocessor is properly fitted, let's go ahead and preprocess our training and validation data. Let's use batch transform to directly preprocess the raw data and store right back into s3.

### Preprocessed training data (Feature) 만들기

In [22]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-train-output')

scikit_learn_inferencee_model = sklearn_preprocessor.create_model(env={'TRANSFORM_MODE': 'feature-transform'})
transformer_train = scikit_learn_inferencee_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_train_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_train.transform(s3_input_train.config['DataSource']['S3DataSource']['S3Uri'], 
                            content_type='text/csv')
print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)
transformer_train.wait()
preprocessed_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name
print(preprocessed_train_path)

Attaching to tmpczieq7o9_algo-1-1w69h_1
[36malgo-1-1w69h_1  |[0m Processing /opt/ml/code
[36malgo-1-1w69h_1  |[0m Building wheels for collected packages: preprocessing
[36malgo-1-1w69h_1  |[0m   Building wheel for preprocessing (setup.py) ... [?25ldone
[36malgo-1-1w69h_1  |[0m [?25h  Created wheel for preprocessing: filename=preprocessing-1.0.0-py2.py3-none-any.whl size=9702 sha256=ba9f0d1060aef8afbb200ae46c3a344d76e94a28fb5fbea6d68ceee6563809f5
[36malgo-1-1w69h_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-b58d3pwe/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-1w69h_1  |[0m Successfully built preprocessing
[36malgo-1-1w69h_1  |[0m Installing collected packages: preprocessing
[36malgo-1-1w69h_1  |[0m Successfully installed preprocessing-1.0.0
[36malgo-1-1w69h_1  |[0m   import imp
[36malgo-1-1w69h_1  |[0m [2020-07-15 06:34:49 +0000] [72] [INFO] Starting gunicorn 19.9.0
[36malgo-1-1w69h_1  |[0m [2020-07-15 06:34:49 +0000

In [23]:
print(preprocessed_train_path)

s3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-07-15-06-33-2020-07-15-06-34-45-905


### Preprocessed Validation data (Feature) 만들기

In [24]:
# Define a SKLearn Transformer from the trained SKLearn Estimator

transform_validation_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-validation-output')
transformer_validation = scikit_learn_inferencee_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_validation_output_path,
    accept = 'text/csv')
# Preprocess validation input
transformer_validation.transform(s3_input_validation.config['DataSource']['S3DataSource']['S3Uri'], content_type='text/csv')
print('Waiting for transform job: ' + transformer_validation.latest_transform_job.job_name)
transformer_validation.wait()
preprocessed_validation_path = transformer_validation.output_path+transformer_validation.latest_transform_job.job_name
print(preprocessed_validation_path)


Attaching to tmp2qr8xre8_algo-1-0zoxo_1
[36malgo-1-0zoxo_1  |[0m Processing /opt/ml/code
[36malgo-1-0zoxo_1  |[0m Building wheels for collected packages: preprocessing
[36malgo-1-0zoxo_1  |[0m   Building wheel for preprocessing (setup.py) ... [?25ldone
[36malgo-1-0zoxo_1  |[0m [?25h  Created wheel for preprocessing: filename=preprocessing-1.0.0-py2.py3-none-any.whl size=9702 sha256=3ac2e3b89fa8457c32c9d2d2b5da5bff624580d64756db38e3edcdb9a6b17e95
[36malgo-1-0zoxo_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-bttrbzky/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-0zoxo_1  |[0m Successfully built preprocessing
[36malgo-1-0zoxo_1  |[0m Installing collected packages: preprocessing
[36malgo-1-0zoxo_1  |[0m Successfully installed preprocessing-1.0.0
[36malgo-1-0zoxo_1  |[0m   import imp
[36malgo-1-0zoxo_1  |[0m [2020-07-15 06:35:38 +0000] [72] [INFO] Starting gunicorn 19.9.0
[36malgo-1-0zoxo_1  |[0m [2020-07-15 06:35:38 +0000

In [27]:
preprocessed_validation_path

's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-07-15-06-33-2020-07-15-06-35-34-867'

---
## Train

Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [26]:
import boto3

from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

	get_image_uri(region, 'xgboost', '0.90-2').


Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [28]:
s3_input_train_processed = sagemaker.session.s3_input(
    preprocessed_train_path, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')
print(s3_input_train_processed.config)
s3_input_validation_processed = sagemaker.session.s3_input(
    preprocessed_validation_path, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')
print(s3_input_validation_processed.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-07-15-06-33-2020-07-15-06-34-45-905', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-07-15-06-33-2020-07-15-06-35-34-867', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}


Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [29]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train_processed, 'validation': s3_input_validation_processed}) 

2020-07-15 06:37:54 Starting - Starting the training job...
2020-07-15 06:37:56 Starting - Launching requested ML instances......
2020-07-15 06:39:00 Starting - Preparing the instances for training...
2020-07-15 06:39:53 Downloading - Downloading input data...
2020-07-15 06:40:15 Training - Downloading the training image..[34mArguments: train[0m
[34m[2020-07-15:06:40:34:INFO] Running standalone xgboost training.[0m
[34m[2020-07-15:06:40:34:INFO] File size need to be processed in the node: 1.29mb. Available memory size in the node: 8516.39mb[0m
[34m[2020-07-15:06:40:34:INFO] Determined delimiter of CSV input is ','[0m
[34m[06:40:34] S3DistributionType set as FullyReplicated[0m
[34m[06:40:34] 2333x69 matrix with 160977 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-07-15:06:40:34:INFO] Determined delimiter of CSV input is ','[0m
[34m[06:40:34] S3DistributionType set as FullyReplicated[0m
[34m[06:40:34] 666x69 matrix with 4

## Post-processing

In [30]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transform_postprocessor_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-postprocessing-output')
scikit_learn_post_process_model = sklearn_preprocessor.create_model(env={'TRANSFORM_MODE': 'inverse-label-transform'})
transformer_post_processing = scikit_learn_post_process_model.transformer(
    instance_count=1, 
    instance_type='local',
    assemble_with = 'Line',
    output_path = transform_postprocessor_path,
    accept = 'text/csv')

In [31]:
sklearn_preprocessor.model_data

's3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-15-06-33-44-815/model.tar.gz'

In [32]:
scikit_learn_post_process_model.model_data

's3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-15-06-33-44-815/model.tar.gz'

## Inference Pipeline <a class="anchor" id="pipeline_setup"></a>
Setting up a Machine Learning pipeline can be done with the create_model(). In this example, we configure our pipeline model with the fitted Scikit-learn inference model, the fitted Xgboost model and the psotprocessing model.

In [34]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_name = 'churn-inference-pipeline-' + timestamp_prefix
client = boto3.client('sagemaker')
response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            'Image': sklearn_preprocessor.image_name,
            'ModelDataUrl': sklearn_preprocessor.model_data,
            'Environment': {
                    "SAGEMAKER_ENABLE_CLOUDWATCH_METRICS": str(sklearn_preprocessor.enable_cloudwatch_metrics),
                    "SAGEMAKER_SUBMIT_DIRECTORY": sklearn_preprocessor.uploaded_code.s3_prefix,
                    "TRANSFORM_MODE": "feature-transform",
                    "SAGEMAKER_CONTAINER_LOG_LEVEL": str(sklearn_preprocessor.container_log_level),
                    "SAGEMAKER_REGION": sklearn_preprocessor.sagemaker_session.boto_region_name,
                    "SAGEMAKER_PROGRAM": sklearn_preprocessor.uploaded_code.script_name
                }
        },
        {
            'Image': xgb.image_name,
            'ModelDataUrl': xgb.model_data,
            "Environment": {}
        },
        {
            'Image': scikit_learn_post_process_model.image,
            'ModelDataUrl': scikit_learn_post_process_model.model_data,
            'Environment': {
                    "SAGEMAKER_ENABLE_CLOUDWATCH_METRICS": str(sklearn_preprocessor.enable_cloudwatch_metrics),
                    "SAGEMAKER_SUBMIT_DIRECTORY": sklearn_preprocessor.uploaded_code.s3_prefix,
                    "TRANSFORM_MODE": "inverse-label-transform",
                    "SAGEMAKER_CONTAINER_LOG_LEVEL": str(sklearn_preprocessor.container_log_level),
                    "SAGEMAKER_REGION": sklearn_preprocessor.sagemaker_session.boto_region_name,
                    "SAGEMAKER_PROGRAM": sklearn_preprocessor.uploaded_code.script_name
                }
        },
    ],
    ExecutionRoleArn = role,
)
model_name

'churn-inference-pipeline-2020-07-15-06-46-36'

In [35]:
%store model_name

Stored 'model_name' (str)
