# [Module 9.0] Inferencde Pipeline 생성 로그 확인

- 이 노트북에서는 아래의 내용을 진행을 하고 추론에 대한 로그를 남기어서, Inference Pipeline Model 이 어떻게 작동을 하는지 알아 봅니다.
    - Feature Transfomer(전처리 학습 모델) 생성
    - Train 데이타를 Feature Transfomer를 통해서 전처리 데이타 생성
    - Validation 데이타를 Feature Transfomer를 통해서 전처리 데이타 생성
    - XGBoost를 학습
    - Inference Pipeline Model 생성 (전처리, XGboost, 휴처리 모델)
    - Realtime Endpoint 생성
    - 한개의 테스트 데이터 추론
- 소요 시간은 약 10분 걸립니다.

In [1]:
import sagemaker
import pandas as pd
import numpy as np
import os
import time
import json
from time import strftime, gmtime

In [2]:
%store -r

## Feature Transformer (전처리 학습 모델) 생성
아래는 다음과 같은 작업을 합니다.
- SKLearn 이라는 Estimator를 생성 합니다. 
    - s3_input_train의 학습 데이타를 SKLearn 입력으로 제공 합니다.
    - "전처리 학습 모델 (Featurizer)" 을 생성할 수 있는 소스 코드 preprocessing.py 를 지정 합니다. 
    - 사용할 리소스로 instance_type = 'local' 를 지정 합니다. (이미 노트북 인스턴스에 설치된 Docker-compose를 이용 합니다.)
        - **Local 이 아니라 SageMaker Cloud Instance도 사용 가능 합니다. (예: ml.m4.xlarge)**
        - **아래 XGBoost 학습 알고리즘을 사용시에는 SageMaker Cloud Instance 사용함**
- SKLearn의 "전처리 학습 모델"이 완료가 되면 결과인 모델 아티펙트 파일이 (model.tar.gz)  s3://{bucket_name}/{job_name}/output.tar.gz 에 저장 됩니다. 
    - (예: s3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-15-08-39-41-035/model.tar.gz)

#### 아래는 약 1분 정도가 소요 됩니다. 아래 셀의 [*] 의 표시가 [숫자] (에: [3])로 바뀔 때까지 기다려 주세요

In [3]:
from sagemaker.sklearn.estimator import SKLearn
sagemaker_session = sagemaker.Session()
from sagemaker import get_execution_role

role = get_execution_role()

script_path = 'log-preprocessing.py'
# instance_type = 'ml.m4.2xlarge'
instance_type = 'local'

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    train_instance_type = instance_type
)
sklearn_preprocessor.fit({'train': s3_input_train})

This is not the latest supported version. If you would like to use version 0.23-1, please add framework_version=0.23-1 to your constructor.


Creating tmpk7ycgxzh_algo-1-z9bwh_1 ... 
[1BAttaching to tmpk7ycgxzh_algo-1-z9bwh_12mdone[0m
[36malgo-1-z9bwh_1  |[0m 2020-08-14 06:48:56,766 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
[36malgo-1-z9bwh_1  |[0m 2020-08-14 06:48:56,768 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-z9bwh_1  |[0m 2020-08-14 06:48:56,777 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36malgo-1-z9bwh_1  |[0m 2020-08-14 06:48:56,898 sagemaker-containers INFO     Module log-preprocessing does not provide a setup.py. 
[36malgo-1-z9bwh_1  |[0m Generating setup.py
[36malgo-1-z9bwh_1  |[0m 2020-08-14 06:48:56,898 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-z9bwh_1  |[0m 2020-08-14 06:48:56,898 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-z9bwh_1  |[0m 2020-08-14 06:48:56,898 sagemaker-containers INFO     Installing module with the following comma

## Feature Transfomer를 사용하여 전처리된 학습 및 검증 데이타 생성 

![Transformer_Train](img/Fig2.1.transformer_train.png)

### Preprocessed Training data (Feature) 만들기

#### 아래는 약 1분 정도가 소요 됩니다. 아래 셀의 [*] 의 표시가 [숫자] (에: [4])로 바뀔 때까지 기다려 주세요

In [4]:
# 아웃풋 경로 지정
transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-train-output')
instance_type = 'local'
# instance_type = 'ml.m4.2xlarge'

# scikit_learn_inferencee_model 이름으로 전처리 학습 모델 생성
# TRANSFORM_MODE의 환경 변수는 전처리 모드라는 것을 알려 줌.
    # 추론시에는 환경 변수를 TRANSFORM_MODE": "inverse-label-transform" 설정 함.
    # 위의 두개의 과정을 분리할 수 있으나, 한개의 소스를 (preprocessor.py)를 사용하기 위해서, 환경 변수를 통해서 구분함.
scikit_learn_inferencee_model = sklearn_preprocessor.create_model(
    env={'TRANSFORM_MODE': 'feature-transform'})
# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_train = scikit_learn_inferencee_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_train_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_train.transform(s3_input_train.config['DataSource']['S3DataSource']['S3Uri'], 
                            content_type='text/csv')
print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)
transformer_train.wait()
preprocessed_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name
print(preprocessed_train_path)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


Attaching to tmpgvalg9o0_algo-1-9tarl_1
[36malgo-1-9tarl_1  |[0m Processing /opt/ml/code
[36malgo-1-9tarl_1  |[0m Building wheels for collected packages: log-preprocessing
[36malgo-1-9tarl_1  |[0m   Building wheel for log-preprocessing (setup.py) ... [?25ldone
[36malgo-1-9tarl_1  |[0m [?25h  Created wheel for log-preprocessing: filename=log_preprocessing-1.0.0-py2.py3-none-any.whl size=10220 sha256=7c35f452d381b05aec98b62ff30659221d3fa3d1a746f728acb752fe51e72470
[36malgo-1-9tarl_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-m269pbya/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-9tarl_1  |[0m Successfully built log-preprocessing
[36malgo-1-9tarl_1  |[0m Installing collected packages: log-preprocessing
[36malgo-1-9tarl_1  |[0m Successfully installed log-preprocessing-1.0.0
[36malgo-1-9tarl_1  |[0m   import imp
[36malgo-1-9tarl_1  |[0m [2020-08-14 06:49:02 +0000] [48] [INFO] Starting gunicorn 19.9.0
[36malgo-1-9tarl_1  |[

#### Training 전처리된 학습 파일 확인

In [5]:
print(preprocessed_train_path)

s3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-14-06-48-2020-08-14-06-48-59-299


In [6]:
! aws s3 ls {preprocessed_train_path} --recursive

2020-08-14 06:49:06    1054526 sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-14-06-48-2020-08-14-06-48-59-299/train.csv.out


In [7]:
preprocessed_train_path_file = os.path.join (preprocessed_train_path, 'train.csv.out')
df_pre_train = pd.read_csv(preprocessed_train_path_file)
df_pre_train.head()


Unnamed: 0,0.0,0.11941369588439606,-0.5962380254245051,1.744368057672484,0.9789570533336895,-0.028992907038264654,-0.8931854019845896,-0.8017032037830547,-1.9825286353116254,-1.5305589315744583,...,0.0.48,0.0.49,0.0.50,0.0.51,0.0.52,0.0.53,1.0.1,0.0.54,1.0.2,0.0.55
0,0.0,-1.852652,-0.596238,0.140284,-0.310405,0.970689,-0.689888,0.146389,1.232901,0.124852,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.0,1.181295,-0.596238,1.83513,0.185503,0.030988,-0.639063,1.568529,-0.063643,-0.846802,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,0.776769,-0.596238,0.216227,0.334276,0.136954,1.393914,1.394712,-0.634123,0.844596,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,-0.234547,1.508734,-0.459859,0.483049,-0.230929,0.224952,1.056954,0.92173,-0.810815,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,0.0,0.751486,1.218393,0.231046,-0.756723,0.516833,0.275776,1.043127,-2.138114,0.232814,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


### Preprocessed Validation data (Feature) 만들기

In [8]:
# 아웃풋 경로 지정
transform_validation_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-validation-output')
# scikit_learn_inferencee_model 에서 Validation Transformer 생성
transformer_validation = scikit_learn_inferencee_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_validation_output_path,
    accept = 'text/csv')
# Preprocess validation input
transformer_validation.transform(s3_input_validation.config['DataSource']['S3DataSource']['S3Uri'], content_type='text/csv')
print('Waiting for transform job: ' + transformer_validation.latest_transform_job.job_name)
transformer_validation.wait()
preprocessed_validation_path = transformer_validation.output_path+transformer_validation.latest_transform_job.job_name
print(preprocessed_validation_path)



Attaching to tmpzch47bdg_algo-1-tqzpt_1
[36malgo-1-tqzpt_1  |[0m Processing /opt/ml/code
[36malgo-1-tqzpt_1  |[0m Building wheels for collected packages: log-preprocessing
[36malgo-1-tqzpt_1  |[0m   Building wheel for log-preprocessing (setup.py) ... [?25ldone
[36malgo-1-tqzpt_1  |[0m [?25h  Created wheel for log-preprocessing: filename=log_preprocessing-1.0.0-py2.py3-none-any.whl size=10220 sha256=efaea819ce1449e392ad9dd41a23ed80b5cc30b94632468e4718b3465f23a978
[36malgo-1-tqzpt_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-mns4o6kz/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-tqzpt_1  |[0m Successfully built log-preprocessing
[36malgo-1-tqzpt_1  |[0m Installing collected packages: log-preprocessing
[36malgo-1-tqzpt_1  |[0m Successfully installed log-preprocessing-1.0.0
[36malgo-1-tqzpt_1  |[0m   import imp
[36malgo-1-tqzpt_1  |[0m [2020-08-14 06:49:09 +0000] [48] [INFO] Starting gunicorn 19.9.0
[36malgo-1-tqzpt_1  |[

## PCA 학습

In [9]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'pca'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

057716757052
us-east-2
arn:aws:iam::057716757052:role/service-role/AmazonSageMaker-ExecutionRole-20191128T110038
sagemaker-us-east-2-057716757052


In [10]:
! cp pca_byoc_train.py docker/code/

In [11]:
%%writefile docker/Dockerfile

FROM 257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3
    
# install python package
RUN pip install joblib


ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE

ENV PATH="/opt/ml/code:${PATH}"

# Copy training code
COPY code/* /opt/ml/code/
 
WORKDIR /opt/ml/code

# ENTRYPOINT ["python", "pca_train.py"]
# In order to use SageMaker Env varaibles, use the statement below
ENV SAGEMAKER_PROGRAM pca_byoc_train.py

Overwriting docker/Dockerfile


In [12]:
import os
os.environ['account_id'] = account_id
os.environ['region'] = region
os.environ['ecr_repository_name'] = ecr_repository_name

In [13]:
%%sh

ACCOUNT_ID=${account_id}
REGION=${region}
REPO_NAME=${ecr_repository_name}

echo $REGION
echo $ACCOUNT_ID
echo $REPO_NAME


# Get the login command from ECR in order to pull down the Tensorflow-gpu:1.5 image
$(aws ecr get-login --registry-ids 257758044811 --region ${region} --no-include-email)



docker build -f docker/Dockerfile -t $REPO_NAME docker

docker tag $REPO_NAME $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest

$(aws ecr get-login --no-include-email --registry-ids $ACCOUNT_ID)

aws ecr describe-repositories --repository-names $REPO_NAME || aws ecr create-repository --repository-name $REPO_NAME

docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest



us-east-2
057716757052
sagemaker-training-containers/pca
Login Succeeded
Sending build context to Docker daemon  11.26kB
Step 1/8 : FROM 257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3
 ---> 30adb1aa9af5
Step 2/8 : RUN pip install joblib
 ---> Using cache
 ---> 0786847c4f79
Step 3/8 : ENV PYTHONUNBUFFERED=TRUE
 ---> Using cache
 ---> 7d94abd2b857
Step 4/8 : ENV PYTHONDONTWRITEBYTECODE=TRUE
 ---> Using cache
 ---> 8696b5e742b3
Step 5/8 : ENV PATH="/opt/ml/code:${PATH}"
 ---> Using cache
 ---> daba2554dce8
Step 6/8 : COPY code/* /opt/ml/code/
 ---> Using cache
 ---> 9685910a18a5
Step 7/8 : WORKDIR /opt/ml/code
 ---> Using cache
 ---> ae3f15597ed8
Step 8/8 : ENV SAGEMAKER_PROGRAM pca_byoc_train.py
 ---> Using cache
 ---> 2838d3d55148
Successfully built 2838d3d55148
Successfully tagged sagemaker-training-containers/pca:latest
Login Succeeded
{
    "repositories": [
        {
            "repositoryArn": "arn:aws:ecr:us-east-2:057716757052:repository/sagem

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



In [14]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

057716757052.dkr.ecr.us-east-2.amazonaws.com/sagemaker-training-containers/pca:latest


In [15]:
preprocessed_train_path_file = '{}/train.csv.out'.format(preprocessed_train_path)
preprocessed_validation_path_file = '{}/validation.csv.out'.format(preprocessed_validation_path)
print("preprocessed_train_path_file: \n", preprocessed_train_path_file)
print("preprocessed_validation_path_file: \n", preprocessed_validation_path_file)

preprocessed_train_path_file: 
 s3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-14-06-48-2020-08-14-06-48-59-299/train.csv.out
preprocessed_validation_path_file: 
 s3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-08-14-06-48-2020-08-14-06-49-06-650/validation.csv.out


## PCA 학습

In [16]:
import pandas as pd

preprocessed_train_path_file = '{}/train.csv.out'.format(preprocessed_train_path)
pre_df = pd.read_csv(preprocessed_train_path_file, header=None)
print(pre_df.shape)
num_cols = pre_df.shape[1]
print("num_cols: ", num_cols)

(2333, 70)
num_cols:  70


In [17]:
import pandas as pd
# preprocessed_train_path_file = 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-12-07-07-2020-08-12-07-07-08-229/train.csv.out'

churn_df = pd.read_csv(preprocessed_train_path_file, header=None)
churn_df.head()
train_y = churn_df.iloc[:,0]
train_X = churn_df.iloc[:,1:]

print("Shape of train_X: ", train_X.shape)
print("Shape of train_y: ", train_y.shape)

os.makedirs('./data', exist_ok =True)
np.savetxt('./data/churn-preprocessed.csv', train_X, delimiter=',',
           fmt='%1.5f'
          )

WORK_DIRECTORY = 'data'
prefix = 'Scikit-pca-custom'
train_input = sagemaker_session.upload_data(WORK_DIRECTORY,
                                            key_prefix="{}/{}".format(prefix, WORK_DIRECTORY)
                                           )
print("train_input: ", train_input)


Shape of train_X:  (2333, 69)
Shape of train_y:  (2333,)
train_input:  s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/data


In [18]:
%%time

import sagemaker

instance_type = 'local'
# instance_type = 'ml.m4.xlarge'

pca_estimator = sagemaker.estimator.Estimator(container_image_uri,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type= instance_type,
                                    base_job_name=prefix)

pca_estimator.set_hyperparameters(n_components= 15)

train_config = sagemaker.session.s3_input(train_input, content_type='text/csv')

pca_estimator.fit({'train': train_config})

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


Creating tmpzt1l985a_algo-1-3c8kz_1 ... 
[1BAttaching to tmpzt1l985a_algo-1-3c8kz_12mdone[0m
[36malgo-1-3c8kz_1  |[0m 2020-08-14 06:49:29,524 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
[36malgo-1-3c8kz_1  |[0m 2020-08-14 06:49:29,527 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-3c8kz_1  |[0m 2020-08-14 06:49:29,535 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36malgo-1-3c8kz_1  |[0m 2020-08-14 06:49:29,536 sagemaker-containers INFO     Module pca_byoc_train does not provide a setup.py. 
[36malgo-1-3c8kz_1  |[0m Generating setup.py
[36malgo-1-3c8kz_1  |[0m 2020-08-14 06:49:29,536 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-3c8kz_1  |[0m 2020-08-14 06:49:29,536 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-3c8kz_1  |[0m 2020-08-14 06:49:29,536 sagemaker-containers INFO     Installing module with the following command:

# Transforming Train PCA

In [19]:
import pandas as pd

preprocessed_train_path_file = '{}/train.csv.out'.format(preprocessed_train_path)
pre_df = pd.read_csv(preprocessed_train_path_file, header=None)
print(pre_df.shape)
num_cols = pre_df.shape[1]
print("num_cols: ", num_cols)

(2333, 70)
num_cols:  70


In [20]:
instance_type = 'local'
# instance_type = 'ml.m4.2xlarge'
transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-pca-train-output')

pca_model = pca_estimator.create_model(
    env={'TRANSFORM_MODE': 'feature-transform', 'LENGTH_COLS': str(num_cols)})

# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_train = pca_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_train_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_train.transform(preprocessed_train_path_file, 
                            content_type='text/csv',                            
                           )

print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)
transformer_train.wait()

preprocessed_pca_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name


Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


Attaching to tmp3923ide9_algo-1-jh2xd_1
[36malgo-1-jh2xd_1  |[0m Processing /opt/ml/code
[36malgo-1-jh2xd_1  |[0m Building wheels for collected packages: pca-byoc-train
[36malgo-1-jh2xd_1  |[0m   Building wheel for pca-byoc-train (setup.py) ... [?25ldone
[36malgo-1-jh2xd_1  |[0m [?25h  Created wheel for pca-byoc-train: filename=pca_byoc_train-1.0.0-py2.py3-none-any.whl size=9478 sha256=8e352ccc9c9426d0b7959ec336ab90088d80b12faa0a0b710b516122417557b6
[36malgo-1-jh2xd_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-rlwl2vbf/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-jh2xd_1  |[0m Successfully built pca-byoc-train
[36malgo-1-jh2xd_1  |[0m Installing collected packages: pca-byoc-train
[36malgo-1-jh2xd_1  |[0m Successfully installed pca-byoc-train-1.0.0
[36malgo-1-jh2xd_1  |[0m   import imp
[36malgo-1-jh2xd_1  |[0m [2020-08-14 06:49:35 +0000] [44] [INFO] Starting gunicorn 19.9.0
[36malgo-1-jh2xd_1  |[0m [2020-08-14 06:49:3

In [21]:
print(preprocessed_pca_train_path)

s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-14-06-49-32-137-2020-08-14-06-49-32-137


In [22]:
! aws s3 ls s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-13-01-27-21-375-2020-08-13-01-27-21-375 --recursive

2020-08-13 01:27:28     707835 Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-13-01-27-21-375-2020-08-13-01-27-21-375/train.csv.out.out


In [23]:
preprocessed_pca_train_path_file = '{}/train.csv.out.out'.format(preprocessed_pca_train_path)
pca_preoc_df = pd.read_csv(preprocessed_pca_train_path_file, header=None)
pca_preoc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.0,-0.823085,-0.108811,0.452843,-0.030507,-1.031997,-2.956747,-0.0513,-0.706446,1.064536,-0.43729,-0.507988,-0.181687,-0.070627,0.043467,0.092892
1,0.0,-0.343474,0.091423,1.949026,1.268235,0.009081,0.421556,-0.988975,0.868835,-0.542355,-0.341819,0.825626,-0.158794,-0.077455,-0.124417,-0.017133
2,1.0,-0.764309,0.011604,0.822738,-1.429299,-1.627301,-0.74156,0.550952,0.428517,-0.326231,-0.289266,-1.686701,-0.128033,-0.016965,-0.028397,-0.156041
3,0.0,-0.825983,-0.722031,-0.339858,-0.980679,-0.260783,0.35779,-0.669738,-1.122771,-1.451326,-0.006193,-1.091866,-0.134256,-0.024666,0.014783,-0.10382
4,0.0,1.830756,0.701878,0.194737,-1.351443,-0.729668,0.845136,0.150495,0.084293,0.180299,0.751587,0.435199,-0.075141,-0.047975,-0.0581,0.133976


## PCA Validation Transforming

In [24]:
preprocessed_validation_path

's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-08-14-06-48-2020-08-14-06-49-06-650'

In [25]:
import pandas as pd

preprocessed_validation_path_file = '{}/validation.csv.out'.format(preprocessed_validation_path)
pre_df = pd.read_csv(preprocessed_validation_path_file, header=None)
print(pre_df.shape)
num_cols = pre_df.shape[1]
print("num_cols: ", num_cols)

(666, 70)
num_cols:  70


In [26]:


instance_type = 'local'
# instance_type = 'ml.m4.2xlarge'
transform_validation_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-pca-validation-output')

pca_model = pca_estimator.create_model(
    env={'TRANSFORM_MODE': 'feature-transform', 'LENGTH_COLS': str(num_cols)})

# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_validation = pca_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_validation_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_validation.transform(preprocessed_validation_path_file, 
                            content_type='text/csv',                            
                           )

print('Waiting for transform job: ' + transformer_validation.latest_transform_job.job_name)
transformer_validation.wait()

preprocessed_pca_validation_path = transformer_validation.output_path + transformer_validation.latest_transform_job.job_name
print(preprocessed_pca_validation_path)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


Attaching to tmpi3x9b6uq_algo-1-crw6p_1
[36malgo-1-crw6p_1  |[0m Processing /opt/ml/code
[36malgo-1-crw6p_1  |[0m Building wheels for collected packages: pca-byoc-train
[36malgo-1-crw6p_1  |[0m   Building wheel for pca-byoc-train (setup.py) ... [?25ldone
[36malgo-1-crw6p_1  |[0m [?25h  Created wheel for pca-byoc-train: filename=pca_byoc_train-1.0.0-py2.py3-none-any.whl size=9474 sha256=e90ee86dd84a20a18a277a810e4630fdc740da3c78f9359efea35330b3b05ad9
[36malgo-1-crw6p_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-jxfo434e/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-crw6p_1  |[0m Successfully built pca-byoc-train
[36malgo-1-crw6p_1  |[0m Installing collected packages: pca-byoc-train
[36malgo-1-crw6p_1  |[0m Successfully installed pca-byoc-train-1.0.0
[36malgo-1-crw6p_1  |[0m   import imp
[36malgo-1-crw6p_1  |[0m [2020-08-14 06:49:42 +0000] [45] [INFO] Starting gunicorn 19.9.0
[36malgo-1-crw6p_1  |[0m [2020-08-14 06:49:4

In [27]:
preprocessed_pca_validation_path_file = '{}/validation.csv.out.out'.format(preprocessed_pca_validation_path)
pca_val_preoc_df = pd.read_csv(preprocessed_pca_validation_path_file, header=None)
pca_val_preoc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.0,1.647523,1.321312,-0.096829,0.255327,1.185455,-1.356901,-0.144986,-1.376,0.192108,-0.407086,1.423016,-0.110452,-0.094824,0.003522,0.225941
1,0.0,-0.568756,0.209782,0.928801,-0.421777,-1.249794,-1.183297,-2.058817,0.980869,-1.320562,-1.227984,0.1844,-0.143092,-0.057522,-0.088475,-0.051763
2,0.0,1.856171,-0.558975,-1.969703,0.315529,0.132519,-0.136767,-0.62357,0.832536,1.131302,1.985652,-0.23546,-0.172431,-0.021995,-0.061832,-0.13382
3,0.0,-0.681862,-1.326923,-0.849271,-1.372694,-0.960617,1.216868,-0.801562,2.278247,-0.030436,0.099985,-1.289597,-0.164497,-0.024283,-0.094379,0.040295
4,0.0,2.290559,0.265566,-0.89117,0.860266,0.113337,0.457285,-0.131537,-0.713534,-0.735986,0.54587,1.292899,-0.146048,0.970827,0.02332,-0.086288


---
## Train with XGBoost on SageMaker Cloud Instance
아 과정은 위의 전처리된 데이타를 가지고 실제 SageMaker Built-in Algorithm XGBoost를 이용하여 학습을 수행 합니다.<br>
실제의 학습 과정은 SageMaker Cloud Instance에서 실제 학습니 됩니다.


Built-in XGboost 알고리즘 컨테이너를 가져옵니다.

In [28]:
import boto3

from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost', '1.0-1')

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


S3에 있는 Train, Validation 전처리된 (Features) 데이타의 경로 및 파일 포맷등을 지정하는 오브젝트를 생성 합니다.

In [29]:
s3_input_train_processed = sagemaker.session.s3_input(
    preprocessed_pca_train_path, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')
print("S3 Train input: \n")
print(s3_input_train_processed.config)
s3_input_validation_processed = sagemaker.session.s3_input(
    preprocessed_pca_validation_path, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')
print("\nS3 Validation input: \n")
print(s3_input_validation_processed.config)

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


S3 Train input: 

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-14-06-49-32-137-2020-08-14-06-49-32-137', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}

S3 Validation input: 

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/transformtrain-pca-validation-output/pca-2020-08-14-06-49-40-033-2020-08-14-06-49-40-033', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}


#### 아래는 약 5분 정도가 소요 됩니다. 아래 셀의 [*] 의 표시가 [숫자] (에: [13])로 바뀔 때까지 기다려 주세요

In [30]:
sess = sagemaker.Session()
instance_type = 'ml.m4.2xlarge'


xgb = sagemaker.estimator.Estimator(container, # Built-in XGBoost Container
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type= instance_type,
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess
                                   )
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100,
                       )


xgb.fit({'train': s3_input_train_processed, 'validation': s3_input_validation_processed}) 

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


2020-08-14 06:50:01 Starting - Starting the training job...
2020-08-14 06:50:04 Starting - Launching requested ML instances......
2020-08-14 06:51:08 Starting - Preparing the instances for training...
2020-08-14 06:51:56 Downloading - Downloading input data...
2020-08-14 06:52:20 Training - Downloading the training image..[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[06:52:43] 2333x15 matrix with 34995 entries loaded from /opt/ml/input/data/trai

## Inference Pipeline <a class="anchor" id="pipeline_setup"></a>

아래 그림과 같이 위에서 생성한 전처리, 알고리즘 학습, 후처리의 세가지 모델을 가지고 1개의 단일 모델을 만들어 Inference Pipleline을 생성 합니다. <br>
**입력 데이타 가공이 없이 실제 데이타가 입력이 되면, 1개의 단일 모델을 통해서 최종적으로 예측 결과인 True, False의 결과 값이 제공 됩니다.**

![Inference-pipeline](img/Fig2.2.inference_pipeline.png)


**Machine Learning Model Pipeline (Inference Pipeline)는 create_model() 를 호출하여 만들 수 있습니다.** <br>
예를 들어 여기서는 the fitted Scikit-learn inference model, the fitted Xgboost model and the psotprocessing model 의 세가지 모델을 가지고 만듦니다.

아래는 세개 모델을 생성함. 전처리, 후처리 모델 생성시에는 환경 변수를 제공 함

## 4개의 모델 파이프라인

In [31]:
# 전처리 모델
scikit_learn_pre_process_model = sklearn_preprocessor.create_model(
    env={'TRANSFORM_MODE': 'feature-transform'})    

# PCA 전처리 모델
pca_infer_model = pca_estimator.create_model(
    env={'TRANSFORM_MODE': 'inverse-label-transform', 'LENGTH_COLS': '69'})

# 학습 모델
xgb_model = xgb.create_model()

# 후처리 모델
scikit_learn_post_process_model = sklearn_preprocessor.create_model(
    env={'TRANSFORM_MODE': 'inverse-label-transform'})

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


In [32]:
print("Feature Transformer Model:\n {}".format(scikit_learn_pre_process_model.model_data))
print("\nPCA Model:\n {}".format(pca_infer_model.model_data))
print("\nXGBoost Model:\n {}".format(xgb_model.model_data))
print("\nPost-Processing Model :\n {}".format(scikit_learn_post_process_model.model_data))


print("env: ", pca_infer_model.env)
print("model_data: ", pca_infer_model.model_data)
print("name: ", pca_infer_model.name)

Feature Transformer Model:
 s3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-08-14-06-48-54-829/model.tar.gz

PCA Model:
 s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom-2020-08-14-06-49-27-571/model.tar.gz

XGBoost Model:
 s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/output/sagemaker-xgboost-2020-08-14-06-50-01-822/output/model.tar.gz

Post-Processing Model :
 s3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-08-14-06-48-54-829/model.tar.gz
env:  {'TRANSFORM_MODE': 'inverse-label-transform', 'LENGTH_COLS': '69'}
model_data:  s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom-2020-08-14-06-49-27-571/model.tar.gz
name:  None


전처리 모델의 기타 설정 변수 확인


**아래와 같은 에러가 나올시에 ECR --> 해당 리파지토리 선택 (생성한 다커 이미지) --> Permission --> 아래 정책 추가 를 해주세요**
"The repository of your image  does not grant ecr:GetDownloadUrlForLayer, ecr:BatchGetImage, ecr:BatchCheckLayerAvailability permission to sagemaker.amazonaws.com service principal"

```
{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "allowSageMakerToPull",
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": [
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage",
                "ecr:BatchCheckLayerAvailability"
            ]
        }
    ]
}
```
참고 자료: 

Troubleshoot Amazon ECR Permissions for Inference Pipelines
- https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-troubleshoot.html
Setting a Repository Policy Statement
- https://docs.aws.amazon.com/AmazonECR/latest/userguide/set-repository-policy.html

In [33]:
from sagemaker.model import Model
from sagemaker.pipeline import PipelineModel
import boto3

from time import gmtime, strftime

timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_name = 'churn-model-inference-pipeline-' + timestamp_prefix


sklearn_preprocessor.create_model

pipeline_model = PipelineModel(
    name = model_name,
    role = role,
    models = [
        scikit_learn_pre_process_model,
        pca_infer_model,        
        xgb_model,
        scikit_learn_post_process_model        
    ]
)

In [34]:
%%time

instance_type='ml.t2.medium'
# instance_type='local'
endpoint_name= 'churn-model-pipeline-endpoint-' + timestamp_prefix

deployed_model = pipeline_model.deploy(
    initial_instance_count=1, 
    instance_type= instance_type, 
    endpoint_name = endpoint_name,        
    wait = True
)

-------------------------!CPU times: user 373 ms, sys: 28.4 ms, total: 402 ms
Wall time: 12min 33s


아래의 local endpoint는 에러가 발생함. 
추정되는 이유는 전처리 모델, 후처리 모델은 'local' mode 로 생성이 되었고 <br>
xgboost 모델은 Sagemaker Host mode 로 생성이 되었을 겻으로 추정 함

In [35]:
from sagemaker.predictor import json_serializer, csv_serializer, json_deserializer, RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_CSV, CONTENT_TYPE_JSON
import sagemaker
sagemaker_session = sagemaker.Session()

predictor = RealTimePredictor(
    endpoint = endpoint_name,
    sagemaker_session = sagemaker_session,
    serializer = csv_serializer,
    content_type = CONTENT_TYPE_CSV,
    accept = CONTENT_TYPE_JSON
)



In [36]:

def make_inference_format(sample):
    instance = str()
    for i, token in enumerate(sample):
        # print(token)
        if i > 0:
            instance = instance  + ',' + str(token) 
        else:
            instance = instance  +  str(token) 
    return instance


In [39]:
test_df = pd.read_csv("churn_data/batch_transform_test.csv", header=None)

for i in range(10):
    sample = test_df.iloc[i,:]
    instance = make_inference_format(sample)
    print("instance: \n", instance)

    payload = instance
    churn_result = predictor.predict(payload)
    print("Churn result?: \n", churn_result)
    print("")

instance: 
 KS,186,510,400-6454,no,no,0,137.8,97,23.43,187.7,118,15.95,146.4,85,6.59,8.7,6,2.35,1
Churn result?: 
 b'False\n'

instance: 
 MA,132,415,343-5372,no,yes,25,113.2,96,19.24,269.9,107,22.94,229.1,87,10.31,7.1,7,1.92,2
Churn result?: 
 b'False\n'

instance: 
 MA,112,415,358-7379,no,yes,17,183.2,95,31.14,252.8,125,21.49,156.7,95,7.05,9.7,3,2.62,0
Churn result?: 
 b'False\n'

instance: 
 FL,91,510,387-9855,yes,yes,24,93.5,112,15.9,183.4,128,15.59,240.7,133,10.83,9.9,3,2.67,0
Churn result?: 
 b'False\n'

instance: 
 SC,22,408,331-5138,no,no,0,110.3,107,18.75,166.5,93,14.15,202.3,96,9.1,9.5,5,2.57,0
Churn result?: 
 b'False\n'

instance: 
 DC,102,415,402-9704,no,no,0,186.8,92,31.76,173.7,123,14.76,250.9,131,11.29,9.7,4,2.62,2
Churn result?: 
 b'False\n'

instance: 
 ME,118,408,384-8723,yes,yes,21,156.5,122,26.61,209.2,125,17.78,158.7,81,7.14,11.1,3,3.0,4
Churn result?: 
 b'True\n'

instance: 
 NM,178,415,398-1332,no,yes,35,175.4,88,29.82,190.0,65,16.15,138.7,94,6.24,10.5,3,2.84,2
