# [Module 2.2] 전처리 모델 학습 및 Train, Validation 데이타 전처리

이 노트북에서는 전처리 모델을 학습하고, 이 모델을 바탕으로 Train, Validation의 데이타를 전처리하여 S3에 저장합니다. <br>
전처리된 데이타는 Custom PCA 모델를 학습하기 위해 입력으로 제공 됩니다. <br>
또한 전처리 모델의 정보 (model artifact, 학습/추론에 사용을 하는 SKLearn docker iamge url 등)를 저장하여 추후 inference pipeline을 구성시에 사용 합니다.



구체적으로 이 노트북은 아래와 같은 작업을 합니다.

- Feature Transfomer(전처리 모델) 학습
- Train 데이타를 Feature Transfomer를 추론하여 전처리 데이타 생성
- Validation 데이타를 Feature Transfomer를 추론하여 전처리 데이타 생성
- Model Artifact 및 코드 경로 저장    
    - 추후 inference pipeline을 만들때 사용할 필요한 값을 저장

---
소요 시간은 약 10분 걸립니다.

---
inference pipeline 정의: An inference pipeline is an Amazon SageMaker model that is composed of a linear sequence of two to five containers that process requests for inferences on data.<br>
[참고:inference pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html)


In [1]:
import sagemaker
import pandas as pd
import numpy as np
import os
import time
import json
from time import strftime, gmtime

In [2]:
%store -r

## Feature Transformer (전처리 모델) 학습
아래는 다음과 같은 작업을 합니다.
- SKLearn 이라는 Estimator를 생성 합니다. 
    - s3_input_train의 학습 데이타를 SKLearn 입력으로 제공 합니다.
    - "전처리 모델 (Featurizer)" 을 학습할 수 있는 소스 코드 preprocessing.py 를 지정 합니다. 
    - 사용할 리소스로 instance_type = 'local' 를 지정 합니다. (이미 노트북 인스턴스에 설치된 Docker-compose를 이용 합니다.)
        - **Local 이 아니라 SageMaker Cloud Instance도 사용 가능 합니다. (예: ml.m4.xlarge)**
        - **아래 XGBoost 알고리즘을 사용시에는 SageMaker Cloud Instance 사용함**
- SKLearn의 "전처리 모델"이 학습 완료가 되면 결과인 모델 아티펙트 파일이 (model.tar.gz)  s3://{bucket_name}/{job_name}/output.tar.gz 에 저장 됩니다. 
    - (예: s3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-15-08-39-41-035/model.tar.gz)

#### 아래는 약 1분 정도가 소요 됩니다. 아래 셀의 [*] 의 표시가 [숫자] (에: [3])로 바뀔 때까지 기다려 주세요

In [3]:
%%time 

from sagemaker.sklearn.estimator import SKLearn
sagemaker_session = sagemaker.Session()
from sagemaker import get_execution_role

role = get_execution_role()

script_path = 'preprocessing.py'
# instance_type = 'ml.t2.medium'
instance_type = 'local'

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    py_version='py3',
    framework_version="0.20.0",
    instance_type = instance_type
)
sklearn_preprocessor.fit({'train': s3_input_train})

Creating tmp0r6wwoea_algo-1-zj5te_1 ... 
[1BAttaching to tmp0r6wwoea_algo-1-zj5te_12mdone[0m
[36malgo-1-zj5te_1  |[0m 2020-08-27 09:25:06,681 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
[36malgo-1-zj5te_1  |[0m 2020-08-27 09:25:06,683 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-zj5te_1  |[0m 2020-08-27 09:25:06,691 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36malgo-1-zj5te_1  |[0m 2020-08-27 09:25:06,821 sagemaker-containers INFO     Module preprocessing does not provide a setup.py. 
[36malgo-1-zj5te_1  |[0m Generating setup.py
[36malgo-1-zj5te_1  |[0m 2020-08-27 09:25:06,822 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-zj5te_1  |[0m 2020-08-27 09:25:06,822 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-zj5te_1  |[0m 2020-08-27 09:25:06,822 sagemaker-containers INFO     Installing module with the following command:


## Feature Transfomer를 사용하여 전처리된 train 및 validation 데이타 생성 

![TransformerTrain](img/Fig2.1.transformer_train.png)

### 전처리된 Train 데이터 만들기

#### 아래는 약 1분 정도가 소요 됩니다. 아래 셀의 [*] 의 표시가 [숫자] (에: [4])로 바뀔 때까지 기다려 주세요

In [4]:
%%time

# 아웃풋 경로 지정
transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-train-output')
instance_type = 'local'
# instance_type = 'ml.t2.medium'

# scikit_learn_inferencee_model 이름으로 전처리 학습 모델 생성
# TRANSFORM_MODE의 환경 변수는 전처리 모드라는 것을 알려 줌.
    # 추론시에는 환경 변수를 TRANSFORM_MODE": "inverse-label-transform" 설정 함.
    # 위의 두개의 과정을 분리할 수 있으나, 한개의 소스를 (preprocessor.py)를 사용하기 위해서, 환경 변수를 통해서 구분함.
scikit_learn_inferencee_model = sklearn_preprocessor.create_model(
    env={'TRANSFORM_MODE': 'feature-transform'})
# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_train = scikit_learn_inferencee_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_train_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_train.transform(s3_input_train.config['DataSource']['S3DataSource']['S3Uri'], 
                            content_type='text/csv')
print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)
transformer_train.wait()
preprocessed_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name
print(preprocessed_train_path)

Attaching to tmp1n58j1yr_algo-1-ww4ka_1
[36malgo-1-ww4ka_1  |[0m Processing /opt/ml/code
[36malgo-1-ww4ka_1  |[0m Building wheels for collected packages: preprocessing
[36malgo-1-ww4ka_1  |[0m   Building wheel for preprocessing (setup.py) ... [?25ldone
[36malgo-1-ww4ka_1  |[0m [?25h  Created wheel for preprocessing: filename=preprocessing-1.0.0-py2.py3-none-any.whl size=10224 sha256=dfcaafedd6c79d89f3a28379c3a56c30859c5199122826c5cf5a5560217ff0bc
[36malgo-1-ww4ka_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-_ng1ijm9/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-ww4ka_1  |[0m Successfully built preprocessing
[36malgo-1-ww4ka_1  |[0m Installing collected packages: preprocessing
[36malgo-1-ww4ka_1  |[0m Successfully installed preprocessing-1.0.0
[36malgo-1-ww4ka_1  |[0m   import imp
[36malgo-1-ww4ka_1  |[0m [2020-08-27 09:25:12 +0000] [49] [INFO] Starting gunicorn 19.9.0
[36malgo-1-ww4ka_1  |[0m [2020-08-27 09:25:12 +000

#### 전처리된 Train 파일 확인

In [5]:
print(preprocessed_train_path)

s3://sagemaker-ap-northeast-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-27-09-25-2020-08-27-09-25-09-297


In [6]:
! aws s3 ls {preprocessed_train_path} --recursive

2020-08-27 09:25:16    1054526 sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-27-09-25-2020-08-27-09-25-09-297/train.csv.out


In [7]:
preprocessed_train_path_file = os.path.join (preprocessed_train_path, 'train.csv.out')
df_pre_train = pd.read_csv(preprocessed_train_path_file)
df_pre_train.head()


Unnamed: 0,0.0,0.11941369588439606,-0.5962380254245051,1.744368057672484,0.9789570533336895,-0.028992907038264654,-0.8931854019845896,-0.8017032037830547,-1.9825286353116254,-1.5305589315744583,...,0.0.48,0.0.49,0.0.50,0.0.51,0.0.52,0.0.53,1.0.1,0.0.54,1.0.2,0.0.55
0,0.0,-1.852652,-0.596238,0.140284,-0.310405,0.970689,-0.689888,0.146389,1.232901,0.124852,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.0,1.181295,-0.596238,1.83513,0.185503,0.030988,-0.639063,1.568529,-0.063643,-0.846802,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,0.776769,-0.596238,0.216227,0.334276,0.136954,1.393914,1.394712,-0.634123,0.844596,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,-0.234547,1.508734,-0.459859,0.483049,-0.230929,0.224952,1.056954,0.92173,-0.810815,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,0.0,0.751486,1.218393,0.231046,-0.756723,0.516833,0.275776,1.043127,-2.138114,0.232814,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


### 전처리된 Validation 데이터 만들기

In [8]:
# 아웃풋 경로 지정
transform_validation_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-validation-output')
# scikit_learn_inferencee_model 에서 Validation Transformer 생성
transformer_validation = scikit_learn_inferencee_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_validation_output_path,
    accept = 'text/csv')
# Preprocess validation input
transformer_validation.transform(s3_input_validation.config['DataSource']['S3DataSource']['S3Uri'], content_type='text/csv')
print('Waiting for transform job: ' + transformer_validation.latest_transform_job.job_name)
transformer_validation.wait()
preprocessed_validation_path = transformer_validation.output_path+transformer_validation.latest_transform_job.job_name
print(preprocessed_validation_path)


Attaching to tmpvl90k_bp_algo-1-n3h5g_1
[36malgo-1-n3h5g_1  |[0m Processing /opt/ml/code
[36malgo-1-n3h5g_1  |[0m Building wheels for collected packages: preprocessing
[36malgo-1-n3h5g_1  |[0m   Building wheel for preprocessing (setup.py) ... [?25ldone
[36malgo-1-n3h5g_1  |[0m [?25h  Created wheel for preprocessing: filename=preprocessing-1.0.0-py2.py3-none-any.whl size=10222 sha256=6b76a932cab23d0c6b40acce06eb7e9ad17eb6cb4aa71f879dbebee5c2ba8408
[36malgo-1-n3h5g_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-bbzusv0j/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-n3h5g_1  |[0m Successfully built preprocessing
[36malgo-1-n3h5g_1  |[0m Installing collected packages: preprocessing
[36malgo-1-n3h5g_1  |[0m Successfully installed preprocessing-1.0.0
[36malgo-1-n3h5g_1  |[0m   import imp
[36malgo-1-n3h5g_1  |[0m [2020-08-27 09:25:20 +0000] [49] [INFO] Starting gunicorn 19.9.0
[36malgo-1-n3h5g_1  |[0m [2020-08-27 09:25:20 +000

In [9]:
!aws s3 ls s3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-08-14-13-43-2020-08-14-14-09-18-061 --recursive

## Model Artifact 및 코드 경로 저장

이 과정읜 추후에 Inference Pipeline의 SageMaker Model을 만들때에 사용하기 위해서 내용을 변수에 저장 합니다.
아래 크게 두 가지 종류가 있습니다.

- sklearn_preprocessor (SKLearn의 Estimator) 에서 사용한 코드가 들어 있는 S3의 경로 및 파일 이름을 저장.(이 코드는 Training, Inference 시 사용하는 모든 코드가 있음)
- 훈련을 한 Model Artifact (model.tar.gz) 및 이를 실행하기 위한 환경인 docker image (366743142698.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3) 를 저장 합니다.

In [10]:
# Store preprocess code location
preprocessor_uploaded_code_s3_prefix = sklearn_preprocessor.uploaded_code.s3_prefix
preprocessor_uploaded_code_script_name = sklearn_preprocessor.uploaded_code.script_name

print("preprocessor_uploaded_code_s3_prefix: \n", preprocessor_uploaded_code_s3_prefix)
print("preprocessor_uploaded_code_script_name: \n", preprocessor_uploaded_code_script_name)


preprocessor_uploaded_code_s3_prefix: 
 s3://sagemaker-ap-northeast-2-057716757052/sagemaker-scikit-learn-2020-08-27-09-25-04-653/source/sourcedir.tar.gz
preprocessor_uploaded_code_script_name: 
 preprocessing.py


In [11]:
preprocessor_model_data = sklearn_preprocessor.model_data
preprocessor_image_name = sklearn_preprocessor.image_uri
print("preprocessor_model_data: \n", preprocessor_model_data)
print("preprocessor_image_name: \n", preprocessor_image_name)

%store preprocessed_train_path
%store preprocessed_validation_path
%store preprocessor_model_data
%store preprocessor_image_name

%store preprocessor_uploaded_code_s3_prefix 
%store preprocessor_uploaded_code_script_name 


preprocessor_model_data: 
 s3://sagemaker-ap-northeast-2-057716757052/sagemaker-scikit-learn-2020-08-27-09-25-04-653/model.tar.gz
preprocessor_image_name: 
 366743142698.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3
Stored 'preprocessed_train_path' (str)
Stored 'preprocessed_validation_path' (str)
Stored 'preprocessor_model_data' (str)
Stored 'preprocessor_image_name' (str)
Stored 'preprocessor_uploaded_code_s3_prefix' (str)
Stored 'preprocessor_uploaded_code_script_name' (str)


In [12]:
%store

Stored variables and their in-db values:
bucket                                             -> 'sagemaker-ap-northeast-2-057716757052'
custom_pca_docker_image_uri                        -> '057716757052.dkr.ecr.ap-northeast-2.amazonaws.com
inference_pipeline_model_name                      -> 'churn-inference-pipeline-2020-08-27-09-03-30'
pca_image_uri                                      -> '057716757052.dkr.ecr.ap-northeast-2.amazonaws.com
pca_model_data                                     -> 's3://sagemaker-ap-northeast-2-057716757052/Scikit
prefix                                             -> 'sagemaker/customer-churn'
preprocessed_pca_train_path                        -> 's3://sagemaker-ap-northeast-2-057716757052/Scikit
preprocessed_pca_validation_path                   -> 's3://sagemaker-ap-northeast-2-057716757052/Scikit
preprocessed_train_path                            -> 's3://sagemaker-ap-northeast-2-057716757052/sagema
preprocessed_train_path_file                       ->