# [Module 9.0] Inferencde Pipeline 생성 로그 확인

- 이 노트북에서는 아래의 내용을 진행을 하고 추론에 대한 로그를 남기어서, Inference Pipeline Model 이 어떻게 작동을 하는지 알아 봅니다.
    - Feature Transfomer(전처리 학습 모델) 생성
    - Train 데이타를 Feature Transfomer를 통해서 전처리 데이타 생성
    - Validation 데이타를 Feature Transfomer를 통해서 전처리 데이타 생성
    - XGBoost를 학습
    - Inference Pipeline Model 생성 (전처리, XGboost, 휴처리 모델)
    - Realtime Endpoint 생성
    - 한개의 테스트 데이터 추론
- 소요 시간은 약 10분 걸립니다.

## Feature Transformer (전처리 학습 모델) - log-preprocessing.py 파일
- Numerical 데이타는 <a href=https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html>StandardScaler</a>를 사용하여 Normalization을 함. 
    * z = (x - u) / s 
        * (z: 표준화된 값. 이 값을 학습시에 사용, x: 각 테이타의 값, u: 데이타 항목의 평균, s: 데이타 항목의 표준편차)
- 아래 Account Length, ..CustServ Calls까지 모두 위의 방법으로 전처리 함.
    - 아래 imputer는 결측값이 있을 경우에 해당 컬럼의 median 값을 사용 함.

```python
    numeric_features = list([
    'Account Length',
    'VMail Message',
    'Day Mins',
    'Day Calls',
    'Eve Mins',
    'Eve Calls',
    'Night Mins',
    'Night Calls',
    'Intl Mins',
    'Intl Calls',
    'CustServ Calls'])

    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])
```
- Categorical 데이타는 One Hot Encoding 방식으로 전처리 함. (예, 남자:0, 여자:1 일 경우에 남자:(1,0), 여자:(0,1) 방식으로 처리)
    - State, Area Code, Int'l Plan, VMail Plan을 적용 함
```python
    categorical_features = ['State','Area Code',"Int'l Plan",'VMail Plan']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
```
- 최종적으로 Numerical and Categorical Transformer를 합쳐서 Transformer 생성하고, 학습하여 Transformer의 모델을 S3에 업로드 함.
```python
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)],
        remainder="drop")

    preprocessor.fit(concat_data)

    joblib.dump(preprocessor, os.path.join(args.model_dir, "model.joblib"))
```
- Phone 데이타 항목은 위의 전처리 항목에서 제외 함. 유저별로 고유한 번호이기에 피쳐로서 의미가 없을 것으로 보임

In [1]:
import sagemaker
import pandas as pd
import numpy as np
import os
import time
import json
from time import strftime, gmtime

In [2]:
%store -r

## Feature Transformer (전처리 학습 모델) 생성
아래는 다음과 같은 작업을 합니다.
- SKLearn 이라는 Estimator를 생성 합니다. 
    - s3_input_train의 학습 데이타를 SKLearn 입력으로 제공 합니다.
    - "전처리 학습 모델 (Featurizer)" 을 생성할 수 있는 소스 코드 preprocessing.py 를 지정 합니다. 
    - 사용할 리소스로 instance_type = 'local' 를 지정 합니다. (이미 노트북 인스턴스에 설치된 Docker-compose를 이용 합니다.)
        - **Local 이 아니라 SageMaker Cloud Instance도 사용 가능 합니다. (예: ml.m4.xlarge)**
        - **아래 XGBoost 학습 알고리즘을 사용시에는 SageMaker Cloud Instance 사용함**
- SKLearn의 "전처리 학습 모델"이 완료가 되면 결과인 모델 아티펙트 파일이 (model.tar.gz)  s3://{bucket_name}/{job_name}/output.tar.gz 에 저장 됩니다. 
    - (예: s3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-15-08-39-41-035/model.tar.gz)

#### 아래는 약 1분 정도가 소요 됩니다. 아래 셀의 [*] 의 표시가 [숫자] (에: [3])로 바뀔 때까지 기다려 주세요

In [3]:
from sagemaker.sklearn.estimator import SKLearn
sagemaker_session = sagemaker.Session()
from sagemaker import get_execution_role

role = get_execution_role()

script_path = 'log-preprocessing.py'
# instance_type = 'ml.m4.2xlarge'
instance_type = 'local'

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    train_instance_type = instance_type
)
sklearn_preprocessor.fit({'train': s3_input_train})

This is not the latest supported version. If you would like to use version 0.23-1, please add framework_version=0.23-1 to your constructor.


Creating tmpjf1cvu47_algo-1-p6bj1_1 ... 
[1BAttaching to tmpjf1cvu47_algo-1-p6bj1_12mdone[0m
[36malgo-1-p6bj1_1  |[0m 2020-08-12 07:07:05,709 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
[36malgo-1-p6bj1_1  |[0m 2020-08-12 07:07:05,711 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-p6bj1_1  |[0m 2020-08-12 07:07:05,720 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36malgo-1-p6bj1_1  |[0m 2020-08-12 07:07:05,809 sagemaker-containers INFO     Module log-preprocessing does not provide a setup.py. 
[36malgo-1-p6bj1_1  |[0m Generating setup.py
[36malgo-1-p6bj1_1  |[0m 2020-08-12 07:07:05,809 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-p6bj1_1  |[0m 2020-08-12 07:07:05,810 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-p6bj1_1  |[0m 2020-08-12 07:07:05,810 sagemaker-containers INFO     Installing module with the following comma

## Feature Transfomer를 사용하여 전처리된 학습 및 검증 데이타 생성 

![Transformer_Train](img/Fig2.1.transformer_train.png)

### Preprocessed Training data (Feature) 만들기

#### 아래는 약 1분 정도가 소요 됩니다. 아래 셀의 [*] 의 표시가 [숫자] (에: [4])로 바뀔 때까지 기다려 주세요

In [4]:
# 아웃풋 경로 지정
transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-train-output')
instance_type = 'local'
# instance_type = 'ml.m4.2xlarge'

# scikit_learn_inferencee_model 이름으로 전처리 학습 모델 생성
# TRANSFORM_MODE의 환경 변수는 전처리 모드라는 것을 알려 줌.
    # 추론시에는 환경 변수를 TRANSFORM_MODE": "inverse-label-transform" 설정 함.
    # 위의 두개의 과정을 분리할 수 있으나, 한개의 소스를 (preprocessor.py)를 사용하기 위해서, 환경 변수를 통해서 구분함.
scikit_learn_inferencee_model = sklearn_preprocessor.create_model(
    env={'TRANSFORM_MODE': 'feature-transform'})
# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_train = scikit_learn_inferencee_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_train_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_train.transform(s3_input_train.config['DataSource']['S3DataSource']['S3Uri'], 
                            content_type='text/csv')
print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)
transformer_train.wait()
preprocessed_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name
print(preprocessed_train_path)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


Attaching to tmpnqdvrr9p_algo-1-4tc5s_1
[36malgo-1-4tc5s_1  |[0m Processing /opt/ml/code
[36malgo-1-4tc5s_1  |[0m Building wheels for collected packages: log-preprocessing
[36malgo-1-4tc5s_1  |[0m   Building wheel for log-preprocessing (setup.py) ... [?25ldone
[36malgo-1-4tc5s_1  |[0m [?25h  Created wheel for log-preprocessing: filename=log_preprocessing-1.0.0-py2.py3-none-any.whl size=10221 sha256=a26db4631d8053b24b8a186ed1fec0c80bbc5b9cde871df219d302488b1d7af8
[36malgo-1-4tc5s_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-fsuv95_4/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-4tc5s_1  |[0m Successfully built log-preprocessing
[36malgo-1-4tc5s_1  |[0m Installing collected packages: log-preprocessing
[36malgo-1-4tc5s_1  |[0m Successfully installed log-preprocessing-1.0.0
[36malgo-1-4tc5s_1  |[0m   import imp
[36malgo-1-4tc5s_1  |[0m [2020-08-12 07:07:11 +0000] [72] [INFO] Starting gunicorn 19.9.0
[36malgo-1-4tc5s_1  |[

#### Training 전처리된 학습 파일 확인

In [5]:
print(preprocessed_train_path)

s3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-12-07-07-2020-08-12-07-07-08-229


In [6]:
! aws s3 ls {preprocessed_train_path} --recursive

2020-08-12 07:07:14    1054526 sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-12-07-07-2020-08-12-07-07-08-229/train.csv.out


In [7]:
preprocessed_train_path_file = os.path.join (preprocessed_train_path, 'train.csv.out')
df_pre_train = pd.read_csv(preprocessed_train_path_file)
df_pre_train.head()


Unnamed: 0,0.0,0.11941369588439606,-0.5962380254245051,1.744368057672484,0.9789570533336895,-0.028992907038264654,-0.8931854019845896,-0.8017032037830547,-1.9825286353116254,-1.5305589315744583,...,0.0.48,0.0.49,0.0.50,0.0.51,0.0.52,0.0.53,1.0.1,0.0.54,1.0.2,0.0.55
0,0.0,-1.852652,-0.596238,0.140284,-0.310405,0.970689,-0.689888,0.146389,1.232901,0.124852,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.0,1.181295,-0.596238,1.83513,0.185503,0.030988,-0.639063,1.568529,-0.063643,-0.846802,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,0.776769,-0.596238,0.216227,0.334276,0.136954,1.393914,1.394712,-0.634123,0.844596,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,-0.234547,1.508734,-0.459859,0.483049,-0.230929,0.224952,1.056954,0.92173,-0.810815,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,0.0,0.751486,1.218393,0.231046,-0.756723,0.516833,0.275776,1.043127,-2.138114,0.232814,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
