# [Module 3.2] Train PCA Model

- 이 노트북에서는 아래의 내용을 진행을 하고 추론에 대한 로그를 남기어서, Inference Pipeline Model 이 어떻게 작동을 하는지 알아 봅니다.
    - Feature Transfomer(전처리 학습 모델) 생성
    - Train 데이타를 Feature Transfomer를 통해서 전처리 데이타 생성
    - Validation 데이타를 Feature Transfomer를 통해서 전처리 데이타 생성
    - XGBoost를 학습
    - Inference Pipeline Model 생성 (전처리, XGboost, 휴처리 모델)
    - Realtime Endpoint 생성
    - 한개의 테스트 데이터 추론
- 소요 시간은 약 10분 걸립니다.

In [1]:
import sagemaker
import pandas as pd
import numpy as np
import os
import time
import json
from time import strftime, gmtime

In [2]:
%store -r

## PCA 학습

In [3]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'pca'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

057716757052
ap-northeast-2
arn:aws:iam::057716757052:role/service-role/AmazonSageMaker-ExecutionRole-20191128T110038
sagemaker-ap-northeast-2-057716757052


In [4]:
! cp pca_byoc_train.py docker/code/

<font color="red">만일 현재 Region이 ap-northwest-2 가 이니시면 반드시 해당 Region으로 변경 해주셔야 합니다.</font><br>
예: 현재 Ohio 인 경우 (us-east-2) 로 변경.
```
FROM 257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3
```

In [5]:
%%writefile docker/Dockerfile

FROM 366743142698.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3
    
# install python package
RUN pip install joblib


ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE

ENV PATH="/opt/ml/code:${PATH}"

# Copy training code
COPY code/* /opt/ml/code/
 
WORKDIR /opt/ml/code

# ENTRYPOINT ["python", "pca_train.py"]
# In order to use SageMaker Env varaibles, use the statement below
ENV SAGEMAKER_PROGRAM pca_byoc_train.py

Overwriting docker/Dockerfile


In [6]:
import os
os.environ['account_id'] = account_id
os.environ['region'] = region
os.environ['ecr_repository_name'] = ecr_repository_name

In [7]:
%%sh

ACCOUNT_ID=${account_id}
REGION=${region}
REPO_NAME=${ecr_repository_name}

echo $REGION
echo $ACCOUNT_ID
echo $REPO_NAME


# Get the login command from ECR in order to pull down the Tensorflow-gpu:1.5 image
$(aws ecr get-login --registry-ids 257758044811 --region ${region} --no-include-email)



docker build -f docker/Dockerfile -t $REPO_NAME docker

docker tag $REPO_NAME $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest

$(aws ecr get-login --no-include-email --registry-ids $ACCOUNT_ID)

aws ecr describe-repositories --repository-names $REPO_NAME || aws ecr create-repository --repository-name $REPO_NAME

docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest



ap-northeast-2
057716757052
sagemaker-training-containers/pca
Login Succeeded
Sending build context to Docker daemon  11.26kB
Step 1/8 : FROM 366743142698.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3
 ---> 30adb1aa9af5
Step 2/8 : RUN pip install joblib
 ---> Running in 20153568247f
Collecting joblib
  Downloading https://files.pythonhosted.org/packages/51/dd/0e015051b4a27ec5a58b02ab774059f3289a94b0906f880a3f9507e74f38/joblib-0.16.0-py3-none-any.whl (300kB)
Installing collected packages: joblib
Successfully installed joblib-0.16.0
Removing intermediate container 20153568247f
 ---> 59663d1629c3
Step 3/8 : ENV PYTHONUNBUFFERED=TRUE
 ---> Running in 36e7c7ec8e18
Removing intermediate container 36e7c7ec8e18
 ---> 605f60b5674c
Step 4/8 : ENV PYTHONDONTWRITEBYTECODE=TRUE
 ---> Running in fdf5876bc4d9
Removing intermediate container fdf5876bc4d9
 ---> b70c6ea0613d
Step 5/8 : ENV PATH="/opt/ml/code:${PATH}"
 ---> Running in c8328a74f026
Removing intermediate contai

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



In [8]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/pca:latest


## Docker Image에 권한 부여

ECR 콘솔로 이동을 하여 위에서 생성한 Docker Image를 선택 합니다.

![Fig.3.2.ECR-Repository](img/Fig.3.2.ECR-Repository.png)

왼쪽의 permission 을 클릭하고, 오른쪽 상단에 "Edit policy JSON"을 클릭 합니다.

![Fig.3.2.ECR-permission](img/Fig.3.2.ECR-permission.png)

아래의 "Edit JSON"에 아래의 JSON 코드를 복사하여 붙입니다.
sagemaker.amazonaws.com 이 docker image의 특정 액션을 수행하게 허가 합니다.
```
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "allowSageMakerToPull",
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer"
      ]
    }
  ]
}
```

![Fig.3.2.ECR-edit-json](img/Fig.3.2.ECR-edit-json.png)

아래와 같은 화면이 나오면 완료 입니다.
![Fig.3.2.ECR-permission-finish](img/Fig.3.2.ECR-permission-finish.png)

In [9]:
preprocessed_train_path_file = '{}/train.csv.out'.format(preprocessed_train_path)
preprocessed_validation_path_file = '{}/validation.csv.out'.format(preprocessed_validation_path)
print("preprocessed_train_path_file: \n", preprocessed_train_path_file)
print("preprocessed_validation_path_file: \n", preprocessed_validation_path_file)

preprocessed_train_path_file: 
 s3://sagemaker-ap-northeast-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-20-03-14-2020-08-20-03-14-50-345/train.csv.out
preprocessed_validation_path_file: 
 s3://sagemaker-ap-northeast-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-08-20-03-14-2020-08-20-03-14-57-711/validation.csv.out


## PCA Transformation

In [10]:
import pandas as pd

preprocessed_train_path_file = '{}/train.csv.out'.format(preprocessed_train_path)
pre_df = pd.read_csv(preprocessed_train_path_file, header=None)
print(pre_df.shape)
num_cols = pre_df.shape[1]
print("num_cols: ", num_cols)

(2333, 70)
num_cols:  70


In [11]:
import pandas as pd
# preprocessed_train_path_file = 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-12-07-07-2020-08-12-07-07-08-229/train.csv.out'

churn_df = pd.read_csv(preprocessed_train_path_file, header=None)
churn_df.head()
train_y = churn_df.iloc[:,0]
train_X = churn_df.iloc[:,1:]

print("Shape of train_X: ", train_X.shape)
print("Shape of train_y: ", train_y.shape)

os.makedirs('./data', exist_ok =True)
np.savetxt('./data/churn-preprocessed.csv', train_X, delimiter=',',
           fmt='%1.5f'
          )

WORK_DIRECTORY = 'data'
prefix = 'Scikit-pca-custom'
train_input = sagemaker_session.upload_data(WORK_DIRECTORY,
                                            key_prefix="{}/{}".format(prefix, WORK_DIRECTORY)
                                           )
print("train_input: ", train_input)


Shape of train_X:  (2333, 69)
Shape of train_y:  (2333,)
train_input:  s3://sagemaker-ap-northeast-2-057716757052/Scikit-pca-custom/data


In [12]:
%%time

import sagemaker

instance_type = 'local'
# instance_type = 'ml.m4.xlarge'

pca_estimator = sagemaker.estimator.Estimator(container_image_uri,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type= instance_type,
                                    base_job_name=prefix)

pca_estimator.set_hyperparameters(n_components= 25)

train_config = sagemaker.session.s3_input(train_input, content_type='text/csv')

pca_estimator.fit({'train': train_config})

Creating tmpwa891249_algo-1-0rjb1_1 ... 
[1BAttaching to tmpwa891249_algo-1-0rjb1_12mdone[0m
[36malgo-1-0rjb1_1  |[0m 2020-08-21 03:02:29,199 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
[36malgo-1-0rjb1_1  |[0m 2020-08-21 03:02:29,202 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-0rjb1_1  |[0m 2020-08-21 03:02:29,210 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36malgo-1-0rjb1_1  |[0m 2020-08-21 03:02:29,211 sagemaker-containers INFO     Module pca_byoc_train does not provide a setup.py. 
[36malgo-1-0rjb1_1  |[0m Generating setup.py
[36malgo-1-0rjb1_1  |[0m 2020-08-21 03:02:29,211 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-0rjb1_1  |[0m 2020-08-21 03:02:29,212 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-0rjb1_1  |[0m 2020-08-21 03:02:29,212 sagemaker-containers INFO     Installing module with the following command:

# Transforming Train PCA

In [13]:
import pandas as pd

preprocessed_train_path_file = '{}/train.csv.out'.format(preprocessed_train_path)
pre_df = pd.read_csv(preprocessed_train_path_file, header=None)
print(pre_df.shape)
num_cols = pre_df.shape[1]
print("num_cols: ", num_cols)

(2333, 70)
num_cols:  70


In [14]:
instance_type = 'local'
# instance_type = 'ml.m4.2xlarge'
transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-pca-train-output')

pca_model = pca_estimator.create_model(
    env={'TRANSFORM_MODE': 'feature-transform', 'LENGTH_COLS': str(num_cols)})

# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_train = pca_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_train_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_train.transform(preprocessed_train_path_file, 
                            content_type='text/csv',                            
                           )

print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)
transformer_train.wait()

preprocessed_pca_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name


Attaching to tmpo5z1b6a1_algo-1-liv5z_1
[36malgo-1-liv5z_1  |[0m Processing /opt/ml/code
[36malgo-1-liv5z_1  |[0m Building wheels for collected packages: pca-byoc-train
[36malgo-1-liv5z_1  |[0m   Building wheel for pca-byoc-train (setup.py) ... [?25ldone
[36malgo-1-liv5z_1  |[0m [?25h  Created wheel for pca-byoc-train: filename=pca_byoc_train-1.0.0-py2.py3-none-any.whl size=9477 sha256=c1c80cffa67455f74ca8bc5353413d0db39710b1f0346c331a624c9dfa64f03f
[36malgo-1-liv5z_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-mi5lz68k/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-liv5z_1  |[0m Successfully built pca-byoc-train
[36malgo-1-liv5z_1  |[0m Installing collected packages: pca-byoc-train
[36malgo-1-liv5z_1  |[0m Successfully installed pca-byoc-train-1.0.0
[36malgo-1-liv5z_1  |[0m   import imp
[36malgo-1-liv5z_1  |[0m [2020-08-21 03:02:35 +0000] [44] [INFO] Starting gunicorn 19.9.0
[36malgo-1-liv5z_1  |[0m [2020-08-21 03:02:3

In [15]:
print(preprocessed_pca_train_path)

s3://sagemaker-ap-northeast-2-057716757052/Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-21-03-02-31-990-2020-08-21-03-02-31-990


In [16]:
! aws s3 ls s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-13-01-27-21-375-2020-08-13-01-27-21-375 --recursive

2020-08-13 01:27:28     707835 Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-13-01-27-21-375-2020-08-13-01-27-21-375/train.csv.out.out


In [17]:
preprocessed_pca_train_path_file = '{}/train.csv.out.out'.format(preprocessed_pca_train_path)
pca_preoc_df = pd.read_csv(preprocessed_pca_train_path_file, header=None)
pca_preoc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,0.0,-0.822971,-0.108986,0.452238,-0.028701,-1.031895,-2.957447,-0.042852,-0.718277,1.05564,...,0.070918,0.122288,-0.039894,-0.208358,-0.512724,-0.145938,-0.285708,-0.161326,0.166548,0.134311
1,0.0,-0.343563,0.092688,1.946674,1.271797,0.008117,0.419165,-0.99059,0.874162,-0.533861,...,0.128007,-0.094468,-0.27066,-0.39555,0.643733,-0.106463,-0.255564,0.0167,0.086769,0.116975
2,1.0,-0.764182,0.010289,0.825062,-1.427365,-1.627981,-0.740463,0.554086,0.429989,-0.32233,...,-0.061429,-0.133886,0.11017,-0.321982,-0.001844,0.73389,-0.003944,0.042556,-0.057111,0.040024
3,0.0,-0.825846,-0.722819,-0.338572,-0.981696,-0.260432,0.356843,-0.670504,-1.10858,-1.461891,...,-0.060964,0.148803,-0.254422,-0.195045,0.031457,-0.116513,0.713128,-0.000927,-0.068848,0.065247
4,0.0,1.830923,0.700579,0.197524,-1.351677,-0.7296,0.844843,0.148726,0.082408,0.181474,...,-0.043718,-0.075812,0.137604,0.047247,0.036165,-0.058655,0.190555,0.090646,-0.130942,-0.202309


## PCA Validation Transforming

In [18]:
preprocessed_validation_path

's3://sagemaker-ap-northeast-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-08-20-03-14-2020-08-20-03-14-57-711'

In [19]:
import pandas as pd

preprocessed_validation_path_file = '{}/validation.csv.out'.format(preprocessed_validation_path)
pre_df = pd.read_csv(preprocessed_validation_path_file, header=None)
print(pre_df.shape)
num_cols = pre_df.shape[1]
print("num_cols: ", num_cols)

(666, 70)
num_cols:  70


In [20]:


instance_type = 'local'
# instance_type = 'ml.m4.2xlarge'
transform_validation_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-pca-validation-output')

pca_model = pca_estimator.create_model(
    env={'TRANSFORM_MODE': 'feature-transform', 'LENGTH_COLS': str(num_cols)})

# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_validation = pca_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_validation_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_validation.transform(preprocessed_validation_path_file, 
                            content_type='text/csv',                            
                           )

print('Waiting for transform job: ' + transformer_validation.latest_transform_job.job_name)
transformer_validation.wait()

preprocessed_pca_validation_path = transformer_validation.output_path + transformer_validation.latest_transform_job.job_name
print(preprocessed_pca_validation_path)

Attaching to tmphkox5xvz_algo-1-j3nal_1
[36malgo-1-j3nal_1  |[0m Processing /opt/ml/code
[36malgo-1-j3nal_1  |[0m Building wheels for collected packages: pca-byoc-train
[36malgo-1-j3nal_1  |[0m   Building wheel for pca-byoc-train (setup.py) ... [?25ldone
[36malgo-1-j3nal_1  |[0m [?25h  Created wheel for pca-byoc-train: filename=pca_byoc_train-1.0.0-py2.py3-none-any.whl size=9479 sha256=2fff9bb14846b492f2fc6967f966c8e867173be6b152363a47f7c2880435e4c3
[36malgo-1-j3nal_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-se9rgn4c/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-j3nal_1  |[0m Successfully built pca-byoc-train
[36malgo-1-j3nal_1  |[0m Installing collected packages: pca-byoc-train
[36malgo-1-j3nal_1  |[0m Successfully installed pca-byoc-train-1.0.0
[36malgo-1-j3nal_1  |[0m   import imp
[36malgo-1-j3nal_1  |[0m [2020-08-21 03:02:44 +0000] [44] [INFO] Starting gunicorn 19.9.0
[36malgo-1-j3nal_1  |[0m [2020-08-21 03:02:4

In [21]:
preprocessed_pca_validation_path_file = '{}/validation.csv.out.out'.format(preprocessed_pca_validation_path)
pca_val_preoc_df = pd.read_csv(preprocessed_pca_validation_path_file, header=None)
pca_val_preoc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,0.0,1.647741,1.321463,-0.097282,0.254665,1.186017,-1.35672,-0.141434,-1.377084,0.177782,...,0.241058,-0.055968,0.008347,-0.00549,0.065375,-0.095202,-0.003368,-0.233436,-0.478926,-0.02145
1,0.0,-0.568643,0.209874,0.929371,-0.420525,-1.250525,-1.188874,-2.054817,0.993901,-1.311588,...,-0.105239,-0.117971,0.028713,-0.064478,0.00962,0.009191,-0.127253,-0.051297,-0.019462,0.019785
2,0.0,1.856334,-0.55911,-1.96966,0.311881,0.133223,-0.14006,-0.622796,0.821204,1.138125,...,-0.101777,-0.010605,-0.012165,0.003192,0.133012,0.022872,-0.047962,0.038177,-0.00468,-0.084035
3,0.0,-0.681761,-1.327774,-0.846023,-1.374146,-0.960982,1.212969,-0.803956,2.278834,-0.007486,...,-0.081463,-0.05016,-0.101598,0.069258,0.089074,-0.053233,-0.202646,0.109185,-0.027187,-0.050105
4,0.0,2.290679,0.266063,-0.893022,0.858179,0.113871,0.458166,-0.133466,-0.705785,-0.742677,...,-0.106594,0.018597,0.007495,0.001531,0.034039,-0.009584,-0.016881,-0.000947,-0.000392,-0.00356


## Inference Pipeline <a class="anchor" id="pipeline_setup"></a>

아래 그림과 같이 위에서 생성한 전처리, 알고리즘 학습, 후처리의 세가지 모델을 가지고 1개의 단일 모델을 만들어 Inference Pipleline을 생성 합니다. <br>
**입력 데이타 가공이 없이 실제 데이타가 입력이 되면, 1개의 단일 모델을 통해서 최종적으로 예측 결과인 True, False의 결과 값이 제공 됩니다.**

![Inference-pipeline](img/Fig2.2.inference_pipeline.png)


**Machine Learning Model Pipeline (Inference Pipeline)는 create_model() 를 호출하여 만들 수 있습니다.** <br>
예를 들어 여기서는 the fitted Scikit-learn inference model, the fitted Xgboost model and the psotprocessing model 의 세가지 모델을 가지고 만듦니다.

아래는 세개 모델을 생성함. 전처리, 후처리 모델 생성시에는 환경 변수를 제공 함

In [22]:
pca_estimator.model_data
pca_estimator.image_name

'057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/pca:latest'

In [23]:
pca_model_data = pca_estimator.model_data
pca_image_name = pca_estimator.image_name
print("pca_model_data: \n", pca_model_data)
print("pca_image_name: \n", pca_image_name)

%store preprocessed_pca_train_path
%store preprocessed_pca_validation_path
%store pca_model_data
%store pca_image_name

pca_model_data: 
 s3://sagemaker-ap-northeast-2-057716757052/Scikit-pca-custom-2020-08-21-03-02-27-274/model.tar.gz
pca_image_name: 
 057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/pca:latest
Stored 'preprocessed_pca_train_path' (str)
Stored 'preprocessed_pca_validation_path' (str)
Stored 'pca_model_data' (str)
Stored 'pca_image_name' (str)


In [24]:
# ! aws s3 ls {preprocessed_pca_train_path} --recursive