# [Module 3.2] Train PCA Model

- 이 노트북에서는 아래의 내용을 진행을 하고 추론에 대한 로그를 남기어서, Inference Pipeline Model 이 어떻게 작동을 하는지 알아 봅니다.
    - Feature Transfomer(전처리 학습 모델) 생성
    - Train 데이타를 Feature Transfomer를 통해서 전처리 데이타 생성
    - Validation 데이타를 Feature Transfomer를 통해서 전처리 데이타 생성
    - XGBoost를 학습
    - Inference Pipeline Model 생성 (전처리, XGboost, 휴처리 모델)
    - Realtime Endpoint 생성
    - 한개의 테스트 데이터 추론
- 소요 시간은 약 10분 걸립니다.

In [1]:
import sagemaker
import pandas as pd
import numpy as np
import os
import time
import json
from time import strftime, gmtime

In [2]:
%store -r

Unable to restore variable 'scikit_learn_pre_process_model', ignoring (use %store -d to forget!)
The error was: <class 'KeyError'>


## PCA 학습

In [3]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'pca'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

057716757052
us-east-2
arn:aws:iam::057716757052:role/service-role/AmazonSageMaker-ExecutionRole-20191128T110038
sagemaker-us-east-2-057716757052


In [4]:
! cp pca_byoc_train.py docker/code/

In [5]:
%%writefile docker/Dockerfile

FROM 257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3
    
# install python package
RUN pip install joblib


ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE

ENV PATH="/opt/ml/code:${PATH}"

# Copy training code
COPY code/* /opt/ml/code/
 
WORKDIR /opt/ml/code

# ENTRYPOINT ["python", "pca_train.py"]
# In order to use SageMaker Env varaibles, use the statement below
ENV SAGEMAKER_PROGRAM pca_byoc_train.py

Overwriting docker/Dockerfile


In [6]:
import os
os.environ['account_id'] = account_id
os.environ['region'] = region
os.environ['ecr_repository_name'] = ecr_repository_name

In [7]:
%%sh

ACCOUNT_ID=${account_id}
REGION=${region}
REPO_NAME=${ecr_repository_name}

echo $REGION
echo $ACCOUNT_ID
echo $REPO_NAME


# Get the login command from ECR in order to pull down the Tensorflow-gpu:1.5 image
$(aws ecr get-login --registry-ids 257758044811 --region ${region} --no-include-email)



docker build -f docker/Dockerfile -t $REPO_NAME docker

docker tag $REPO_NAME $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest

$(aws ecr get-login --no-include-email --registry-ids $ACCOUNT_ID)

aws ecr describe-repositories --repository-names $REPO_NAME || aws ecr create-repository --repository-name $REPO_NAME

docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest



us-east-2
057716757052
sagemaker-training-containers/pca
Login Succeeded
Sending build context to Docker daemon  11.78kB
Step 1/8 : FROM 257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3
 ---> 30adb1aa9af5
Step 2/8 : RUN pip install joblib
 ---> Using cache
 ---> 0786847c4f79
Step 3/8 : ENV PYTHONUNBUFFERED=TRUE
 ---> Using cache
 ---> 7d94abd2b857
Step 4/8 : ENV PYTHONDONTWRITEBYTECODE=TRUE
 ---> Using cache
 ---> 8696b5e742b3
Step 5/8 : ENV PATH="/opt/ml/code:${PATH}"
 ---> Using cache
 ---> daba2554dce8
Step 6/8 : COPY code/* /opt/ml/code/
 ---> Using cache
 ---> 9685910a18a5
Step 7/8 : WORKDIR /opt/ml/code
 ---> Using cache
 ---> ae3f15597ed8
Step 8/8 : ENV SAGEMAKER_PROGRAM pca_byoc_train.py
 ---> Using cache
 ---> 2838d3d55148
Successfully built 2838d3d55148
Successfully tagged sagemaker-training-containers/pca:latest
Login Succeeded
{
    "repositories": [
        {
            "repositoryArn": "arn:aws:ecr:us-east-2:057716757052:repository/sagem

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



In [8]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

057716757052.dkr.ecr.us-east-2.amazonaws.com/sagemaker-training-containers/pca:latest


In [9]:
preprocessed_train_path_file = '{}/train.csv.out'.format(preprocessed_train_path)
preprocessed_validation_path_file = '{}/validation.csv.out'.format(preprocessed_validation_path)
print("preprocessed_train_path_file: \n", preprocessed_train_path_file)
print("preprocessed_validation_path_file: \n", preprocessed_validation_path_file)

preprocessed_train_path_file: 
 s3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-15-07-23-2020-08-15-07-23-07-341/train.csv.out
preprocessed_validation_path_file: 
 s3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-08-15-07-23-2020-08-15-07-23-14-753/validation.csv.out


## PCA Transformation

In [10]:
import pandas as pd

preprocessed_train_path_file = '{}/train.csv.out'.format(preprocessed_train_path)
pre_df = pd.read_csv(preprocessed_train_path_file, header=None)
print(pre_df.shape)
num_cols = pre_df.shape[1]
print("num_cols: ", num_cols)

(2333, 70)
num_cols:  70


In [11]:
import pandas as pd
# preprocessed_train_path_file = 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-08-12-07-07-2020-08-12-07-07-08-229/train.csv.out'

churn_df = pd.read_csv(preprocessed_train_path_file, header=None)
churn_df.head()
train_y = churn_df.iloc[:,0]
train_X = churn_df.iloc[:,1:]

print("Shape of train_X: ", train_X.shape)
print("Shape of train_y: ", train_y.shape)

os.makedirs('./data', exist_ok =True)
np.savetxt('./data/churn-preprocessed.csv', train_X, delimiter=',',
           fmt='%1.5f'
          )

WORK_DIRECTORY = 'data'
prefix = 'Scikit-pca-custom'
train_input = sagemaker_session.upload_data(WORK_DIRECTORY,
                                            key_prefix="{}/{}".format(prefix, WORK_DIRECTORY)
                                           )
print("train_input: ", train_input)


Shape of train_X:  (2333, 69)
Shape of train_y:  (2333,)
train_input:  s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/data


In [12]:
%%time

import sagemaker

instance_type = 'local'
# instance_type = 'ml.m4.xlarge'

pca_estimator = sagemaker.estimator.Estimator(container_image_uri,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type= instance_type,
                                    base_job_name=prefix)

pca_estimator.set_hyperparameters(n_components= 10)

train_config = sagemaker.session.s3_input(train_input, content_type='text/csv')

pca_estimator.fit({'train': train_config})

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


Creating tmp7uvkmrfh_algo-1-xd1tc_1 ... 
[1BAttaching to tmp7uvkmrfh_algo-1-xd1tc_12mdone[0m
[36malgo-1-xd1tc_1  |[0m 2020-08-15 07:44:41,960 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
[36malgo-1-xd1tc_1  |[0m 2020-08-15 07:44:41,963 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-xd1tc_1  |[0m 2020-08-15 07:44:41,971 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36malgo-1-xd1tc_1  |[0m 2020-08-15 07:44:41,972 sagemaker-containers INFO     Module pca_byoc_train does not provide a setup.py. 
[36malgo-1-xd1tc_1  |[0m Generating setup.py
[36malgo-1-xd1tc_1  |[0m 2020-08-15 07:44:41,972 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-xd1tc_1  |[0m 2020-08-15 07:44:41,972 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-xd1tc_1  |[0m 2020-08-15 07:44:41,972 sagemaker-containers INFO     Installing module with the following command:

# Transforming Train PCA

In [13]:
import pandas as pd

preprocessed_train_path_file = '{}/train.csv.out'.format(preprocessed_train_path)
pre_df = pd.read_csv(preprocessed_train_path_file, header=None)
print(pre_df.shape)
num_cols = pre_df.shape[1]
print("num_cols: ", num_cols)

(2333, 70)
num_cols:  70


In [14]:
instance_type = 'local'
# instance_type = 'ml.m4.2xlarge'
transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-pca-train-output')

pca_model = pca_estimator.create_model(
    env={'TRANSFORM_MODE': 'feature-transform', 'LENGTH_COLS': str(num_cols)})

# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_train = pca_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_train_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_train.transform(preprocessed_train_path_file, 
                            content_type='text/csv',                            
                           )

print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)
transformer_train.wait()

preprocessed_pca_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name


Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


Attaching to tmpvyce479d_algo-1-kuada_1
[36malgo-1-kuada_1  |[0m Processing /opt/ml/code
[36malgo-1-kuada_1  |[0m Building wheels for collected packages: pca-byoc-train
[36malgo-1-kuada_1  |[0m   Building wheel for pca-byoc-train (setup.py) ... [?25ldone
[36malgo-1-kuada_1  |[0m [?25h  Created wheel for pca-byoc-train: filename=pca_byoc_train-1.0.0-py2.py3-none-any.whl size=9474 sha256=e0cf974bccf3504cbe887371c84eaa28013182e8748d56f4deb56245af8a7893
[36malgo-1-kuada_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-lyppfr03/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-kuada_1  |[0m Successfully built pca-byoc-train
[36malgo-1-kuada_1  |[0m Installing collected packages: pca-byoc-train
[36malgo-1-kuada_1  |[0m Successfully installed pca-byoc-train-1.0.0
[36malgo-1-kuada_1  |[0m   import imp
[36malgo-1-kuada_1  |[0m [2020-08-15 07:44:47 +0000] [44] [INFO] Starting gunicorn 19.9.0
[36malgo-1-kuada_1  |[0m [2020-08-15 07:44:4

In [15]:
print(preprocessed_pca_train_path)

s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-15-07-44-44-619-2020-08-15-07-44-44-619


In [16]:
! aws s3 ls s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-13-01-27-21-375-2020-08-13-01-27-21-375 --recursive

2020-08-13 01:27:28     707835 Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-13-01-27-21-375-2020-08-13-01-27-21-375/train.csv.out.out


In [17]:
preprocessed_pca_train_path_file = '{}/train.csv.out.out'.format(preprocessed_pca_train_path)
pca_preoc_df = pd.read_csv(preprocessed_pca_train_path_file, header=None)
pca_preoc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.0,-0.823085,-0.108811,0.452843,-0.030507,-1.031997,-2.956747,-0.0513,-0.706446,1.064536,-0.43729
1,0.0,-0.343474,0.091423,1.949026,1.268235,0.009081,0.421556,-0.988975,0.868835,-0.542355,-0.341819
2,1.0,-0.764309,0.011604,0.822738,-1.429299,-1.627301,-0.74156,0.550952,0.428517,-0.326231,-0.289266
3,0.0,-0.825983,-0.722031,-0.339858,-0.980679,-0.260783,0.35779,-0.669738,-1.122771,-1.451326,-0.006193
4,0.0,1.830756,0.701878,0.194737,-1.351443,-0.729668,0.845136,0.150495,0.084293,0.180299,0.751587


## PCA Validation Transforming

In [18]:
preprocessed_validation_path

's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-08-15-07-23-2020-08-15-07-23-14-753'

In [19]:
import pandas as pd

preprocessed_validation_path_file = '{}/validation.csv.out'.format(preprocessed_validation_path)
pre_df = pd.read_csv(preprocessed_validation_path_file, header=None)
print(pre_df.shape)
num_cols = pre_df.shape[1]
print("num_cols: ", num_cols)

(666, 70)
num_cols:  70


In [20]:


instance_type = 'local'
# instance_type = 'ml.m4.2xlarge'
transform_validation_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-pca-validation-output')

pca_model = pca_estimator.create_model(
    env={'TRANSFORM_MODE': 'feature-transform', 'LENGTH_COLS': str(num_cols)})

# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_validation = pca_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_validation_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_validation.transform(preprocessed_validation_path_file, 
                            content_type='text/csv',                            
                           )

print('Waiting for transform job: ' + transformer_validation.latest_transform_job.job_name)
transformer_validation.wait()

preprocessed_pca_validation_path = transformer_validation.output_path + transformer_validation.latest_transform_job.job_name
print(preprocessed_pca_validation_path)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


Attaching to tmpg6rpbia8_algo-1-xzlvl_1
[36malgo-1-xzlvl_1  |[0m Processing /opt/ml/code
[36malgo-1-xzlvl_1  |[0m Building wheels for collected packages: pca-byoc-train
[36malgo-1-xzlvl_1  |[0m   Building wheel for pca-byoc-train (setup.py) ... [?25ldone
[36malgo-1-xzlvl_1  |[0m [?25h  Created wheel for pca-byoc-train: filename=pca_byoc_train-1.0.0-py2.py3-none-any.whl size=9475 sha256=673f9d50fb7a73f7d53f01a90ef8e31c4ac9f2163a2b19433d5a4eee62f81c92
[36malgo-1-xzlvl_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-6tz6akhh/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-xzlvl_1  |[0m Successfully built pca-byoc-train
[36malgo-1-xzlvl_1  |[0m Installing collected packages: pca-byoc-train
[36malgo-1-xzlvl_1  |[0m Successfully installed pca-byoc-train-1.0.0
[36malgo-1-xzlvl_1  |[0m   import imp
[36malgo-1-xzlvl_1  |[0m [2020-08-15 07:44:55 +0000] [44] [INFO] Starting gunicorn 19.9.0
[36malgo-1-xzlvl_1  |[0m [2020-08-15 07:44:5

In [21]:
preprocessed_pca_validation_path_file = '{}/validation.csv.out.out'.format(preprocessed_pca_validation_path)
pca_val_preoc_df = pd.read_csv(preprocessed_pca_validation_path_file, header=None)
pca_val_preoc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.0,1.647523,1.321312,-0.096829,0.255327,1.185455,-1.356901,-0.144986,-1.376,0.192108,-0.407087
1,0.0,-0.568756,0.209782,0.928801,-0.421777,-1.249794,-1.183297,-2.058817,0.980869,-1.320562,-1.227984
2,0.0,1.856171,-0.558975,-1.969703,0.315529,0.132519,-0.136767,-0.62357,0.832536,1.131302,1.985652
3,0.0,-0.681862,-1.326923,-0.849271,-1.372694,-0.960617,1.216868,-0.801562,2.278247,-0.030436,0.099985
4,0.0,2.290559,0.265566,-0.89117,0.860266,0.113337,0.457285,-0.131537,-0.713534,-0.735986,0.54587


## Inference Pipeline <a class="anchor" id="pipeline_setup"></a>

아래 그림과 같이 위에서 생성한 전처리, 알고리즘 학습, 후처리의 세가지 모델을 가지고 1개의 단일 모델을 만들어 Inference Pipleline을 생성 합니다. <br>
**입력 데이타 가공이 없이 실제 데이타가 입력이 되면, 1개의 단일 모델을 통해서 최종적으로 예측 결과인 True, False의 결과 값이 제공 됩니다.**

![Inference-pipeline](img/Fig2.2.inference_pipeline.png)


**Machine Learning Model Pipeline (Inference Pipeline)는 create_model() 를 호출하여 만들 수 있습니다.** <br>
예를 들어 여기서는 the fitted Scikit-learn inference model, the fitted Xgboost model and the psotprocessing model 의 세가지 모델을 가지고 만듦니다.

아래는 세개 모델을 생성함. 전처리, 후처리 모델 생성시에는 환경 변수를 제공 함

In [22]:
pca_estimator.model_data
pca_estimator.image_name

'057716757052.dkr.ecr.us-east-2.amazonaws.com/sagemaker-training-containers/pca:latest'

In [23]:
pca_model_data = pca_estimator.model_data
pca_image_name = pca_estimator.image_name
print("pca_model_data: \n", pca_model_data)
print("pca_image_name: \n", pca_image_name)

%store preprocessed_pca_train_path
%store preprocessed_pca_validation_path
%store pca_model_data
%store pca_image_name

pca_model_data: 
 s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom-2020-08-15-07-44-39-957/model.tar.gz
pca_image_name: 
 057716757052.dkr.ecr.us-east-2.amazonaws.com/sagemaker-training-containers/pca:latest
Stored 'preprocessed_pca_train_path' (str)
Stored 'preprocessed_pca_validation_path' (str)
Stored 'pca_model_data' (str)
Stored 'pca_image_name' (str)


In [24]:
# ! aws s3 ls {preprocessed_pca_train_path} --recursive