# [Module 3.3] Custom PCA 모델 학습 및 Train, Validation 전처리 데이터의 차원 축소

이 노트북에서는 Custome PCA docker image를 가져와서 학습하여 Custom PCA 모델을 생성하고,이 모델을 바탕으로 이전 노트북에서 전처리된 Train, Validation의 데이타의 차원 축소를 합니다.<br>
또한 PCA 모델의 정보 (model artifact) 경로를 저장하여 추후 inference pipeline을 구성시에 사용 합니다.

구체적으로 이 노트북은 아래와 같은 작업을 합니다.

- Custom PCA 모델 학습을 위한 데이타 준비
- Custom PCA Docker Image 가져와서 Custom PCA 모델 학습
- Custom PCA 학습 모델을 사용하여 전처리된 Train 입력 파일의 차원 축소
- PCA Model Artifact 및 차원 축소된 train, validation 데이터 경로 저장

---
소요 시간은 약 10분 걸립니다.

In [1]:
import sagemaker
import pandas as pd
import numpy as np
import os
import time
import json
from time import strftime, gmtime

In [2]:
%store -r

In [3]:
import boto3
import sagemaker
from sagemaker import get_execution_role

# Define custom docker image name
ecr_namespace = 'sagemaker-training-containers/'
prefix = 'pca'
ecr_repository_name = ecr_namespace + prefix
print("ecr_repository_name: ", ecr_repository_name)

role = get_execution_role()

account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print("account_id: ", account_id)
print("region: ", region)
print("role: ", role)
print("bucket: ", bucket)

ecr_repository_name:  sagemaker-training-containers/pca
account_id:  057716757052
region:  ap-northeast-2
role:  arn:aws:iam::057716757052:role/service-role/AmazonSageMaker-ExecutionRole-20191128T110038
bucket:  sagemaker-ap-northeast-2-057716757052


## Custom PCA 모델 학습을 위한 데이타 준비

이전 노트북에서 전처리된 파일을 가져와서 형태를 봅니다.<br>
총 2333 개에 70개의 컬럼이 있습니다.

In [4]:
import pandas as pd

churn_df = pd.read_csv(preprocessed_train_path_file, header=None)

print("preprocessed train shape", churn_df.shape)
num_cols = churn_df.shape[1]
print("# of feature columns: ", num_cols)
churn_df.head(2)

preprocessed train shape (2333, 70)
# of feature columns:  70


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,60,61,62,63,64,65,66,67,68,69
0,0.0,0.119414,-0.596238,1.744368,0.978957,-0.028993,-0.893185,-0.801703,-1.982529,-1.530559,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,-1.852652,-0.596238,0.140284,-0.310405,0.970689,-0.689888,0.146389,1.232901,0.124852,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


70개의 컬럼에서 제일 첫 번째의 컬럼은 Y 값인 레이블입니다. 나머지 69개는 X 값인 피쳐 입니다.<br>
여기서 X, Y를 분리하고 X를 csv 파일로 만듧니다.
X값은 PCA 알고리즘을 학습하여 PCA 학습 모델을 만드는데 사용 합니다.

In [5]:
train_y = churn_df.iloc[:,0]
train_X = churn_df.iloc[:,1:]

print("Shape of train_X: ", train_X.shape)
print("Shape of train_y: ", train_y.shape)

os.makedirs('./data', exist_ok =True)
np.savetxt('./data/churn-preprocessed.csv', train_X, delimiter=',',
           fmt='%1.5f'
          )

Shape of train_X:  (2333, 69)
Shape of train_y:  (2333,)


X의 csv 파일을 S3에 업로드하고, s3_input_train 오브젝트를 생성 합니다.

In [6]:
WORK_DIRECTORY = 'data'
prefix = 'Scikit-pca-custom'
train_input = sagemaker_session.upload_data(WORK_DIRECTORY,
                                            key_prefix="{}/{}".format(prefix, WORK_DIRECTORY)
                                           )
s3_input_train = sagemaker.inputs.TrainingInput(
    s3_data = train_input,
    content_type= 'text/csv'
)

print("s3_input_train: ", s3_input_train.config)

s3_input_train:  {'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-ap-northeast-2-057716757052/Scikit-pca-custom/data', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}


In [7]:
!aws s3 ls s3://sagemaker-ap-northeast-2-057716757052/Scikit-pca-custom/data --recursive
# !aws s3 rm s3://sagemaker-ap-northeast-2-057716757052/Scikit-pca-custom/data --recursive    

2020-08-27 09:26:27    1301493 Scikit-pca-custom/data/churn-preprocessed.csv


## Custom PCA Docker Image 가져와서 PCA 모델 학습

Estimator 를 생성하고 인자값으로써 아래와 같은 값을 넣어 학습합니다.
- 기존 노트북에서 생성한 Custom PCA Docker Image
- instance_type을 local 로 설정
- 하이퍼 파라미터로 n_components = 25 설정 
    - 이는 69개의 피쳐 Dimension을 25개 (69 --> 25) 로 Dimension Reduction을 하게 됩니다.
- s3_input_train로서 S3에 있는 69개의 피쳐를 데이타 입력으로 제공 합니다.

In [8]:
%%time

import sagemaker

instance_type = 'local'
# instance_type = 'ml.m4.xlarge'

pca_estimator = sagemaker.estimator.Estimator(custom_pca_docker_image_uri,
                                    role, 
                                    instance_count=1, 
                                    instance_type= instance_type,
                                    base_job_name=prefix)

pca_estimator.set_hyperparameters(n_components= 25)

pca_estimator.fit({'train': s3_input_train})

Creating tmpfpa7pnz7_algo-1-q0lvb_1 ... 
[1BAttaching to tmpfpa7pnz7_algo-1-q0lvb_12mdone[0m
[36malgo-1-q0lvb_1  |[0m 2020-08-27 09:26:28,703 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
[36malgo-1-q0lvb_1  |[0m 2020-08-27 09:26:28,705 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-q0lvb_1  |[0m 2020-08-27 09:26:28,713 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36malgo-1-q0lvb_1  |[0m 2020-08-27 09:26:28,713 sagemaker-containers INFO     Module pca_byoc_train does not provide a setup.py. 
[36malgo-1-q0lvb_1  |[0m Generating setup.py
[36malgo-1-q0lvb_1  |[0m 2020-08-27 09:26:28,714 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-q0lvb_1  |[0m 2020-08-27 09:26:28,714 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-q0lvb_1  |[0m 2020-08-27 09:26:28,714 sagemaker-containers INFO     Installing module with the following command:

## Custom PCA 학습 모델을 사용하여 전처리된 Train 입력 파일의 차원 축소 

전처리된 Train 입력 파일의 컬럼 개수를 가져 옵니다.

In [9]:
import pandas as pd

preprocessed_train_path_file = '{}/train.csv.out'.format(preprocessed_train_path)
churn_train_df = pd.read_csv(preprocessed_train_path_file, header=None)
num_cols = churn_train_df.shape[1]
print(churn_train_df.shape)
print("num_cols: ", num_cols)

(2333, 70)
num_cols:  70


### Train 입력 파일의 차원 축소


아래 셀에서는 다음과 같은 작업을 합니다.
- (1) pca_estimator의 create_model() 함수를 호출하여 SageMaker Model을 생성 합니다. (이름은 pca_model로 할당)
    - 환경 변수로 TRANSFORM_MODE: feature-transform, 'LENGTH_COLS': str(num_cols) 를 제공합니다.
        - Custome Docker image 안에 pca_byoc_train.py 코드를 넣어 custom docker image를 만들었습니다. pca_byoc_train.py 의 코드 안에 위의 환경 변수를 받아 로직을 수행하는 코드가 있습니다. 자세한 사항은 pca_byoc_train.py 를 보시면 됩니다.
        
        - num_cols = 70 으로 환경 변수 입력이 됩니다. pca_byoc_train.py 코드의 predict_fn 함수 안에서 첫 번째 레이블 컬럼을 제외하고 69개로 PCA 알고리즘의 입력값으로 사용하게 됩니다.
        
- (2) SageMaker Model인 pca_model에서 transformer() 함수를 실행하여 transfomer 오브젝트(transformer_train)를 생성 합니다. 

- (3) transformer_train의 transform() 함수 실행
    - 입력 파일(preprocessed_train_path_file) 을 넣고 69 --> 25개의 피쳐를 생성하여 transform_train_output_path 에 저장 합니다.
    - 이 작업은 pca_byoc_train.py 코드 안의 input_fn --> predict_fn --> output_fn 함수를 차례로 호출 합니다. 자세한 사항은 아래 실행 결과 로그를 확인하시면 됩니다.

In [10]:
instance_type = 'local'
# instance_type = 'ml.m4.2xlarge'
transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-pca-train-output')

pca_model = pca_estimator.create_model(
    env={'TRANSFORM_MODE': 'feature-transform', 'LENGTH_COLS': str(num_cols)})

# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_train = pca_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_train_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_train.transform(preprocessed_train_path_file, 
                            content_type='text/csv',                            
                           )

print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)
transformer_train.wait()

preprocessed_pca_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name


Attaching to tmppr7_s275_algo-1-ywhi4_1
[36malgo-1-ywhi4_1  |[0m Processing /opt/ml/code
[36malgo-1-ywhi4_1  |[0m Building wheels for collected packages: pca-byoc-train
[36malgo-1-ywhi4_1  |[0m   Building wheel for pca-byoc-train (setup.py) ... [?25ldone
[36malgo-1-ywhi4_1  |[0m [?25h  Created wheel for pca-byoc-train: filename=pca_byoc_train-1.0.0-py2.py3-none-any.whl size=9484 sha256=67c51aa831c03a3a20f991ca236d38e6db88b4f93d0ba2b69920c0d7295447cb
[36malgo-1-ywhi4_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-p8rgj4t9/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-ywhi4_1  |[0m Successfully built pca-byoc-train
[36malgo-1-ywhi4_1  |[0m Installing collected packages: pca-byoc-train
[36malgo-1-ywhi4_1  |[0m Successfully installed pca-byoc-train-1.0.0
[36malgo-1-ywhi4_1  |[0m   import imp
[36malgo-1-ywhi4_1  |[0m [2020-08-27 09:26:34 +0000] [44] [INFO] Starting gunicorn 19.9.0
[36malgo-1-ywhi4_1  |[0m [2020-08-27 09:26:3

In [11]:
print(preprocessed_pca_train_path)

s3://sagemaker-ap-northeast-2-057716757052/Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-27-09-26-31-312


In [12]:
! aws s3 ls s3://sagemaker-us-east-2-057716757052/Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-13-01-27-21-375-2020-08-13-01-27-21-375 --recursive

2020-08-13 01:27:28     707835 Scikit-pca-custom/transformtrain-pca-train-output/pca-2020-08-13-01-27-21-375-2020-08-13-01-27-21-375/train.csv.out.out


#### 69 --> 25 개로 차원 축소 됨. 첫번째 컬럼은 레이블 컬림이고 이후 25개가 차원 축소 된 컬럼 임

In [13]:
preprocessed_pca_train_path_file = '{}/train.csv.out.out'.format(preprocessed_pca_train_path)
pca_preoc_df = pd.read_csv(preprocessed_pca_train_path_file, header=None)
pca_preoc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,0.0,-0.822971,-0.108986,0.452238,-0.028701,-1.031895,-2.957447,-0.042852,-0.718277,1.05564,...,0.208491,-0.11674,-0.000488,-0.19275,-0.136014,-0.112258,-0.201363,0.439746,-0.341187,-0.132092
1,0.0,-0.343563,0.092688,1.946674,1.271797,0.008117,0.419165,-0.99059,0.874162,-0.533861,...,0.093025,0.067919,-0.083135,0.672877,0.483212,0.030119,0.013241,0.234645,-0.151748,-0.082311
2,1.0,-0.764182,0.010289,0.825062,-1.427365,-1.627981,-0.740463,0.554086,0.429989,-0.32233,...,-0.136035,-0.207805,-0.075758,0.325582,-0.399338,-0.38358,0.559366,0.212113,0.059432,0.002064
3,0.0,-0.825846,-0.722819,-0.338572,-0.981696,-0.260432,0.356843,-0.670504,-1.10858,-1.461891,...,0.003084,0.121354,-0.062031,-0.000835,-0.273703,0.697754,0.110431,0.322106,0.063063,0.027535
4,0.0,1.830923,0.700579,0.197524,-1.351677,-0.7296,0.844843,0.148726,0.082408,0.181474,...,0.020706,-0.072349,0.027476,-0.0475,-0.006686,0.110502,0.066237,-0.161733,-0.037429,0.167787


## Validation 입력 파일의 차원 축소

아래 셀에서는 입력 파일이 Validation 파일만이 다르지 Train 전처리 파일을 차원 축소하는 것은 동일 함


In [14]:
instance_type = 'local'
# instance_type = 'ml.m4.2xlarge'
transform_validation_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-pca-validation-output')

pca_model = pca_estimator.create_model(
    env={'TRANSFORM_MODE': 'feature-transform', 'LENGTH_COLS': str(num_cols)})

# scikit_learn_inferencee_model 에서 Train Transformer 생성
transformer_validation = pca_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_validation_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_validation.transform(preprocessed_validation_path_file, 
                            content_type='text/csv',                            
                           )

print('Waiting for transform job: ' + transformer_validation.latest_transform_job.job_name)
transformer_validation.wait()

preprocessed_pca_validation_path = transformer_validation.output_path + transformer_validation.latest_transform_job.job_name
print(preprocessed_pca_validation_path)

Attaching to tmpvk23zsjb_algo-1-sfxzy_1
[36malgo-1-sfxzy_1  |[0m Processing /opt/ml/code
[36malgo-1-sfxzy_1  |[0m Building wheels for collected packages: pca-byoc-train
[36malgo-1-sfxzy_1  |[0m   Building wheel for pca-byoc-train (setup.py) ... [?25ldone
[36malgo-1-sfxzy_1  |[0m [?25h  Created wheel for pca-byoc-train: filename=pca_byoc_train-1.0.0-py2.py3-none-any.whl size=9485 sha256=a49b65dc6727eead0427881c8639dca06d29aeff63f7aebde6e925edc0fa1db5
[36malgo-1-sfxzy_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-ezwzckx4/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-sfxzy_1  |[0m Successfully built pca-byoc-train
[36malgo-1-sfxzy_1  |[0m Installing collected packages: pca-byoc-train
[36malgo-1-sfxzy_1  |[0m Successfully installed pca-byoc-train-1.0.0
[36malgo-1-sfxzy_1  |[0m   import imp
[36malgo-1-sfxzy_1  |[0m [2020-08-27 09:26:43 +0000] [44] [INFO] Starting gunicorn 19.9.0
[36malgo-1-sfxzy_1  |[0m [2020-08-27 09:26:4

In [15]:
preprocessed_pca_validation_path_file = '{}/validation.csv.out.out'.format(preprocessed_pca_validation_path)
pca_val_preoc_df = pd.read_csv(preprocessed_pca_validation_path_file, header=None)
pca_val_preoc_df.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,0.0,1.647741,1.321463,-0.097282,0.254665,1.186017,-1.35672,-0.141434,-1.377084,0.177782,...,-0.01193,0.039874,-0.124408,0.002135,0.23526,0.017927,0.201687,-0.103127,-0.051984,-0.178245
1,0.0,-0.568643,0.209874,0.929371,-0.420525,-1.250525,-1.188874,-2.054817,0.993901,-1.311588,...,-0.064601,0.002703,-0.029425,0.003413,0.027497,-0.032,0.000916,0.040338,0.06206,0.028352


## Custom PCA Model Artifact 및 차원 축소된 train, validation 데이터 경로 저장

이 과정읜 추후에 Inference Pipeline의 SageMaker Model을 만들때에 사용하기 위해서 내용을 변수에 저장 합니다.

- 훈련을 한 Model Artifact (model.tar.gz) 및 이를 실행하기 위한 환경인 docker image (057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/pca:latest) 를 저장 합니다.

또한 차원 축소된 train, validation 데이터 경로 저장

아래는 세개 모델을 생성함. 전처리, 후처리 모델 생성시에는 환경 변수를 제공 함

In [16]:
pca_model_data = pca_estimator.model_data
pca_image_uri = pca_estimator.image_uri
print("pca_model_data: \n", pca_model_data)
print("pca_image_name: \n", pca_image_uri)

%store preprocessed_pca_train_path
%store preprocessed_pca_validation_path
%store pca_model_data
%store pca_image_uri

pca_model_data: 
 s3://sagemaker-ap-northeast-2-057716757052/Scikit-pca-custom-2020-08-27-09-26-26-801/model.tar.gz
pca_image_name: 
 057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/pca:latest
Stored 'preprocessed_pca_train_path' (str)
Stored 'preprocessed_pca_validation_path' (str)
Stored 'pca_model_data' (str)
Stored 'pca_image_uri' (str)
