# PCA Script Mode

How to implement PCA with Python and scikit-learn: Theory & Code
https://medium.com/ai-in-plain-english/how-to-implement-pca-with-python-and-scikit-learn-22f3de4e5983

Iris Training and Prediction with Sagemaker Scikit-learn

- Scikit Learn 스크립트 모드

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_iris/Scikit-learn%20Estimator%20Example%20With%20Batch%20Transform.ipynb

Amazon SageMaker Custom Training containers
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/custom-training-containers

Using Scikit-learn with the SageMaker Python SDK
https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#id2

Building your own algorithm container
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb

Bring Your Own Model (XGboost)
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/xgboost_bring_your_own_model

In [1]:
prefix = 'Scikit-pca'

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

In [2]:
from sklearn import datasets
import os
import numpy as np

iris = datasets.load_iris()
train_X = iris.data
train_y = iris.target

os.makedirs('./data', exist_ok =True)
np.savetxt('./data/iris.csv', train_X, delimiter=',',
           fmt='%1.3f, %1.3f, %1.3f, %1.3f'
          )


In [3]:
WORK_DIRECTORY = 'data'
train_input = sagemaker_session.upload_data(WORK_DIRECTORY,
                                            key_prefix="{}/{}".format(prefix, WORK_DIRECTORY)
                                           )

In [31]:
%%writefile pca_train.py

from __future__ import print_function

import argparse
import joblib
import os
import pandas as pd
import logging

from sklearn.decomposition import PCA

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    
    parser.add_argument('--n_components', type=int, default = 3)
    
    args = parser.parse_args()
    
    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train)]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files]
    train_data = pd.concat(raw_data)
    
    pca = PCA(n_components = args.n_components)
    print("train shape: ", train_data.shape)
    X_new = pca.fit_transform(train_data)
    
    print("Component Variability: \n", pca.explained_variance_ratio_)
    
    joblib.dump(pca, os.path.join(args.model_dir, "model.joblib"))
    

def model_fn(model_dir):
    """
    Deserialized and return fitted model
    Note that this should have the same name as the serialized model in the main method
    """   
    pca = joblib.load(os.path.join(model_dir, "model.joblib"))
    
    return pca  

def predict_fn(input_data, model):
    """Preprocess input data
    
    We implement this because the default predict_fn uses .predict(), but our model is a preprocessor
    so we want to use .transform().
    """
    logging.info("predict_fn: ")
    # model, PCA model, has transform()
    components = model.transform(input_data)
    
    logging.info("predict_fn: PCA components: \n'{components}'")    
    return components
    
# predict_fn을 정의하지 않으면 default predict_fn을 호출 함.
# PCA는 predict 함수를 제공하지 않으므로 사용자 정의 필요 함.

# algo-1-dhteh_1  | 2020-08-10 14:15:55,970 ERROR - pca_train - Exception on /invocations [POST]
# algo-1-dhteh_1  | Traceback (most recent call last):
# algo-1-dhteh_1  |   File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_functions.py", line 93, in wrapper
# algo-1-dhteh_1  |     return fn(*args, **kwargs)
# algo-1-dhteh_1  |   File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/serving.py", line 70, in default_predict_fn
# algo-1-dhteh_1  |     output = model.predict(input_data)
# algo-1-dhteh_1  | AttributeError: 'PCA' object has no attribute 'predict'

Overwriting pca_train.py


In [32]:
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"
script_path = 'pca_train.py'

instance_type = 'local'

sklearn = SKLearn(
    entry_point = script_path,
    framework_version = FRAMEWORK_VERSION,
    train_instance_type = instance_type,
    role = role,
#     sagemaker_session = sagemaker_session, # Exclude in local mode
    hyperparameters = {'n_components' : 2}
)

In [33]:
sklearn.fit({'train' : train_input}, wait=True)

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


Creating tmpfq4c_n88_algo-1-n23ci_1 ... 
[1BAttaching to tmpfq4c_n88_algo-1-n23ci_12mdone[0m
[36malgo-1-n23ci_1  |[0m 2020-08-10 14:25:24,579 sagemaker-training-toolkit INFO     Imported framework sagemaker_sklearn_container.training
[36malgo-1-n23ci_1  |[0m 2020-08-10 14:25:24,581 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-n23ci_1  |[0m 2020-08-10 14:25:24,589 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36malgo-1-n23ci_1  |[0m 2020-08-10 14:25:24,783 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-n23ci_1  |[0m 2020-08-10 14:25:24,792 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-n23ci_1  |[0m 2020-08-10 14:25:24,801 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-n23ci_1  |[0m 2020-08-10 14:25:24,811 sagemaker-training-toolkit INFO     Invoking us

In [34]:
print("model data: ", sklearn.model_data)

model data:  s3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-08-10-14-25-22-772/model.tar.gz


In [36]:
instance_type = 'local'

predictor = sklearn.deploy(
    initial_instance_count = 1,
    instance_type = instance_type
)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


Attaching to tmp0kd81dtv_algo-1-esb4s_1
[36malgo-1-esb4s_1  |[0m 2020-08-10 14:26:19,731 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)
[36malgo-1-esb4s_1  |[0m 2020-08-10 14:26:19,733 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)
[36malgo-1-esb4s_1  |[0m 2020-08-10 14:26:19,734 INFO - sagemaker-containers - nginx config: 
[36malgo-1-esb4s_1  |[0m worker_processes auto;
[36malgo-1-esb4s_1  |[0m daemon off;
[36malgo-1-esb4s_1  |[0m pid /tmp/nginx.pid;
[36malgo-1-esb4s_1  |[0m error_log  /dev/stderr;
[36malgo-1-esb4s_1  |[0m 
[36malgo-1-esb4s_1  |[0m worker_rlimit_nofile 4096;
[36malgo-1-esb4s_1  |[0m 
[36malgo-1-esb4s_1  |[0m events {
[36malgo-1-esb4s_1  |[0m   worker_connections 2048;
[36malgo-1-esb4s_1  |[0m }
[36malgo-1-esb4s_1  |[0m 
[36malgo-1-esb4s_1  |[0m http {
[36malgo-1-esb4s_1  |[0m   include /etc/nginx/mime.types;
[36malgo-1-esb4s_1  |[0m   default_type application/octet-stream;
[

In [37]:
sample = train_X[0].reshape(1,-1) # Single Sample (1,-1)
print("Shape of sample: ", sample.shape)

Shape of sample:  (1, 4)


In [40]:
pca_components = predictor.predict(sample)

[36malgo-1-esb4s_1  |[0m 2020-08-10 14:27:11,004 INFO - root - predict_fn: 
[36malgo-1-esb4s_1  |[0m 2020-08-10 14:27:11,004 INFO - root - predict_fn: PCA components: 
[36malgo-1-esb4s_1  |[0m '{components}'
[36malgo-1-esb4s_1  |[0m 172.18.0.1 - - [10/Aug/2020:14:27:11 +0000] "POST /invocations HTTP/1.1" 200 144 "-" "-"


In [41]:
print("pca_components: ", pca_components)

pca_components:  [[-2.68412563  0.31939725]]
