# [모듈 0] 전처리 스크래치, 로컬 다커 및 모델 빌딩 파이프라인

In [208]:
import boto3
import sagemaker
import pandas as pd
import os


region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
model_package_group_name = f"FraudScratchModelPackageGroupName"

# 1. 데이터 준비
---

## 데이터 세트 로딩 및 S3 업로드

In [218]:
data_dir = '../data'
base_preproc_input_dir = 'opt/ml/processing/input'
os.makedirs(base_preproc_input_dir, exist_ok=True)



In [219]:

df_train = pd.read_csv(f"{data_dir}/train.csv", index_col=0)
df_test = pd.read_csv(f"{data_dir}/test.csv", index_col=0)
df_dataset = pd.concat([df_train, df_test], axis=0)
df_dataset = df_dataset.reset_index()

dataset_path = "{}/dataset.csv".format(base_preproc_input_dir)
df_dataset.to_csv(dataset_path, index=None)
print("df_train shape: ", df_train.shape)
print("df_test shape: ", df_test.shape)
dataset_df = pd.read_csv(dataset_path)
print("df_dataset shape: ", dataset_df.shape) # fraud 추가

df_train shape:  (16000, 45)
df_test shape:  (4000, 45)
df_dataset shape:  (20000, 46)


이제 데이터를 디폴트 버킷으로 업로드합니다. `input_data_uri` 변수를 통해 데이터셋의 위치를 저장하였습니다.

In [220]:
local_path = f"{base_preproc_input_dir}/dataset.csv"
project_prefix = 'fraud2scratch'
data_prefix = 'fraud2scratch'
base_uri = f"s3://{default_bucket}/{data_prefix}"
input_data_uri = sagemaker.s3.S3Uploader.upload(
    local_path=local_path, 
    desired_s3_uri=base_uri,
)
print(input_data_uri)

s3://sagemaker-ap-northeast-2-057716757052/fraud2scratch/dataset.csv


# 2. 전처리 로직 프로토타이핑

## (1) 로컬 노트북에서 전처리 로직 실행 



### 로컬 환경 셋업 

로컬에서 테스트 하기 위해 다커 컨테이너와 같은 환경 생성

In [221]:
import os
base_output_dir = 'opt/ml/processing/output'
# base_preproc_dir = 'opt/ml/processing'


base_train_dir = 'opt/ml/processing/output/train'
os.makedirs(base_train_dir, exist_ok=True)

base_validation_dir = 'opt/ml/processing/output/validation'
os.makedirs(base_validation_dir, exist_ok=True)

base_test_dir = 'opt/ml/processing/output/test'
os.makedirs(base_test_dir, exist_ok=True)


로컬 폴더에 입력 데이터 저장

### 로컬에서 스크립트 실행

In [222]:
%%sh -s "$base_preproc_input_dir" "$base_output_dir"
python fraud/preprocessing.py \
--base_preproc_input_dir $1 \
--base_output_dir $2 

#! python fraud/preprocessing.py --base_preproc_input_dir {base_preproc_input_dir} --base_output_dir {base_output_dir} 


numpy version:  1.19.5
#############################################
args.base_output_dir: opt/ml/processing/output
args.base_preproc_input_dir: opt/ml/processing/input
args.label_column: fraud
input files: 
 ['opt/ml/processing/input/dataset.csv']
dataset sample 
    fraud  incident_type_theft  ...  collision_type_rear  collision_type_front
0      0                    0  ...                    1                     0
1      0                    0  ...                    1                     0

[2 rows x 46 columns]
df columns 
 Index(['fraud', 'incident_type_theft', 'policy_state_ca', 'policy_deductable',
       'num_witnesses', 'policy_state_or', 'incident_month',
       'customer_gender_female', 'num_insurers_past_5_years',
       'customer_gender_male', 'total_claim_amount',
       'authorities_contacted_police', 'incident_day', 'collision_type_side',
       'customer_age', 'customer_education', 'driver_relationship_child',
       'driver_relationship_spouse', 'injury_claim', 'inc

## (2) 로컬 다커 컨테이너에서 전처리 로직 실행 



In [223]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

instance_type = 'local'
sklearn_processor = SKLearnProcessor(framework_version= "0.23-1",
                                     role=role,
                                     instance_type= instance_type,
                                     instance_count=1)

sklearn_processor.run(code='fraud/preprocessing.py',
                      inputs=[ProcessingInput(
                        source=input_data_uri,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
                               ProcessingOutput(source='/opt/ml/processing/output/validation'),
                               ProcessingOutput(source='/opt/ml/processing/output/test')]
                      ,wait=False
                     )


Job Name:  sagemaker-scikit-learn-2021-04-12-11-40-41-489
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ap-northeast-2-057716757052/fraud2scratch/dataset.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ap-northeast-2-057716757052/sagemaker-scikit-learn-2021-04-12-11-40-41-489/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-ap-northeast-2-057716757052/sagemaker-scikit-learn-2021-04-12-11-40-41-489/output/output-1', 'LocalPath': '/opt/ml/processing/output/train', 'S3UploadMode': 'EndOfJob'}}, {'

# 2. 훈련 로직 프로토타이핑

In [274]:
sklearn_processor.latest_job.describe()
prep_train_dir = sklearn_processor.latest_job.describe()['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']
prep_train_output = f"{prep_train_dir}/train.csv"
print("prep_train_dir: ", prep_train_output)
s3_input_train = sagemaker.inputs.TrainingInput(
    s3_data= prep_train_output, 
    content_type='csv')
prep_test_dir = sklearn_processor.latest_job.describe()['ProcessingOutputConfig']['Outputs'][2]['S3Output']['S3Uri']
prep_test_output = f"{prep_test_dir}/test.csv"
print("prep_test_output: ", prep_test_output)



prep_train_dir:  s3://sagemaker-ap-northeast-2-057716757052/sagemaker-scikit-learn-2021-04-12-11-40-41-489/output/output-1/train.csv
prep_test_output:  s3://sagemaker-ap-northeast-2-057716757052/sagemaker-scikit-learn-2021-04-12-11-40-41-489/output/output-3/test.csv


In [245]:
import sagemaker

sagemaker_session = sagemaker.Session()
bucket = sagemaker.Session().default_bucket()  # replace with an existing bucket if needed



# Define IAM role
import boto3
from sagemaker import get_execution_role

role = get_execution_role()


In [246]:
from sagemaker import image_uris, session
# container = image_uris.retrieve("xgboost", session.Session().boto_region_name, version="latest")
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
)


print("image_uri: ", image_uri)

# train_instance_type = 'local'
train_instance_type = 'ml.m5.2xlarge'



from sagemaker import local

if train_instance_type == 'local':
    sess = local.LocalSession()
    print("local session is assigned")
else:
    sess = sagemaker.Session()
    print("SageMaker session is assigned")    

image_uri:  366743142698.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3
SageMaker session is assigned


In [247]:
%%time 

xgb = sagemaker.estimator.Estimator(image_uri,
                                    role, 
                                    instance_count=1, 
                                    instance_type= train_instance_type,
                                    output_path='s3://{}/{}/output'.format(bucket, project_prefix),
                                    sagemaker_session= sess)
xgb.set_hyperparameters(max_depth=6, # default: 6
                        eta=0.3, # learning_rate, default : 0.3
                        alpha = 10, # L1 regularization, default : 0.3
                        gamma=0, # regularization, default : 0
                        colsample_bytree = 0.3,                         
                        min_child_weight=1, # regularization, default: 1, possible: 6
                        subsample=0.8, # default: 1
                        silent=0,
                        num_class = 5,
                        objective='multi:softmax',
                        num_round=100,
                        seed = 1000
                       )

# xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})
xgb.fit({'train': s3_input_train}, wait=False)

CPU times: user 20.2 ms, sys: 0 ns, total: 20.2 ms
Wall time: 243 ms


In [249]:
xgb.latest_training_job.wait(logs=False)


2021-04-12 11:51:43 Starting - Preparing the instances for training
2021-04-12 11:51:43 Downloading - Downloading input data
2021-04-12 11:51:43 Training - Training image download completed. Training in progress.
2021-04-12 11:51:43 Uploading - Uploading generated training model
2021-04-12 11:51:43 Completed - Training job completed


In [250]:
# prep_train = s3_input_train.config['DataSource']['S3DataSource']['S3Uri']
# prep_df = pd.read_csv(prep_train)
# prep_df.shape


# 3. 평가 (Evaluation)
---

### 환경 셋업

In [281]:
import os
base_dir = 'opt/ml/processing'
os.makedirs(base_dir, exist_ok=True)

base_model_dir = 'opt/ml/processing/model'
base_model_path = f"{base_model_dir}/model.tar.gz"
os.makedirs(base_model_dir, exist_ok=True)

base_test_path = f"{base_test_dir}/test.csv"
print("base_test_path: ", base_test_path)
test_df = pd.read_csv(base_test_path)
print(test_df.shape)

base_test_path:  opt/ml/processing/output/test/test.csv
(2999, 46)


In [282]:

model_artifcat_path = xgb.latest_training_job.describe()['ModelArtifacts']['S3ModelArtifacts']
print("model_artifcat_path: \n", model_artifcat_path)
# model_artifcat_path = 's3://sagemaker-ap-northeast-2-057716757052/fraud2scratch/model/pipelines-9ct6szmb6rb0-FraudScratchTrain-zQJzAc2pYM/output/model.tar.gz'
print("model_artifcat_path: ", model_artifcat_path)
! aws s3 cp  {model_artifcat_path} {base_model_dir}

model_artifcat_path: 
 s3://sagemaker-ap-northeast-2-057716757052/fraud2scratch/output/sagemaker-xgboost-2021-04-12-11-48-20-620/output/model.tar.gz
model_artifcat_path:  s3://sagemaker-ap-northeast-2-057716757052/fraud2scratch/output/sagemaker-xgboost-2021-04-12-11-48-20-620/output/model.tar.gz
download: s3://sagemaker-ap-northeast-2-057716757052/fraud2scratch/output/sagemaker-xgboost-2021-04-12-11-48-20-620/output/model.tar.gz to opt/ml/processing/model/model.tar.gz


### 로컬에서 스크립트 실행

In [284]:
output_evaluation_dir = 'opt/ml/processing/evaluation'
print("model_artifcat_path: ", model_artifcat_path)

model_artifcat_path:  s3://sagemaker-ap-northeast-2-057716757052/fraud2scratch/output/sagemaker-xgboost-2021-04-12-11-48-20-620/output/model.tar.gz


In [287]:
%%sh -s "$base_dir" "$base_model_path" "$base_test_path" "$output_evaluation_dir"
python fraud/evaluation.py \
--base_dir $1 \
--model_path $2 \
--test_path $3 \
--output_evaluation_dir $4


#############################################
args.model_path: opt/ml/processing/model/model.tar.gz
args.test_path: opt/ml/processing/output/test/test.csv
args.output_evaluation_dir: opt/ml/processing/evaluation
****** All folder and files under opt/ml/processing ****** 
('opt/ml/processing', ['validation', 'train', 'test', 'evaluation', 'input', 'output', 'model'], [])
('opt/ml/processing/validation', [], ['validation.csv'])
('opt/ml/processing/train', [], ['train.csv'])
('opt/ml/processing/test', [], ['test.csv'])
('opt/ml/processing/evaluation', [], ['evaluation.json'])
('opt/ml/processing/input', ['.ipynb_checkpoints'], ['dataset.csv'])
('opt/ml/processing/input/.ipynb_checkpoints', [], [])
('opt/ml/processing/output', ['validation', 'train', 'test'], [])
('opt/ml/processing/output/validation', [], ['validation.csv'])
('opt/ml/processing/output/train', [], ['train.csv'])
('opt/ml/processing/output/test', [], ['test.csv'])
('opt/ml/processing/model', [], ['model.tar.gz'])
**********

## 모델 빌딩 평가

In [292]:
image_uri = '366743142698.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3'

In [293]:
from sagemaker.processing import ScriptProcessor

processing_instance_type = 'local'
eval_script_processor = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    instance_type=processing_instance_type,
    instance_count=1,
    base_job_name="script-fraud-scratch-eval",
    role=role,
)

In [294]:
#model_artifcat_path
prep_test_output

's3://sagemaker-ap-northeast-2-057716757052/sagemaker-scikit-learn-2021-04-12-11-40-41-489/output/output-3/test.csv'

In [298]:
prep_test_output = 's3://sagemaker-ap-northeast-2-057716757052/sagemaker-scikit-learn-2021-04-12-11-40-41-489/output/output-3/test.csv'
! aws s3 ls {prep_test_output}

2021-04-12 11:40:47    2684966 test.csv


In [301]:
eval_script_processor.run(
                        inputs=[
                            ProcessingInput(
#                                source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
                                source= model_artifcat_path,
                                destination="/opt/ml/processing/model"
                            ),
                            ProcessingInput(
#                                 source=step_process.properties.ProcessingOutputConfig.Outputs[
#                                     "test"
#                                 ].S3Output.S3Uri,
                                source = prep_test_output,
                                destination="/opt/ml/processing/test"
                            )
                        ],
                        outputs=[
                            ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
                        ],
                        code="fraud/evaluation.py",
)


Job Name:  script-fraud-scratch-eval-2021-04-13-01-37-01-915
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ap-northeast-2-057716757052/fraud2scratch/output/sagemaker-xgboost-2021-04-12-11-48-20-620/output/model.tar.gz', 'LocalPath': '/opt/ml/processing/model', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ap-northeast-2-057716757052/sagemaker-scikit-learn-2021-04-12-11-40-41-489/output/output-3/test.csv', 'LocalPath': '/opt/ml/processing/test', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ap-northeast-2-057716757052/script-fraud-scratch-eval-2021-04-13-01-37-01-915/input/code/evaluation.py', 'LocalPath': '/opt/ml/processin

In [None]:
from sagemaker.workflow.properties import PropertyFile


evaluation_report = PropertyFile(
    name="EvaluationReport",
    output_name="evaluation",
    path="evaluation.json"
)
step_eval = ProcessingStep(
    name="AbaloneEval",
    processor=script_eval,
    inputs=[
        ProcessingInput(
            source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model"
        ),
        ProcessingInput(
            source=step_process.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            destination="/opt/ml/processing/test"
        )
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="abalone/evaluation.py",
    property_files=[evaluation_report],
)