# [Module 1.2] 로컬 모드 및 스크립트 모드로 훈련 (SageMaker 사용)

### 본 워크샵의 모든 노트북은 `conda_python3` 여기에서 작업 합니다.

이 노트북은 아래와 같은 작업을 합니다.
- 1. 환경 셋업
- 2. 세이지 메이크 로컬 모드 훈련
- 3. SageMaker Host Mode 로 훈련
- 4. 모델 아티펙트 저장

---

참고:

- 세이지 메이커로 파이토치 사용 
    - [Use PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html)


- Use PyTorch with the SageMaker Python SDK
    - https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html


- Amazon SageMaker Local Mode Examples
    - TF, Pytorch, SKLean, SKLearn Processing JOb에 대한 로컬 모드 샘플
        - https://github.com/aws-samples/amazon-sagemaker-local-mode
    - Pytorch 로컬 모드
        - https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/pytorch_script_mode_local_training_and_serving/pytorch_script_mode_local_training_and_serving.py    



# 1. 환경 셋업

## 기본 세팅
사용하는 패키지는 import 시점에 다시 재로딩 합니다.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('./src')

In [2]:
### 커스텀 라이브러리
import config 

버킷 및 폴더(prefix) 설정

In [28]:
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

use_default_bucket = True
if use_default_bucket:
    bucket = sagemaker_session.default_bucket()
else:
    bucket = '<Type your bucket>'
    
prefix = "KoElectra-HF"



# 2. 세이지 메이크 로컬 모드 훈련
#### 로컬의 GPU, CPU 여부로 instance_type 결정

In [4]:
import os
import subprocess

try:
    if subprocess.call("nvidia-smi") == 0:
        ## Set type to GPU if one is present
        instance_type = "local_gpu"
    else:
        instance_type = "local"        
except:
    pass

print("Instance type = " + instance_type)

Instance type = local_gpu


## 2.1. 스크립트 모드의 코드 작성 방법
- ![script_mode_example](img/script_mode_example.png)

## 2.2.훈련 코드 확인
- 아래의 코드는 전형적인 스크립트 모드의 코드 작성 방법을 따르고 있습니다.
- 훈련 함수는 `from train_lib import train` 로서 이전 노트북의 **[세이지 메이커 없이]** 작성한 스크래치 버전에서 사용한 훈련 함수와 동일 합니다.


In [74]:
train_code = 'src/train.py'
!pygmentize {train_code}

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

[34mfrom[39;49;00m [04m[36mtrain_lib[39;49;00m [34mimport[39;49;00m train

[34mdef[39;49;00m [32mparser_args[39;49;00m():
    parser = argparse.ArgumentParser()

    [37m# Default Setting[39;49;00m
    parser.add_argument([33m"[39;49;00m[33m--epochs[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m, default=[34m1[39;49;00m)
    parser.add_argument([33m"[39;49;00m[33m--train_batch_size[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m, default=[34m32[39;49;00m)
    parser.add_argument([33m"[39;49;00m[33m--eval_batch_size[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m, default=[34m128[39;49;00m)
    parser.add_argument([33m"[39;49;00m[33m--learning_rate[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00

## 2.3. 로컬에 있는 데이타 세트의 위치를 지정 합니다.

In [19]:
local_inputs = config.train_data_dir
print("local_inputs: ", local_inputs)

local_inputs:  data/nsmc/train


In [20]:
local_inputs = {'train': f'file://{local_inputs}'}

## 2.4. 로컬 모드로 훈련 실행
- 아래의 두 라인이 로컬모드로 훈련을 지시 합니다.
```python
    instance_type=instance_type, # local_gpu or local 지정
    session = sagemaker.LocalSession(), # 로컬 세션을 사용합니다.
```

In [21]:
hyperparameters = {'epochs': 1, 
                   'train_batch_size' : 32,
                   'eval_batch_size' : 128,                   
                   'learning_rate': 5e-5,
                   'warmup_steps' : 0,
                   'tokenizer_id' : 'monologg/koelectra-small-v3-discriminator',
                   'model_id' : 'monologg/koelectra-small-v3-discriminator',     
                   'is_evaluation' : config.is_evaluation,
                   'eval_ratio' : config.eval_ratio,
                   'use_subset_train_sampler' : config.use_subset_train_sampler,
                   'log_interval' : 50,
                    }  

In [24]:
from sagemaker.pytorch import PyTorch
import os
import subprocess

local_estimator = PyTorch(
    entry_point="train.py",    
    source_dir='src',    
    role=role,
    framework_version='1.8.1',
    py_version='py3',
    instance_count=1,
    instance_type=instance_type, # local_gpu or local 지정
    session = sagemaker.LocalSession(), # 로컬 세션을 사용합니다.
    hyperparameters= hyperparameters               
    
)
local_estimator.fit(local_inputs)

Creating k6vrq660pi-algo-1-e0ic8 ... 
Creating k6vrq660pi-algo-1-e0ic8 ... done
Attaching to k6vrq660pi-algo-1-e0ic8
[36mk6vrq660pi-algo-1-e0ic8 |[0m 2022-06-05 07:30:56,296 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36mk6vrq660pi-algo-1-e0ic8 |[0m 2022-06-05 07:30:56,337 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36mk6vrq660pi-algo-1-e0ic8 |[0m 2022-06-05 07:30:56,340 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36mk6vrq660pi-algo-1-e0ic8 |[0m 2022-06-05 07:30:56,532 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36mk6vrq660pi-algo-1-e0ic8 |[0m /opt/conda/bin/python3.6 -m pip install -r requirements.txt
[36mk6vrq660pi-algo-1-e0ic8 |[0m Collecting transformers==4.18.0
[36mk6vrq660pi-algo-1-e0ic8 |[0m   Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
     |████████████████████████████████| 4.0 MB 29.

# 3. SageMaker Host Mode 로 훈련
- instance_type, session 을 수정 합니다.
- 입력 데이터를 inputs로서 S3 의 경로를 제공합니다.
- wait=False 로 지정해서 async 모드로 훈련을 실행합니다. 
- 실행 경과는 아래의 cifar10_estimator.logs() 에서 확인 합니다.

## 3.1. 데이터 세트를 S3에 업로드


In [29]:
local_inputs = config.train_data_dir
print("local_inputs: ", local_inputs)

local_inputs:  data/nsmc/train


In [31]:
s3_data_loc = sagemaker_session.upload_data(path=config.train_data_dir, bucket=bucket, 
                                       key_prefix=f"{prefix}/{config.train_data_dir}")
print("s3_data_loc: ", s3_data_loc)

s3_data_loc:  s3://sagemaker-us-east-1-057716757052/KoElectra-HF/data/nsmc/train


In [32]:
! aws s3 ls {s3_data_loc} --recursive

2022-06-05 11:16:14   13296169 KoElectra-HF/data/nsmc/train/ratings_train.txt


## 3.2. 훈련 및 테스트 데이터를 S3 로 지정

In [33]:
s3_inputs = {
            'train': f'{s3_data_loc}',
            #'test': f'{s3_data_loc}'
            }

print("s3_inputs: \n", s3_inputs)

s3_inputs: 
 {'train': 's3://sagemaker-us-east-1-057716757052/KoElectra-HF/data/nsmc/train'}


## 3.3. 실험 세팅

### 실험(Experiment) 세팅
- Amazon SageMaker 실험은 기계 학습 실험을 구성, 추적, 비교 및 평가할 수 있는 Amazon SageMaker 의 기능입니다
- 상세 사항은 개발자 가이드 참조 하세요. --> [Amazon SageMaker 실험을 통한 Machine Learning 관리](https://docs.aws.amazon.com/ko_kr/sagemaker/latest/dg/experiments.html)
- sagemaker experiment는 추가적인 패키지를 설치하여야 합니다. 1_Setup_Environment 가 실행이 안되었다고 하면, `!pip install --upgrade sagemaker-experiments` 를 통해 설치 해주세요.
- 여기서는 boto3 API를 통해서 실험을 생성합니다. SageMaker Python SDK를 통해서도 가능합니다.


### 실험 생성

In [34]:
# !pip install --upgrade sagemaker-experiments
import boto3
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

from datetime import datetime

sm_client = boto3.client('sagemaker')


# 설험에 대한 이름을 생성 합니다.
experiment_name = prefix 

# 실험이 존재하지 않으면 생성하고, 그렇지 않으면 지나갑니다.
try:
    response = sm_client.describe_experiment(ExperimentName=experiment_name)
    print(f"Experiment:{experiment_name} already exists")    
    
except:
    response = sm_client.create_experiment(
        ExperimentName = experiment_name,
        Description = 'Experiment for NCF',
    )
    print(f"Experiment:{experiment_name} is created")        




Experiment:KoElectra-HF is created


### 하이퍼 파라미터 세팅
- epochs 값을 조절해서 실행 시간을 조정 하세요.

In [61]:
host_hyperparameters = {'epochs': 5, 
                   'train_batch_size' : 32,
                   'eval_batch_size' : 128,                   
                   'learning_rate': 5e-5,
                   'warmup_steps' : 0,
                   'tokenizer_id' : 'monologg/koelectra-small-v3-discriminator',
                   'model_id' : 'monologg/koelectra-small-v3-discriminator',     
                   'is_evaluation' : config.is_evaluation,
                   'eval_ratio' : config.eval_ratio,
                   'use_subset_train_sampler' : False,
                   'log_interval' : 50,
                    }  

### 시도(Trial) 생성

In [62]:
from datetime import datetime
# 시도 이름 생성
ts = datetime.now().strftime('%Y-%m-%d-%H-%M-%S-%f')
trial_name = experiment_name + f"-{ts}"

# 1개의 실험 안에 시도를 생성함.
response = sm_client.create_trial(
    ExperimentName = experiment_name,
    TrialName = trial_name,
)    

# 실험 설정: 실험 이름, 시도 이름으로 구성
experiment_config = {
    'ExperimentName' : experiment_name,
    'TrialName' : trial_name,
    "TrialComponentDisplayName" : 'Training',
}    



## 3.4 훈련 실행


### 훈련 메트릭을 CloudWatch 에서 보기
- 개발자 가이드
    - [Monitor and Analyze Training Jobs Using Amazon CloudWatch ](https://docs.amazonaws.cn/en_us/sagemaker/latest/dg/training-metrics.html#define-train-metrics)

In [63]:
metric_definitions=[
       {'Name': 'Accuracy', 'Regex': 'Acc=(.*?);'},
       {'Name': 'Loss', 'Regex': 'Loss=(.*?);'}        
    ]


In [64]:
from sagemaker.pytorch import PyTorch

instance_type = 'ml.p3.8xlarge'

host_estimator = PyTorch(
    entry_point="train.py",    
    source_dir='src',    
    role=role,
    framework_version='1.8.1',
    py_version='py3',
    instance_count=1,
    instance_type=instance_type,
    session = sagemaker.Session(), # 세이지 메이커 세션
    hyperparameters=host_hyperparameters,
    metric_definitions = metric_definitions
    
)
host_estimator.fit(s3_inputs, 
                   experiment_config = experiment_config, # 실험 설정 제공                   
                   wait=False)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: pytorch-training-2022-06-05-11-54-33-968


In [65]:
%%time

host_estimator.logs()

2022-06-05 11:54:34 Starting - Starting the training job...ProfilerReport-1654430074: InProgress
...
2022-06-05 11:55:17 Starting - Preparing the instances for training......
2022-06-05 11:56:31 Downloading - Downloading input data......
2022-06-05 11:57:18 Training - Downloading the training image.....................
2022-06-05 12:01:02 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-06-05 12:01:06,179 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-06-05 12:01:06,221 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-06-05 12:01:06,228 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2022-06-05 12:01:06,679 sagemaker-training-toolkit INFO     Installing dependencies from requirement

# 4. 실험 결과 보기


위의 실험한 결과를 확인 합니다.
- 각각의 훈련잡의 시도에 대한 훈련 사용 데이터, 모델 입력 하이퍼 파라미터, 모델 평가 지표, 모델 아티펙트 결과 위치 등의 확인이 가능합니다.
- **아래의 모든 내용은 SageMaker Studio 를 통해서 직관적으로 확인이 가능합니다.**

In [66]:
from sagemaker.analytics import ExperimentAnalytics
import pandas as pd
pd.options.display.max_columns = 50
pd.options.display.max_rows = 5
pd.options.display.max_colwidth = 50

search_expression = {
    "Filters": [
        {
            "Name": "DisplayName",
            "Operator": "Equals",
            "Value": "Training",
        }
    ],
}


trial_component_analytics = ExperimentAnalytics(
    sagemaker_session= sagemaker_session,
    experiment_name= experiment_name,
    search_expression=search_expression,
)

trial_component_analytics.dataframe()

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,epochs,eval_batch_size,eval_ratio,is_evaluation,learning_rate,log_interval,model_id,sagemaker_container_log_level,sagemaker_job_name,sagemaker_program,sagemaker_region,sagemaker_submit_directory,tokenizer_id,train_batch_size,use_subset_train_sampler,warmup_steps,Accuracy - Min,Accuracy - Max,Accuracy - Avg,Accuracy - StdDev,Accuracy - Last,Accuracy - Count,CrossEntropyLoss_output_0_GLOBAL - Min,CrossEntropyLoss_output_0_GLOBAL - Max,CrossEntropyLoss_output_0_GLOBAL - Avg,CrossEntropyLoss_output_0_GLOBAL - StdDev,CrossEntropyLoss_output_0_GLOBAL - Last,CrossEntropyLoss_output_0_GLOBAL - Count,Loss - Min,Loss - Max,Loss - Avg,Loss - StdDev,Loss - Last,Loss - Count,train - MediaType,train - Value,SageMaker.DebugHookOutput - MediaType,SageMaker.DebugHookOutput - Value,SageMaker.ModelArtifact - MediaType,SageMaker.ModelArtifact - Value,Trials,Experiments
0,pytorch-training-2022-06-05-11-54-33-968-aws-t...,Training,arn:aws:sagemaker:us-east-1:057716757052:train...,763104351884.dkr.ecr.us-east-1.amazonaws.com/p...,1.0,ml.p3.8xlarge,30.0,5.0,128.0,0.2,true,0.00005,50.0,"""monologg/koelectra-small-v3-discriminator""",20.0,"""pytorch-training-2022-06-05-11-54-33-968""","""train.py""","""us-east-1""","""s3://sagemaker-us-east-1-057716757052/pytorch...","""monologg/koelectra-small-v3-discriminator""",32.0,false,0.0,0.541611,0.791592,0.715158,0.096468,0.541611,10.0,0.692698,0.692698,0.692698,0.0,0.692698,1.0,0.315686,0.676168,0.500768,0.155789,0.676168,10.0,,s3://sagemaker-us-east-1-057716757052/KoElectr...,,s3://sagemaker-us-east-1-057716757052/,,s3://sagemaker-us-east-1-057716757052/pytorch-...,[KoElectra-HF-2022-06-05-11-54-31-064827],[KoElectra-HF]
1,pytorch-training-2022-06-05-11-37-42-813-aws-t...,Training,arn:aws:sagemaker:us-east-1:057716757052:train...,763104351884.dkr.ecr.us-east-1.amazonaws.com/p...,1.0,ml.p3.8xlarge,30.0,1.0,128.0,0.2,true,0.00005,50.0,"""monologg/koelectra-small-v3-discriminator""",20.0,"""pytorch-training-2022-06-05-11-37-42-813""","""train.py""","""us-east-1""","""s3://sagemaker-us-east-1-057716757052/pytorch...","""monologg/koelectra-small-v3-discriminator""",32.0,false,0.0,0.734673,0.734673,0.734673,0.000000,0.734673,2.0,0.693690,0.693690,0.693690,0.0,0.693690,1.0,0.675772,0.675772,0.675772,0.000000,0.675772,2.0,,s3://sagemaker-us-east-1-057716757052/KoElectr...,,s3://sagemaker-us-east-1-057716757052/,,s3://sagemaker-us-east-1-057716757052/pytorch-...,[KoElectra-HF-2022-06-05-11-35-12-423903],[KoElectra-HF]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,pytorch-training-2022-06-05-11-22-18-494-aws-t...,Training,arn:aws:sagemaker:us-east-1:057716757052:train...,763104351884.dkr.ecr.us-east-1.amazonaws.com/p...,1.0,ml.p3.2xlarge,30.0,1.0,128.0,0.2,true,0.00005,50.0,"""monologg/koelectra-small-v3-discriminator""",20.0,"""pytorch-training-2022-06-05-11-22-18-494""","""train.py""","""us-east-1""","""s3://sagemaker-us-east-1-057716757052/pytorch...","""monologg/koelectra-small-v3-discriminator""",32.0,true,0.0,0.687887,0.687887,0.687887,0.000000,0.687887,2.0,0.694898,0.694898,0.694898,0.0,0.694898,1.0,0.687613,0.687613,0.687613,0.000000,0.687613,2.0,,s3://sagemaker-us-east-1-057716757052/KoElectr...,,s3://sagemaker-us-east-1-057716757052/,,s3://sagemaker-us-east-1-057716757052/pytorch-...,[KoElectra-HF-2022-06-05-11-17-51-008615],[KoElectra-HF]
5,pytorch-training-2022-06-05-11-20-17-118-aws-t...,Training,arn:aws:sagemaker:us-east-1:057716757052:train...,763104351884.dkr.ecr.us-east-1.amazonaws.com/p...,1.0,ml.p3.2xlarge,30.0,1.0,128.0,0.2,true,0.00005,50.0,"""monologg/koelectra-small-v3-discriminator""",20.0,"""pytorch-training-2022-06-05-11-20-17-118""","""train.py""","""us-east-1""","""s3://sagemaker-us-east-1-057716757052/pytorch...","""monologg/koelectra-small-v3-discriminator""",32.0,true,0.0,,,,,,,,,,,,,,,,,,,,s3://sagemaker-us-east-1-057716757052/KoElectr...,,s3://sagemaker-us-east-1-057716757052/,,,[KoElectra-HF-2022-06-05-11-17-51-008615],[KoElectra-HF]


### 모델 평가 지표에 순서에 따른 시도 보기
- 아래는 모델 평가 지표에 따른 순서로 보여주기 입니다.

In [71]:

trial_component_training_analytics = ExperimentAnalytics(
    sagemaker_session= sagemaker_session,
    experiment_name= experiment_name,
    search_expression=search_expression,
    sort_by="metrics.Accuracy.max",        
    sort_order="Descending",
    metric_names=["Accuracy"],    
    parameter_names=["epochs", "train_batch_size",
                    ],
)

trial_component_training_analytics.dataframe()

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,epochs,train_batch_size,Accuracy - Min,Accuracy - Max,Accuracy - Avg,Accuracy - StdDev,Accuracy - Last,Accuracy - Count,train - MediaType,train - Value,SageMaker.DebugHookOutput - MediaType,SageMaker.DebugHookOutput - Value,SageMaker.ModelArtifact - MediaType,SageMaker.ModelArtifact - Value,Trials,Experiments
0,pytorch-training-2022-06-05-11-54-33-968-aws-t...,Training,arn:aws:sagemaker:us-east-1:057716757052:train...,5.0,32.0,0.541611,0.791592,0.715158,0.096468,0.541611,10.0,,s3://sagemaker-us-east-1-057716757052/KoElectr...,,s3://sagemaker-us-east-1-057716757052/,,s3://sagemaker-us-east-1-057716757052/pytorch-...,[KoElectra-HF-2022-06-05-11-54-31-064827],[KoElectra-HF]
1,pytorch-training-2022-06-05-11-37-42-813-aws-t...,Training,arn:aws:sagemaker:us-east-1:057716757052:train...,1.0,32.0,0.734673,0.734673,0.734673,0.000000,0.734673,2.0,,s3://sagemaker-us-east-1-057716757052/KoElectr...,,s3://sagemaker-us-east-1-057716757052/,,s3://sagemaker-us-east-1-057716757052/pytorch-...,[KoElectra-HF-2022-06-05-11-35-12-423903],[KoElectra-HF]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,pytorch-training-2022-06-05-11-35-15-371-aws-t...,Training,arn:aws:sagemaker:us-east-1:057716757052:train...,1.0,32.0,,,,,,,,s3://sagemaker-us-east-1-057716757052/KoElectr...,,s3://sagemaker-us-east-1-057716757052/,,,[KoElectra-HF-2022-06-05-11-35-12-423903],[KoElectra-HF]
5,pytorch-training-2022-06-05-11-20-17-118-aws-t...,Training,arn:aws:sagemaker:us-east-1:057716757052:train...,1.0,32.0,,,,,,,,s3://sagemaker-us-east-1-057716757052/KoElectr...,,s3://sagemaker-us-east-1-057716757052/,,,[KoElectra-HF-2022-06-05-11-17-51-008615],[KoElectra-HF]


# 5. 모델 아티펙트 저장
- S3 에 저장된 모델 아티펙트를 저장하여 추론시 사용합니다.

In [72]:
artifact_path = host_estimator.model_data
print("artifact_path: ", artifact_path)

%store artifact_path

artifact_path:  s3://sagemaker-us-east-1-057716757052/pytorch-training-2022-06-05-11-54-33-968/output/model.tar.gz
Stored 'artifact_path' (str)


기타 변수 저장

In [73]:
%store bucket 
%store prefix

Stored 'bucket' (str)
Stored 'prefix' (str)
