# [모듈 2.1] SageMaker 클러스터에서 훈련 (No VPC에서 실행)

이 노트북은 아래의 작업을 실행 합니다.
- SageMaker Hosting Cluster 에서 훈련을 실행
- 훈련한 Job 이름을 저장 
    - 다음 노트북에서 모델 배포 및 추론시에 사용 합니다.
---

SageMaker의 세션을 얻고, role 정보를 가져옵니다.
- 위의 두 정보를 통해서 SageMaker Hosting Cluster에 연결합니다.

In [5]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

## 로컬의 데이터 S3 업로딩
로컬의 데이터를 S3에 업로딩하여 훈련시에 Input으로 사용 합니다.

In [6]:
# dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')
# display(dataset_location)
dataset_location = 's3://sagemaker-ap-northeast-2-057716757052/data/DEMO-cifar10'
dataset_location

's3://sagemaker-ap-northeast-2-057716757052/data/DEMO-cifar10'

In [7]:
# efs_dir = '/home/ec2-user/efs/data'

# ! ls {efs_dir} -al
# ! aws s3 cp {dataset_location} {efs_dir} --recursive

In [8]:
from sagemaker.inputs import FileSystemInput

# Specify EFS ile system id.
file_system_id = 'fs-38dc1558' # 'fs-xxxxxxxx'
print(f"EFS file-system-id: {file_system_id}")

# Specify directory path for input data on the file system. 
# You need to provide normalized and absolute path below.
train_file_system_directory_path = '/data/train'
eval_file_system_directory_path = '/data/eval'
validation_file_system_directory_path = '/data/validation'
print(f'EFS file-system data input path: {train_file_system_directory_path}')
print(f'EFS file-system data input path: {eval_file_system_directory_path}')
print(f'EFS file-system data input path: {validation_file_system_directory_path}')

# Specify the access mode of the mount of the directory associated with the file system. 
# Directory must be mounted  'ro'(read-only).
file_system_access_mode = 'ro'

# Specify your file system type
file_system_type = 'EFS'

train = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=train_file_system_directory_path,
                                    file_system_access_mode=file_system_access_mode)

eval = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=eval_file_system_directory_path,
                                    file_system_access_mode=file_system_access_mode)

validation = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=validation_file_system_directory_path,
                                    file_system_access_mode=file_system_access_mode)

EFS file-system-id: fs-38dc1558
EFS file-system data input path: /data/train
EFS file-system data input path: /data/eval
EFS file-system data input path: /data/validation


In [12]:
aws_region = 'ap-northeast-2'# aws-region-code e.g. us-east-1
s3_bucket  = 'sagemaker-ap-northeast-2-057716757052'# your-s3-bucket-name

In [13]:
prefix = "cifar10/efs" #prefix in your bucket
s3_output_location = f's3://{s3_bucket}/{prefix}/output'
print(f'S3 model output location: {s3_output_location}')

S3 model output location: s3://sagemaker-ap-northeast-2-057716757052/cifar10/efs/output


In [14]:
security_group_ids = ['sg-0192524ef63ec6138'] # ['sg-xxxxxxxx'] 
# subnets = ['subnet-0a84bcfa36d3981e6','subnet-0304abaaefc2b1c34','subnet-0a2204b79f378b178'] # [ 'subnet-xxxxxxx', 'subnet-xxxxxxx', 'subnet-xxxxxxx']
subnets = ['subnet-0a84bcfa36d3981e6'] # [ 'subnet-xxxxxxx', 'subnet-xxxxxxx', 'subnet-xxxxxxx']




In [None]:
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(base_job_name='cifar10',
                       entry_point='cifar10_keras_sm_tf2.py',
                       source_dir='training_script',
                       role=role,
                       framework_version='2.0.0',
                       py_version='py3',
                       script_mode=True,
                       hyperparameters={'epochs' : 1},
                       train_instance_count=1, 
                       train_instance_type='ml.p3.2xlarge',
                       output_path=s3_output_location,                       
                       subnets=subnets,
                       security_group_ids=security_group_ids,                       
                       session = sagemaker.Session()
                      )

estimator.fit({'train': train,
               'validation': validation,
               'eval': eval,
              })
# estimator.fit({'train': 'file://data/train',
#                'validation': 'file://data/validation',
#                'eval': 'file://data/eval'})

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-02-23 13:44:11 Starting - Starting the training job...
2021-02-23 13:44:36 Starting - Launching requested ML instancesProfilerReport-1614087851: InProgress
......
2021-02-23 13:45:40 Starting - Preparing the instances for training....................................
2021-02-23 13:51:42 Failed - Training job failed
.

# VPC_Mode를 True, False 선택
#### **[중요] VPC_Mode에서 실행시에 True로 변경해주세요**

In [3]:
VPC_Mode = False

In [4]:
from sagemaker.tensorflow import TensorFlow

def retrieve_estimator(VPC_Mode):
    if VPC_Mode:
        # VPC 모드 경우에 subnets, security_group을 기술 합니다.
        estimator = TensorFlow(base_job_name='cifar10',
                               entry_point='cifar10_keras_sm_tf2.py',
                               source_dir='training_script',
                               role=role,
                               framework_version='2.0.0',
                               py_version='py3',
                               script_mode=True,                       
                               hyperparameters={'epochs': 2},
                               train_instance_count=1, 
                               train_instance_type='ml.p3.8xlarge',
                               subnets = ['subnet-090c1fad32165b0fa','subnet-0bd7cff3909c55018'],
                               security_group_ids = ['sg-0f45d634d80aef27e']                                              
                              )        
    else:
        estimator = TensorFlow(base_job_name='cifar10',
                               entry_point='cifar10_keras_sm_tf2.py',
                               source_dir='training_script',
                               role=role,
                               framework_version='2.0.0',
                               py_version='py3',
                               script_mode=True,                       
                               hyperparameters={'epochs': 2},
                               train_instance_count=1, 
                               train_instance_type='ml.p3.8xlarge')
    return estimator

estimator = retrieve_estimator(VPC_Mode)

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


학습을 수행합니다. 이번에는 각각의 채널(`train, validation, eval`)에 S3의 데이터 저장 위치를 지정합니다.<br>
학습 완료 후 Billable seconds도 확인해 보세요. Billable seconds는 실제로 학습 수행 시 과금되는 시간입니다.
```
Billable seconds: <time>
```

참고로, `ml.p2.xlarge` 인스턴스로 5 epoch 학습 시 전체 6분-7분이 소요되고, 실제 학습에 소요되는 시간은 3분-4분이 소요됩니다.

In [5]:
%%time
estimator.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})

2021-01-27 04:02:44 Starting - Starting the training job...
2021-01-27 04:03:08 Starting - Launching requested ML instancesProfilerReport-1611720164: InProgress
.........
2021-01-27 04:04:29 Starting - Preparing the instances for training......
2021-01-27 04:05:44 Downloading - Downloading input data
2021-01-27 04:05:44 Training - Downloading the training image...
2021-01-27 04:06:11 Training - Training image download completed. Training in progress..[34m2021-01-27 04:06:06,541 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2021-01-27 04:06:07,035 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "eval": "/opt/ml/input/data/eval",
        "validation": "/opt/ml/input/data/validation",
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.t

## training_job_name 저장

현재의 training_job_name을 저장 합니다.
- training_job_name을 에는 훈련에 관련 내용 및 훈련 결과인 **Model Artifact** 파일의 S3 경로를 제공 합니다.

In [6]:
train_job_name = estimator._current_job_name

In [7]:
%store train_job_name

Stored 'train_job_name' (str)
