# [모듈 2.0] 모델 빌딩하기 (No VPC 환경에서 실행하세요)



In [5]:
import tensorflow as tf
tf.__version__

'2.1.2'

# (1) Run the training locally

본 스크립트는 모델 학습에 필요한 인자값(arguments)들을 사용합니다. 모델 학습에 필요한 인자값들은 아래와 같습니다.

1. `model_dir` - 로그와 체크 포인트를 저장하는 경로
2. `train, validation, eval` - TFRecord 데이터셋을 저장하는 경로
3. `epochs` - epoch 횟수

아래 명령어로 **<font color='red'>SageMaker 관련 API 호출 없이</font>** 로컬 노트북 인스턴스 환경에서 1 epoch만 학습해 봅니다. 참고로, MacBook Pro(15-inch, 2018) 2.6GHz Core i7 16GB 사양에서 2분 20초~2분 40초 소요됩니다.

In [2]:
%%time
!mkdir -p logs
!python training_script/cifar10_keras_ddp_tf2.py --model_dir ./logs \
                                         --train data/train \
                                         --validation data/validation \
                                         --eval data/eval \
                                         --epochs 1
!rm -rf logs

Step #0	Loss: 14.399995
Step #10	Loss: 14.481102
Step #20	Loss: 14.964404
Step #30	Loss: 14.229256
Step #40	Loss: 14.984791
CPU times: user 114 ms, sys: 71 ms, total: 185 ms
Wall time: 9.55 s


**<font color='blue'>본 스크립트는 SageMaker상의 notebook에서 구동하고 있지만, 여러분의 로컬 컴퓨터에서도 python과 jupyter notebook이 정상적으로 인스톨되어 있다면 동일하게 수행 가능합니다.</font>**

# (2) Use TensorFlow Script Mode



### Test your script locally (just like on your laptop)

테스트를 위해 위와 동일한 명령(command)으로 새 스크립트를 실행하고, 예상대로 실행되는지 확인합니다. <br>
SageMaker TensorFlow API 호출 시에 환경 변수들은 자동으로 넘겨기지만, 로컬 주피터 노트북에서 테스트 시에는 수동으로 환경 변수들을 지정해야 합니다. (아래 예제 코드를 참조해 주세요.)

```python
%env SM_MODEL_DIR=./logs
```

In [8]:
%%time
!mkdir -p logs   

# Number of GPUs on this machine
%env SM_NUM_GPUS=1
# Where to save the model
%env SM_MODEL_DIR=./logs
# Where the training data is
%env SM_CHANNEL_TRAIN=data/train
# Where the validation data is
%env SM_CHANNEL_VALIDATION=data/validation
# Where the evaluation data is
%env SM_CHANNEL_EVAL=data/eval

!python training_script/cifar10_keras_sm_ddp_tf2.py --model_dir ./logs --epochs 1
!rm -rf logs

env: SM_NUM_GPUS=1
env: SM_MODEL_DIR=./logs
env: SM_CHANNEL_TRAIN=data/train
env: SM_CHANNEL_VALIDATION=data/validation
env: SM_CHANNEL_EVAL=data/eval
Step #0	Loss: 4.030188
Step #10	Loss: 2.626105
Step #20	Loss: 2.318017
Step #30	Loss: 2.091323
Step #40	Loss: 2.048301
CPU times: user 157 ms, sys: 50.4 ms, total: 207 ms
Wall time: 11.6 s


# (3) Use SageMaker local for local testing


In [6]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

`sagemaker.tensorflow` 클래스를 사용하여 SageMaker Python SDK의 Tensorflow Estimator 인스턴스를 생성합니다.
인자값으로 하이퍼파라메터와 다양한 설정들을 변경할 수 있습니다.

자세한 내용은 [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#training-with-tensorflow-estimator)을 확인하시기 바랍니다.

In [7]:
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(base_job_name='cifar10',
                       entry_point='cifar10_keras_sm_ddp_tf2.py',
                       source_dir='training_script',
                       role=role,
                       framework_version='2.0.0',
                       py_version='py3',
                       script_mode=True,
                       hyperparameters={'epochs' : 1},
                       train_instance_count=1, 
                       train_instance_type='local')

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


학습을 수행할 3개의 채널과 데이터의 경로를 지정합니다. **로컬 모드로 수행하기 때문에 S3 경로 대신 노트북 인스턴스의 경로를 지정하시면 됩니다.**

In [11]:
%%time
estimator.fit({'train': 'file://data/train',
               'validation': 'file://data/validation',
               'eval': 'file://data/eval'})

Building with native build. Learn about native build in Compose here: https://docs.docker.com/go/compose-native-build/
Creating mbce57qndc-algo-1-yek3h ... 
Creating mbce57qndc-algo-1-yek3h ... done
Attaching to mbce57qndc-algo-1-yek3h
[36mmbce57qndc-algo-1-yek3h |[0m 2021-02-22 06:10:16,268 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
[36mmbce57qndc-algo-1-yek3h |[0m 2021-02-22 06:10:16,276 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36mmbce57qndc-algo-1-yek3h |[0m 2021-02-22 06:10:16,482 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36mmbce57qndc-algo-1-yek3h |[0m 2021-02-22 06:10:16,500 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36mmbce57qndc-algo-1-yek3h |[0m 2021-02-22 06:10:16,517 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36mmbce57qndc-algo-1-yek3h |[0m 2021-02-22 06:10:16,529 sagemaker-con

# (4) Using SageMaker for faster training time


In [9]:
dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')
display(dataset_location)

's3://sagemaker-ap-northeast-2-057716757052/data/DEMO-cifar10'

In [10]:
pip install --upgrade sagemaker

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/tensorflow2_p36/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [12]:
%%time

estimator = TensorFlow(base_job_name='cifar10-ddp',
                       entry_point='cifar10_keras_sm_ddp_tf2.py',
                       source_dir='training_script',
                       role=role,
                       framework_version='2.3',
                       py_version='py37',
                       hyperparameters={'epochs': 1},
                       train_instance_count=1, 
                       train_instance_type='ml.p3.16xlarge',
                       wait=True,
                       # Training using smdistributed.dataparallel Distributed Training Framework
                       # distribution={"smdistributed": {"dataparallel": {"enabled": True}}}
                      )


estimator.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-02-22 06:30:43 Starting - Starting the training job...
2021-02-22 06:31:06 Starting - Launching requested ML instancesProfilerReport-1613975442: InProgress
.........
2021-02-22 06:32:27 Starting - Preparing the instances for training.........
2021-02-22 06:34:08 Downloading - Downloading input data
2021-02-22 06:34:08 Training - Downloading the training image...........[34m2021-02-22 06:35:47.390393: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2021-02-22 06:35:47.396057: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2021-02-22 06:35:47.640057: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2[0m
[34m2021-02-22 06:35:47.734419: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializ

학습을 수행합니다. 이번에는 각각의 채널(`train, validation, eval`)에 S3의 데이터 저장 위치를 지정합니다.<br>
학습 완료 후 Billable seconds도 확인해 보세요. Billable seconds는 실제로 학습 수행 시 과금되는 시간입니다.
```
Billable seconds: <time>
```

참고로, `ml.p2.xlarge` 인스턴스로 5 epoch 학습 시 전체 6분-7분이 소요되고, 실제 학습에 소요되는 시간은 3분-4분이 소요됩니다.

2021-02-22 06:11:46 Starting - Starting the training job...
2021-02-22 06:12:11 Starting - Launching requested ML instancesProfilerReport-1613974305: InProgress
.........
2021-02-22 06:13:46 Starting - Preparing the instances for training......
2021-02-22 06:14:37 Downloading - Downloading input data...
2021-02-22 06:15:13 Training - Downloading the training image...
2021-02-22 06:15:34 Training - Training image download completed. Training in progress.[34m2021-02-22 06:15:34,621 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2021-02-22 06:15:35,163 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "eval": "/opt/ml/input/data/eval",
        "validation": "/opt/ml/input/data/validation",
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container

KeyboardInterrupt: 

In [15]:
%store dataset_location

Stored 'dataset_location' (str)


## (5) DDP

In [22]:
from sagemaker.tensorflow import TensorFlow
pt_estimator = TensorFlow(
                        base_job_name='tensorflow2-smdataparallel-mnist',
                        source_dir='training_script',    
                        entry_point='cifar10_keras_sm_ddp_81.py',
                        role=role,
                        py_version='py37',
                        framework_version='2.3.1',
                        # For training with multinode distributed training, set this count. Example: 2
                        instance_count=2,
                        # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
                        instance_type= 'ml.p3.16xlarge',
                        sagemaker_session=sagemaker_session,
                        # Training using SMDataParallel Distributed Training Framework
                        distribution={'smdistributed':{
                                            'dataparallel':{
                                                    'enabled': True
                                             }
                                      }}
                        )
pt_estimator.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})

2021-02-22 13:07:19 Starting - Starting the training job...
2021-02-22 13:07:44 Starting - Launching requested ML instancesProfilerReport-1613999239: InProgress
.........
2021-02-22 13:09:17 Starting - Preparing the instances for training.........
2021-02-22 13:10:46 Downloading - Downloading input data...
2021-02-22 13:11:06 Training - Downloading the training image............
2021-02-22 13:13:16 Training - Training image download completed. Training in progress..[35m2021-02-22 13:13:16.760556: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[35m2021-02-22 13:13:16.765568: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[35m2021-02-22 13:13:17.000480: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0[0m
[35m2021-02-

UnexpectedStatusException: Error for Training job tensorflow2-smdataparallel-mnist-2021-02-22-13-07-19-068: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.