# [Module 3.1] SageMaker DDP 모델 훈련


### 중요 사항

- 이 예시는 노트북 인스턴스가 **<font color="red">ml.p3.16xlarge</font>** 에서만 동작 합니다.
- 본 워크샵의 모든 노트북은 **<font color="red">conda_tensorflow2_p36</font>** 를 사용합니다.

이 노트북은 아래와 같은 작업을 합니다.
- 1. 기본 환경 세팅 
- 2. 데이터 세트를 S3 에 업로딩
- 3. 노트북에서 세이지 메이커 스크립트 모드 스타일로 코드 변경
- 4. 세이지 메이커 로컬 모드로 훈련
- 5. 세이지 메이커의 호스트 모드로 훈련
- 6. 모델 아티펙트 경로 저장

## 참고:

- 세이지 메이커의 공식 개발자 가이드 입니다.
    - [개발자 가이드](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html)


- 세이지 메이커 분산 라이브러리 예세 Git 입니다.
    - [세이지 메이커 분산 라이브러리 공식 예제](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training)
---

# 1. 기본 세팅
사용하는 패키지는 import 시점에 다시 재로딩 합니다.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-cnn-cifar10"

role = sagemaker.get_execution_role()

In [3]:
import os
import subprocess

instance_type = "local_gpu"

print("Instance type = " + instance_type)

Instance type = local_gpu


In [4]:
%store -r train_dir
%store -r validation_dir
%store -r eval_dir
%store -r data_dir

# 2. 데이터 세트를 S3에 업로드


In [5]:
dataset_location = sagemaker_session.upload_data(path=data_dir, key_prefix='data/DEMO-cifar10')
display(dataset_location)

's3://sagemaker-us-east-1-057716757052/data/DEMO-cifar10'

# 3. 모델 훈련 준비

In [6]:
import os
import subprocess

instance_type = "local_gpu"
# instance_type = "ml.p3.8xlarge"

job_name ='cifar10-horovod'

## 시스템의 이전 도커 컨테이너 삭제
- 아래와 같은 명령어를 사용하여 저장 공간을 확보 합니다.

### 도커 컨테이너 모두 삭제

In [7]:
! df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        241G   80K  241G   1% /dev
tmpfs           241G     0  241G   0% /dev/shm
/dev/xvda1      109G  108G  863M 100% /
/dev/xvdf       984G   25G  910G   3% /home/ec2-user/SageMaker


In [8]:
# ! docker container prune -f 
# ! rm -rf /tmp/tmp*
# ! df -h

### 도커 이미지 모두 삭제

In [9]:
# ! df -h
# ! docker image prune -f --all
# ! df -h

### 추가 용량 확보

추가적인 용량 삭제가 필요하면 아래를 실행 하세요
```
rm -rf /tmp/tmp*
```

# 4. 로컬모드로 훈련
- 현 실행 노트북 인스턴스에서 실행

In [10]:

def calculate_learning_rate(one_gpu_learning_rate, num_gpu, train_instance_count ):
    total_gpu = num_gpu * train_instance_count

    multi_gpu_learning_rate = one_gpu_learning_rate / total_gpu
    print("multi_gpu_learning_rate: ", multi_gpu_learning_rate)
    
    return multi_gpu_learning_rate

train_instance_type = 'ml.p3.16xlarge'
num_gpu = 8
train_instance_count = 1
one_gpu_learning_rate = 0.001 

multi_gpu_learning_rate = calculate_learning_rate(one_gpu_learning_rate, num_gpu, train_instance_count )

multi_gpu_learning_rate:  0.000125


In [11]:
hyperparameters = {
                    'epochs' : 10,
                    'learning-rate' : f"{multi_gpu_learning_rate}",
                    'print-interval' : 100,
                    'train-batch-size': 64,    
                    'eval-batch-size': 512,        
                    'validation-batch-size': 512,
                  }



In [13]:
from sagemaker.tensorflow import TensorFlow

job_name ='cifar10-sm-ddp'
estimator = TensorFlow(base_job_name= job_name,
                       entry_point='cifar10_tf2_sm_ddp.py',
                       source_dir='src',
                       role=role,
                       framework_version='2.4.1',
                       py_version='py37',
                       script_mode=True,                            
                       hyperparameters= hyperparameters,
                       train_instance_count=1,   # 변경
                       train_instance_type='local_gpu',
                       distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
                       debugger_hook_config=False,                       
                      )


estimator.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Creating le2gyt8gwa-algo-1-dopi5 ... 
Creating le2gyt8gwa-algo-1-dopi5 ... done
Attaching to le2gyt8gwa-algo-1-dopi5
[36mle2gyt8gwa-algo-1-dopi5 |[0m 2021-10-10 05:26:26.677031: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
[36mle2gyt8gwa-algo-1-dopi5 |[0m 2021-10-10 05:26:26.677227: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
[36mle2gyt8gwa-algo-1-dopi5 |[0m 2021-10-10 05:26:26.681562: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[36mle2gyt8gwa-algo-1-dopi5 |[0m 2021-10-10 05:26:26.718624: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
[36mle2gyt8gwa-algo-1-dopi5 |[0m 2021-10-10 05:26:28,545 sagemaker-training-toolkit INFO     Imported framework sagemaker_

#### 로컬모드에서 도커 이미지 다운로드 된 것을 확인

In [14]:
! docker image ls

REPOSITORY                                                         TAG                 IMAGE ID            CREATED             SIZE
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training   2.4.1-gpu-py37      8467bc1c5070        5 months ago        8.91GB


# 5. 호스트 모드로 훈련

## multi_gpu_learning_rate
- GPU 의 개수, Batch Size, Epoch 당 배치 수 에 따라 튜닝이 필요한 수치 입니다. 여기서는 예시로 사용한 것이기에, 실제 사용시에 적절하게 튜닝을 해주시기 바랍니다.

In [15]:
train_instance_type = 'ml.p3.16xlarge'
num_gpu = 8
train_instance_count = 2
one_gpu_learning_rate = 0.001 

multi_gpu_learning_rate = calculate_learning_rate(one_gpu_learning_rate, num_gpu, train_instance_count )

multi_gpu_learning_rate:  6.25e-05


In [16]:
from sagemaker.tensorflow import TensorFlow
hyperparameters = {
                    'epochs' : 50,
                    'learning-rate' : f"{multi_gpu_learning_rate}",
                    'print-interval' : 100,
                    'train-batch-size': 64,    
                    'eval-batch-size': 512,        
                    'validation-batch-size': 512,
                  }


job_name ='cifar10-sm-ddp'

In [17]:

ddp_estimator = TensorFlow(base_job_name= job_name,
                       entry_point='cifar10_tf2_sm_ddp.py',
                       source_dir='src',
                       role=role,
                       framework_version='2.4.1',
                       py_version='py37',
                       script_mode=True,                            
                       hyperparameters= hyperparameters,
                       train_instance_count=train_instance_count,   # 변경
                       train_instance_type=train_instance_type,
                       distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
                       debugger_hook_config=False,                       
                      )


ddp_estimator.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)}, wait=False)

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [18]:
ddp_estimator.logs()

2021-10-10 05:27:45 Starting - Starting the training job...
2021-10-10 05:27:48 Starting - Launching requested ML instancesProfilerReport-1633843665: InProgress
.........
2021-10-10 05:29:38 Starting - Preparing the instances for training.........
2021-10-10 05:31:06 Downloading - Downloading input data...
2021-10-10 05:31:39 Training - Downloading the training image..............[34m2021-10-10 05:33:57.516542: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2021-10-10 05:33:57.523293: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2021-10-10 05:33:57.641516: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0[0m
[34m2021-10-10 05:33:57.760163: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Ini

# 6. 정리 작업

## 모델 아티펙트 저장
- S3 에 저장된 모델 아티펙트를 저장하여 추론시 사용합니다.

In [19]:
tf2_ddp_artifact_path = ddp_estimator.model_data
print("ddp_artifact_path: ", tf2_ddp_artifact_path)


%store tf2_ddp_artifact_path

ddp_artifact_path:  s3://sagemaker-us-east-1-057716757052/cifar10-sm-ddp-2021-10-10-05-27-44-881/output/model.tar.gz
Stored 'tf2_ddp_artifact_path' (str)


In [20]:
! aws s3 ls {tf2_ddp_artifact_path} --recursive

2021-10-10 05:39:42    6052287 cifar10-sm-ddp-2021-10-10-05-27-44-881/output/model.tar.gz
