# [Module 2.2] TF 호로보드로 분산 훈련 (로컬 모드 및 호스트 모드)


본 워크샵의 모든 노트북은 **<font color="red">conda_tensorflow2_p36</font>** 를 사용합니다.

이 노트북은 아래와 같은 작업을 합니다.
- 1. 기본 환경 세팅 
- 2. 데이터 세트를 S3 에 업로딩
- 3. 노트북에서 세이지 메이커 스크립트 모드 스타일로 코드 변경
- 4. 세이지 메이커 로컬 모드로 훈련
- 5. 세이지 메이커의 호스트 모드로 훈련
- 6. 모델 아티펙트 경로 저장


## 참고:
- 호로보드 깃의 TF2 공식 예시 입니다.
    - [호로보드 공식 예제](https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_mnist.py)

---

# 1. 기본 세팅
사용하는 패키지는 import 시점에 다시 재로딩 합니다.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-cnn-cifar10"

role = sagemaker.get_execution_role()

In [3]:
import os
import subprocess

instance_type = "local_gpu"

print("Instance type = " + instance_type)

Instance type = local_gpu


In [4]:
%store -r train_dir
%store -r validation_dir
%store -r eval_dir
%store -r data_dir

# 2. 데이터 세트를 S3에 업로드


In [5]:
dataset_location = sagemaker_session.upload_data(path=data_dir, key_prefix='data/DEMO-cifar10')
display(dataset_location)

's3://sagemaker-us-east-1-227612457811/data/DEMO-cifar10'

# 3. 모델 훈련 준비

In [6]:
import os
import subprocess

instance_type = "local_gpu"
# instance_type = "ml.p3.8xlarge"

job_name ='cifar10-horovod'

## 시스템의 이전 도커 컨테이너 삭제
- 아래와 같은 명령어를 사용하여 저장 공간을 확보 합니다.
- 필요시 주석을 제거하고 사용하세요

### 도커 컨테이너 모두 삭제

In [7]:
! df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         30G   72K   30G   1% /dev
tmpfs            30G     0   30G   0% /dev/shm
/dev/xvda1      109G   98G   11G  91% /
/dev/xvdf        20G  571M   18G   4% /home/ec2-user/SageMaker


In [8]:
# ! docker container prune -f
# ! df -h

### 도커 이미지 모두 삭제

In [9]:
# ! df -h
# ! docker image prune -f --all
# ! df -h

# 4. 로컬모드로 훈련 
- 현 실행 노트북 인스턴스에서 실행

In [11]:

def calculate_learning_rate(one_gpu_learning_rate, num_gpu, train_instance_count ):
    total_gpu = num_gpu * train_instance_count

    multi_gpu_learning_rate = one_gpu_learning_rate / total_gpu
    print("multi_gpu_learning_rate: ", multi_gpu_learning_rate)
    
    return multi_gpu_learning_rate

train_instance_type = 'local_gpu'
num_gpu = 1
train_instance_count = 1
one_gpu_learning_rate = 0.001 

multi_gpu_learning_rate = calculate_learning_rate(one_gpu_learning_rate, num_gpu, train_instance_count )

multi_gpu_learning_rate:  0.001


In [12]:
from sagemaker.tensorflow import TensorFlow
hyperparameters = {
                    'epochs' : 1,
                    'learning-rate' : float(f"{multi_gpu_learning_rate}"),
                    'print-interval' : 100,
                    'train-batch-size': 64,    
                    'eval-batch-size': 512,        
                    'validation-batch-size': 512,
                  }

distributions = {
    'mpi': {
        'enabled': True, 
        'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO',
        'processes_per_host': int(f"{num_gpu}")
    }
}

# Change base_job_name to 'cifar10-dist' for console visibility
estimator = TensorFlow(base_job_name='cifar10-tf-dist',
                       entry_point='cifar10_tf2_sm_horovod.py',
                       source_dir='src',
                       role=role,
                       framework_version='2.4.1',
                       py_version='py37',
                       script_mode=True,                            
                       hyperparameters= hyperparameters,
                       train_instance_count=train_instance_count,   # 변경
                       train_instance_type=train_instance_type,
                       distributions=distributions # 추가
                      )


estimator.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})

distributions has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Creating xmn4yiyyrw-algo-1-twv56 ... 
Creating xmn4yiyyrw-algo-1-twv56 ... done
Attaching to xmn4yiyyrw-algo-1-twv56
[36mxmn4yiyyrw-algo-1-twv56 |[0m 2021-10-11 11:36:57.827877: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
[36mxmn4yiyyrw-algo-1-twv56 |[0m 2021-10-11 11:36:57.828070: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
[36mxmn4yiyyrw-algo-1-twv56 |[0m 2021-10-11 11:36:57.832327: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[36mxmn4yiyyrw-algo-1-twv56 |[0m 2021-10-11 11:36:57.869090: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
[36mxmn4yiyyrw-algo-1-twv56 |[0m 2021-10-11 11:36:59,598 sagemaker-training-toolkit INFO     Imported framework sagemaker_

#### 로컬모드에서 도커 이미지 다운로드 된 것을 확인

In [13]:
! docker image ls

REPOSITORY                                                         TAG                 IMAGE ID            CREATED             SIZE
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training   2.4.1-gpu-py37      8467bc1c5070        5 months ago        8.91GB


# 5. 호스트 모드로 훈련

<font color="red">`processes_per_host: 4` 는 한개의 EC2에서 사용할 수 있는 총 GPU 의 개수 입니다.</font>

- 아래 ml.p3.8xlarge 는 4개의 GPU 가 있고, 최대 4개를 사용 할 수 있습니다.
- 아래의 경우는 ml.p3.8xlarge 는 4개의 GPU 가 있고, 2개의 인스턴스이기에 종합적으로 8개를 사용 할 수 있습니다.
    - train_instance_type='ml.p3.8xlarge'
    - train_instance_count = 2




```
train_instance_type='ml.p3.8xlarge'
train_instance_count = 2

distributions = {
    'mpi': {
        'enabled': True, 
        'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO',
        'processes_per_host': 4
    }
}
```

## multi_gpu_learning_rate
- GPU 의 개수, Batch Size, Epoch 당 배치 수 에 따라 튜닝이 필요한 수치 입니다. 여기서는 예시로 사용한 것이기에, 실제 사용시에 적절하게 튜닝을 해주시기 바랍니다.

In [15]:
train_instance_type = 'ml.p3.8xlarge'
num_gpu = 4
train_instance_count = 2
total_num_gpu = num_gpu * train_instance_count
one_gpu_learning_rate = 0.001 

multi_gpu_learning_rate = calculate_learning_rate(one_gpu_learning_rate, num_gpu, train_instance_count )

multi_gpu_learning_rate:  0.000125


In [14]:
metric_definitions = [
    {'Name': 'train:loss', 'Regex': 'loss: (.*?) '},
    {'Name': 'train:accuracy', 'Regex': 'acc: (.*?) '},
    {'Name': 'validation:loss', 'Regex': 'val_loss: (.*?) '},
    {'Name': 'validation:accuracy', 'Regex': 'val_acc: (.*?) '}
]

In [16]:
hyperparameters = {
                    'epochs' : 40,
                    'learning-rate' : float(f"{multi_gpu_learning_rate}"),
                    'print-interval' : 50,
                    'train-batch-size': 64,    
                    'eval-batch-size': 512,        
                    'validation-batch-size': 512,
                  }


distributions = {
    'mpi': {
        'enabled': True, 
        'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO',
        'processes_per_host': int(f"{num_gpu}")
    }
}



In [17]:
from sagemaker.tensorflow import TensorFlow



horovod_estimator = TensorFlow(base_job_name='cifar10-tf-dist',
                       entry_point='cifar10_tf2_sm_horovod.py',
                       source_dir='src',
                       role=role,
                       framework_version='2.4.1',
                       py_version='py37',
                       script_mode=True,                            
                       hyperparameters= hyperparameters,
                       train_instance_count=train_instance_count,   # 변경
                       train_instance_type= train_instance_type,
                       distributions=distributions # 추가
                      )


horovod_estimator.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)}, wait=False)

distributions has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [18]:
horovod_estimator.logs()

2021-10-11 11:37:52 Starting - Starting the training job...
2021-10-11 11:38:21 Starting - Launching requested ML instancesProfilerReport-1633952272: InProgress
.........
2021-10-11 11:39:44 Starting - Preparing the instances for training......
2021-10-11 11:40:43 Downloading - Downloading input data...
2021-10-11 11:41:21 Training - Downloading the training image..............[34m2021-10-11 11:43:36.122570: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2021-10-11 11:43:36.127899: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2021-10-11 11:43:36.225247: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0[0m
[34m2021-10-11 11:43:36.328498: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initia

# 6. 정리 작업

## 모델 아티펙트 저장
- S3 에 저장된 모델 아티펙트를 저장하여 추론시 사용합니다.

In [19]:
tf2_horovod_artifact_path = horovod_estimator.model_data
print("horovod_artifact_path: ", tf2_horovod_artifact_path)


%store tf2_horovod_artifact_path

horovod_artifact_path:  s3://sagemaker-us-east-1-227612457811/cifar10-tf-dist-2021-10-11-11-37-52-205/output/model.tar.gz
Stored 'tf2_horovod_artifact_path' (str)


In [20]:
! aws s3 ls {tf2_horovod_artifact_path} --recursive

2021-10-11 11:45:57   12006151 cifar10-tf-dist-2021-10-11-11-37-52-205/output/model.tar.gz
