# [Module 1.6] 호로보드로 분산 훈련 (로컬 모드 및 호스트 모드)

### 본 워크샵의 모든 노트북은 `conda_python3` 여기에서 작업 합니다.

이 노트북은 아래와 같은 작업을 합니다.
- 준비 작업을 걸쳐서 현재의 노트북 인스턴스에서 로컬 모드로 호로보드로 모델 훈련
- 호스트 모드에서 2개의 인스턴스로 호로보드 모델 훈련
- 훈련된 모델 아티펙트 저장

## 참고:
- [파이토치 호로보드 공식 예시](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/pytorch_horovod_mnist)
- 세이지 메이커로 파이토치 사용 --> [Use PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html)

---

# 1. 기본 세팅
사용하는 패키지는 import 시점에 다시 재로딩 합니다.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-cnn-cifar10"

role = sagemaker.get_execution_role()

In [3]:
import os
import subprocess

instance_type = "local"

try:
    if subprocess.call("nvidia-smi") == 0:
        ## Set type to GPU if one is present
        instance_type = "local_gpu"
except:
    pass

print("Instance type = " + instance_type)

Instance type = local_gpu


# 2. 데이터 세트를 S3에 업로드


In [4]:
inputs = sagemaker_session.upload_data(path="../data", bucket=bucket, key_prefix="data/cifar10")
print("s3 inputs: ", inputs)

s3 inputs:  s3://sagemaker-us-east-1-057716757052/data/cifar10


# 3. 모델 훈련 준비

In [5]:
import os
import subprocess

instance_type = "local_gpu"
# instance_type = "ml.p3.8xlarge"

job_name ='cifar10-horovod'

## 시스템의 이전 도커 컨테이너 삭제
- 아래와 같은 명령어를 사용하여 저장 공간을 확보 합니다.

### 도커 컨테이너 모두 삭제

In [6]:
! df -h
! docker container prune -f --all
! df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        241G   80K  241G   1% /dev
tmpfs           241G  320K  241G   1% /dev/shm
/dev/xvda1      109G   91G   18G  85% /
/dev/xvdf       984G   23G  911G   3% /home/ec2-user/SageMaker
unknown flag: --all
See 'docker container prune --help'.
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        241G   80K  241G   1% /dev
tmpfs           241G  320K  241G   1% /dev/shm
/dev/xvda1      109G   91G   18G  85% /
/dev/xvdf       984G   23G  911G   3% /home/ec2-user/SageMaker


### 도커 이미지 모두 삭제

In [7]:
! df -h
! docker image prune -f --all
! df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        241G   80K  241G   1% /dev
tmpfs           241G  320K  241G   1% /dev/shm
/dev/xvda1      109G   91G   18G  85% /
/dev/xvdf       984G   23G  911G   3% /home/ec2-user/SageMaker
Total reclaimed space: 0B
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        241G   80K  241G   1% /dev
tmpfs           241G  320K  241G   1% /dev/shm
/dev/xvda1      109G   91G   18G  85% /
/dev/xvdf       984G   23G  911G   3% /home/ec2-user/SageMaker


# 4. 로컬모드로 훈련 
- 현 실행 노트북 인스턴스에서 실행

In [8]:
from sagemaker.pytorch import PyTorch

cifar10_estimator = PyTorch(
    entry_point="train_horovod.py",    
    source_dir='source',    
    base_job_name = job_name,
    role=role,
    framework_version='1.6.0',
    py_version='py3',
    train_instance_count=1,
    train_instance_type=instance_type,
    hyperparameters={"epochs": 5, 
                     'lr': 0.001,
                     'batch-size': 64,
                     'log-interval' : 100,
                     "backend": "gloo",                     
                    },    
)
cifar10_estimator.fit({"training" : inputs})

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Creating w7brcs2cp0-algo-1-y7hwy ... 
Creating w7brcs2cp0-algo-1-y7hwy ... done
Attaching to w7brcs2cp0-algo-1-y7hwy
[36mw7brcs2cp0-algo-1-y7hwy |[0m 2021-09-27 11:26:40,048 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36mw7brcs2cp0-algo-1-y7hwy |[0m 2021-09-27 11:26:40,129 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36mw7brcs2cp0-algo-1-y7hwy |[0m 2021-09-27 11:26:40,132 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36mw7brcs2cp0-algo-1-y7hwy |[0m 2021-09-27 11:26:40,298 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36mw7brcs2cp0-algo-1-y7hwy |[0m /opt/conda/bin/python3.6 -m pip install -r requirements.txt
[36mw7brcs2cp0-algo-1-y7hwy |[0m Collecting torchsummary==1.5.1
[36mw7brcs2cp0-algo-1-y7hwy |[0m   Downloading torchsummary-1.5.1-py3-none-any.whl (2.8 kB)
[36mw7brcs2cp0-algo-1-y7hwy |[0m Installing collec

#### 로컬모드에서 도커 이미지 다운로드 된 것을 확인

In [9]:
! docker image ls

REPOSITORY                                                      TAG                 IMAGE ID            CREATED             SIZE
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training   1.6.0-gpu-py3       30e42e4701a4        5 months ago        8.6GB


# 5. 호스트 모드로 훈련

In [10]:
from sagemaker.pytorch import PyTorch

instance_type = 'ml.p3.8xlarge'

cifar10_estimator = PyTorch(
    entry_point="train_horovod.py",    
    source_dir='source',    
    base_job_name = job_name,
    role=role,
    framework_version='1.6.0',
    py_version='py3',
    train_instance_count=2,
    train_instance_type=instance_type,
    hyperparameters={"epochs": 10, 
                     'lr': 0.001,
                     'batch-size': 64,
                     "backend": "gloo",                     
                    },    
)
cifar10_estimator.fit({"training" : inputs, wait=False})

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-09-27 11:32:22 Starting - Starting the training job...
2021-09-27 11:32:46 Starting - Launching requested ML instancesProfilerReport-1632742342: InProgress
.........
2021-09-27 11:34:07 Starting - Preparing the instances for training.........
2021-09-27 11:35:47 Downloading - Downloading input data...
2021-09-27 11:36:07 Training - Downloading the training image.....[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2021-09-27 11:37:06,790 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[35m2021-09-27 11:37:06,833 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[35m2021-09-27 11:37:08,257 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[35m2021-09-27 11:37:08,666 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[35m/opt/conda/bin/pyth

In [14]:
# cifar10_estimator.logs()

# 6. 정리 작업

## 모델 아티펙트 저장
- S3 에 저장된 모델 아티펙트를 저장하여 추론시 사용합니다.

In [15]:
horovod_artifact_path = cifar10_estimator.model_data
print("horovod_artifact_path: ", horovod_artifact_path)


%store horovod_artifact_path

horovod_artifact_path:  s3://sagemaker-us-east-1-057716757052/cifar10-horovod-2021-09-27-11-32-22-045/output/model.tar.gz
Stored 'horovod_artifact_path' (str)


In [16]:
! aws s3 ls {horovod_artifact_path} --recursive

2021-09-27 11:40:28     461492 cifar10-horovod-2021-09-27-11-32-22-045/output/model.tar.gz
