# [Module 1.7] SageMaker DDP 모델 훈련

### 본 워크샵의 모든 노트북은 `conda_python3` 여기에서 작업 합니다.

이 노트북은 아래와 같은 작업을 합니다.
- 준비 작업을 걸쳐서 현재의 노트북 인스턴스에서 로컬 모드로 호로보드로 모델 훈련
- 호스트 모드에서 2개의 인스턴스로 호로보드 모델 훈련
- 훈련된 모델 아티펙트 저장

## 참고:
- [파이토치 호로보드 공식 예시](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/pytorch_horovod_mnist)
- 세이지 메이커로 파이토치 사용 --> [Use PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html)

---

# 1. 기본 세팅
사용하는 패키지는 import 시점에 다시 재로딩 합니다.

In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-cnn-cifar10"

role = sagemaker.get_execution_role()

# 2. 데이터 세트를 S3에 업로드


In [5]:
s3_inputs = sagemaker_session.upload_data(path="../data", bucket=bucket, key_prefix="data/cifar10")
print("s3 inputs: ", s3_inputs)

s3 inputs:  s3://sagemaker-us-east-1-189546603447/data/cifar10


# 3. 모델 훈련 준비

## 시스템의 이전 도커 컨테이너 삭제
- 아래와 같은 명령어를 사용하여 저장 공간을 확보 합니다.

### 도커 컨테이너 모두 삭제

In [6]:
! df -h
! docker container prune -f 
! df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        241G   80K  241G   1% /dev
tmpfs           241G  320K  241G   1% /dev/shm
/dev/xvda1      109G   95G   14G  88% /
/dev/xvdf       492G  1.3G  465G   1% /home/ec2-user/SageMaker
Total reclaimed space: 0B
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        241G   80K  241G   1% /dev
tmpfs           241G  320K  241G   1% /dev/shm
/dev/xvda1      109G   95G   14G  88% /
/dev/xvdf       492G  1.3G  465G   1% /home/ec2-user/SageMaker


### 도커 이미지 모두 삭제

In [7]:
! df -h
! docker image prune -f --all
! df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        241G   80K  241G   1% /dev
tmpfs           241G  320K  241G   1% /dev/shm
/dev/xvda1      109G   95G   14G  88% /
/dev/xvdf       492G  1.3G  465G   1% /home/ec2-user/SageMaker
Total reclaimed space: 0B
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        241G   80K  241G   1% /dev
tmpfs           241G  320K  241G   1% /dev/shm
/dev/xvda1      109G   95G   14G  88% /
/dev/xvdf       492G  1.3G  465G   1% /home/ec2-user/SageMaker


### 추가 용량 확보

추가적인 용량 삭제가 필요하면 아래를 실행 하세요
```
rm -rf /tmp/tmp*
```

# 4. 로컬모드로 훈련 
- 현 실행 노트북 인스턴스에서 실행

In [8]:
instance_type = "local_gpu"
sess = sagemaker.local.LocalSession()
inputs = 'file://../data'    
instance_count = 1
hyperparameters={"epochs": 2, 
                 'batch-size': 128,                     
                 'lr': 0.01,
                }    

In [9]:
from sagemaker.pytorch import PyTorch

job_name ='cifar10-sm-local-ddp'

estimator = PyTorch(
    entry_point="train_ddp.py",    
#     source_dir='source/ddp',    
    source_dir='source',        
    base_job_name = job_name,
    role=role,
    framework_version="1.8.1",
    py_version="py36",
    instance_count= instance_count,
    instance_type= instance_type,
    sagemaker_session= sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
    hyperparameters=hyperparameters,    
    debugger_hook_config=False,
)
estimator.fit({"training" : inputs}, wait=False)
#estimator.fit({"training" : "file://../data"})

Creating 1d90cppl4s-algo-1-6o5j2 ... 
Creating 1d90cppl4s-algo-1-6o5j2 ... done
Attaching to 1d90cppl4s-algo-1-6o5j2
[36m1d90cppl4s-algo-1-6o5j2 |[0m 2022-03-06 12:12:36,808 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36m1d90cppl4s-algo-1-6o5j2 |[0m 2022-03-06 12:12:36,887 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36m1d90cppl4s-algo-1-6o5j2 |[0m 2022-03-06 12:12:36,890 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel
[36m1d90cppl4s-algo-1-6o5j2 |[0m 2022-03-06 12:12:36,890 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36m1d90cppl4s-algo-1-6o5j2 |[0m 2022-03-06 12:12:37,113 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36m1d90cppl4s-algo-1-6o5j2 |[0m /opt/conda/bin/python3.6 -m pip install -r requirements.txt
[36m1d90cppl4s-algo-1-6o5j2 |[0m Collecting torchsummary==1.5.1
[36m1d90cppl4s-a

#### 로컬모드에서 도커 이미지 다운로드 된 것을 확인

In [20]:
! docker image ls

REPOSITORY                                                      TAG                 IMAGE ID            CREATED             SIZE
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training   1.8.1-gpu-py36      b4191cf0b8c9        2 months ago        13.3GB


# 5. 호스트 모드로 훈련

In [21]:
import os
import subprocess

job_name ='cifar10-sm-ddp'
instance_type="ml.p3.16xlarge"


sess = sagemaker.Session()
inputs = s3_inputs
instance_count = 2
hyperparameters={"epochs": 20, 
                 'batch-size': 128,                     
                 'lr': 0.01,
                }        


In [22]:
from sagemaker.pytorch import PyTorch

job_name ='cifar10-ddp'

estimator = PyTorch(
    entry_point="train_ddp.py",    
    source_dir='source',    
    base_job_name = job_name,
    role=role,
    framework_version="1.8.1",
    py_version="py36",
    # For training with multinode distributed training, set this count. Example: 2
    instance_count= instance_count,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type= instance_type,
    sagemaker_session= sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
    hyperparameters=hyperparameters,    
    debugger_hook_config=False,
)
estimator.fit({"training" : inputs}, wait=False)
#estimator.fit({"training" : "file://../data"})

In [23]:
estimator.logs()

2021-09-27 14:04:34 Starting - Starting the training job...
2021-09-27 14:04:56 Starting - Launching requested ML instancesProfilerReport-1632751472: InProgress
.........
2021-09-27 14:06:26 Starting - Preparing the instances for training.........
2021-09-27 14:08:04 Downloading - Downloading input data...
2021-09-27 14:08:18 Training - Downloading the training image......................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2021-09-27 14:12:05,497 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[35m2021-09-27 14:12:05,574 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[35m2021-09-27 14:12:08,604 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel[0m


# 6. 정리 작업

## 모델 아티펙트 저장
- S3 에 저장된 모델 아티펙트를 저장하여 추론시 사용합니다.

In [24]:
ddp_artifact_path = estimator.model_data
print("ddp_artifact_path: ", ddp_artifact_path)


%store ddp_artifact_path

ddp_artifact_path:  s3://sagemaker-us-east-1-057716757052/cifar10-ddp-2021-09-27-14-04-32-403/output/model.tar.gz
Stored 'ddp_artifact_path' (str)


In [25]:
! aws s3 ls {ddp_artifact_path} --recursive

2021-09-27 14:15:01     230774 cifar10-ddp-2021-09-27-14-04-32-403/output/model.tar.gz
