# [모듈 2.1] 세이지메이커에서 분산 훈련 하기

이 노트북은 커널을 'conda_python3' 를 사용합니다.

---
이 노트북은 PyTorch Lightning 의 Multi GPUs 기능으로 1개의 인스턴스에서 (ml.g4dn.12xlarge) 에서 훈련 합니다.

# 1. 환경 설정


## 기본 세팅
사용하는 패키지는 import 시점에 다시 재로딩 합니다.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('./scripts')

In [2]:
import sagemaker

sagemaker.__version__

# sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

## 파라미터 세팅

In [17]:
import torch
import os

epochs = 5
num_gpus = torch.cuda.device_count()
# model_dir = 'model'
# num_gpus = 4
# train_notebook = True

print("num_gpus: ", num_gpus)
print("epochs: ", epochs)



num_gpus:  8
epochs:  5


# 2. 세이지 메이크 로컬 모드 훈련
#### 로컬의 GPU, CPU 여부로 instance_type 결정

In [18]:
import os
import subprocess


try:
    if subprocess.call("nvidia-smi") == 0:
        ## Set type to GPU if one is present
        instance_type = "local_gpu"
    else:
        instance_type = "local"        
except:
    pass

print("Instance type = " + instance_type)

Mon Apr  3 14:24:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   29C    P0    40W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   28C    P0    41W / 300W |      3MiB / 16384MiB |      0%      Default |
|       

## 로컬 모드로 훈련 실행
- 아래의 두 라인이 로컬모드로 훈련을 지시 합니다.
```python
    instance_type=instance_type, # local_gpu or local 지정
    session = sagemaker.LocalSession(), # 로컬 세션을 사용합니다.
```

In [19]:
hyperparameters = {'epochs': epochs, 
                   'n_gpus': num_gpus,
                    }  

In [20]:
from sagemaker.pytorch import PyTorch
import os
import subprocess


local_estimator = PyTorch(
    entry_point="TFT_Train.py",    
    source_dir='src',    
    role=role,
    framework_version='1.12.1',    
    py_version='py38',        
    instance_count=1,
    instance_type=instance_type, # local_gpu or local 지정
    session = sagemaker.LocalSession(), # 로컬 세션을 사용합니다.
    hyperparameters= hyperparameters               
    
)
local_estimator.fit()

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-04-03-14-24-10-049
INFO:sagemaker.local.local_session:Starting training job
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker.local.image:No AWS credentials found in session but credentials from EC2 Metadata Service are available.
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-56zlg:
    command: train
    container_name: ktn1e20lvt-algo-1-56zlg
    deploy:
      resources:
        reservations:
          devices:
          - capabilities:
            - gpu
    envir

Creating ktn1e20lvt-algo-1-56zlg ... 
Creating ktn1e20lvt-algo-1-56zlg ... done
Attaching to ktn1e20lvt-algo-1-56zlg
[36mktn1e20lvt-algo-1-56zlg |[0m 2023-04-03 14:24:13,698 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36mktn1e20lvt-algo-1-56zlg |[0m 2023-04-03 14:24:13,763 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
[36mktn1e20lvt-algo-1-56zlg |[0m 2023-04-03 14:24:13,772 sagemaker-training-toolkit INFO     instance_groups entry not present in resource_config
[36mktn1e20lvt-algo-1-56zlg |[0m 2023-04-03 14:24:13,775 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36mktn1e20lvt-algo-1-56zlg |[0m 2023-04-03 14:24:13,783 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36mktn1e20lvt-algo-1-56zlg |[0m 2023-04-03 14:24:13,848 botocore.credentials INFO     Found credentials from IAM Role: BaseNotebookInstanceEc2Instance

INFO:root:creating /tmp/tmpf53v9rtk/artifacts/output/data
INFO:root:copying /tmp/tmpf53v9rtk/algo-1-56zlg/output/success -> /tmp/tmpf53v9rtk/artifacts/output
INFO:root:copying /tmp/tmpf53v9rtk/model/model.pth -> /tmp/tmpf53v9rtk/artifacts/model


[36mktn1e20lvt-algo-1-56zlg exited with code 0
[0mAborting on container exit...
===== Job Complete =====


# 3. SageMaker Cloud Mode


리소스 프로파일링 관련 링크 입니다.
- [프로파일링 셋업](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configuration-for-profiling.html)
- [Debugger Python SDK](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_builtin_rule/tf-mnist-builtin-rule.html)
- [Open the Amazon SageMaker Debugger Insights Dashboard](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio-insights.html)
- [New – Profile Your Machine Learning Training Jobs With Amazon SageMaker Debugger](https://aws.amazon.com/blogs/aws/profile-your-machine-learning-training-jobs-with-amazon-sagemaker-debugger/)

## 파라미터 셋업

In [24]:
instance_type = 'ml.g4dn.12xlarge' # AMD Radeon Pro V520 4장 GPU

epochs = 500
hyperparameters = {'epochs': epochs, 
                    }  

In [25]:
from sagemaker.pytorch import PyTorch
import os

from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs


profiler_config=ProfilerConfig(
    system_monitor_interval_millis=1000
)
rules=[
    # ProfilerRule.sagemaker(rule_configs.BuiltInRule())
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]


estimator = PyTorch(
    entry_point="TFT_Train.py",    
    source_dir='src',    
    role=role,
    framework_version='1.12.1',    
    py_version='py38',     
    instance_count=1,
    instance_type=instance_type, # local_gpu or local 지정
    session = sagemaker.Session(),
    hyperparameters= hyperparameters,
    profiler_config=profiler_config,
    rules=rules,
    
)
estimator.fit(wait=False)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-04-03-14-45-07-300


In [26]:
estimator.logs()

2023-04-03 14:45:15 Starting - Starting the training job...
2023-04-03 14:45:37 Starting - Preparing the instances for trainingProfilerReport: InProgress
......
2023-04-03 14:46:43 Downloading - Downloading input data
2023-04-03 14:46:43 Training - Downloading the training image..................
2023-04-03 14:49:38 Training - Training image download completed. Training in progress......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-04-03 14:50:24,783 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-04-03 14:50:24,819 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-04-03 14:50:24,828 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-04-03 14:50:24,831 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[

# 4. 모델 가중치 파일 확인

In [12]:
print("model artifact: \n", estimator.model_data)

model artifact: 
 s3://sagemaker-us-east-1-057716757052/pytorch-training-2023-03-07-06-33-33-202/output/model.tar.gz


# 5. SageMaker Debug Report

SageMaker Studio 에 로긴하여 Experiment 메뉴 클릭 후에 Unassigned runs 클릭
![sm_debug_01.png](img/sm_debug_01.png)

실행한 실험을 클릭. 아래의 예시는 가장 최근의 실험을 클릭 함. 그리고 왼쪽 메뉴에서 Debug 를 클릭 후에 하단의 Training job 을 클릭
![sm_debug_02.png](img/sm_debug_02.png)

"Download report" 를 클릭하여 리포트를 다운로드 함.
![sm_debug_03.png](img/sm_debug_03.png)

## Debug Profiler Report

클릭하여 다운로드 --> [profiler-report.pdf](img/profiler-report.pdf)
