# [Module 1.6] Horovod 훈련

본 워크샵의 모든 노트북은 `conda_python3` 여기에서 작업 합니다.

이 노트북은 아래와 같은 작업을 합니다.
- 아래는 세이지메이커의 어떤 피쳐도 사용하지 않고, PyTorch 만을 사용해서 훈련 합니다.

# PyTorch CIFAR-10 local training  



In [12]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-cnn-cifar10"

role = sagemaker.get_execution_role()

In [14]:
import os
import subprocess

instance_type = "local"

try:
    if subprocess.call("nvidia-smi") == 0:
        ## Set type to GPU if one is present
        instance_type = "local_gpu"
except:
    pass

print("Instance type = " + instance_type)

Instance type = local_gpu


### Upload the data
We use the ```sagemaker.Session.upload_data``` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job.

In [15]:
inputs = sagemaker_session.upload_data(path="../data", bucket=bucket, key_prefix="data/cifar10")
print("s3 inputs: ", inputs)

s3 inputs:  s3://sagemaker-ap-northeast-2-057716757052/data/cifar10


# Construct a script for training 
Here is the full code for the network model:

In [16]:
import os
import subprocess

instance_type = "local_gpu"
# instance_type = "ml.p3.8xlarge"

job_name ='cifar10-horovod'

In [17]:
from sagemaker.pytorch import PyTorch

cifar10_estimator = PyTorch(
    entry_point="train_horovod.py",    
    source_dir='source',    
    base_job_name = job_name,
    role=role,
    framework_version='1.6.0',
    py_version='py3',
    train_instance_count=1,
    train_instance_type=instance_type,
    hyperparameters={"epochs": 1, "backend": "gloo"},    
)
cifar10_estimator.fit({"training" : inputs})

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Creating 4f3ecy1q7r-algo-1-1zxze ... 
Creating 4f3ecy1q7r-algo-1-1zxze ... done
Attaching to 4f3ecy1q7r-algo-1-1zxze
[36m4f3ecy1q7r-algo-1-1zxze |[0m 2021-06-08 02:10:31,698 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36m4f3ecy1q7r-algo-1-1zxze |[0m 2021-06-08 02:10:31,740 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36m4f3ecy1q7r-algo-1-1zxze |[0m 2021-06-08 02:10:31,743 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36m4f3ecy1q7r-algo-1-1zxze |[0m 2021-06-08 02:10:31,926 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36m4f3ecy1q7r-algo-1-1zxze |[0m /opt/conda/bin/python3.6 -m pip install -r requirements.txt
[36m4f3ecy1q7r-algo-1-1zxze |[0m Collecting torchsummary==1.5.1
[36m4f3ecy1q7r-algo-1-1zxze |[0m   Downloading torchsummary-1.5.1-py3-none-any.whl (2.8 kB)
[36m4f3ecy1q7r-algo-1-1zxze |[0m Collecting sagema

In [20]:
horovod_artifact_path = cifar10_estimator.model_data
print("horovod_artifact_path: ", horovod_artifact_path)


%store horovod_artifact_path

horovod_artifact_path:  s3://sagemaker-ap-northeast-2-057716757052/cifar10-horovod-2021-06-08-02-10-25-236/model.tar.gz
Stored 'horovod_artifact_path' (str)


In [21]:
! aws s3 ls {horovod_artifact_path} --recursive

2021-06-08 02:11:07     230777 cifar10-horovod-2021-06-08-02-10-25-236/model.tar.gz
