# [Module 1.5] 로컬 모드 훈련

본 워크샵의 모든 노트북은 `conda_python3` 여기에서 작업 합니다.

이 노트북은 아래와 같은 작업을 합니다.
- 아래는 세이지메이커의 어떤 피쳐도 사용하지 않고, PyTorch 만을 사용해서 훈련 합니다.

# PyTorch CIFAR-10 local training  



In [13]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [14]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-cnn-cifar10"

role = sagemaker.get_execution_role()

In [15]:
import os
import subprocess

instance_type = "local"

try:
    if subprocess.call("nvidia-smi") == 0:
        ## Set type to GPU if one is present
        instance_type = "local_gpu"
except:
    pass

print("Instance type = " + instance_type)

Instance type = local_gpu


### Upload the data
We use the ```sagemaker.Session.upload_data``` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job.

In [16]:
inputs = sagemaker_session.upload_data(path="../data", bucket=bucket, key_prefix="data/cifar10")
print("s3 inputs: ", inputs)

s3 inputs:  s3://sagemaker-ap-northeast-2-057716757052/data/cifar10


# Construct a script for training 
Here is the full code for the network model:

In [17]:
local_inputs = "file://../data"




In [18]:
from sagemaker.pytorch import PyTorch
import os
import subprocess


instance_type = "local_gpu"

local_cifar10_estimator = PyTorch(
    entry_point="train.py",    
    source_dir='source',    
    role=role,
    framework_version='1.6.0',
    py_version='py3',
    instance_count=1,
    instance_type=instance_type,
    session = sagemaker.LocalSession(),
    hyperparameters={'epochs': 1, 
                     'lr': 0.1,
                     'batch_size': 16
                    }                      
    
)
# local_cifar10_estimator.fit(inputs)
local_cifar10_estimator.fit(local_inputs)

CalledProcessError: Command '['docker', 'pull', '763104351884.dkr.ecr.ap-northeast-2.amazonaws.com/pytorch-training:1.6.0-gpu-py3']' returned non-zero exit status 1.

## SageMaker Host Mode 로 훈련

In [7]:
from sagemaker.pytorch import PyTorch

instance_type = 'ml.p3.2xlarge'

cifar10_estimator = PyTorch(
    entry_point="train.py",    
    source_dir='source',    
    role=role,
    framework_version='1.6.0',
    py_version='py3',
    instance_count=1,
    instance_type=instance_type,
    hyperparameters={'epochs': 2, 
                     'lr': 0.01,
                     'batch_size': 16                     
                    }                      
    
)
cifar10_estimator.fit(inputs)

2021-07-28 12:16:02 Starting - Starting the training job...
2021-07-28 12:16:28 Starting - Launching requested ML instancesProfilerReport-1627474562: InProgress
......
2021-07-28 12:17:29 Starting - Preparing the instances for training............
2021-07-28 12:19:31 Downloading - Downloading input data
2021-07-28 12:19:31 Training - Downloading the training image.........
2021-07-28 12:21:03 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-07-28 12:21:03,925 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-07-28 12:21:03,950 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-07-28 12:21:03,961 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-07-28 12:21:04,451 sagemaker-training-to

## 모델 아티펙트 저장

In [11]:
artifact_path = cifar10_estimator.model_data
print("artifact_path: ", artifact_path)

%store artifact_path

artifact_path:  s3://sagemaker-ap-northeast-2-057716757052/pytorch-training-2021-07-28-12-16-02-088/output/model.tar.gz
Stored 'artifact_path' (str)


In [12]:
# local_artifact_path = local_cifar10_estimator.model_data
# print("local_artifact_path: ", local_artifact_path)
# %store local_artifact_path

In [10]:
# ! aws s3 ls {local_artifact_path} --recursive