# [모듈 3.1] 사용자 정의 컨테이너로 학습

---

SageMaker의 세션을 얻고, role 정보를 가져옵니다.
- 위의 두 정보를 통해서 SageMaker Hosting Cluster에 연결합니다.

In [19]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'byoc-cifar10'


## 로컬의 데이터 S3 업로딩
로컬의 데이터를 S3에 업로딩하여 훈련시에 Input으로 사용 합니다.

In [20]:
dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')
display(dataset_location)

's3://sagemaker-ap-northeast-2-057716757052/data/DEMO-cifar10'

In [21]:
%store -r container_image_uri
print (container_image_uri)

057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/tf-script-mode-container-2:latest


In [22]:
train_code_file = 'cifar10_keras_sm_tf2.py'

<h3>Training with a custom SDK framework estimator</h3>

As you have seen, in the previous steps we had to upload our code to Amazon S3 and then inject reserved hyperparameters to execute training. In order to facilitate this task, you can also try defining a custom framework estimator using the Amazon SageMaker Python SDK and run training with that class, which will take care of managing these tasks.

## output bucket 지정
모델의 아티펙트 등이 결과물이 저장될 버킷 지정. output 버킷을 지정하지 않으면, 세이지 메이커의 디폴트 버킷에 저장이 됨

In [25]:
output_bucket = 'gonsoo-share' # 본인의 버켓으로 바꾸어 주세요
output_prefix = 'cifar10-output'
s3_output_path = f's3://{output_bucket}/{output_prefix}'
print(s3_output_path)

s3://gonsoo-share/cifar10-output


In [34]:
from sagemaker.estimator import Framework

class CustomFramework(Framework):
    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        framework_version=None,
        image_name=None,
        distributions=None,
        **kwargs
    ):
        super(CustomFramework, self).__init__(
            entry_point, source_dir, hyperparameters, image_name=image_name, **kwargs
        )
    
    def _configure_distribution(self, distributions):
        return
    
    def create_model(
        self,
        model_server_workers=None,
        role=None,
        vpc_config_override=None,
        entry_point=None,
        source_dir=None,
        dependencies=None,
        image_name=None,
        **kwargs
    ):
        return None
        
import sagemaker

# 'model_dir' : './logs' 코드는 더미 코드 임. 기존의 코드와 호환성을 위해서 사용 함
# 실제 모델 아티펙트의 결과는 s3_output_path 에 저장 됨
est = CustomFramework(image_name=container_image_uri,
                      entry_point=train_code_file,
                      source_dir='source_dir/',
                      output_path = s3_output_path,                                             
                      role=role,
                      train_instance_count=1, 
                      train_instance_type='local', # we use local mode
                      #train_instance_type='ml.m5.xlarge',
                      base_job_name='byoc-cifar10',
                      hyperparameters={'epochs': "1", 
                                       'model_dir' : './logs'
                                      },    

                      
                     )

est.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})


train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
image_name has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Building with native build. Learn about native build in Compose here: https://docs.docker.com/go/compose-native-build/
Creating avmm6df9lw-algo-1-ijk1e ... 
Creating avmm6df9lw-algo-1-ijk1e ... done
Attaching to avmm6df9lw-algo-1-ijk1e
[36mavmm6df9lw-algo-1-ijk1e |[0m 2021-02-27 03:20:39,416 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36mavmm6df9lw-algo-1-ijk1e |[0m /usr/bin/python3 -m pip install -r requirements.txt
[36mavmm6df9lw-algo-1-ijk1e |[0m   from cryptography.utils import int_from_bytes
[36mavmm6df9lw-algo-1-ijk1e |[0m   from cryptography.utils import int_from_bytes
[36mavmm6df9lw-algo-1-ijk1e |[0m Collecting pandas<2
[36mavmm6df9lw-algo-1-ijk1e |[0m   Downloading pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)
[K     |████████████████████████████████| 9.5 MB 2.4 MB/s eta 0:00:01
[36mavmm6df9lw-algo-1-ijk1e |[0m [?25hCollecting pytz>=2017.2
[36mavmm6df9lw-algo-1-ijk1e |[0m   Downloading pytz-2021.1-py2.py3-none-

## S3 Code에서 실행

현재의 training_job_name을 저장 합니다.
- training_job_name을 에는 훈련에 관련 내용 및 훈련 결과인 **Model Artifact** 파일의 S3 경로를 제공 합니다.

In [6]:
import tarfile
import os

def create_tar_file(source_files, target=None):
    if target:
        filename = target
    else:
        _, filename = tempfile.mkstemp()

    with tarfile.open(filename, mode="w:gz") as t:
        for sf in source_files:
            # Add all files from the directory into the root of the directory structure of the tar
            t.add(sf, arcname=os.path.basename(sf))
    return filename

create_tar_file(["source_dir/cifar10_keras_sm_tf2.py", "source_dir/requirements.txt"], "sourcedir.tar.gz")

'sourcedir.tar.gz'

In [7]:
sources = sagemaker_session.upload_data('sourcedir.tar.gz', bucket, prefix + '/code')
print(sources)
! rm sourcedir.tar.gz

s3://sagemaker-ap-northeast-2-057716757052/byoc-cifar10/code/sourcedir.tar.gz


When starting the training job, we need to let the sagemaker-training-toolkit library know where the sources are stored in Amazon S3 and what is the module to be invoked. These parameters are specified through the following reserved hyperparameters (these reserved hyperparameters are injected automatically when using framework estimators of the Amazon SageMaker Python SDK):
<ul>
    <li>sagemaker_program</li>
    <li>sagemaker_submit_directory</li>
</ul>

Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html).

In [35]:
import sagemaker
import json
# JSON encode hyperparameters.
def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

# 'model_dir' : './logs' 코드는 더미 코드 임. 기존의 코드와 호환성을 위해서 사용 함
# 실제 모델 아티펙트의 결과는 s3_output_path 에 저장 됨
hyperparameters = json_encode_hyperparameters({
    "sagemaker_program": train_code_file,
    "sagemaker_submit_directory": sources,
    'epochs': "1", 
    'model_dir' : './logs'    
    })

hyperparameters

{'sagemaker_program': '"cifar10_keras_sm_tf2.py"',
 'sagemaker_submit_directory': '"s3://sagemaker-ap-northeast-2-057716757052/byoc-cifar10/code/sourcedir.tar.gz"',
 'epochs': '"1"',
 'model_dir': '"./logs"'}

In [36]:

est = sagemaker.estimator.Estimator(container_image_uri,
                                    role,
                                    train_instance_count=1, 
                                    train_instance_type='local',
                                    base_job_name=prefix,
                                    hyperparameters=hyperparameters,
                                    output_path = s3_output_path,                                                                                 
                                   )

est.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})



train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Building with native build. Learn about native build in Compose here: https://docs.docker.com/go/compose-native-build/
Creating gpunmvy05m-algo-1-thr3v ... 
Creating gpunmvy05m-algo-1-thr3v ... done
Attaching to gpunmvy05m-algo-1-thr3v
[36mgpunmvy05m-algo-1-thr3v |[0m 2021-02-27 03:22:03,321 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36mgpunmvy05m-algo-1-thr3v |[0m /usr/bin/python3 -m pip install -r requirements.txt
[36mgpunmvy05m-algo-1-thr3v |[0m   from cryptography.utils import int_from_bytes
[36mgpunmvy05m-algo-1-thr3v |[0m   from cryptography.utils import int_from_bytes
[36mgpunmvy05m-algo-1-thr3v |[0m Collecting pandas<2
[36mgpunmvy05m-algo-1-thr3v |[0m   Downloading pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)
[K     |████████████████████████████████| 9.5 MB 2.4 MB/s eta 0:00:01
[36mgpunmvy05m-algo-1-thr3v |[0m Collecting pytz>=2017.2
[36mgpunmvy05m-algo-1-thr3v |[0m   Downloading pytz-2021.1-py2.py3-none-any.wh

## 사용자 정의 컨테이너로 SageMaker Cloud 에서 학습

In [37]:
# 'model_dir' : './logs' 코드는 더미 코드 임. 기존의 코드와 호환성을 위해서 사용 함
# 실제 모델 아티펙트의 결과는 s3_output_path 에 저장 됨
hyperparameters = json_encode_hyperparameters({
    "sagemaker_program": train_code_file,
    "sagemaker_submit_directory": sources,
    'epochs': "5", 
    'model_dir' : './logs'    
    })


est = sagemaker.estimator.Estimator(container_image_uri, # 사용자 정의 컨테이너
                                    role,
                                    train_instance_count=1, 
                                    train_instance_type='ml.p2.xlarge',
                                    base_job_name=prefix,
                                    hyperparameters=hyperparameters,
                                    output_path = s3_output_path,                                                                                 
                                   )

est.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})




train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-02-27 03:23:30 Starting - Starting the training job...
2021-02-27 03:23:54 Starting - Launching requested ML instancesProfilerReport-1614396210: InProgress
......
2021-02-27 03:24:55 Starting - Preparing the instances for training......
2021-02-27 03:25:56 Downloading - Downloading input data...
2021-02-27 03:26:16 Training - Downloading the training image...........[34m2021-02-27 03:28:11,411 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/usr/bin/python3 -m pip install -r requirements.txt[0m
  from cryptography.utils import int_from_bytes[0m
  from cryptography.utils import int_from_bytes[0m
[34mCollecting pandas<2
  Downloading pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)[0m
[34mCollecting pytz>=2017.2
  Downloading pytz-2021.1-py2.py3-none-any.whl (510 kB)[0m
[34mInstalling collected packages: pytz, pandas[0m

2021-02-27 03:28:18 Training - Training image download completed. Training in progress.[34mSuccessfully