# [모듈 2.1] SageMaker 클러스터에서 훈련 (No VPC에서 실행)

이 노트북은 아래의 작업을 실행 합니다.
- SageMaker Hosting Cluster 에서 훈련을 실행
- 훈련한 Job 이름을 저장 
    - 다음 노트북에서 모델 배포 및 추론시에 사용 합니다.
---

SageMaker의 세션을 얻고, role 정보를 가져옵니다.
- 위의 두 정보를 통해서 SageMaker Hosting Cluster에 연결합니다.

In [6]:
# import boto3
# import sagemaker
# from sagemaker import get_execution_role

# ecr_namespace = 'sagemaker-training-containers/'
# prefix = 'tf-script-mode-container-2'

# ecr_repository_name = ecr_namespace + prefix
# role = get_execution_role()
# account_id = role.split(':')[4]
# region = boto3.Session().region_name
# sagemaker_session = sagemaker.session.Session()
# bucket = sagemaker_session.default_bucket()

# print(account_id)
# print(region)
# print(role)
# print(bucket)

In [23]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'byoc-cifar10'


## 로컬의 데이터 S3 업로딩
로컬의 데이터를 S3에 업로딩하여 훈련시에 Input으로 사용 합니다.

In [2]:
dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')
display(dataset_location)

's3://sagemaker-ap-northeast-2-057716757052/data/DEMO-cifar10'

In [5]:
%store -r container_image_uri
print (container_image_uri)

057716757052.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-training-containers/tf-script-mode-container-2:latest


In [27]:
train_code_file = 'cifar10_keras_sm_tf2.py'

<h3>Training with a custom SDK framework estimator</h3>

As you have seen, in the previous steps we had to upload our code to Amazon S3 and then inject reserved hyperparameters to execute training. In order to facilitate this task, you can also try defining a custom framework estimator using the Amazon SageMaker Python SDK and run training with that class, which will take care of managing these tasks.

In [17]:
from sagemaker.estimator import Framework

class CustomFramework(Framework):
    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        framework_version=None,
        image_name=None,
        distributions=None,
        **kwargs
    ):
        super(CustomFramework, self).__init__(
            entry_point, source_dir, hyperparameters, image_name=image_name, **kwargs
        )
    
    def _configure_distribution(self, distributions):
        return
    
    def create_model(
        self,
        model_server_workers=None,
        role=None,
        vpc_config_override=None,
        entry_point=None,
        source_dir=None,
        dependencies=None,
        image_name=None,
        **kwargs
    ):
        return None
        
import sagemaker

est = CustomFramework(image_name=container_image_uri,
                      entry_point=train_code_file,
                      source_dir='source_dir/',
                      role=role,
                      train_instance_count=1, 
                      train_instance_type='local', # we use local mode
                      #train_instance_type='ml.m5.xlarge',
                      base_job_name='byoc-cifar10',
                      hyperparameters={'epochs': "1", 
                                       'model_dir' : './logs'
                                      },                      
                        )

# train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
# val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')

# est.fit({'train': train_config, 'validation': val_config })

est.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})


train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
image_name has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Building with native build. Learn about native build in Compose here: https://docs.docker.com/go/compose-native-build/
Creating z8qbmpgrj6-algo-1-yuqh4 ... 
Creating z8qbmpgrj6-algo-1-yuqh4 ... done
Attaching to z8qbmpgrj6-algo-1-yuqh4
[36mz8qbmpgrj6-algo-1-yuqh4 |[0m 2021-02-25 11:43:31,794 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36mz8qbmpgrj6-algo-1-yuqh4 |[0m /usr/bin/python3 -m pip install -r requirements.txt
[36mz8qbmpgrj6-algo-1-yuqh4 |[0m   from cryptography.utils import int_from_bytes
[36mz8qbmpgrj6-algo-1-yuqh4 |[0m   from cryptography.utils import int_from_bytes
[36mz8qbmpgrj6-algo-1-yuqh4 |[0m Collecting pandas<2
[36mz8qbmpgrj6-algo-1-yuqh4 |[0m   Downloading pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)
[K     |████████████████████████████████| 9.5 MB 2.6 MB/s eta 0:00:01
[36mz8qbmpgrj6-algo-1-yuqh4 |[0m Collecting pytz>=2017.2
[36mz8qbmpgrj6-algo-1-yuqh4 |[0m   Downloading pytz-2021.1-py2.py3-none-any.wh

## S3 Code에서 실행

현재의 training_job_name을 저장 합니다.
- training_job_name을 에는 훈련에 관련 내용 및 훈련 결과인 **Model Artifact** 파일의 S3 경로를 제공 합니다.

In [32]:
import tarfile
import os

def create_tar_file(source_files, target=None):
    if target:
        filename = target
    else:
        _, filename = tempfile.mkstemp()

    with tarfile.open(filename, mode="w:gz") as t:
        for sf in source_files:
            # Add all files from the directory into the root of the directory structure of the tar
            t.add(sf, arcname=os.path.basename(sf))
    return filename

create_tar_file(["source_dir/cifar10_keras_sm_tf2.py", "source_dir/requirements.txt"], "sourcedir.tar.gz")

'sourcedir.tar.gz'

In [33]:
sources = sagemaker_session.upload_data('sourcedir.tar.gz', bucket, prefix + '/code')
print(sources)
! rm sourcedir.tar.gz

s3://sagemaker-ap-northeast-2-057716757052/byoc-cifar10/code/sourcedir.tar.gz


When starting the training job, we need to let the sagemaker-training-toolkit library know where the sources are stored in Amazon S3 and what is the module to be invoked. These parameters are specified through the following reserved hyperparameters (these reserved hyperparameters are injected automatically when using framework estimators of the Amazon SageMaker Python SDK):
<ul>
    <li>sagemaker_program</li>
    <li>sagemaker_submit_directory</li>
</ul>

Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html).

In [34]:
import sagemaker
import json
# JSON encode hyperparameters.
def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

hyperparameters = json_encode_hyperparameters({
    "sagemaker_program": train_code_file,
    "sagemaker_submit_directory": sources,
    'epochs': "1", 
    'model_dir' : './logs'    
    })

hyperparameters

{'sagemaker_program': '"cifar10_keras_sm_tf2.py"',
 'sagemaker_submit_directory': '"s3://sagemaker-ap-northeast-2-057716757052/byoc-cifar10/code/sourcedir.tar.gz"',
 'epochs': '"1"',
 'model_dir': '"./logs"'}

In [35]:

est = sagemaker.estimator.Estimator(container_image_uri,
                                    role,
                                    train_instance_count=1, 
                                    train_instance_type='local',
                                    base_job_name=prefix,
                                    hyperparameters=hyperparameters)

# train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
# val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')

# est.fit({'train': train_config, 'validation': val_config })

est.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})



train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Building with native build. Learn about native build in Compose here: https://docs.docker.com/go/compose-native-build/
Creating xtn3in2wip-algo-1-qqfo4 ... 
Creating xtn3in2wip-algo-1-qqfo4 ... done
Attaching to xtn3in2wip-algo-1-qqfo4
[36mxtn3in2wip-algo-1-qqfo4 |[0m 2021-02-25 12:01:22,647 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36mxtn3in2wip-algo-1-qqfo4 |[0m /usr/bin/python3 -m pip install -r requirements.txt
[36mxtn3in2wip-algo-1-qqfo4 |[0m   from cryptography.utils import int_from_bytes
[36mxtn3in2wip-algo-1-qqfo4 |[0m   from cryptography.utils import int_from_bytes
[36mxtn3in2wip-algo-1-qqfo4 |[0m Collecting pandas<2
[36mxtn3in2wip-algo-1-qqfo4 |[0m   Downloading pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)
[K     |████████████████████████████████| 9.5 MB 2.4 MB/s eta 0:00:01
[36mxtn3in2wip-algo-1-qqfo4 |[0m Collecting pytz>=2017.2
[36mxtn3in2wip-algo-1-qqfo4 |[0m   Downloading pytz-2021.1-py2.py3-none-any.wh

## 사용자 정의 컨테이너로 SageMaker Cloud 에서 학습

In [36]:

est = sagemaker.estimator.Estimator(container_image_uri, # 사용자 정의 컨테이너
                                    role,
                                    train_instance_count=1, 
                                    train_instance_type='ml.p2.xlarge',
                                    base_job_name=prefix,
                                    hyperparameters=hyperparameters)

est.fit({'train':'{}/train'.format(dataset_location),
              'validation':'{}/validation'.format(dataset_location),
              'eval':'{}/eval'.format(dataset_location)})




train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-02-25 12:03:51 Starting - Starting the training job...
2021-02-25 12:04:15 Starting - Launching requested ML instancesProfilerReport-1614254631: InProgress
......
2021-02-25 12:05:16 Starting - Preparing the instances for training......
2021-02-25 12:06:17 Downloading - Downloading input data...
2021-02-25 12:06:37 Training - Downloading the training image...........[34m2021-02-25 12:08:32,555 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/usr/bin/python3 -m pip install -r requirements.txt[0m
  from cryptography.utils import int_from_bytes[0m
  from cryptography.utils import int_from_bytes[0m
[34mCollecting pandas<2
  Downloading pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)[0m
[34mCollecting pytz>=2017.2
  Downloading pytz-2021.1-py2.py3-none-any.whl (510 kB)[0m
[34mInstalling collected packages: pytz, pandas[0m
[34mSuccessfully installed pandas-1.1.5 pytz-2021.1[0m
[34mYou should consider upgrading via the '/usr

In [6]:
train_job_name = estimator._current_job_name

In [7]:
%store train_job_name

Stored 'train_job_name' (str)
