# [Module 3.4.2] Custom Docker Image 사용하여 학습

이 노트북은 스크립트 모드로 학습을 수행 합니다.
여기서는 다음과 같은 작업을 합니다.

- 학습할 데이타를 S3로 지정
- Train 학습 파리미터 설정
- Estimator를 생성하고 Train Docker Image 경로 설정
- train_instance_type 을 'ml.p3.2xlarge' 인스턴스로 설정
- 학습
- S3에 생성된 모델 아티펙트 확인

---
이 노트북의 실행 시간은 **약 35분** 걸립니다. <br>
2개의 ml.p3.2xlarge instance type으로 학습시에 약 30분 소요 됩니다.
실행 시간이 줄이시려면 epoch = 400 을 줄여서 해주십시오.


In [1]:
%store -r

In [2]:
import os
import sagemaker
import boto3
from sagemaker.tensorflow import TensorFlow

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

## 입력 데이터 설정

In [4]:
s3_input_train_data = sagemaker.inputs.TrainingInput(s3_data=processed_train_data_s3_uri, 
                                         distribution='ShardedByS3Key') 
s3_input_validation_data = sagemaker.inputs.TrainingInput(s3_data=processed_validation_data_s3_uri, 
                                              distribution='ShardedByS3Key')
s3_input_test_data = sagemaker.inputs.TrainingInput(s3_data=processed_test_data_s3_uri, 
                                        distribution='ShardedByS3Key')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-079636235537/sagemaker-scikit-learn-2021-04-04-14-21-14-399/output/bert-train', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-079636235537/sagemaker-scikit-learn-2021-04-04-14-21-14-399/output/bert-validation', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-079636235537/sagemaker-scikit-learn-2021-04-04-14-21-14-399/output/bert-test', 'S3DataDistributionType': 'ShardedByS3Key'}}}


In [5]:
import uuid

checkpoint_s3_prefix = 'checkpoints/{}'.format(str(uuid.uuid4()))
checkpoint_s3_uri = 's3://{}/{}/'.format(bucket, checkpoint_s3_prefix)

print(checkpoint_s3_uri)

s3://sagemaker-us-east-1-079636235537/checkpoints/76a85ac4-d992-492e-accd-babe2d58f8f5/


In [6]:
metrics_definitions = [
     {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
]

## 학습 (2개의 ml.p3.2xlarge 사용)
아래는 아래의 하이퍼라라미터 세팅으로 학습 결과의 validation, train accuracy 입니다.<br>
이 화면은 SageMaker --> 왼쪽 메뉴의 Traing 밑의 Training Job --> 해당 실행 training job 클릭하면 하단에 아래와 같은 차트가 나옵니다.

현재 validation accuracy가 약 32% 나옵니다. <br>
낮게 나오는 이유는 학습양의 데이타가 적은 것이 주요한 이유 입니다.<br>
validation accuracy 를 올리기 위해서는 데이타의 준비가 더 필요 합니다.

![Fig.3.4.Train-AccuracyChart](img/Fig.3.4.Train-Accuracy-Chart.png)

## Parameters

In [7]:
epochs= 400

learning_rate= 4e-4
epsilon=0.00000001

steps = 100

train_steps_per_epoch= steps
validation_steps= int(steps / 2)
test_steps= int(steps / 2)

train_batch_size=128
validation_batch_size=128
test_batch_size=128



train_instance_count=2 # modified by gonsoo
train_instance_type='ml.p3.2xlarge'
train_volume_size=1024

use_xla=True
use_amp=True

max_seq_length = 32
freeze_bert_layer= True

enable_checkpointing=True
input_mode='Pipe'
run_validation=True
run_test=True

아래와 같이 Train 용의 Container를 ECR 콘솔에서 복사하여 사용 함

![ECR-Training-Container](img/ecr-train-container.png)

<font color="red">**만일 에러가 발생한다면, 위의 ECR 콘솔에 가서 이미지 경로를 아래 print(ect_image) 와 확인 해보세요.**</font><br>

In [8]:
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
region_name = boto3.session.Session().region_name
print("account_id, region_name: ", account_id, region_name)

account_id, region_name:  079636235537 us-east-1


In [9]:
ecr_image = "{}.dkr.ecr.{}.amazonaws.com/bert2tweet:latest".format(account_id, region_name)
print(ecr_image)

079636235537.dkr.ecr.us-east-1.amazonaws.com/bert2tweet:latest


In [11]:
from sagemaker.estimator import Estimator


estimator = Estimator( image_uri = ecr_image,
                       role=sagemaker.get_execution_role(),
                       train_instance_count=train_instance_count, # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
                       train_instance_type=train_instance_type,
                       train_volume_size=train_volume_size,
                       checkpoint_s3_uri=checkpoint_s3_uri, # Not support in local mode
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'run_validation':run_validation,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'enable_checkpointing': enable_checkpointing
                                        },
                       input_mode=input_mode,
                       metric_definitions=metrics_definitions
                      )

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [12]:
inputs={'train': s3_input_train_data, 
        'validation': s3_input_validation_data,
         'test': s3_input_test_data
              }

estimator.fit(inputs,wait=False)         

## Train Job Status 확인

In [13]:
latest_training_job = estimator.latest_training_job

In [14]:
latest_training_job.describe()

{'TrainingJobName': 'bert2tweet-2021-04-05-00-37-06-999',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:079636235537:training-job/bert2tweet-2021-04-05-00-37-06-999',
 'TrainingJobStatus': 'InProgress',
 'SecondaryStatus': 'Starting',
 'HyperParameters': {'enable_checkpointing': 'True',
  'epochs': '400',
  'epsilon': '1e-08',
  'freeze_bert_layer': 'True',
  'learning_rate': '0.0004',
  'max_seq_length': '32',
  'run_validation': 'True',
  'test_batch_size': '128',
  'test_steps': '50',
  'train_batch_size': '128',
  'train_steps_per_epoch': '100',
  'use_amp': 'True',
  'use_xla': 'True',
  'validation_batch_size': '128',
  'validation_steps': '50'},
 'AlgorithmSpecification': {'TrainingImage': '079636235537.dkr.ecr.us-east-1.amazonaws.com/bert2tweet:latest',
  'TrainingInputMode': 'Pipe',
  'MetricDefinitions': [{'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
   {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
   {'Name': 'validation:loss', 'Regex': 'val_loss: (

In [None]:
estimator.latest_training_job.wait(logs=False)


2021-04-05 00:37:09 Starting - Launching requested ML instances...............
2021-04-05 00:38:38 Starting - Preparing the instances for training................
2021-04-05 00:40:03 Downloading - Downloading input data.
2021-04-05 00:40:12 Training - Downloading the training image.............................
2021-04-05 00:42:40 Training - Training image download completed. Training in progress...................................................................................................................

## 생성된 모델 아티펙트 확인

In [None]:
training_job_name = estimator.latest_training_job.job_name
model_artifact_path = "s3://{}/{}/{}".format(bucket,training_job_name,'output' )
! aws s3 ls {model_artifact_path} --recursive

In [None]:
%store training_job_name