# [Module 3.3] On a Script Mode, Train a BERT Model with Tensorflow

이 노트북은 스크립트 모드로 학습을 수행 합니다.
여기서는 다음과 같은 작업을 합니다.

- 학습할 데이타를 S3로 지정
- Train 학습 파리미터 설정
- Estimator를 생성하고 tf_script_bert_tweet.py Train Script를 지정
- train_instance_type 을 'ml.p3.2xlarge' 인스턴스로 설정
- Estimator를 를 스크립트 모드로 실행
- S3에 생성된 모델 아티펙트 확인


- 스크립트 모드의 자세한 사용은 아래의 API 문서 참고 하세요
- Script Mode Ref:
    - https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#train-a-model-with-tensorflow

In [1]:
%store -r

In [2]:
import os
import sagemaker
import boto3
from sagemaker.tensorflow import TensorFlow

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

## 입력 데이타 설정
로컬모드와 다르게 distribution='ShardedByS3Key' 로 설정 함

In [3]:
s3_input_train_data = sagemaker.s3_input(s3_data=processed_train_data_s3_uri, 
                                         distribution='ShardedByS3Key') 
s3_input_validation_data = sagemaker.s3_input(s3_data=processed_validation_data_s3_uri, 
                                              distribution='ShardedByS3Key')
s3_input_test_data = sagemaker.s3_input(s3_data=processed_test_data_s3_uri, 
                                        distribution='ShardedByS3Key')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-31-07-19-19-437/output/bert-train', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-31-07-19-19-437/output/bert-validation', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-31-07-19-19-437/output/bert-test', 'S3DataDistributionType': 'ShardedByS3Key'}}}


In [4]:
import uuid

checkpoint_s3_prefix = 'checkpoints/{}'.format(str(uuid.uuid4()))
checkpoint_s3_uri = 's3://{}/{}/'.format(bucket, checkpoint_s3_prefix)

print(checkpoint_s3_uri)

s3://sagemaker-us-east-2-057716757052/checkpoints/f8ded034-8ba0-47a2-8b3c-765222f0e3e5/


In [5]:
metrics_definitions = [
     {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
]
# Name':'test:loss', 'Regex':'Test Average loss: (.*?),'},
#                             {'Name':'test:accuracy', 'Regex':'Test Accuracy: (.*?)%;'}

## Parameters

In [6]:

MAX_SEQ_LENGTH= 32


epochs= 20
max_seq_length = 32
learning_rate= 1e-5
epsilon=0.00000001

train_batch_size=128
validation_batch_size=128
test_batch_size=128

train_steps_per_epoch= 100

validation_steps= 50
test_steps= 5

train_instance_count=2 # modified by gonsoo
train_instance_type='ml.p3.2xlarge'
train_volume_size=1024

use_xla=True
use_amp=True

freeze_bert_layer=False

enable_checkpointing=True
input_mode='Pipe'
run_validation=True
run_test=True

In [7]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
                       entry_point='tf_script_bert_tweet.py', 
#                       source_dir='src', # put requirements.txt in this directory and it gets picked up
                       role=sagemaker.get_execution_role(),
                       train_instance_count=train_instance_count, # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
                       train_instance_type=train_instance_type,
                       train_volume_size=train_volume_size,
                       checkpoint_s3_uri=checkpoint_s3_uri, # Not support in local mode
                       py_version='py3',
                       framework_version='2.1.0',
                       script_mode = True,
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'run_validation' : run_validation,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'enable_checkpointing': enable_checkpointing
                                        },
                       input_mode=input_mode,
                       metric_definitions=metrics_definitions
                      )

In [8]:
inputs={'train': s3_input_train_data, 
        'validation': s3_input_validation_data,
         'test': s3_input_test_data
              }
estimator.fit(inputs,
              wait=False)         

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


## Train Job Status 확인

In [9]:
estimator.latest_training_job.describe()

{'TrainingJobName': 'tensorflow-training-2020-07-31-08-16-07-141',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-2:057716757052:training-job/tensorflow-training-2020-07-31-08-16-07-141',
 'TrainingJobStatus': 'InProgress',
 'SecondaryStatus': 'Starting',
 'HyperParameters': {'enable_checkpointing': 'true',
  'epochs': '20',
  'epsilon': '1e-08',
  'freeze_bert_layer': 'false',
  'learning_rate': '1e-05',
  'max_seq_length': '32',
  'model_dir': '"s3://sagemaker-us-east-2-057716757052/tensorflow-training-2020-07-31-08-16-07-141/model"',
  'run_validation': 'true',
  'sagemaker_container_log_level': '20',
  'sagemaker_enable_cloudwatch_metrics': 'false',
  'sagemaker_job_name': '"tensorflow-training-2020-07-31-08-16-07-141"',
  'sagemaker_program': '"tf_script_bert_tweet.py"',
  'sagemaker_region': '"us-east-2"',
  'sagemaker_submit_directory': '"s3://sagemaker-us-east-2-057716757052/tensorflow-training-2020-07-31-08-16-07-141/source/sourcedir.tar.gz"',
  'test_batch_size': '128',
  'tes

## 학습이 완료 될 때까지 기다림

In [10]:
estimator.latest_training_job.wait(logs=False)


2020-07-31 08:16:07 Starting - Starting the training job
2020-07-31 08:16:09 Starting - Launching requested ML instances............
2020-07-31 08:17:13 Starting - Preparing the instances for training...........
2020-07-31 08:18:16 Downloading - Downloading input data..
2020-07-31 08:18:28 Training - Downloading the training image...............
2020-07-31 08:19:49 Training - Training image download completed. Training in progress..................................................................................
2020-07-31 08:26:41 Uploading - Uploading generated training model........................
2020-07-31 08:28:47 Completed - Training job completed


## 생성된 모델 아티펙트 확인

In [11]:
training_job_name = estimator.latest_training_job.job_name

In [12]:
model_artifact_path = "s3://{}/{}/{}".format(bucket,training_job_name,'output' )

In [13]:
! aws s3 ls {model_artifact_path} --recursive

2020-07-31 08:28:37  989714162 tensorflow-training-2020-07-31-08-16-07-141/output/model.tar.gz


In [14]:
%store training_job_name

Stored 'training_job_name' (str)
