# [Module 3.3] On a Script Mode, Train a BERT Model (Fine-tuning) 

이 노트북은 스크립트 모드 (Bring Your Own Script) 로 학습을 수행 합니다.
여기서는 다음과 같은 작업을 합니다.

- 학습할 데이타를 S3로 지정
- Train 학습 파리미터 설정
- Estimator를 생성하고 tf_script_bert_tweet.py Train Script를 지정
- train_instance_type 을 'ml.p3.2xlarge' 인스턴스로 설정
- Estimator를 를 스크립트 모드로 실행
- S3에 생성된 모델 아티펙트 확인


- 스크립트 모드의 자세한 사용은 아래의 API 문서 참고 하세요
- Script Mode Ref:
    - https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#train-a-model-with-tensorflow
    
---
이 노트북의 실행 시간은 **약 35분** 걸립니다. <br>
2개의 ml.p3.2xlarge instance type으로 학습시에 약 30분 소요 됩니다.
실행 시간이 줄이시려면 epoch = 400 을 줄여서 해주십시오.

    

In [1]:
%store -r

In [2]:
import os
import sagemaker
import boto3
from sagemaker.tensorflow import TensorFlow

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

## 입력 데이타 설정
로컬모드와 다르게 distribution='ShardedByS3Key' 로 설정 함

In [4]:
s3_input_train_data = sagemaker.inputs.TrainingInput(s3_data=processed_train_data_s3_uri, 
                                         distribution='ShardedByS3Key') 
s3_input_validation_data = sagemaker.inputs.TrainingInput(s3_data=processed_validation_data_s3_uri, 
                                              distribution='ShardedByS3Key')
s3_input_test_data = sagemaker.inputs.TrainingInput(s3_data=processed_test_data_s3_uri, 
                                        distribution='ShardedByS3Key')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-079636235537/sagemaker-scikit-learn-2021-04-04-14-21-14-399/output/bert-train', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-079636235537/sagemaker-scikit-learn-2021-04-04-14-21-14-399/output/bert-validation', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-079636235537/sagemaker-scikit-learn-2021-04-04-14-21-14-399/output/bert-test', 'S3DataDistributionType': 'ShardedByS3Key'}}}


In [5]:
import uuid

checkpoint_s3_prefix = 'checkpoints/{}'.format(str(uuid.uuid4()))
checkpoint_s3_uri = 's3://{}/{}/'.format(bucket, checkpoint_s3_prefix)

print(checkpoint_s3_uri)

s3://sagemaker-us-east-1-079636235537/checkpoints/fb436142-5f68-47d4-b191-5f9cbe3a449b/


In [6]:
metrics_definitions = [
     {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
]
# Name':'test:loss', 'Regex':'Test Average loss: (.*?),'},
#                             {'Name':'test:accuracy', 'Regex':'Test Accuracy: (.*?)%;'}

## 학습 (2개의 ml.p3.2xlarge 사용)
아래는 아래의 하이퍼라라미터 세팅으로 학습 결과의 validation, train accuracy 입니다.<br>
이 화면은 SageMaker --> 왼쪽 메뉴의 Traing 밑의 Training Job --> 해당 실행 training job 클릭하면 하단에 아래와 같은 차트가 나옵니다.

현재 validation accuracy가 약 32% 나옵니다. <br>
낮게 나오는 이유는 학습양의 데이타가 적은 것이 주요한 이유 입니다.<br>
validation accuracy 를 올리기 위해서는 데이타의 준비가 더 필요 합니다.

![Fig.3.3.BYOS-Train-Accuracy](img/Fig.3.3.BYOS-Train-Accuracy.png)

## Parameters

In [7]:
epochs= 400

learning_rate = 4e-4
epsilon=0.00000001

steps = 100
train_steps_per_epoch= steps
validation_steps= int(steps / 2)
test_steps= int(steps / 2)

train_batch_size=128
validation_batch_size=128
test_batch_size=128

train_instance_count=2 # modified by gonsoo
train_instance_type='ml.p3.2xlarge'
train_volume_size=1024

use_xla=True
use_amp=True

max_seq_length = 32

freeze_bert_layer= True

enable_checkpointing=True
input_mode='Pipe'
run_validation=True
run_test=True

In [8]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
                       entry_point='tf_script_bert_tweet.py', 
#                       source_dir='src', # put requirements.txt in this directory and it gets picked up
                       role=sagemaker.get_execution_role(),
                       train_instance_count=train_instance_count, # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
                       train_instance_type=train_instance_type,
                       train_volume_size=train_volume_size,
                       checkpoint_s3_uri=checkpoint_s3_uri, # Not support in local mode
                       py_version='py3',
                       framework_version='2.1.0',
                       script_mode = True,
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'run_validation' : run_validation,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'enable_checkpointing': enable_checkpointing
                                        },
                       input_mode=input_mode,
                       metric_definitions=metrics_definitions
                      )

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [9]:
inputs={'train': s3_input_train_data, 
        'validation': s3_input_validation_data,
         'test': s3_input_test_data
              }
estimator.fit(inputs,
              wait=False)         

## Train Job Status 확인

In [10]:
estimator.latest_training_job.describe()

{'TrainingJobName': 'tensorflow-training-2021-04-04-14-32-48-991',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:079636235537:training-job/tensorflow-training-2021-04-04-14-32-48-991',
 'TrainingJobStatus': 'InProgress',
 'SecondaryStatus': 'Starting',
 'HyperParameters': {'enable_checkpointing': 'true',
  'epochs': '400',
  'epsilon': '1e-08',
  'freeze_bert_layer': 'true',
  'learning_rate': '0.0004',
  'max_seq_length': '32',
  'model_dir': '"s3://sagemaker-us-east-1-079636235537/tensorflow-training-2021-04-04-14-32-48-991/model"',
  'run_validation': 'true',
  'sagemaker_container_log_level': '20',
  'sagemaker_job_name': '"tensorflow-training-2021-04-04-14-32-48-991"',
  'sagemaker_program': '"tf_script_bert_tweet.py"',
  'sagemaker_region': '"us-east-1"',
  'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-079636235537/tensorflow-training-2021-04-04-14-32-48-991/source/sourcedir.tar.gz"',
  'test_batch_size': '128',
  'test_steps': '50',
  'train_batch_size': '128',
  't

## 학습이 완료 될 때까지 기다림

In [11]:
%%time
estimator.latest_training_job.wait(logs=False)


2021-04-04 14:32:51 Starting - Launching requested ML instances..........
2021-04-04 14:34:00 Starting - Preparing the instances for training................
2021-04-04 14:35:21 Downloading - Downloading input data.....
2021-04-04 14:35:56 Training - Downloading the training image.......
2021-04-04 14:36:36 Training - Training image download completed. Training in progress........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
2021-04-04 15:18:48 Uploading - Uploading generated training model.....................
2021-04-04 15:20:40 Completed -

## 생성된 모델 아티펙트 확인

In [12]:
training_job_name = estimator.latest_training_job.job_name

In [13]:
model_artifact_path = "s3://{}/{}/{}".format(bucket,training_job_name,'output' )

In [14]:
! aws s3 ls {model_artifact_path} --recursive

2021-04-04 15:20:32  996723430 tensorflow-training-2021-04-04-14-32-48-991/output/model.tar.gz


In [15]:
%store training_job_name

Stored 'training_job_name' (str)
