# On a LocalMode, Train a BERT Model with Tensorflow
- 스크립트 모드를 사용하기 위해서 아래의 API 문서 참고 하세요
- Script Mode Ref:
    - https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#train-a-model-with-tensorflow

##  로컬모드 (Local Mode) 학습 <a class="anchor" id="LocalModeTraining">

SageMaker에서 로컬 모드는, 여러분이 작성한 코드를 SageMaker에서 관리되는 보다 강력한 클러스터에서 실행하기 전에, 여러분의 코드가 기대한 방식으로 동작하는 지 로컬에서 확인할 수 있는 편리한 방식입니다. 로컬모드 학습을 위해서는 docker-compose 또는 nvidia-docker-compose (GPU 인스턴스인 경우)의 설치가 필요합니다. 다음 셀의 명령은 본 노트북환경에 docker-compose 또는 nvidia-docker-compose를 설치하고 구성합니다. 

In [1]:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/daemon.json    
!/bin/bash ./local_mode_setup.sh

SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


In [2]:
import os
import sagemaker
import boto3
from sagemaker.tensorflow import TensorFlow

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

In [3]:
# s3_input_train_data = sagemaker.s3_input(s3_data=processed_train_data_s3_uri, 
#                                          distribution='ShardedByS3Key') 
# s3_input_validation_data = sagemaker.s3_input(s3_data=processed_validation_data_s3_uri, 
#                                               distribution='ShardedByS3Key')
# s3_input_test_data = sagemaker.s3_input(s3_data=processed_test_data_s3_uri, 
#                                         distribution='ShardedByS3Key')

# print(s3_input_train_data.config)
# print(s3_input_validation_data.config)
# print(s3_input_test_data.config)

In [4]:
train_dir = 'data/output/bert/train'
validation_dir = 'data/output/bert/validation'
test_dir = 'data/output/bert/test'


In [5]:
import uuid

checkpoint_s3_prefix = 'checkpoints/{}'.format(str(uuid.uuid4()))
checkpoint_s3_uri = 's3://{}/{}/'.format(bucket, checkpoint_s3_prefix)

print(checkpoint_s3_uri)

s3://sagemaker-us-east-2-057716757052/checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/


In [6]:
metrics_definitions = [
     {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
]

In [7]:
epochs=1
max_seq_length = 128
learning_rate=0.00001
epsilon=0.00000001
model_dir = checkpoint_s3_uri

train_batch_size=128
validation_batch_size=128
test_batch_size=128
# train_steps_per_epoch=1000
train_steps_per_epoch=1
# validation_steps=100
validation_steps=1
# test_steps=100
test_steps=1
# train_instance_count=2 # modified by gonsoo
# train_instance_type='ml.p3.2xlarge'
train_instance_type='local'
train_instance_count=1
train_volume_size=1024
use_xla=True
use_amp=True
freeze_bert_layer=False
enable_sagemaker_debugger=False
enable_checkpointing=True
# enable_tensorboard=True
# input_mode='Pipe'
input_mode='File'
# run_validation=True
run_test=True

In [8]:
from sagemaker.tensorflow import TensorFlow

local_estimator = TensorFlow(entry_point='tf_script_BERT_tweet.py', 
#                       source_dir='src', # put requirements.txt in this directory and it gets picked up
                       role=sagemaker.get_execution_role(),
                       train_instance_count=train_instance_count, # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
                       train_instance_type=train_instance_type,
                       train_volume_size=train_volume_size,
#                        train_use_spot_instances=True,
#                        train_max_wait=7200, # Seconds to wait for spot instances to become available
#                        checkpoint_s3_uri=checkpoint_s3_uri, # Not support in local mode
                       model_dir = model_dir,
                       py_version='py3',
                       framework_version='2.1.0',
                       script_mode = True,
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'model_dir': model_dir,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'enable_checkpointing': enable_checkpointing
                                        },
                       input_mode=input_mode,
                       metric_definitions=metrics_definitions
                      )

In [9]:
inputs = {'train': f'file://{train_dir}',
          'validation': f'file://{validation_dir}',
          'test': f'file://{test_dir}'}
local_estimator.fit(inputs)         

Creating tmpjp3bxgut_algo-1-w6n6d_1 ... 
[1BAttaching to tmpjp3bxgut_algo-1-w6n6d_12mdone[0m
[36malgo-1-w6n6d_1  |[0m 2020-06-26 23:37:56,128 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
[36malgo-1-w6n6d_1  |[0m 2020-06-26 23:37:56,137 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-w6n6d_1  |[0m 2020-06-26 23:37:56,311 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-w6n6d_1  |[0m 2020-06-26 23:37:56,328 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-w6n6d_1  |[0m 2020-06-26 23:37:56,343 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-w6n6d_1  |[0m 2020-06-26 23:37:56,352 sagemaker-containers INFO     Invoking user script
[36malgo-1-w6n6d_1  |[0m 
[36malgo-1-w6n6d_1  |[0m Training Env:
[36malgo-1-w6n6d_1  |[0m 
[36malgo-1-w6n6d_1  |[0m {
[36malgo-1-w6n6d_1  |[0

In [10]:
! aws s3 ls {checkpoint_s3_uri}/

In [11]:
checkpoint_s3_uri

's3://sagemaker-us-east-2-057716757052/checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/'

In [15]:
! aws s3 ls s3://sagemaker-us-east-2-057716757052/checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/ --recursive

2020-06-26 23:39:13          0 checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/
2020-06-26 23:39:13          0 checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/tensorflow/
2020-06-26 23:39:13          0 checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/tensorflow/saved_model/
2020-06-26 23:39:13          0 checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/tensorflow/saved_model/0/
2020-06-26 23:39:18          0 checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/tensorflow/saved_model/0/assets/
2020-06-26 23:39:18    4787957 checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/tensorflow/saved_model/0/saved_model.pb
2020-06-26 23:39:13          0 checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/tensorflow/saved_model/0/variables/
2020-06-26 23:39:16  803602808 checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/tensorflow/saved_model/0/variables/variables.data-00000-of-00001
2020-06-26 23:39:17      23052 checkpoints/007a0b72-c4dc-4b82-9658-d5d513380525/tensorflow/saved_model/0/variables/variable