# [Module 3.2] On a Local Mode, Train a BERT Model with Tensorflow

이 노트북은 아래와 같은 작업을 진행 합니다. 로컬 모드를 사용하는 이유는 Train Script의 로직이 맞는지를 주로 확인하는데 사용 합니다. 로컬 모드로 로직 확인이 완료 되면 Script Mode (BYOS, Bring Your Own Script)혹은 필요할 경우에 BYOC(Bring Your Own Container)로 학습을 합니다.

- 학습할 데이타를 S3로 지정
- Train 학습 파리미터 설정
- Estimator를 생성하고 tf_script_bert_tweet.py Train Script를 지정
- Estimator를 를 로컬 모드로 실행
    
---
이 노트북은 약 3분 소요 됨

##  로컬모드 (Local Mode) 학습 <a class="anchor" id="LocalModeTraining">

SageMaker에서 로컬 모드는, 여러분이 작성한 코드를 SageMaker에서 관리되는 보다 강력한 클러스터에서 실행하기 전에, 여러분의 코드가 기대한 방식으로 동작하는 지 로컬에서 확인할 수 있는 편리한 방식입니다. 로컬모드 학습을 위해서는 docker-compose 또는 nvidia-docker-compose (GPU 인스턴스인 경우)의 설치가 필요합니다. 다음 셀의 명령은 본 노트북환경에 docker-compose 또는 nvidia-docker-compose를 설치하고 구성합니다. 
    
스크립트 모드를 사용하기 위해서 아래의 API 문서 참고 하세요
- Script Mode Ref:
    - https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#train-a-model-with-tensorflow
    

In [1]:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/daemon.json    
!/bin/bash ./local_mode_setup.sh

SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


In [2]:
import os
import sagemaker
import boto3
from sagemaker.tensorflow import TensorFlow

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

In [3]:
%store -r

## Input Data 설정

In [4]:

s3_input_train_data = sagemaker.s3_input(s3_data=processed_train_data_s3_uri) 
s3_input_validation_data = sagemaker.s3_input(s3_data=processed_validation_data_s3_uri)
s3_input_test_data = sagemaker.s3_input(s3_data=processed_test_data_s3_uri)


print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-ap-northeast-2-343441690612/sagemaker-scikit-learn-2020-08-17-09-41-31-333/output/bert-train', 'S3DataDistributionType': 'FullyReplicated'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-ap-northeast-2-343441690612/sagemaker-scikit-learn-2020-08-17-09-41-31-333/output/bert-validation', 'S3DataDistributionType': 'FullyReplicated'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-ap-northeast-2-343441690612/sagemaker-scikit-learn-2020-08-17-09-41-31-333/output/bert-test', 'S3DataDistributionType': 'FullyReplicated'}}}


uuid를 생성하여 checkpoint 파일이 저장될 폴더를 생성

In [5]:
import uuid

checkpoint_s3_prefix = 'checkpoints/{}'.format(str(uuid.uuid4()))
checkpoint_s3_uri = 's3://{}/{}/'.format(bucket, checkpoint_s3_prefix)

print(checkpoint_s3_uri)

s3://sagemaker-ap-northeast-2-343441690612/checkpoints/760f4189-c5cb-4b74-97ce-f1966ca7e58f/


Metrics를 정의하여 CloudWatch에서 모니터링을 할 수 있게 합니다.

In [6]:
metrics_definitions = [
     {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
]

## Parameters

In [7]:
epochs=1
max_seq_length = 32
learning_rate=0.00001
epsilon=0.00000001

train_batch_size=128
validation_batch_size=128
test_batch_size=128

train_steps_per_epoch=1
validation_steps=1
test_steps=1

train_instance_type='local'
train_instance_count=1
train_volume_size=1024

use_xla=True
use_amp=True

freeze_bert_layer=False

enable_sagemaker_debugger=False
enable_checkpointing=True

# input_mode='Pipe'
input_mode='File'
run_validation=True
run_test=True

In [8]:
from sagemaker.tensorflow import TensorFlow

local_estimator = TensorFlow(entry_point='tf_script_bert_tweet.py', 
#                       source_dir='src', # put requirements.txt in this directory and it gets picked up
                       role=sagemaker.get_execution_role(),
                       train_instance_count=train_instance_count, # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
                       train_instance_type=train_instance_type,
                       train_volume_size=train_volume_size,
#                        train_use_spot_instances=True, # Not support in local mode
#                        train_max_wait=7200, # Seconds to wait for spot instances to become available
#                        checkpoint_s3_uri=checkpoint_s3_uri, # Not support in local mode
                       py_version='py3',
                       framework_version='2.1.0',
                       script_mode = True,
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'enable_checkpointing': enable_checkpointing
                                        },
                       input_mode=input_mode,
                       metric_definitions=metrics_definitions
                      )

## Input 위치 지정하고 Train 실행

In [9]:
# S3
inputs={'train': s3_input_train_data, 
        'validation': s3_input_validation_data,
         'test': s3_input_test_data
              }

# Local 파일을 사용한다면 아래를 Uncomment하고 사용
# train_dir = 'data/output/bert/train'
# validation_dir = 'data/output/bert/validation'
# test_dir = 'data/output/bert/test'

# inputs = {'train': f'file://{train_dir}',
#           'validation': f'file://{validation_dir}',
#           'test': f'file://{test_dir}'}

local_estimator.fit(inputs)         

Creating tmpikr_h4cd_algo-1-usz92_1 ... 
[1BAttaching to tmpikr_h4cd_algo-1-usz92_12mdone[0m
[36malgo-1-usz92_1  |[0m 2020-08-17 09:53:25,369 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
[36malgo-1-usz92_1  |[0m 2020-08-17 09:53:25,376 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-usz92_1  |[0m 2020-08-17 09:53:25,566 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-usz92_1  |[0m 2020-08-17 09:53:25,580 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-usz92_1  |[0m 2020-08-17 09:53:25,592 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-usz92_1  |[0m 2020-08-17 09:53:25,601 sagemaker-containers INFO     Invoking user script
[36malgo-1-usz92_1  |[0m 
[36malgo-1-usz92_1  |[0m Training Env:
[36malgo-1-usz92_1  |[0m 
[36malgo-1-usz92_1  |[0m {
[36malgo-1-usz92_1  |[0