## Use SageMaker training Job for the finetuning

And this notebook is used as a client to call the remote (GPU/Computing) resources.

Running Requirements: conda_python3

In [None]:
# Update SageMaker python SDK
# Restart kernel after pip install
!pip install --upgrade sagemaker

Construct a path to put training files and other resources. 
Will be uploaded to the training instances via Estimator API call.

In [None]:
!rm -rf src
!mkdir -p src

In [None]:
!mv s5cmd src/
!mv alpaca_data.json src/
!mv train.py src/
!mv requirements.txt src/

General configs for SageMaker Runtime

In [None]:
import sagemaker
import boto3
from sagemaker import get_execution_role

sess = sagemaker.Session()
role = get_execution_role()
sagemaker_default_bucket = sess.default_bucket()
region = sess.boto_session.region_name

Core API call: Initialize the SageMaker Estimator, which works as a client config and fetch remote computing resources on demand.

In [None]:
import time
from sagemaker.estimator import Estimator
from datetime import datetime

# Pre-built dockers: https://github.com/aws/deep-learning-containers/blob/master/available_images.md
image_uri = '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker'

instance_count = 1
instance_type = 'ml.g5.2xlarge'

ts_str = str(datetime.now().strftime("%Y-%m-%d-%H-%M-%S"))
model_output_path = f's3://{sagemaker_default_bucket}/output-models/bloke-llama2-7b-qlora/{ts_str}/' 

environment = {
    # 'NODE_NUMBER':str(instance_count),
    'MODEL_S3_PATH': f's3://{sagemaker_default_bucket}/bloke-llama2-7b-fp16/*', # source model files
    'OUTPUT_MODEL_S3_PATH': model_output_path # destination
}

hyp_param = {
    'seed':99,
    'data_dir':'/opt/ml/code/alpaca_data.json', # use /opt/ml/input/data/trainabc if data source is s3
    'per_device_train_batch_size':1,
    'max_steps':20
}

estimator = Estimator(role=role,
                      entry_point='train.py',
                      source_dir='./src',
                      base_job_name='llama2-qlora-train',
                      instance_count=instance_count,
                      instance_type=instance_type,
                      image_uri=image_uri,
                      environment=environment,
                      hyperparameters=hyp_param,
                      max_run=2*24*3600, #任务最大存续时间，默认2day，需要提交ticket提升quota最大28天
                      keep_alive_period_in_seconds=3600, #warmpool，为下一次训练保持机器&镜像（滚动续期，最大1hour）；需要开quota。
                      disable_profiler=True,
                      debugger_hook_config=False)


Core API call: Trigger the actual training job configed above.

The S3 path of the training data can be passed in, and the corresponding path in the training docker image should also be adjusted (the ENV variable defined above).

In [None]:
# data in channel will be automatically copied to training node, e.g. /opt/ml/input/data/trainabc
# input_channel = {'trainabc': 's3://<s3_bucket>/datasets/cn_alpaca_jsonline_data/'}
# estimator.fit(input_channel)

estimator.fit()

After the training Job, we copy manually the S3 path where the tuned model is saved, will be used in model hosting process.

In [None]:
# Copy the model output path to LMI option.s3url
print('PATH for LMI inference option.s3url:')
print(model_output_path)

In [None]:
!aws s3 ls {model_output_path}