# Fine tune GPTJ on SageMaker Training with Deepspeed

In this notebook we will finetune GPTJ on the processed [quotes dataset](https://www.kaggle.com/datasets/akmittal/quotes-dataset). This notebook/repo makes use of [this GitHub repo](https://github.com/Xirider/finetune-gpt2xl) where the Dockerfile has been adpated to be compliant with SageMaker. 

The dataset is in the format we wish to make inference with:

`<Catagory>: <AIGeneratedQuote>`+

Example: 
*We want to give GTJ a catagory and it must generate a quote*

`love: <AIGeneratedQuote>`

Take a look at the quote dataset we will be using in this notebook. 
1. [train.csv](https://raw.githubusercontent.com/marckarp/amazon-sagemaker-fine-tune-gptj/main/Finetune_GPTNEO_GPTJ6B/quotes_dataset/train.csv)
2. [validation.csv](https://raw.githubusercontent.com/marckarp/amazon-sagemaker-fine-tune-gptj/main/Finetune_GPTNEO_GPTJ6B/quotes_dataset/validation.csv)

If you wish to make use of your own dataset feel free to create a train and validation dataset with your own data to accomplish the task you are setting out to achieve. 

For more details on preparing a dataset please see [this link](https://github.com/mallorbc/Finetune_GPTNEO_GPTJ6B/tree/main/finetuning_repo#preparing-a-dataset).

In [None]:
!pip install sagemaker boto3 --upgrade

## Build & Push the container for SageMaker Training

In order to fine tune GPTJ we will have to make use of a docker container with Deepspeed installed. 
The Dockerfile is adapted from this repo [here](https://github.com/mallorbc/Finetune_GPTNEO_GPTJ6B/blob/main/Dockerfile). It has been adapted to be SageMaker compatible. 

Below we will define the deepspeed CLI command that will be run within our SageMaker Training Job. It has been paramterized using Enviroment variables so that we can have the ability to tune/customize the parameters when we kick of a SageMaker Training Job. 



In [None]:
%%writefile ../Finetune_GPTNEO_GPTJ6B/train
#!/bin/bash

df -h
cd finetuning_repo

deepspeed --num_gpus=$num_gpus run_clm.py --deepspeed $deepspeed --model_name_or_path EleutherAI/gpt-j-6B --train_file /opt/ml/input/data/train/train.csv --validation_file /opt/ml/input/data/validation/validation.csv --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=$evaluation_strategy --output_dir $output_dir --num_train_epochs $num_train_epochs  --eval_steps $eval_steps --gradient_accumulation_steps $gradient_accumulation_steps --per_device_train_batch_size $per_device_train_batch_size --use_fast_tokenizer $use_fast_tokenizer --learning_rate $learning_rate --warmup_steps $warmup_steps --save_total_limit $save_total_limit --save_steps $save_steps --save_strategy $save_strategy --tokenizer_name $tokenizer_name --load_best_model_at_end=$load_best_model_at_end --block_size=$block_size --weight_decay=$weight_decay

We set the train file we created above as executable. Once we have all our files ready we can build and push our image to ECR. 

In [None]:
%%sh 

cd ../Finetune_GPTNEO_GPTJ6B
chmod +x train


In [None]:
%%sh
cd ../Finetune_GPTNEO_GPTJ6B
./build_push_image.sh

## SageMaker Training

Once the image has been pushed to ECR we can then kick off the Training Job but first we need to create a SageMaker Estimator object. That contains the information required to start a Training Job. 

There is a number of paramaters you can tune. Depending on the number of GPUs avaialble you set `num_gpus`. The deepspeed configratuon file is also parameterized. 

There are three options to choose from:
1. ds_config_stage1.json
2. ds_config_stage2.json
3. ds_config_stage3.json

https://github.com/mallorbc/Finetune_GPTNEO_GPTJ6B/tree/main/finetuning_repo#deepspeed


Training and finetuning a model is an experimental science. You may want to tune different learning rates, weight decay, etc.

The Training Job has also been configured to emit metrics such as `eval_loss`. The regex for the metrics that the container emits is specified in `metric_definitions`.  You can use these metrics to decide if you wish to stop the Training Job if you do not see an improvement in the loss.



In [12]:
import boto3
import sagemaker 
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer

from sagemaker import local

sagemaker_session = sagemaker.Session()

#local_sagemaker_session = local.LocalSession()

role = get_execution_role()

account = sagemaker_session.boto_session.client('sts').get_caller_identity()['Account']
region = sagemaker_session.boto_session.region_name

image = '{}.dkr.ecr.{}.amazonaws.com/gptj-finetune:latest'.format(account, region)

bucket = sagemaker_session.default_bucket() # Set a default S3 bucket
prefix = 'DEMO-fine-tune-GPTJ'


sm_model = sagemaker.estimator.Estimator(
image_uri=image,
role=role,
instance_count = 1,
#instance_type='local_gpu', 
#sagemaker_session=local_sagemaker_session,
sagemaker_session= sagemaker_session,
#instance_type = 'ml.g5.48xlarge',
instance_type="ml.g5.12xlarge",
environment = {
    "num_gpus": "4",
    "deepspeed": "ds_config_stage3.json",
    "evaluation_strategy": "steps",
    "output_dir": "/opt/ml/checkpoints/",
    "num_train_epochs": "12",
    "eval_steps": "20",
    "gradient_accumulation_steps": "1",
    "per_device_train_batch_size": "4",
    "use_fast_tokenizer": "False",
    "learning_rate": "5e-06",
    "warmup_steps": "10",
    "save_total_limit": "1",
    "save_steps": "20",
    "save_strategy": "steps",
    "tokenizer_name": "gpt2",
    "load_best_model_at_end": "True",
    "block_size": "2048",
    "weight_decay": "0.1"
},
checkpoint_s3_uri=f"s3://{bucket}/fine-tune-GPTJ/first-run/checkpoint/",
    
output_path=f"s3://{bucket}/fine-tune-GPTJ/first-run/",

metric_definitions=[
    {'Name': 'eval:loss', 'Regex': "'eval_loss': ([0-9]+\.[0-9]+)"},
    {'Name': 'eval:runtime', 'Regex': "'eval_runtime': ([0-9]+\.[0-9]+)"},
    {'Name': 'eval:samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+\.[0-9]+)"},
    {'Name': 'eval:eval_steps_per_second', 'Regex': "'eval_steps_per_second': ([0-9]+\.[0-9]+)"},
]

)


INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In our case we will be using the processed quotes dataset to finetune GPTJ and as such we upload the train and validation set to S3. 

In [13]:
train_s3=f"s3://{bucket}/fine-tune-GPTJ/datasets/train/train.csv"
val_s3=f"s3://{bucket}/fine-tune-GPTJ/datasets/validation/validation.csv"


!aws s3 cp ../Finetune_GPTNEO_GPTJ6B/quotes_dataset/train.csv $train_s3
!aws s3 cp ../Finetune_GPTNEO_GPTJ6B/quotes_dataset/validation.csv $val_s3


upload: ../Finetune_GPTNEO_GPTJ6B/quotes_dataset/train.csv to s3://sagemaker-us-east-1-171503325295/fine-tune-GPTJ/datasets/train/train.csv
upload: ../Finetune_GPTNEO_GPTJ6B/quotes_dataset/validation.csv to s3://sagemaker-us-east-1-171503325295/fine-tune-GPTJ/datasets/validation/validation.csv


The container we bult expects two datasets, a train and validation dataset. Here set the training input channels "train" and "validation" which each point to their respective S3 locations for SageMaker to make use of during Training. SageMaker handles downloading the datasets to the container from S3 on our behalf. 

Finally, we kick off the job with the `.fit()` method. 

In [14]:
from sagemaker.session import TrainingInput

train_input = TrainingInput(
    train_s3, content_type="csv"
)
validation_input = TrainingInput(
    val_s3, content_type="csv"
)

sm_model.fit({"train": train_input, "validation": validation_input}, wait=False)


INFO:sagemaker:Creating training-job with name: gptj-finetune-2023-02-10-20-31-13-476


In [8]:
s3_checkpoints = f"s3://{bucket}/fine-tune-GPTJ/checkpoint/"
!aws s3 ls $s3_checkpoints

                           PRE checkpoint-120/
