## This example notebook uses Axolotl to fine-tune large foundation models

[Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.

Features:

- Train various Huggingface models such as llama, pythia, falcon, mpt
- Supports fullfinetune, lora, qlora, relora, and gptq
- Customize configurations using a simple yaml file or CLI overwrite
- Load different dataset formats, use custom formats, or bring your own tokenized datasets
- Integrated with xformer, flash attention, rope scaling, and multipacking
- Works with single GPU or multiple GPUs via FSDP or Deepspeed
- Easily run with Docker locally or on the cloud

In [2]:
%pip install -Uq sagemaker
%pip install -Uq datasets

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [31]:
import boto3
import sagemaker
import json
from sagemaker import Model, image_uris, serializers, deserializers
import time
from pathlib import Path
from utils import download_model

boto3_session=boto3.session.Session(region_name="us-east-1")
# boto3_session=boto3.session.Session()

smr = boto3_session.client("sagemaker-runtime") # sagemaker runtime client for invoking the endpoint
sm = boto3_session.client("sagemaker") 
s3_rsr = boto3_session.resource("s3")
role = sagemaker.get_execution_role()  

sess = sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


### Download Model

In [32]:
# uncomment to download model
local_model_path = download_model("TheBloke/Llama-2-13B-fp16", "./Llama2-13B")

Model already exists at Llama2-13B
Skipping download


In [33]:
# check if the model has already been uploaded to the S3 bucket. If not, upload it.
model_prefix = "Llama2-13B"

if list(s3_rsr.Bucket(bucket).objects.filter(Prefix=model_prefix)) :
    print("Model already exists on the S3 bucket")
    print(f"If you want to upload a new model, please delete the existing model from the S3 bucket with the following command: \n !aws s3 rm --recursive s3://{bucket}/{model_prefix}")
    s3_model_location = f"s3://{bucket}/{model_prefix}"
else:
    s3_model_location = sess.upload_data(path=local_model_path.as_posix(), bucket=bucket, key_prefix=model_prefix)

Model already exists on the S3 bucket
If you want to upload a new model, please delete the existing model from the S3 bucket with the following command: 
 !aws s3 rm --recursive s3://sagemaker-us-east-1-376678947624/Llama2-13B


### Download Data and upload to S3

In [34]:
import datasets

# download the training data mhenrichsen/alpaca_2k_test using the HuggingFace datasets library and save output as json
dataset = datasets.load_dataset("garage-bAInd/Open-Platypus")
print(dataset)

data_path = Path("data")
data_path.mkdir(exist_ok=True)

dataset["train"].to_pandas().to_json("data/Open-Platypus.json", orient="records", lines=True)
s3_data = sess.upload_data(path="data/Open-Platypus.json", bucket=bucket, key_prefix="data")

print(f"Uploaded training data file to {s3_data}")

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruction', 'data_source'],
        num_rows: 24926
    })
})
Uploaded training data file to s3://sagemaker-us-east-1-376678947624/data/Open-Platypus.json


In [35]:
!aws s3 ls $s3_data

2023-10-25 23:39:35   32640413 Open-Platypus.json


In [36]:
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import TensorBoardOutputConfig
import time

str_time = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())

tb_output_config = TensorBoardOutputConfig(s3_output_path=f"s3://{bucket}/llama7b-pixiu/tensorboard/{str_time}",
    container_local_output_path="/opt/ml/output/tensorboard")

hyperparameters = {
    "config": "llama2-13b-qlora.yml",
    "deepspeed": "axolotl/deepspeed/zero2.json"
}


estimator = PyTorch(
    source_dir = "src",
    entry_point="axolotl/src/axolotl/cli/train.py",
    sagemaker_session=sess,
    role=role,
    instance_count=2, 
    hyperparameters=hyperparameters,
    instance_type="ml.g5.12xlarge", 
    framework_version="2.0.1",
    py_version="py310",
    disable_profiler=True,
    max_run=60*60*24*2,
    keep_alive_period_in_seconds=3600,
    tensorboard_output_config=tb_output_config,
    environment = {"HUGGINGFACE_HUB_CACHE": "/tmp", 
                    "LIBRARY_PATH": "/opt/conda/lib/",
                    "TRANSFORMERS_CACHE": "/tmp",
                    "NCCL_P2P_LEVEL": "NVL"},
    distribution={"torch_distributed": {"enabled": True}} 
)

In [None]:
estimator.fit({"model": s3_model_location, "train": s3_data})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-10-25-23-39-39-073


2023-10-25 23:39:41 Starting - Starting the training job...
2023-10-25 23:40:05 Starting - Preparing the instances for training.........
2023-10-25 23:41:18 Downloading - Downloading input data.....................