## Manage Fine tune Opensource LLM Using SageMaker Model Registry

In [3]:
!pip install -Uq datasets
!pip install -Uq transformers
!pip install -Uq accelerate
!pip install -Uq boto3
!pip install -Uq sagemaker

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.29.14 requires botocore==1.31.14, but you have botocore 1.31.36 which is incompatible.[0m[31m
[0m

## Setup

In [4]:
import os
import glob
import boto3
import pprint
from tqdm import tqdm
import sagemaker
from sagemaker.collection import Collection



In [5]:
sagemaker_session =  sagemaker.session.Session() #sagemaker.session.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
sm_client = boto3.client('sagemaker', region_name=region)
model_collector = Collection(sagemaker_session=sagemaker_session)

## Define Parameters 

In [6]:
# define base model name
model_id = "Mikael110/llama-2-13b-guanaco-fp16" 
# define a base dataset to finetune this base model
dataset_name = "databricks/databricks-dolly-15k"

# s3 prefix
s3_key_prefix = model_id.replace('/', '-')
# model collection name
model_registry_name = s3_key_prefix
model_group_for_base = "llama-2-13b" # we'll group all llama-2 variants under this collection 

model_group_for_finetune = f"{model_group_for_base}-{dataset_name.split('/')[-1]}" # all fine tune variant will be base name + dataset name

## Prepare Dataset

In [7]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
train_dataset = load_dataset(dataset_name, split="train[:05%]")
validation_dataset = load_dataset(dataset_name, split="train[95%:]")

print(f"Training size: {len(train_dataset)} | Validation size: {len(validation_dataset)}")
print("\nTraining sample:\n")
print(train_dataset[randrange(len(train_dataset))])
print("\nValidation sample:\n")
print(validation_dataset[randrange(len(validation_dataset))])

Training size: 751 | Validation size: 751

Training sample:

{'instruction': 'What are the official languages of the United Nations?', 'context': "The official languages of the United Nations are the six languages that are used in UN meetings and in which all official UN documents are written. In the six languages, four are the official language or national language of permanent members in the Security Council, while the remaining two are used due to the large number of their speakers. In alphabetical order of the Latin alphabet, they are:\n\nArabic (Modern Standard Arabic) – official or national language of several countries in the Middle East and North Africa, and used in the Arab world.\nChinese (Mandarin Chinese in simplified Chinese characters) – official language of the People's Republic of China.\nEnglish – majority and de facto official language of the United Kingdom, the United States and Australia, and majority (de jure) official language of Canada and New Zealand. It is also

In [8]:
def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

In [9]:
from random import randrange

print(format_dolly(train_dataset[randrange(len(train_dataset))]))

### Instruction
Why are so many US Hollywood films made in the state of Georgia?

### Answer
While film in the United States as an audience artform started in New York it moved to Los Angeles for the stated reason of longer and sunnier days.  Another part of the story was a patent dispute in 1898 between Thomas Edison and his patent for the Kinetograph.  Edison stated that he effectively had a say in how films were made and wanted royalties for their use.  Moving from New York near where Edison was based to Los Angeles was a simple way to grow the industry and distance themselves from a patent war.  

This mobility at the start of the modern film era due to costs continued through the 1960's and 1970's with westerns being shot in Spain.  The 1980's and 1990's continued this trend primarily shooting in Canada or Mexico.  At the turn of the century state legislators in the United States started provided steep tax rebates.  The practical outworking of this is studios are encouraged to hir

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

In [11]:
from random import randint
from itertools import chain
from functools import partial


# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample


# apply prompt template per sample
# train
train_dataset = train_dataset.map(template_dataset, remove_columns=list(train_dataset.features))
# validation
validation_dataset = validation_dataset.map(template_dataset, remove_columns=list(validation_dataset.features))
# print random sample
print(validation_dataset[randint(0, len(validation_dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset

# training
lm_train_dataset = train_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(train_dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# validation
lm_valid_dataset = validation_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(validation_dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(validation_dataset)}")

### Instruction
What are some fun things to do around Seattle on a warm Summer day?

### Answer
Here are few things that someone might enjoy on a warm Summer day in Seattle: walking around Seattle center, taking a trip up to the top of the Space Needle to get a unique perspective of the city, taking a ferry ride to Bainbridge Island to walk around the small town and shop at the small unique shops, visit Green Lake park just north of the city and take a stroll around the picturesque scenery, or visit Pike place market and see the many items venders have for sale.</s>
Total number of samples: 751


## Upload dataset to S3

In [12]:
# save train_dataset to s3
training_input_path = f's3://{default_bucket}/{s3_key_prefix}/dataset/train'
lm_train_dataset.save_to_disk(training_input_path)

print(f"saving training dataset to: {training_input_path}")

# save train_dataset to s3
validation_input_path = f's3://{default_bucket}/{s3_key_prefix}/dataset/validation'
lm_valid_dataset.save_to_disk(validation_input_path)

print(f"saving validation dataset to: {validation_input_path}")

Saving the dataset (0/1 shards):   0%|          | 0/78 [00:00<?, ? examples/s]

saving training dataset to: s3://sagemaker-us-west-2-376678947624/Mikael110-llama-2-13b-guanaco-fp16/dataset/train


Saving the dataset (0/1 shards):   0%|          | 0/71 [00:00<?, ? examples/s]

saving validation dataset to: s3://sagemaker-us-west-2-376678947624/Mikael110-llama-2-13b-guanaco-fp16/dataset/validation


## Register Base model into Model Registry

In [13]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_save_dir = f"./base_model/{model_id}"
os.makedirs(base_model_save_dir, exist_ok=True)

tokenizer = AutoTokenizer.from_pretrained(model_id).save_pretrained(base_model_save_dir)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16,
    device_map="auto"
).save_pretrained(base_model_save_dir) 

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.


In [14]:
del model
import torch; torch.cuda.empty_cache()

## Tar and upload the model to S3

In [15]:
model_tar_filename = f"{model_id.replace('/', '-')}.tar.gz"
print(f"Model tar file name: {model_tar_filename}")

Model tar file name: Mikael110-llama-2-13b-guanaco-fp16.tar.gz


In [16]:
%%time
!cd ./base_model && tar -cvf ./{model_tar_filename} ./{model_id}

./Mikael110/llama-2-13b-guanaco-fp16/
./Mikael110/llama-2-13b-guanaco-fp16/pytorch_model-00001-of-00003.bin
./Mikael110/llama-2-13b-guanaco-fp16/tokenizer_config.json
./Mikael110/llama-2-13b-guanaco-fp16/pytorch_model-00003-of-00003.bin
./Mikael110/llama-2-13b-guanaco-fp16/config.json
./Mikael110/llama-2-13b-guanaco-fp16/pytorch_model.bin.index.json
./Mikael110/llama-2-13b-guanaco-fp16/pytorch_model-00002-of-00003.bin
./Mikael110/llama-2-13b-guanaco-fp16/special_tokens_map.json
./Mikael110/llama-2-13b-guanaco-fp16/tokenizer.json
CPU times: user 3.6 s, sys: 585 ms, total: 4.19 s
Wall time: 5min 27s


In [17]:
%%time
model_data_uri = sagemaker.s3.S3Uploader.upload(
    local_path=f"./base_model/{model_tar_filename}",
    desired_s3_uri=f's3://{default_bucket}/{s3_key_prefix}/models/base',
)
print(model_data_uri)

s3://sagemaker-us-west-2-376678947624/Mikael110-llama-2-13b-guanaco-fp16/models/base/Mikael110-llama-2-13b-guanaco-fp16.tar.gz
CPU times: user 2min 27s, sys: 2min 38s, total: 5min 6s
Wall time: 1min 53s


## Create Base Model Package Group

In [19]:
# Model Package Group Vars
base_package_group_name = model_id.replace('/', '-0-')
base_package_group_desc = "Source: https://huggingface.co/Mikael110/llama-2-13b-guanaco-fp16"
base_tags = [
    { 
        "Key": "modelType",
        "Value": "BaseModel"
    },
    { 
        "Key": "fineTuned",
        "Value": "False"
    },
    { 
        "Key": "sourceDataset",
        "Value": "None"
    }
]

model_package_group_input_dict = {
    "ModelPackageGroupName" : base_package_group_name,
    "ModelPackageGroupDescription" : base_package_group_desc,
    "Tags": base_tags
    
}
create_model_pacakge_group_response = sm_client.create_model_package_group(
    **model_package_group_input_dict
)
print(f'Created ModelPackageGroup Arn : {create_model_pacakge_group_response["ModelPackageGroupArn"]}')

base_model_pkg_group_name = create_model_pacakge_group_response["ModelPackageGroupArn"]

Created ModelPackageGroup Arn : arn:aws:sagemaker:us-west-2:376678947624:model-package-group/Mikael110-0-llama-2-13b-guanaco-fp16


## Register the Base Model

In [20]:
from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.28',
    pytorch_version='2.0',  
    py_version='py310',
    model_data=model_data_uri,
    role=role,
)

In [21]:
_response = huggingface_model.register(
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=[
        "ml.p2.16xlarge", 
        "ml.p3.16xlarge", 
        "ml.g4dn.4xlarge", 
        "ml.g4dn.8xlarge", 
        "ml.g4dn.12xlarge", 
        "ml.g4dn.16xlarge"
    ],
    transform_instances=[
        "ml.p2.16xlarge", 
        "ml.p3.16xlarge", 
        "ml.g4dn.4xlarge", 
        "ml.g4dn.8xlarge", 
        "ml.g4dn.12xlarge", 
        "ml.g4dn.16xlarge"
    ],
    model_package_group_name=base_model_pkg_group_name,
    approval_status="Approved"
)

### Add Base Model to Model Collection

In [24]:
# create model collection
base_collection = model_collector.create(
    collection_name=model_group_for_base + '-0'
)

In [25]:
_response = model_collector.add_model_groups(
    collection_name=base_collection["Arn"], 
    model_groups=[base_model_pkg_group_name]
)

print(f"Model collection creation status: {_response}")

Model collection creation status: {'added_groups': ['arn:aws:sagemaker:us-west-2:376678947624:model-package-group/Mikael110-0-llama-2-13b-guanaco-fp16'], 'failure': []}


## Train LLM

In [38]:
from datetime import datetime
from sagemaker.huggingface import HuggingFace
from sagemaker.experiments.run import Run

# define Training Job Name 
time_suffix = datetime.now().strftime('%y%m%d%H%M')
job_name = f'huggingface-qlora-{time_suffix}'
experiments_name = f"exp-{model_id.replace('/', '-')}"
run_name = f"qlora-finetune-run-{time_suffix}"

with Run(
    experiment_name=experiments_name, 
    run_name=run_name, 
    sagemaker_session=sagemaker.Session()
) as run:
    # create the Estimator
    huggingface_estimator = HuggingFace(
        entry_point='finetune_llm.py',      
        source_dir='code',         
        instance_type='ml.g5.12xlarge',   
        instance_count=1,       
        role=role,
        base_job_name=job_name,          # the name of the training job
        volume_size=300,               
        transformers_version='4.28',            
        pytorch_version='2.0',             
        py_version='py310',           
        hyperparameters={
            'base_model_group_name': base_package_group_name,
            'model_id': model_id,                             
            'dataset_path': '/opt/ml/input/data/training',    
            'epochs': 1,                                      
            'per_device_train_batch_size': 2,                 
            'lr': 1e-4,
            'merge_weights':True,
            'region':region,
        },
        sagemaker_session=sagemaker_session
    )

    # starting the train job with our uploaded datasets as input
    data = {
        'training': training_input_path, 
        'validation': validation_input_path
    }
    huggingface_estimator.fit(
        data, 
        wait=True
    )
    
    run.log_parameters(data)  

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2308291858-2023-08-29-18-58-34-489


Using provided s3_resource
2023-08-29 18:58:34 Starting - Starting the training job...
2023-08-29 18:58:59 Starting - Preparing the instances for training.........
2023-08-29 19:00:10 Downloading - Downloading input data...
2023-08-29 19:01:00 Training - Downloading the training image...............
2023-08-29 19:03:11 Training - Training image download completed. Training in progress.......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-08-29 19:04:20,780 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-08-29 19:04:20,811 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-08-29 19:04:20,819 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-08-29 19:04:20,821 sagemaker_pytorch_container.training INFO     Invoking user training script.[0

## Save FineTuned model into Model Registry

In [31]:
# Model Package Group Vars
ft_package_group_name = f"{model_id.replace('/', '--')}-finetuned"
ft_package_group_desc = "QLoRA for model Mikael110/llama-2-13b-guanaco-fp16"
ft_tags = [
    { 
        "Key": "modelType",
        "Value": "FineTunedModel"
    },
    { 
        "Key": "fineTuned",
        "Value": "True"
    },
    { 
        "Key": "sourceDataset",
        "Value": f"{dataset_name}"
    }
]

model_package_group_input_dict = {
    "ModelPackageGroupName" : ft_package_group_name,
    "ModelPackageGroupDescription" : ft_package_group_desc,
    "Tags": ft_tags
    
}
create_model_pacakge_group_response = sm_client.create_model_package_group(
    **model_package_group_input_dict
)
print(f'Created ModelPackageGroup Arn : {create_model_pacakge_group_response["ModelPackageGroupArn"]}')

ft_model_pkg_group_name = create_model_pacakge_group_response["ModelPackageGroupArn"]

Created ModelPackageGroup Arn : arn:aws:sagemaker:us-west-2:376678947624:model-package-group/Mikael110--llama-2-13b-guanaco-fp16-finetuned


In [40]:
model_package = huggingface_estimator.register(
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=[
        "ml.p2.16xlarge", 
        "ml.p3.16xlarge", 
        "ml.g4dn.4xlarge", 
        "ml.g4dn.8xlarge", 
        "ml.g4dn.12xlarge", 
        "ml.g4dn.16xlarge"
    ],
    model_package_group_name=ft_model_pkg_group_name,
    approval_status="Approved"
)

## Deploy the model

In [43]:
model_path = model_package.model_data.replace('/model.tar.gz', '')

In [44]:
!sed -i "s@option.s3url=.*@option.s3url={model_path}@g" deepspeed/serving.properties

In [45]:
rm -rf `find -type d -name .ipynb_checkpoints`

In [48]:
!rm -f model.tar.gz
!tar czvf model.tar.gz -C deepspeed .
s3_code_artifact_deepspeed = sagemaker_session.upload_data("model.tar.gz", default_bucket, f"{s3_key_prefix}/inference")
print(f"S3 Code or Model tar for deepspeed uploaded to --- > {s3_code_artifact_deepspeed}")

./
./model.py
./requirements.txt
./serving.properties
S3 Code or Model tar for deepspeed uploaded to --- > s3://sagemaker-us-west-2-376678947624/Mikael110-llama-2-13b-guanaco-fp16/inference/model.tar.gz


### Define the serving container
Here we define the container to use for the model for inference. We will be using SageMaker's Large Model Inference(LMI) container using DeepSpeed.

In [49]:
inference_image_uri = sagemaker.image_uris.retrieve(
    "djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


Image going to be used is ---- > 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118


In [52]:
time_suffix = datetime.now().strftime('%y%m%d%H%M')
model_name_ds = f'{model_group_for_base}-{time_suffix}'

create_model_response = sm_client.create_model(
    ModelName=model_name_ds,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact_deepspeed},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

Created Model: arn:aws:sagemaker:us-west-2:376678947624:model/llama-2-13b-2308292148


In [53]:
endpoint_config_name = f"{model_name_ds}-config"
endpoint_name = f"{model_name_ds}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name_ds,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:376678947624:endpoint-config/llama-2-13b-2308292148-config',
 'ResponseMetadata': {'RequestId': '4be097af-1285-4f96-95c4-30aaf67b1f5e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '4be097af-1285-4f96-95c4-30aaf67b1f5e',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '110',
   'date': 'Tue, 29 Aug 2023 21:49:35 GMT'},
  'RetryAttempts': 0}}

In [54]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-west-2:376678947624:endpoint/llama-2-13b-2308292148-endpoint


In [55]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:376678947624:endpoint/llama-2-13b-2308292148-endpoint
Status: InService


### Run Inference

Large models such as LLama2 have very high accelerator memory footprint. Thus, a very large input payload or generating a large output can cause out of memory errors. The inference examples below are calibrated such that they will work on the ml.g5.12xlarge instance within the SageMaker response time limit of 60 seconds. If you find that increasing the input length or generation length leads to CUDA Out Of Memory errors, we recommend that you try one of the following solutions:

In [92]:
from random import randint

sample = validation_dataset[randint(0,len(validation_dataset))]

instruction = f"### Instruction\n{sample['instruction']}"
context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
response = f"### Answer\n"
# join all the parts together
prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])

    
prompt

'### Instruction\nWhat are some ideas to keep the mind active as I get older?\n\n### Answer\n'

In [93]:
import json
smr_client = boto3.client("sagemaker-runtime")

data = {
    "text": prompt,
    "properties": {
        "min_length": 10,
        "max_length": 100,
        "do_sample": True,
    },
}

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(data),
    ContentType="application/json",
)

outputs = json.loads(response_model["Body"].read().decode("utf8"))['outputs']

generated_text = outputs[0]['generated_text']
generated_text

'### Instruction\nWhat are some ideas to keep the mind active as I get older?\n\n### Answer\nOne of the most common issues people face as they age is cognitive decline. This can be due to a variety of factors such as injury, disease, or simply the natural aging process. However, there are also things that you can do to keep your mind sharp and active as you age. Here are some ideas:\n\n1. Challenge yourself mentally on a'

In [94]:
groudtruth = sample['response']
print(f"GroundTruth -> {groudtruth}")

GroundTruth -> It is important to keep the mind active, and many ways to do so. Board games are a great way to engage the mind, especially games such as Scrabble and chess. Alternatively there are many mobile applications that have solo games such as Wordle and Sudoku that are good for daily mental exercises. And don't forget that reading and conversing daily is a great way to keep the mind engaged.
