Skip to content

Llama Inference using TGI #3168

@Akhil-ender

Description

@Akhil-ender

System Info

I am trying to deploy a pretrained Llama 3 8B model as a sagemaker endpoint on a ml.g5.2xlarge instance I am getting the following error:

Error:
#015Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]#015Loading checkpoint shards: 14%|█▍ | 1/7 [00:00<00:04, 1.29it/s]#015Loading checkpoint shards: 29%|██▊ | 2/7 [00:01<00:03, 1.31it/s]#015Loading checkpoint shards: 43%|████▎ | 3/7 [00:02<00:03, 1.30it/s]#015Loading checkpoint shards: 57%|█████▋ | 4/7 [00:03<00:02, 1.29it/s]#015Loading checkpoint shards: 71%|███████▏ | 5/7 [00:03<00:01, 1.30it/s]#015Loading checkpoint shards: 86%|████████▌ | 6/7 [00:04<00:00, 1.30it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00, 1.35it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00, 1.32it/s]

Error: DownloadError

Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 127, in download_weights
utils.download_and_unload_peft(model_id, revision, trust_remote_code=trust_remote_code)

After this log, I get an error that the endpoint did not pass health checks.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

import sagemaker
import boto3
from sagemaker.huggingface import get_huggingface_llm_image_uri, HuggingFaceModel
import json

sess = sagemaker.Session()

sagemaker session bucket -> used for uploading data, models and logs

sagemaker will automatically create this bucket if it not exists

sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

retrieve the llm image uri

llm_image = get_huggingface_llm_image_uri(
"huggingface",
version="1.0.3"
)

sagemaker config

instance_type = "ml.p3.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

Define Model and Endpoint configuration parameter

config = {
'HF_MODEL_ID': "AkhilenderK/Nutrition_Med_Llama_V2", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text) # comment in to quantize
}

create HuggingFaceModel with the image uri

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)

llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=1800)

Expected behavior

The model should be deployed to the endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions