Llama Inference using TGI

### System Info

I am trying to deploy a pretrained Llama 3 8B model as a sagemaker endpoint on a ml.g5.2xlarge instance I am getting the following error:

Error:
#015Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]#015Loading checkpoint shards:  14%|█▍        | 1/7 [00:00<00:04,  1.29it/s]#015Loading checkpoint shards:  29%|██▊       | 2/7 [00:01<00:03,  1.31it/s]#015Loading checkpoint shards:  43%|████▎     | 3/7 [00:02<00:03,  1.30it/s]#015Loading checkpoint shards:  57%|█████▋    | 4/7 [00:03<00:02,  1.29it/s]#015Loading checkpoint shards:  71%|███████▏  | 5/7 [00:03<00:01,  1.30it/s]#015Loading checkpoint shards:  86%|████████▌ | 6/7 [00:04<00:00,  1.30it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00,  1.35it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00,  1.32it/s]

Error: DownloadError

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 127, in download_weights
    utils.download_and_unload_peft(model_id, revision, trust_remote_code=trust_remote_code)






After this log, I get an error that the endpoint did not pass health checks.

### Information

- [ ] Docker
- [ ] The CLI directly

### Tasks

- [ ] An officially supported command
- [ ] My own modifications

### Reproduction

import sagemaker
import boto3
from sagemaker.huggingface import get_huggingface_llm_image_uri, HuggingFaceModel
import json


sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)


# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.0.3"
)


# sagemaker config
instance_type = "ml.p3.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "AkhilenderK/Nutrition_Med_Llama_V2", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text) # comment in to quantize
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)


llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=1800)


### Expected behavior

The model should be deployed to the endpoint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama Inference using TGI #3168

System Info

Information

Tasks

Reproduction

sagemaker session bucket -> used for uploading data, models and logs

sagemaker will automatically create this bucket if it not exists

retrieve the llm image uri

sagemaker config

Define Model and Endpoint configuration parameter

create HuggingFaceModel with the image uri

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama Inference using TGI #3168

Description

System Info

Information

Tasks

Reproduction

sagemaker session bucket -> used for uploading data, models and logs

sagemaker will automatically create this bucket if it not exists

retrieve the llm image uri

sagemaker config

Define Model and Endpoint configuration parameter

create HuggingFaceModel with the image uri

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions