-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
System Info
I am trying to deploy a pretrained Llama 3 8B model as a sagemaker endpoint on a ml.g5.2xlarge instance I am getting the following error:
Error:
#015Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]#015Loading checkpoint shards: 14%|█▍ | 1/7 [00:00<00:04, 1.29it/s]#015Loading checkpoint shards: 29%|██▊ | 2/7 [00:01<00:03, 1.31it/s]#015Loading checkpoint shards: 43%|████▎ | 3/7 [00:02<00:03, 1.30it/s]#015Loading checkpoint shards: 57%|█████▋ | 4/7 [00:03<00:02, 1.29it/s]#015Loading checkpoint shards: 71%|███████▏ | 5/7 [00:03<00:01, 1.30it/s]#015Loading checkpoint shards: 86%|████████▌ | 6/7 [00:04<00:00, 1.30it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00, 1.35it/s]#015Loading checkpoint shards: 100%|██████████| 7/7 [00:05<00:00, 1.32it/s]
Error: DownloadError
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 127, in download_weights
utils.download_and_unload_peft(model_id, revision, trust_remote_code=trust_remote_code)
After this log, I get an error that the endpoint did not pass health checks.
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
import sagemaker
import boto3
from sagemaker.huggingface import get_huggingface_llm_image_uri, HuggingFaceModel
import json
sess = sagemaker.Session()
sagemaker session bucket -> used for uploading data, models and logs
sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
"huggingface",
version="1.0.3"
)
sagemaker config
instance_type = "ml.p3.2xlarge"
number_of_gpu = 1
health_check_timeout = 300
Define Model and Endpoint configuration parameter
config = {
'HF_MODEL_ID': "AkhilenderK/Nutrition_Med_Llama_V2", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text) # comment in to quantize
}
create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=1800)
Expected behavior
The model should be deployed to the endpoint.