Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelError while deploying FlanT5-xl #21402

Closed
2 of 4 tasks
RonLek opened this issue Feb 1, 2023 · 18 comments
Closed
2 of 4 tasks

ModelError while deploying FlanT5-xl #21402

RonLek opened this issue Feb 1, 2023 · 18 comments
Assignees

Comments

@RonLek
Copy link

RonLek commented Feb 1, 2023

System Info

transformers_version==4.17.0
Plaform = Sagemaker Notebook
python==3.9.0

Who can help?

@ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Amazon Sagemaker deployment script in AWS for flant5-xl

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'google/flan-t5-xl',
	'HF_TASK':'text2text-generation'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.m5.xlarge' # ec2 instance type
)

predictor.predict({
	'inputs': "The answer to the universe is"
})

Results in

---------------------------------------------------------------------------
ModelError                                Traceback (most recent call last)
/tmp/ipykernel_20116/1338286066.py in <cell line: 26>()
     24 )
     25 
---> 26 predictor.predict({
     27         'inputs': "The answer to the universe is"
     28 })

~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant, inference_id)
    159             data, initial_args, target_model, target_variant, inference_id
    160         )
--> 161         response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
    162         return self._handle_response(response)
    163 

~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    528                 )
    529             # The "self" in this scope is referring to the BaseClient.
--> 530             return self._make_api_call(operation_name, kwargs)
    531 
    532         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    958             error_code = parsed_response.get("Error", {}).get("Code")
    959             error_class = self.exceptions.from_code(error_code)
--> 960             raise error_class(parsed_response, operation_name)
    961         else:
    962             return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /.sagemaker/mms/models/google__flan-t5-xl with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM\u0027\u003e, \u003cclass \u0027transformers.models.t5.modeling_t5.T5ForConditionalGeneration\u0027\u003e)."
}
"

From an existing issue, I suspected this might be due to the use of transformers==4.17.0, however, when I use the exact same script to deploy flant5-large model, it works without any issues.

Expected behavior

The model should get deployed on AWS Sagemaker without any issues.

@younesbelkada
Copy link
Contributor

younesbelkada commented Feb 1, 2023

Hello @RonLek

Thanks for the issue!
Note that starting from flan-t5-xl, the weights of the model are sharded.
Sharded weights loading has been supported after the release of transformers==4.17.0 (precisely in transformers==4.18.0: https://github.com/huggingface/transformers/releases/tag/v4.18.0 ), so I think the fix should be updating the transformers version to a more recent one, e.g. 4.26.0 or 4.25.0.

@valentinboyanov
Copy link

valentinboyanov commented Feb 1, 2023

Hi @younesbelkada and @RonLek ! I have the same issue deploying google/flan-t5-xxl on SageMaker.

I've tried to update to transformers==4.26.0 by providing code/requirements.txt through s3://sagemaker-eu-north-1-***/model.tar.gz:

# Hub Model configuration. https://huggingface.co/models
hub: dict = {"HF_MODEL_ID": "google/flan-t5-xxl", "HF_TASK": "text2text-generation"}

# Create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version="4.17.0",
    pytorch_version="1.10.2",
    py_version="py38",
    model_data="s3://sagemaker-eu-north-1-***/model.tar.gz",
    env=hub,
    role=role,
)

Observing the AWS logs I can see that transformers==4.26.0 was installed:

This is an experimental beta features, which allows downloading model from the Hugging Face Hub on start up. It loads the model defined in the env var `HF_MODEL_ID`
/opt/conda/lib/python3.8/site-packages/huggingface_hub/file_download.py:588: FutureWarning: `cached_download` is the legacy way to download files from the HF hub, please consider upgrading to `hf_hub_download`  warnings.warn(
#015Downloading:   0%\|          \| 0.00/11.0k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 11.0k/11.0k [00:00<00:00, 5.49MB/s]
#015Downloading:   0%\|          \| 0.00/674 [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 674/674 [00:00<00:00, 663kB/s]
#015Downloading:   0%\|          \| 0.00/2.20k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 2.20k/2.20k [00:00<00:00, 2.24MB/s]
#015Downloading:   0%\|          \| 0.00/792k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 792k/792k [00:00<00:00, 43.5MB/s]
#015Downloading:   0%\|          \| 0.00/2.42M [00:00<?, ?B/s]#015Downloading:   0%\|          \| 4.10k/2.42M [00:00<01:04, 37.5kB/s]#015Downloading:   1%\|          \| 28.7k/2.42M [00:00<00:16, 147kB/s] #015Downloading:   4%\|▎         \| 86.0k/2.42M [00:00<00:07, 318kB/s]#015Downloading:   9%\|▊         \| 209k/2.42M [00:00<00:03, 633kB/s] #015Downloading:  18%\|█▊        \| 438k/2.42M [00:00<00:01, 1.16MB/s]#015Downloading:  37%\|███▋      \| 897k/2.42M [00:00<00:00, 2.18MB/s]#015Downloading:  76%\|███████▌  \| 1.83M/2.42M [00:00<00:00, 4.24MB/s]#015Downloading: 100%\|██████████\| 2.42M/2.42M [00:00<00:00, 3.12MB/s]
#015Downloading:   0%\|          \| 0.00/2.54k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 2.54k/2.54k [00:00<00:00, 2.62MB/s]
WARNING - Overwriting /.sagemaker/mms/models/google__flan-t5-xxl ...
Collecting transformers==4.26.0  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 65.9 MB/s eta 0:00:00
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2.28.1)
Collecting huggingface-hub<1.0,>=0.11.0  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.3/190.3 kB 46.0 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (1.23.3)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (0.13.0)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (21.3)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (4.64.1)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (6.0)
Requirement already satisfied: filelock in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (3.8.0)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2022.9.13)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.8/site-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (4.3.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging>=20.0->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (3.0.9)
Requirement already satisfied: charset-normalizer<3,>=2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (1.26.11)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2022.9.24)
Installing collected packages: huggingface-hub, transformers  Attempting uninstall: huggingface-hub    Found existing installation: huggingface-hub 0.10.0    Uninstalling huggingface-hub-0.10.0:      Successfully uninstalled huggingface-hub-0.10.0  Attempting uninstall: transformers    Found existing installation: transformers 4.17.0    Uninstalling transformers-4.17.0:      Successfully uninstalled transformers-4.17.0
Successfully installed huggingface-hub-0.12.0 transformers-4.26.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip available: 22.2.2 -> 23.0
[notice] To update, run: pip install --upgrade pip
Warning: MMS is using non-default JVM parameters: -XX:-UseContainerSupport
2023-02-01T15:46:06,090 [INFO ] main com.amazonaws.ml.mms.ModelServer -
MMS Home: /opt/conda/lib/python3.8/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 4
Max heap size: 3461 M
Python executable: /opt/conda/bin/python3.8
Config file: /etc/sagemaker-mms.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8080
Model Store: /.sagemaker/mms/models
Initial Models: ALL
Log dir: null
Metrics dir: null
Netty threads: 0
Netty client threads: 0
Default workers per model: 4
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Preload model: false
Prefer direct buffer: false
2023-02-01T15:46:06,140 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-google__flan-t5-xxl
2023-02-01T15:46:06,204 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /home/model-server/tmp/.mms.sock.9000 --handler sagemaker_huggingface_inference_toolkit.handler_service --model-path /.sagemaker/mms/models/google__flan-t5-xxl --model-name google__flan-t5-xxl --preload-model false --tmp-dir /home/model-server/tmp
2023-02-01T15:46:06,205 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,205 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 47
2023-02-01T15:46:06,206 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
2023-02-01T15:46:06,206 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.8.10
2023-02-01T15:46:06,206 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model google__flan-t5-xxl loaded.
2023-02-01T15:46:06,210 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-02-01T15:46:06,218 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,218 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,219 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,226 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,278 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,281 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,284 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,290 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,298 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
Model server started.
2023-02-01T15:46:06,302 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
2023-02-01T15:46:08,478 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000000-084f36d4c5a81b10-639dfd41
2023-02-01T15:46:08,491 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2081
2023-02-01T15:46:08,493 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-1
2023-02-01T15:46:08,499 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000001-c96df6d4c5a81b10-276a10eb
2023-02-01T15:46:08,500 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2089
2023-02-01T15:46:08,500 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-3
2023-02-01T15:46:08,512 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000004-12e7f154c5a81b12-fe262c46
2023-02-01T15:46:08,512 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2101
2023-02-01T15:46:08,513 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-4
2023-02-01T15:46:08,561 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000003-6582f154c5a81b12-273338b8
2023-02-01T15:46:08,561 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2150
2023-02-01T15:46:08,561 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-2
2023-02-01T15:46:10,450 [INFO ] pool-2-thread-6 ACCESS_LOG - /169.254.178.2:59002 "GET /ping HTTP/1.1" 200 7
2023-02-01T15:46:15,412 [INFO ] pool-2-thread-6 ACCESS_LOG - /169.254.178.2:59002 "GET /ping HTTP/1.1" 200 0
2023-02-01T15:46:20,411 [INFO ] pool-2-thread-6 ACCESS_LOG - /169.254.178.2:59002 "GET /ping HTTP/1.1" 200 0

But I got the same error when trying to do an inference:

botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM\u0027\u003e, \u003cclass \u0027transformers.models.t5.modeling_t5.T5ForConditionalGeneration\u0027\u003e)."
}

AWS logs:

2023-02-01T15:49:59,831 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 219, in handle
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.initialize(context)
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 77, in initialize
2023-02-01T15:49:59,832 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 1
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.model = self.load(self.model_dir)
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 104, in load
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     hf_pipeline = get_pipeline(task=os.environ["HF_TASK"], model_dir=model_dir, device=self.device)
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/transformers_utils.py", line 272, in get_pipeline
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/transformers/pipelines/__init__.py", line 754, in pipeline
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     framework, model = infer_framework_load_model(
2023-02-01T15:49:59,834 [INFO ] W-9000-google__flan-t5-xxl ACCESS_LOG - /169.254.178.2:59002 "POST /invocations HTTP/1.1" 400 13
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/transformers/pipelines/base.py", line 266, in infer_framework_load_model
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>).
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/mms/service.py", line 108, in predict
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 243, in handle
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise PredictionException(str(e), 400)
2023-02-01T15:49:59,837 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>). : 400

@younesbelkada
Copy link
Contributor

Hello @valentinboyanov

I can see in your script that:

HuggingFaceModel(
    transformers_version="4.17.0",
    pytorch_version="1.10.2",
    py_version="py38",
    model_data="s3://sagemaker-eu-north-1-***/model.tar.gz",
    env=hub,
    role=role,
)

can you update transformers_version with the correct value? I suspect this is causing the issue

@valentinboyanov
Copy link

valentinboyanov commented Feb 1, 2023

@younesbelkada if I change it, I'm unable to deploy at all:

    raise ValueError(
ValueError: Unsupported huggingface version: 4.26.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.

This is why I've followed the instructions by Heiko Hotz (marshmellow77) in this comment to provide a requirements.txt file that will let me specify dependencies I want to be installed in the container.

@philschmid
Copy link
Member

@valentinboyanov what is the content for your model_data="s3://sagemaker-eu-north-1-***/model.tar.gz"? Could you please share the folder structure.

@valentinboyanov
Copy link

valentinboyanov commented Feb 1, 2023

@philschmid yes, here it goes:

➜  model tree .
.
└── code
    └── requirements.txt

1 directory, 1 file
➜  model cat code/requirements.txt 
transformers==4.26.0%     

@philschmid
Copy link
Member

When you provide a model_data key word you also have to include the inference.py and the model weights.

@RonLek
Copy link
Author

RonLek commented Feb 2, 2023

@philschmid what should be the contents of the inference.py in case of the flan-t5-xl model? Can this be an empty file if I don't intend to change anything from the hub model? There doesn't seem to be such a file included within the Hugging Face repository.

@valentinboyanov I confirm getting the same as well. From the CW logs it seems that 4.17.0 is un-installed and replaced with the latest version specified in the requirements.txt file.

@younesbelkada if I change it, I'm unable to deploy at all:

    raise ValueError(
ValueError: Unsupported huggingface version: 4.26.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.

This is why I've followed the instructions by Heiko Hotz (marshmellow77) in this comment to provide a requirements.txt file that will let me specify dependencies I want to be installed in the container.

@rafaelsf80
Copy link

I'm having the same Could not load model error with any of the following classes: AutoModelForSeq2SeqLM and T5ForConditionalGeneration when using a docker for inference of a flan-t5-xxl-sharded-fp16 model:

Code works without Docker, but If I build and run docker run --gpus all -p 7080:7080 flan-t5-xxl-sharded-fp16:latest, error is the following:

[2023-02-05 21:33:53 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2023-02-05 21:33:53 +0000] [1] [INFO] Listening at: http://0.0.0.0:7080 (1)
[2023-02-05 21:33:53 +0000] [1] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2023-02-05 21:33:53 +0000] [7] [INFO] Booting worker with pid: 7
[2023-02-05 21:34:01 +0000] [7] [INFO] Is CUDA available: True
[2023-02-05 21:34:01 +0000] [7] [INFO] CUDA device: NVIDIA A100-SXM4-40GB
[2023-02-05 21:34:01 +0000] [7] [INFO] Loading model
[2023-02-05 21:34:02 +0000] [7] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.9/site-packages/uvicorn/workers.py", line 66, in init_process
    super(UvicornWorker, self).init_process()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/base.py", line 134, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.9/site-packages/gunicorn/util.py", line 359, in import_app
    mod = importlib.import_module(module)
  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/app/main.py", line 29, in <module>
    pipe_flan = pipeline("text2text-generation", model="../flan-t5-xxl-sharded-fp16", model_kwargs={"load_in_8bit":True, "device_map": "auto"})
  File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/__init__.py", line 754, in pipeline
    framework, model = infer_framework_load_model(
  File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/base.py", line 266, in infer_framework_load_model
    raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model ../flan-t5-xxl-sharded-fp16 with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>).
[2023-02-05 21:34:02 +0000] [7] [INFO] Worker exiting (pid: 7)
[2023-02-05 21:34:04 +0000] [1] [INFO] Shutting down: Master
[2023-02-05 21:34:04 +0000] [1] [INFO] Reason: Worker failed to boot.

Dockerfile is the following:

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.9

# install dependencies
RUN python3 -m pip install --upgrade pip
RUN pip3 install torch==1.13.0 transformers==4.26.0 sentencepiece torchvision torchaudio accelerate==0.15.0 bitsandbytes-cuda113

COPY ./app /app
COPY ./flan-t5-xxl-sharded-fp16/ /flan-t5-xxl-sharded-fp16

EXPOSE 7080

# Start the app
CMD ["gunicorn", "-b", "0.0.0.0:7080", "main:app","--workers","1","--timeout","180","-k","uvicorn.workers.UvicornWorker"]

The code of app/main.py is the following:

from fastapi import FastAPI, Request
from fastapi.logger import logger

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, T5ForConditionalGeneration 

import json
import logging
import numpy as np
import os
import torch

from transformers import pipeline

app = FastAPI()

gunicorn_logger = logging.getLogger('gunicorn.error')
logger.handlers = gunicorn_logger.handlers

if __name__ != "main":
    logger.setLevel(gunicorn_logger.level)
else:
    logger.setLevel(logging.INFO)

logger.info(f"Is CUDA available: {torch.cuda.is_available()}")
logger.info(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")

logger.info("Loading model")

# error is in this line
pipe_flan = pipeline("text2text-generation", model="../flan-t5-xxl-sharded-fp16", model_kwargs={"load_in_8bit":True, "device_map": "auto"}) 

# extra code removed

@RonLek
Copy link
Author

RonLek commented Feb 8, 2023

@philschmid @younesbelkada just wanted to follow up on this.

@philschmid what should be the contents of the inference.py in case of the flan-t5-xl model? There doesn't seem to be such a file included within the Hugging Face repository.

@valentinboyanov I confirm getting the same as well. From the CW logs it seems that 4.17.0 is un-installed and replaced with the latest version specified in the requirements.txt file.

@younesbelkada if I change it, I'm unable to deploy at all:

    raise ValueError(
ValueError: Unsupported huggingface version: 4.26.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.

This is why I've followed the instructions by Heiko Hotz (marshmellow77) in this comment to provide a requirements.txt file that will let me specify dependencies I want to be installed in the container.

@philschmid
Copy link
Member

@RonLek i am planning to create an example. I ll post it here once it is ready.

@philschmid
Copy link
Member

@RonLek done: https://www.philschmid.de/deploy-flan-t5-sagemaker

@RonLek
Copy link
Author

RonLek commented Feb 8, 2023

This works! Thanks a ton @philschmid for the prompt response 🚀

@RonLek
Copy link
Author

RonLek commented Feb 12, 2023

@philschmid just curious. Would there be a similar sharded model repo for flan-t5-xl?

@philschmid
Copy link
Member

If you check this blog post: https://www.philschmid.de/deploy-t5-11b There is a code snippet on how to do this, for t5-11b https://www.philschmid.de/deploy-t5-11b

import torch
from transformers import AutoModelWithLMHead
from huggingface_hub import HfApi

# load model as float16
model = AutoModelWithLMHead.from_pretrained("t5-11b", torch_dtype=torch.float16, low_cpu_mem_usage=True)
# shard model an push to hub
model.save_pretrained("sharded", max_shard_size="2000MB")

@RonLek
Copy link
Author

RonLek commented Feb 16, 2023

Thanks! This worked 🔥

@rags1357
Copy link

@philschmid thanks for the guidance here. While deploying your solution on SageMaker i noticed that it works great on g5 instances but not on p3 instances( p3.8xlarge). Also, do we know when the the direct deploy from HF hub would work out of the box?
Error below -

Model fails to load, the reason being that the library bitsandbytes that is required "The installed version of bitsandbytes was compiled without GPU support. " on p3 instance and that leads to the below error when you invoke the model-
2023-02-25T01:24:28,714 [INFO ] W-model-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: 'NoneType' object has no attribute 'cget_col_row_stats' : 400

@github-actions
Copy link

github-actions bot commented Apr 4, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants