FileNotFoundError when using an s3 bucket as the model_dir with HuggingFace model server #3423

kevinmingtarja · 2024-02-09T19:54:15Z

/kind bug

First of all, I'd like to say thank you for the work on KServe! It's been delightful so far playing around with KServe. But we found a small bug while testing out the HuggingFace model server (which we're aware is a very new addition as well).

What steps did you take and what happened:

Created an InferenceService using the HuggingFace model server (yaml pasted below)
Specified an s3 bucket as the model_dir (I suspect this might happen for anything that's not a local dir)
Observed that the model is succesfully downloaded to a tmp directory and loaded, but then encountered the FileNotFoundError right after

Logs:

% k logs huggingface-predictor-00003-deployment-8659bb8b9-m945b
Defaulted container "kserve-container" out of: kserve-container, queue-proxy
INFO:root:Copying contents of s3://kserve-test-models/classifier to local
INFO:root:Downloaded object classifier/config.json to /tmp/tmpckx_trr1/config.json
...
INFO:root:Successfully copied s3://kserve-test-models/classifier to /tmp/tmpckx_trr1
INFO:kserve:successfully loaded tokenizer for task: 4
INFO:kserve:successfully loaded huggingface model from path /tmp/tmpckx_trr1
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/huggingfaceserver/huggingfaceserver/__main__.py", line 69, in <module>
    kserve.ModelServer(registered_models=HuggingfaceModelRepository(args.model_dir)).start(
  File "/huggingfaceserver/huggingfaceserver/huggingface_model_repository.py", line 24, in __init__
    self.load_models()
  File "/kserve/kserve/model_repository.py", line 37, in load_models
    for name in os.listdir(self.models_dir):
FileNotFoundError: [Errno 2] No such file or directory: 's3://kserve-test-models/spam-classifier'

What did you expect to happen:

I expected that this would work, as the model was successfully downloaded and loaded. But I did find a tmp workaround below and I think I know where the issue is!

What's the InferenceService yaml:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface
spec:
  predictor:
    serviceAccountName: huggingface-sa
    containers:
    - args:
      - --model_name=spam-classifier
      # - --model_id=xyz (see workaround below)
      - --model_dir=s3://kserve-test-models/classifier
      - --tensor_input_names=input_ids
      image: kserve/huggingfaceserver:latest
      name: kserve-container

Anything else you would like to add:

A temporary workaround I found is to supply the model_id argument. It can have any value, as the model_dir will override it anyway during loading:

kserve/python/huggingfaceserver/huggingfaceserver/model.py

Lines 91 to 94 in 5172dc8

    
           def load(self) -> bool: 
        
               model_id_or_path = self.model_id 
        
               if self.model_dir: 
        
                   model_id_or_path = pathlib.Path(Storage.download(self.model_dir))

I have verified that this workaround works (expand to see logs).

% k logs huggingface-predictor-00004-deployment-946b4d6c8-pk5nj -f
Defaulted container "kserve-container" out of: kserve-container, queue-proxy
INFO:root:Copying contents of s3://kserve-test-models/classifier to local
INFO:root:Downloaded object classifier/config.json to /tmp/tmppwjsica7/config.json
...
INFO:kserve:successfully loaded tokenizer for task: 4
INFO:kserve:successfully loaded huggingface model from path /tmp/tmppwjsica7
INFO:kserve:Registering model: classifier
INFO:kserve:Setting max asyncio worker threads as 5
INFO:kserve:Starting uvicorn with 1 workers
2024-02-09 18:57:33.228 uvicorn.error INFO:     Started server process [1]
2024-02-09 18:57:33.229 uvicorn.error INFO:     Waiting for application startup.
2024-02-09 18:57:33.234 1 kserve INFO [start():62] Starting gRPC server on [::]:8081
2024-02-09 18:57:33.234 uvicorn.error INFO:     Application startup complete.
2024-02-09 18:57:33.235 uvicorn.error INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

I think the issue is here:

kserve/python/huggingfaceserver/huggingfaceserver/__main__.py

Lines 63 to 72 in 5172dc8

    
           try: 
        
               model.load() 
        
           except ModelMissingError: 
        
               logging.error(f"fail to locate model file for model {args.model_name} under dir {args.model_dir}," 
        
                             f"trying loading from model repository.") 
        
           if not args.model_id: 
        
               kserve.ModelServer(registered_models=HuggingfaceModelRepository(args.model_dir)).start( 
        
                   [model] if model.ready else []) 
        
           else: 
        
               kserve.ModelServer().start([model] if model.ready else [])

model.load() will succeed, so we jump to line 68
It checks for args.model_id, which is empty, so we go inside the if block
It will try to instantiate HuggingfaceModelRepository with model_dir, which is pointing to an s3 bucket and not a local directory, thus causing the FileNotFoundError
This is how I came up with the workaround of passing model_id, so that the else block is executed instead (because the model did load succesfully, so doing kserve.ModelServer().start([model] if model.ready else []) won't be a problem)

Environment:

Cloud Environment: aws
Kubernetes version: (use kubectl version): v1.27.9-eks-5e0fdde
OS (e.g. from /etc/os-release): Ubuntu 22.04.3 LTS

The text was updated successfully, but these errors were encountered:

… loaded. Fixes kserve#3423 Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

terrytangyuan · 2024-02-09T21:58:14Z

Thanks for the detailed report. I sent a fix in #3424.

kevinmingtarja · 2024-02-10T14:48:45Z

Thanks for the fix @terrytangyuan !

… loaded. Fixes #3423 (#3424) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

… loaded. Fixes kserve#3423 (kserve#3424) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: tjandy98 <3953059+tjandy98@users.noreply.github.com>

oss-prow-bot bot added the kind/bug label Feb 9, 2024

terrytangyuan added a commit to terrytangyuan/kserve that referenced this issue Feb 9, 2024

fix: Instantiate HuggingfaceModelRepository only when model cannot be…

116685a

… loaded. Fixes kserve#3423 Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

terrytangyuan mentioned this issue Feb 9, 2024

fix: Instantiate HuggingfaceModelRepository only when model cannot be loaded. Fixes #3423 #3424

Merged

yuzisun closed this as completed in #3424 Mar 12, 2024

yuzisun pushed a commit that referenced this issue Mar 12, 2024

fix: Instantiate HuggingfaceModelRepository only when model cannot be…

3a11f50

… loaded. Fixes #3423 (#3424) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

sivanantha321 mentioned this issue Mar 13, 2024

Add support for providing storage uri as model_dir #3518

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileNotFoundError when using an s3 bucket as the model_dir with HuggingFace model server #3423

FileNotFoundError when using an s3 bucket as the model_dir with HuggingFace model server #3423

kevinmingtarja commented Feb 9, 2024 •

edited

terrytangyuan commented Feb 9, 2024

kevinmingtarja commented Feb 10, 2024

FileNotFoundError when using an s3 bucket as the model_dir with HuggingFace model server #3423

FileNotFoundError when using an s3 bucket as the model_dir with HuggingFace model server #3423

Comments

kevinmingtarja commented Feb 9, 2024 • edited

terrytangyuan commented Feb 9, 2024

kevinmingtarja commented Feb 10, 2024

kevinmingtarja commented Feb 9, 2024 •

edited