Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError when using an s3 bucket as the model_dir with HuggingFace model server #3423

Closed
kevinmingtarja opened this issue Feb 9, 2024 · 2 comments · Fixed by #3424
Labels

Comments

@kevinmingtarja
Copy link
Contributor

kevinmingtarja commented Feb 9, 2024

/kind bug

First of all, I'd like to say thank you for the work on KServe! It's been delightful so far playing around with KServe. But we found a small bug while testing out the HuggingFace model server (which we're aware is a very new addition as well).

What steps did you take and what happened:

  1. Created an InferenceService using the HuggingFace model server (yaml pasted below)
  2. Specified an s3 bucket as the model_dir (I suspect this might happen for anything that's not a local dir)
  3. Observed that the model is succesfully downloaded to a tmp directory and loaded, but then encountered the FileNotFoundError right after

Logs:

% k logs huggingface-predictor-00003-deployment-8659bb8b9-m945b
Defaulted container "kserve-container" out of: kserve-container, queue-proxy
INFO:root:Copying contents of s3://kserve-test-models/classifier to local
INFO:root:Downloaded object classifier/config.json to /tmp/tmpckx_trr1/config.json
...
INFO:root:Successfully copied s3://kserve-test-models/classifier to /tmp/tmpckx_trr1
INFO:kserve:successfully loaded tokenizer for task: 4
INFO:kserve:successfully loaded huggingface model from path /tmp/tmpckx_trr1
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/huggingfaceserver/huggingfaceserver/__main__.py", line 69, in <module>
    kserve.ModelServer(registered_models=HuggingfaceModelRepository(args.model_dir)).start(
  File "/huggingfaceserver/huggingfaceserver/huggingface_model_repository.py", line 24, in __init__
    self.load_models()
  File "/kserve/kserve/model_repository.py", line 37, in load_models
    for name in os.listdir(self.models_dir):
FileNotFoundError: [Errno 2] No such file or directory: 's3://kserve-test-models/spam-classifier'

What did you expect to happen:

I expected that this would work, as the model was successfully downloaded and loaded. But I did find a tmp workaround below and I think I know where the issue is!

What's the InferenceService yaml:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface
spec:
  predictor:
    serviceAccountName: huggingface-sa
    containers:
    - args:
      - --model_name=spam-classifier
      # - --model_id=xyz (see workaround below)
      - --model_dir=s3://kserve-test-models/classifier
      - --tensor_input_names=input_ids
      image: kserve/huggingfaceserver:latest
      name: kserve-container

Anything else you would like to add:

A temporary workaround I found is to supply the model_id argument. It can have any value, as the model_dir will override it anyway during loading:

def load(self) -> bool:
model_id_or_path = self.model_id
if self.model_dir:
model_id_or_path = pathlib.Path(Storage.download(self.model_dir))

I have verified that this workaround works (expand to see logs).
% k logs huggingface-predictor-00004-deployment-946b4d6c8-pk5nj -f
Defaulted container "kserve-container" out of: kserve-container, queue-proxy
INFO:root:Copying contents of s3://kserve-test-models/classifier to local
INFO:root:Downloaded object classifier/config.json to /tmp/tmppwjsica7/config.json
...
INFO:kserve:successfully loaded tokenizer for task: 4
INFO:kserve:successfully loaded huggingface model from path /tmp/tmppwjsica7
INFO:kserve:Registering model: classifier
INFO:kserve:Setting max asyncio worker threads as 5
INFO:kserve:Starting uvicorn with 1 workers
2024-02-09 18:57:33.228 uvicorn.error INFO:     Started server process [1]
2024-02-09 18:57:33.229 uvicorn.error INFO:     Waiting for application startup.
2024-02-09 18:57:33.234 1 kserve INFO [start():62] Starting gRPC server on [::]:8081
2024-02-09 18:57:33.234 uvicorn.error INFO:     Application startup complete.
2024-02-09 18:57:33.235 uvicorn.error INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

I think the issue is here:

try:
model.load()
except ModelMissingError:
logging.error(f"fail to locate model file for model {args.model_name} under dir {args.model_dir},"
f"trying loading from model repository.")
if not args.model_id:
kserve.ModelServer(registered_models=HuggingfaceModelRepository(args.model_dir)).start(
[model] if model.ready else [])
else:
kserve.ModelServer().start([model] if model.ready else [])

  1. model.load() will succeed, so we jump to line 68
  2. It checks for args.model_id, which is empty, so we go inside the if block
  3. It will try to instantiate HuggingfaceModelRepository with model_dir, which is pointing to an s3 bucket and not a local directory, thus causing the FileNotFoundError
  4. This is how I came up with the workaround of passing model_id, so that the else block is executed instead (because the model did load succesfully, so doing kserve.ModelServer().start([model] if model.ready else []) won't be a problem)

Environment:

  • Cloud Environment: aws
  • Kubernetes version: (use kubectl version): v1.27.9-eks-5e0fdde
  • OS (e.g. from /etc/os-release): Ubuntu 22.04.3 LTS
@oss-prow-bot oss-prow-bot bot added the kind/bug label Feb 9, 2024
terrytangyuan added a commit to terrytangyuan/kserve that referenced this issue Feb 9, 2024
… loaded. Fixes kserve#3423

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
@terrytangyuan
Copy link
Member

Thanks for the detailed report. I sent a fix in #3424.

@kevinmingtarja
Copy link
Contributor Author

Thanks for the fix @terrytangyuan !

yuzisun pushed a commit that referenced this issue Mar 12, 2024
… loaded. Fixes #3423 (#3424)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
tjandy98 pushed a commit to tjandy98/kserve that referenced this issue Apr 10, 2024
… loaded. Fixes kserve#3423 (kserve#3424)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: tjandy98 <3953059+tjandy98@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants