Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KServe does not catch Caikit runtime status correctly when subsprocess (tgis) have issues #88

Closed
kpouget opened this issue Sep 22, 2023 · 8 comments
Labels
kind/bug Something isn't working rhods-2.5

Comments

@kpouget
Copy link

kpouget commented Sep 22, 2023

When I create a ServingRuntime+InferenceService with some incorrect parameters, Caikit cannot load the model.

{"channel": "MODEL-LOADER", "exception": null, "level": "error", "log_code": "<RUN62912924E>", "message": "load failed when processing path: /mnt/models/flan-t5-small-caikit with error: RuntimeError('TGIS failed to boot up with the model. See logs for details')", "model_id": "flan-t5-small-caikit", "num_indent": 0, "thread_id": 140660900353792, "timestamp": "2023-09-21T19:39:45.781105"}

This part is expected. However, the InferenceService still shows the model as Loaded, which is unexpected:

  modelStatus:
    copies:
      failedCopies: 0
      totalCopies: 1
    states:
      activeModelState: Loaded
      targetModelState: Loaded
    transitionStatus: UpToDate
@heyselbi
Copy link
Contributor

heyselbi commented Oct 5, 2023

We will ping IBM to see what the expected behavior is.

@kpouget
Copy link
Author

kpouget commented Oct 9, 2023

I see that many text-generation process remains <defunct> when the serving runtime is running after it hit this error:

sh-5.1$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1000850+       1  0.5  0.0 19143780 611752 ?     Ssl  09:02   0:07 python3 -m caikit.runtime.grpc_server
1000850+     401  0.0  0.0      0     0 ?        Z    09:03   0:01 [text-generation] <defunct>
1000850+     403  0.0  0.0      0     0 ?        Z    09:03   0:01 [text-generation] <defunct>
1000850+     405  0.0  0.0      0     0 ?        Z    09:03   0:00 [text-generation] <defunct>

this means that Python caikit.runtime.grpc_server process does not wait() for the text-generation. This explains why it doesn't detect that text-generation failed ...

@Xaenalt
Copy link
Contributor

Xaenalt commented Oct 10, 2023

This should be handled by the container splitting (this sprint)

@danielezonca
Copy link
Contributor

danielezonca commented Oct 10, 2023

This is the ticket for reference :)

@dtrifiro
Copy link
Contributor

When running caikit+tgis in this configuration, this is no longer an issue.

@kpouget
Copy link
Author

kpouget commented Oct 26, 2023

@dtrifiro what about when running caikit+tgis in the single container? 🤔

@Xaenalt
Copy link
Contributor

Xaenalt commented Oct 26, 2023

Single container architecture was only a stopgap to the current architecture

@lugi0
Copy link

lugi0 commented Dec 11, 2023

Verified in RHOAI 2.5 RC4, now if I cause a failure while loading a model (i.e. give the wrong path to the deployment modal) the InferenceService will set the modelStatus to FailedToLoad with relevant error messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working rhods-2.5
Projects
Status: Done
Status: No status
Status: Done
Development

No branches or pull requests

7 participants