KServe does not catch Caikit runtime status correctly when subsprocess (tgis) have issues #88

kpouget · 2023-09-22T06:37:51Z

When I create a ServingRuntime+InferenceService with some incorrect parameters, Caikit cannot load the model.

{"channel": "MODEL-LOADER", "exception": null, "level": "error", "log_code": "<RUN62912924E>", "message": "load failed when processing path: /mnt/models/flan-t5-small-caikit with error: RuntimeError('TGIS failed to boot up with the model. See logs for details')", "model_id": "flan-t5-small-caikit", "num_indent": 0, "thread_id": 140660900353792, "timestamp": "2023-09-21T19:39:45.781105"}

This part is expected. However, the InferenceService still shows the model as Loaded, which is unexpected:

  modelStatus:
    copies:
      failedCopies: 0
      totalCopies: 1
    states:
      activeModelState: Loaded
      targetModelState: Loaded
    transitionStatus: UpToDate

The text was updated successfully, but these errors were encountered:

heyselbi · 2023-10-05T14:54:34Z

We will ping IBM to see what the expected behavior is.

kpouget · 2023-10-09T09:33:56Z

I see that many text-generation process remains <defunct> when the serving runtime is running after it hit this error:

sh-5.1$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1000850+       1  0.5  0.0 19143780 611752 ?     Ssl  09:02   0:07 python3 -m caikit.runtime.grpc_server
1000850+     401  0.0  0.0      0     0 ?        Z    09:03   0:01 [text-generation] <defunct>
1000850+     403  0.0  0.0      0     0 ?        Z    09:03   0:01 [text-generation] <defunct>
1000850+     405  0.0  0.0      0     0 ?        Z    09:03   0:00 [text-generation] <defunct>

this means that Python caikit.runtime.grpc_server process does not wait() for the text-generation. This explains why it doesn't detect that text-generation failed ...

Xaenalt · 2023-10-10T14:29:23Z

This should be handled by the container splitting (this sprint)

danielezonca · 2023-10-10T14:41:38Z

This is the ticket for reference :)

[STORY] Split caikit/TGIS ServingRuntime into three distinct images caikit#11

dtrifiro · 2023-10-26T15:24:53Z

When running caikit+tgis in this configuration, this is no longer an issue.

kpouget · 2023-10-26T15:26:40Z

@dtrifiro what about when running caikit+tgis in the single container? 🤔

Xaenalt · 2023-10-26T15:33:49Z

Single container architecture was only a stopgap to the current architecture

lugi0 · 2023-12-11T16:23:00Z

Verified in RHOAI 2.5 RC4, now if I cause a failure while loading a model (i.e. give the wrong path to the deployment modal) the InferenceService will set the modelStatus to FailedToLoad with relevant error messages.

Jooho added the kind/bug Something isn't working label Sep 26, 2023

kpouget mentioned this issue Sep 26, 2023

Caikit/TGIS swallows model loading running out of memory #92

Closed

heyselbi added the rhods-2.5 label Oct 12, 2023

dtrifiro closed this as completed Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KServe does not catch Caikit runtime status correctly when subsprocess (tgis) have issues #88

KServe does not catch Caikit runtime status correctly when subsprocess (tgis) have issues #88

kpouget commented Sep 22, 2023

heyselbi commented Oct 5, 2023

kpouget commented Oct 9, 2023

Xaenalt commented Oct 10, 2023

danielezonca commented Oct 10, 2023 •

edited

dtrifiro commented Oct 26, 2023

kpouget commented Oct 26, 2023

Xaenalt commented Oct 26, 2023

lugi0 commented Dec 11, 2023

KServe does not catch Caikit runtime status correctly when subsprocess (tgis) have issues #88

KServe does not catch Caikit runtime status correctly when subsprocess (tgis) have issues #88

Comments

kpouget commented Sep 22, 2023

heyselbi commented Oct 5, 2023

kpouget commented Oct 9, 2023

Xaenalt commented Oct 10, 2023

danielezonca commented Oct 10, 2023 • edited

dtrifiro commented Oct 26, 2023

kpouget commented Oct 26, 2023

Xaenalt commented Oct 26, 2023

lugi0 commented Dec 11, 2023

danielezonca commented Oct 10, 2023 •

edited