Replies: 1 comment
-
Any comments/suggestions? :) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi folks,
not sure if this has already been discussed elsewhere, I didn't find anything in my research except the parallel-inference doc. At Wikimedia we are trying to port models and their respective libraries (built years ago without the async concept in mind) to KServe, and so far we have encountered a lot of scalability bottlenecks (and related gotchas). In our case we try to use predictors and transformers as much as possible, together with Python async libraries like
aiohttp
,aiokafka
, etc.. The usual architecture that we follow is a single model for each predictor instance, and where possible (or where it makes sense)preprocess
offloaded to a transformer (so a single model for each isvc resource declared).We hit some bottlenecks when cpu-bound code needed to be executed:
We are aware that due to the GIL there are some intrinsic Python limitations in running parallel code, so we tried the following:
We would really be interested to know what is the experience of other teams and companies, since this seems to be a common problem to solve when using KServe.
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions