-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming responses from scaled-to-zero Inference Endpoints return undefined
#549
Comments
We did not add the option yet, but it's definitively possible if needed. |
It would be nice for API compatibiltity between inference API & inference endpoints! |
Should fix part of #549 The other part is fixed in the backend
@jinnovation you will now get a 503 in streaming requests with most recent version With your current version, you can get a 503 if you set const response = hf.textGenerationStream({
inputs: experimental_buildLlama2Prompt([
{
role: "user",
content: "hello",
},
]),
{
retry_on_error: false
}
}); Soon the inference endpoint backend will be updated, so that a call by default with |
Fantastic! Thank you. |
Shortly after huggingface#549, the inference endpoint backend was updated to block by default on model loading. This PR adds documentation explaining how to circumvent that blocking so that the user, if desired, can handle the 500 errors themselves.
It's available on the backend |
) Shortly after #549, the inference endpoint backend was updated to block by default on model loading. This PR adds documentation explaining how to circumvent that blocking so that the user, if desired, can handle the 500 errors themselves.
Hey folks, what's the recommended way to deal with the initializing period of a scaled-to-zero Inference Endpoint when using
.textGenerationStream()
?For example, when using
.textGeneration()
directly, we get the 503 error directly that we can catch and maybe retry on.However, when using
.textGenerationStream()
, fetching the next response chunk appears to succeed, except the.value
at that chunk happens to beundefined
. Reproducing code:Running via terminal results in the following:
My (possibly naive) expectation would be for the generator returned by
textGenerationStream()
to wait for the corresponding Inference Endpoint to fully initialize -- maybe with exponential backoff -- before allow the chunk to be returned via.next()
. But I am a novice here so I could be missing something. 😆Is this expected behavior or a bug?
The text was updated successfully, but these errors were encountered: