Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming responses from scaled-to-zero Inference Endpoints return undefined #549

Closed
jinnovation opened this issue Mar 13, 2024 · 6 comments

Comments

@jinnovation
Copy link
Contributor

Hey folks, what's the recommended way to deal with the initializing period of a scaled-to-zero Inference Endpoint when using .textGenerationStream()?

For example, when using .textGeneration() directly, we get the 503 error directly that we can catch and maybe retry on.

However, when using .textGenerationStream(), fetching the next response chunk appears to succeed, except the .value at that chunk happens to be undefined. Reproducing code:

import { HfInferenceEndpoint } from "@huggingface/inference";
import { experimental_buildLlama2Prompt } from "ai/prompts";

const hf = new HfInferenceEndpoint(
  "DEPLOYED_LLAMA2_ENDPOINT",
  "API_TOKEN",
);

const response = hf.textGenerationStream({
  inputs: experimental_buildLlama2Prompt([
    {
      role: "user",
      content: "hello",
    },
  ]),
});

response.next().then((res) => {
  console.log("print next chunk");
  console.log(res.value);
  console.log("done");
});

Running via terminal results in the following:

> npx ts-node hf-test.ts
print next chunk
undefined
done

My (possibly naive) expectation would be for the generator returned by textGenerationStream() to wait for the corresponding Inference Endpoint to fully initialize -- maybe with exponential backoff -- before allow the chunk to be returned via .next(). But I am a novice here so I could be missing something. 😆

Is this expected behavior or a bug?

@coyotte508
Copy link
Member

@co42 @Narsil , does HF endpoints support wait_for_model like Inference API?

@co42
Copy link

co42 commented Mar 14, 2024

We did not add the option yet, but it's definitively possible if needed.

@coyotte508
Copy link
Member

It would be nice for API compatibiltity between inference API & inference endpoints!

coyotte508 added a commit that referenced this issue Mar 14, 2024
Should fix part of
#549

The other part is fixed in the backend
@coyotte508
Copy link
Member

coyotte508 commented Mar 14, 2024

@jinnovation you will now get a 503 in streaming requests with most recent version 2.6.5

With your current version, you can get a 503 if you set retry_on_error to false:

const response = hf.textGenerationStream({
  inputs: experimental_buildLlama2Prompt([
    {
      role: "user",
      content: "hello",
    },
  ]),
  {
    retry_on_error: false
  }
});

Soon the inference endpoint backend will be updated, so that a call by default with @huggingface/inference will wait until the model is loaded (you can disable this behavior with retry_on_error: false to handle the 503 yourself)

@jinnovation
Copy link
Contributor Author

Soon the inference endpoint backend will be updated, so that a call by default with @huggingface/inference will wait until the model is loaded

Fantastic! Thank you.

jinnovation added a commit to jinnovation/huggingface.js that referenced this issue Mar 14, 2024
Shortly after huggingface#549, the inference endpoint backend
was updated to block by default on model loading.
This PR adds documentation explaining how to
circumvent that blocking so that the user, if
desired, can handle the 500 errors themselves.
@co42
Copy link

co42 commented Mar 14, 2024

It's available on the backend

coyotte508 pushed a commit that referenced this issue Mar 14, 2024
)

Shortly after #549, the inference endpoint backend
was updated to block by default on model loading.
This PR adds documentation explaining how to
circumvent that blocking so that the user, if
desired, can handle the 500 errors themselves.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants