Streaming responses from scaled-to-zero Inference Endpoints return `undefined` #549

jinnovation · 2024-03-13T22:52:29Z

Hey folks, what's the recommended way to deal with the initializing period of a scaled-to-zero Inference Endpoint when using .textGenerationStream()?

For example, when using .textGeneration() directly, we get the 503 error directly that we can catch and maybe retry on.

However, when using .textGenerationStream(), fetching the next response chunk appears to succeed, except the .value at that chunk happens to be undefined. Reproducing code:

import { HfInferenceEndpoint } from "@huggingface/inference";
import { experimental_buildLlama2Prompt } from "ai/prompts";

const hf = new HfInferenceEndpoint(
  "DEPLOYED_LLAMA2_ENDPOINT",
  "API_TOKEN",
);

const response = hf.textGenerationStream({
  inputs: experimental_buildLlama2Prompt([
    {
      role: "user",
      content: "hello",
    },
  ]),
});

response.next().then((res) => {
  console.log("print next chunk");
  console.log(res.value);
  console.log("done");
});

Running via terminal results in the following:

> npx ts-node hf-test.ts
print next chunk
undefined
done

My (possibly naive) expectation would be for the generator returned by textGenerationStream() to wait for the corresponding Inference Endpoint to fully initialize -- maybe with exponential backoff -- before allow the chunk to be returned via .next(). But I am a novice here so I could be missing something. 😆

Is this expected behavior or a bug?

The text was updated successfully, but these errors were encountered:

coyotte508 · 2024-03-13T23:16:06Z

@co42 @Narsil , does HF endpoints support wait_for_model like Inference API?

co42 · 2024-03-14T11:30:27Z

We did not add the option yet, but it's definitively possible if needed.

coyotte508 · 2024-03-14T11:35:44Z

It would be nice for API compatibiltity between inference API & inference endpoints!

Should fix part of #549 The other part is fixed in the backend

coyotte508 · 2024-03-14T13:16:54Z

@jinnovation you will now get a 503 in streaming requests with most recent version 2.6.5

With your current version, you can get a 503 if you set retry_on_error to false:

const response = hf.textGenerationStream({
  inputs: experimental_buildLlama2Prompt([
    {
      role: "user",
      content: "hello",
    },
  ]),
  {
    retry_on_error: false
  }
});

Soon the inference endpoint backend will be updated, so that a call by default with @huggingface/inference will wait until the model is loaded (you can disable this behavior with retry_on_error: false to handle the 503 yourself)

jinnovation · 2024-03-14T13:18:42Z

Soon the inference endpoint backend will be updated, so that a call by default with @huggingface/inference will wait until the model is loaded

Fantastic! Thank you.

Shortly after huggingface#549, the inference endpoint backend was updated to block by default on model loading. This PR adds documentation explaining how to circumvent that blocking so that the user, if desired, can handle the 500 errors themselves.

co42 · 2024-03-14T17:17:59Z

It's available on the backend

) Shortly after #549, the inference endpoint backend was updated to block by default on model loading. This PR adds documentation explaining how to circumvent that blocking so that the user, if desired, can handle the 500 errors themselves.

coyotte508 mentioned this issue Mar 14, 2024

🐛 Fix throwing errors on streaming requests #551

Merged

coyotte508 added a commit that referenced this issue Mar 14, 2024

🐛 Fix throwing errors on streaming requests (#551)

3f549c5

Should fix part of #549 The other part is fixed in the backend

coyotte508 closed this as completed Mar 14, 2024

jinnovation mentioned this issue Mar 14, 2024

Document use of retry_on_error for dedicated inference endpoints #554

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming responses from scaled-to-zero Inference Endpoints return `undefined` #549

Streaming responses from scaled-to-zero Inference Endpoints return `undefined` #549

jinnovation commented Mar 13, 2024

coyotte508 commented Mar 13, 2024

co42 commented Mar 14, 2024

coyotte508 commented Mar 14, 2024

coyotte508 commented Mar 14, 2024 •

edited

Loading

jinnovation commented Mar 14, 2024

co42 commented Mar 14, 2024

Streaming responses from scaled-to-zero Inference Endpoints return undefined #549

Streaming responses from scaled-to-zero Inference Endpoints return undefined #549

Comments

jinnovation commented Mar 13, 2024

coyotte508 commented Mar 13, 2024

co42 commented Mar 14, 2024

coyotte508 commented Mar 14, 2024

coyotte508 commented Mar 14, 2024 • edited Loading

jinnovation commented Mar 14, 2024

co42 commented Mar 14, 2024

Streaming responses from scaled-to-zero Inference Endpoints return `undefined` #549

Streaming responses from scaled-to-zero Inference Endpoints return `undefined` #549

coyotte508 commented Mar 14, 2024 •

edited

Loading