feat: cancel stream generation if client disappears #792

tmm1 · 2023-07-23T02:18:25Z

Description

I added Context and Cancel fields to OpenAIRequest, and then ensured that the new per-request context was wired up to the grpc client.

Now if I ctrl+c my curl request while its streaming, the LocalAI server will show:

7:15PM DBG Sending chunk: {"object":"chat.completion.chunk","model":"llama2","choices":[{"index":0,"delta":{"content":" a"}
}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

7:15PM DBG Sending chunk failed: connection closed
Error rpc error: code = Canceled desc = context canceled

Notes for Reviewers

The next request after a cancellation is acting weird and produces some garbage tokens.

7:21PM DBG Sending chunk: {"object":"chat.completion.chunk","model":"llama2","choices":[{"index":0,"delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

7:21PM DBG Loading model llama from llama-2-13b-chat.ggmlv3.q4_K_M.bin
7:21PM DBG Model already loaded in memory: llama-2-13b-chat.ggmlv3.q4_K_M.bin
7:21PM DBG Model already loaded in memory: llama-2-13b-chat.ggmlv3.q4_K_M.bin
7:21PM DBG Sending chunk: {"object":"chat.completion.chunk","model":"llama2","choices":[{"index":0,"delta":{"content":" to"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

Error rpc error: code = Unavailable desc = error reading from server: EOF

The third request will spin up the backend again and that works. Until another request is cancelled anyway.

Signed commits

Yes, I signed my commits.

api/openai/request.go

mudler · 2023-07-23T09:43:02Z

thanks! nice addition, just a small note from my side regarding the context. If we keep using the top level we are also sure that we can cancel it (particularly useful for tests)

tmm1 · 2023-07-23T23:53:42Z

The next request after a cancellation is acting weird and produces some garbage tokens.

@mudler any thoughts on what might be causing this or how to go about fixing it?

mudler · 2023-07-24T19:54:31Z

The next request after a cancellation is acting weird and produces some garbage tokens.

@mudler any thoughts on what might be causing this or how to go about fixing it?

my guess is that as llama.cpp doesn't support multiple concurrent states - what probably happens is that the process is still busy inferencing, and another request tries to book the same context, failing

tmm1 added 3 commits July 22, 2023 19:09

wire up cancelable context to chat completion streams

5174b1d

fix debug log message format

6720aa6

fix duplicate debug log message

5db2e19

tmm1 force-pushed the stream-cancellation branch from c316a08 to 5db2e19 Compare July 23, 2023 02:45

mudler reviewed Jul 23, 2023

View reviewed changes

api/openai/request.go Outdated Show resolved Hide resolved

mudler changed the title ~~Cancel stream generation if client disappears~~ feat: cancel stream generation if client disappears Jul 23, 2023

mudler added the enhancement New feature or request label Jul 23, 2023

inherit request context from app

7658e16

mudler enabled auto-merge (squash) July 24, 2023 19:54

mudler approved these changes Jul 24, 2023

View reviewed changes

mudler disabled auto-merge July 24, 2023 21:10

mudler merged commit 12fe093 into mudler:master Jul 24, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cancel stream generation if client disappears #792

feat: cancel stream generation if client disappears #792

tmm1 commented Jul 23, 2023 •

edited

mudler commented Jul 23, 2023 •

edited

tmm1 commented Jul 23, 2023

mudler commented Jul 24, 2023

feat: cancel stream generation if client disappears #792

feat: cancel stream generation if client disappears #792

Conversation

tmm1 commented Jul 23, 2023 • edited

mudler commented Jul 23, 2023 • edited

tmm1 commented Jul 23, 2023

mudler commented Jul 24, 2023

tmm1 commented Jul 23, 2023 •

edited

mudler commented Jul 23, 2023 •

edited