fix: race condition between stream and non-stream inference #507

vansangpfiev · 2024-04-14T00:17:06Z

Issue: stream and non-stream requests are not in the same thread right now. That leads to the Error during inference error, only happen on some old machines when we send non-stream request and the stream response is not completed yet.

20240413 08:01:52.241000 UTC 1476 INFO  Request 1: Reached result stop - llamaCPP.cc:387
20240413 08:01:52.241000 UTC 1476 INFO  Request 1: End of result - llamaCPP.cc:353
20240413 08:01:52.273000 UTC 10300 INFO  Request 1: Task completed, release it - llamaCPP.cc:431
20240413 08:01:52.273000 UTC 10300 INFO  Request 1: Inference completed - llamaCPP.cc:434
20240413 08:01:53.261000 UTC 1476 DEBUG [fromRequest] Request: {"messages":[{"role":"system","content":"The conversation below is for a text summarization, user asks assistant to summarize a text and assistant should response in just less than 10 words"},{"role":"user","content":"Summarize in a 5-word Title. Give the title only. \"hi\""}],"model":"mistral-7b-instruct-v0.1.Q4_0","stream":false} - models/chat_completion_request.h:23
20240413 08:01:53.261000 UTC 1476 INFO  Request 2: Generating reponse for inference request - llamaCPP.cc:192
20240413 08:01:53.261000 UTC 1476 INFO  Request 2: Stop words:null
 - llamaCPP.cc:213
20240413 08:01:53.261000 UTC 1476 INFO  Request 2: Non stream, waiting for respone - llamaCPP.cc:439
20240413 08:01:53.261000 UTC 1400 DEBUG [update_slots] slot 0 released (22 tokens in cache) - context/llama_server_context.h:1601
20240413 08:01:53.262000 UTC 1400 DEBUG [launch_slot_with_data] slot 0 is processing [task id: 3] - context/llama_server_context.h:873
20240413 08:01:53.263000 UTC 1400 DEBUG [update_slots] slot 0 : in cache: 1 tokens | to process: 58 tokens - context/llama_server_context.h:1735
20240413 08:01:53.263000 UTC 1400 DEBUG [update_slots] slot 0 : kv cache rm - [1, end) - context/llama_server_context.h:1740
20240413 08:01:59.250000 UTC 11144 DEBUG [fromRequest] Request: {"messages":[{"role":"user","content":"hi"},{"role":"assistant","content":"Hello! How can I help you today?"},{"role":"user","content":"how are you today?"}],"model":"mistral-7b-instruct-v0.1.Q4_0","stream":true,"temperature":0.7,"top_p":0.95,"max_tokens":2048,"stop":["<endofstring>"],"frequency_penalty":0,"presence_penalty":0} - models/chat_completion_request.h:23
20240413 08:01:59.250000 UTC 11144 INFO  Request 3: Generating reponse for inference request - llamaCPP.cc:192
20240413 08:01:59.250000 UTC 11144 INFO  Request 3: Stop words:[
	"<endofstring>"
]
 - llamaCPP.cc:213
20240413 08:01:59.250000 UTC 11144 INFO  Request 3: Streamed, waiting for respone - llamaCPP.cc:333
20240413 08:01:59.250000 UTC 10300 INFO  Request 3: Wait for task to be released:4 - llamaCPP.cc:427
20240413 08:01:59.250000 UTC 11144 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535
20240413 08:02:02.677000 UTC 1400 DEBUG [process_tasks] slot unavailable - context/llama_server_context.h:1468
20240413 08:02:02.677000 UTC 11144 ERROR Request 3: Error during inference - llamaCPP.cc:400

2024-04-13T08:02:03.927Z [NITRO]::Debug: 20240413 08:02:02.756000 UTC 10300 INFO  Request 3: Task completed, release it - llamaCPP.cc:431
20240413 08:02:02.756000 UTC 10300 INFO  Request 3: Inference completed - llamaCPP.cc:434
20240413 08:02:03.693000 UTC 11144 DEBUG [fromRequest] Request: {"messages":[{"role":"system","content":"The conversation below is for a text summarization, user asks assistant to summarize a text and assistant should response in just less than 10 words"},{"role":"user","content":"Summarize in a 5-word Title. Give the title only. \"hi\""}],"model":"mistral-7b-instruct-v0.1.Q4_0","stream":false} - models/chat_completion_request.h:23
20240413 08:02:03.693000 UTC 11144 INFO  Request 4: Generating reponse for inference request - llamaCPP.cc:192
20240413 08:02:03.693000 UTC 11144 INFO  Request 4: Stop words:null
 - llamaCPP.cc:213
20240413 08:02:03.693000 UTC 11144 INFO  Request 4: Non stream, waiting for respone - llamaCPP.cc:439
20240413 08:02:03.927000 UTC 1400 DEBUG [process_tasks] slot unavailable - context/llama_server_context.h:1468
20240413 08:02:03.927000 UTC 11144 ERROR Request 4: Error during inference - llamaCPP.cc:453

tikikun

LGTM

vansangpfiev added 3 commits April 13, 2024 15:55

fix: race condition in inference between stream and non-stream

7129874

refactor: small improvement

44ab290

refactor: small things continue

ede0e06

vansangpfiev requested a review from tikikun April 15, 2024 01:12

tikikun approved these changes Apr 15, 2024

View reviewed changes

vansangpfiev merged commit 8c63530 into main Apr 15, 2024

vansangpfiev deleted the fix-race-cond-inference branch July 8, 2024 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: race condition between stream and non-stream inference #507

fix: race condition between stream and non-stream inference #507

Uh oh!

vansangpfiev commented Apr 14, 2024 •

edited

Loading

Uh oh!

tikikun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: race condition between stream and non-stream inference #507

fix: race condition between stream and non-stream inference #507

Uh oh!

Conversation

vansangpfiev commented Apr 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tikikun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vansangpfiev commented Apr 14, 2024 •

edited

Loading