Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Conversation

@vansangpfiev
Copy link
Contributor

@vansangpfiev vansangpfiev commented Apr 14, 2024

Issue: stream and non-stream requests are not in the same thread right now. That leads to the Error during inference error, only happen on some old machines when we send non-stream request and the stream response is not completed yet.

20240413 08:01:52.241000 UTC 1476 INFO  Request 1: Reached result stop - llamaCPP.cc:387
20240413 08:01:52.241000 UTC 1476 INFO  Request 1: End of result - llamaCPP.cc:353
20240413 08:01:52.273000 UTC 10300 INFO  Request 1: Task completed, release it - llamaCPP.cc:431
20240413 08:01:52.273000 UTC 10300 INFO  Request 1: Inference completed - llamaCPP.cc:434
20240413 08:01:53.261000 UTC 1476 DEBUG [fromRequest] Request: {"messages":[{"role":"system","content":"The conversation below is for a text summarization, user asks assistant to summarize a text and assistant should response in just less than 10 words"},{"role":"user","content":"Summarize in a 5-word Title. Give the title only. \"hi\""}],"model":"mistral-7b-instruct-v0.1.Q4_0","stream":false} - models/chat_completion_request.h:23
20240413 08:01:53.261000 UTC 1476 INFO  Request 2: Generating reponse for inference request - llamaCPP.cc:192
20240413 08:01:53.261000 UTC 1476 INFO  Request 2: Stop words:null
 - llamaCPP.cc:213
20240413 08:01:53.261000 UTC 1476 INFO  Request 2: Non stream, waiting for respone - llamaCPP.cc:439
20240413 08:01:53.261000 UTC 1400 DEBUG [update_slots] slot 0 released (22 tokens in cache) - context/llama_server_context.h:1601
20240413 08:01:53.262000 UTC 1400 DEBUG [launch_slot_with_data] slot 0 is processing [task id: 3] - context/llama_server_context.h:873
20240413 08:01:53.263000 UTC 1400 DEBUG [update_slots] slot 0 : in cache: 1 tokens | to process: 58 tokens - context/llama_server_context.h:1735
20240413 08:01:53.263000 UTC 1400 DEBUG [update_slots] slot 0 : kv cache rm - [1, end) - context/llama_server_context.h:1740
20240413 08:01:59.250000 UTC 11144 DEBUG [fromRequest] Request: {"messages":[{"role":"user","content":"hi"},{"role":"assistant","content":"Hello! How can I help you today?"},{"role":"user","content":"how are you today?"}],"model":"mistral-7b-instruct-v0.1.Q4_0","stream":true,"temperature":0.7,"top_p":0.95,"max_tokens":2048,"stop":["<endofstring>"],"frequency_penalty":0,"presence_penalty":0} - models/chat_completion_request.h:23
20240413 08:01:59.250000 UTC 11144 INFO  Request 3: Generating reponse for inference request - llamaCPP.cc:192
20240413 08:01:59.250000 UTC 11144 INFO  Request 3: Stop words:[
	"<endofstring>"
]
 - llamaCPP.cc:213
20240413 08:01:59.250000 UTC 11144 INFO  Request 3: Streamed, waiting for respone - llamaCPP.cc:333
20240413 08:01:59.250000 UTC 10300 INFO  Request 3: Wait for task to be released:4 - llamaCPP.cc:427
20240413 08:01:59.250000 UTC 11144 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535
20240413 08:02:02.677000 UTC 1400 DEBUG [process_tasks] slot unavailable - context/llama_server_context.h:1468
20240413 08:02:02.677000 UTC 11144 ERROR Request 3: Error during inference - llamaCPP.cc:400

2024-04-13T08:02:03.927Z [NITRO]::Debug: 20240413 08:02:02.756000 UTC 10300 INFO  Request 3: Task completed, release it - llamaCPP.cc:431
20240413 08:02:02.756000 UTC 10300 INFO  Request 3: Inference completed - llamaCPP.cc:434
20240413 08:02:03.693000 UTC 11144 DEBUG [fromRequest] Request: {"messages":[{"role":"system","content":"The conversation below is for a text summarization, user asks assistant to summarize a text and assistant should response in just less than 10 words"},{"role":"user","content":"Summarize in a 5-word Title. Give the title only. \"hi\""}],"model":"mistral-7b-instruct-v0.1.Q4_0","stream":false} - models/chat_completion_request.h:23
20240413 08:02:03.693000 UTC 11144 INFO  Request 4: Generating reponse for inference request - llamaCPP.cc:192
20240413 08:02:03.693000 UTC 11144 INFO  Request 4: Stop words:null
 - llamaCPP.cc:213
20240413 08:02:03.693000 UTC 11144 INFO  Request 4: Non stream, waiting for respone - llamaCPP.cc:439
20240413 08:02:03.927000 UTC 1400 DEBUG [process_tasks] slot unavailable - context/llama_server_context.h:1468
20240413 08:02:03.927000 UTC 11144 ERROR Request 4: Error during inference - llamaCPP.cc:453

@vansangpfiev vansangpfiev requested a review from tikikun April 15, 2024 01:12
Copy link
Contributor

@tikikun tikikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vansangpfiev vansangpfiev merged commit 8c63530 into main Apr 15, 2024
@vansangpfiev vansangpfiev deleted the fix-race-cond-inference branch July 8, 2024 05:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants