issue: conversation abruptly stops across multiple models and backends with many tool calls (REPEATABLE) #24915
Replies: 11 comments 29 replies
-
|
🔍 Related Issues Found I found some existing issues that might be related. Please check if any of these are duplicates or contain helpful solutions:
💡 If your issue is a duplicate, please close it and add any additional details to the existing issue instead. This comment was generated automatically. React with 👍 if helpful, 👎 if not. |
Beta Was this translation helpful? Give feedback.
-
|
Are you hitting the context limit? OWUI doesn't really tell you if you are. It just stops, like you're describing. The only way to know is if your backend records what context you're using. It won't show up under OWUI (even with usage enabled) on tool calls if it fails during it. |
Beta Was this translation helpful? Give feedback.
-
|
For example, opening a single large file consumes 100k context for me. |
Beta Was this translation helpful? Give feedback.
-
#23466 and #24607 - Not related because my experience doesn't show printing tool calls, mine experience is just stops generating or won't continue #20896 - May be related, but their use case is that cli coding agent uses openweb-ui as the backend for API. So their setup may make troubleshooting more difficult. #21768 - May be related. #23863 - Not related, switching to Default tool calling doesn't fix my issue. |
Beta Was this translation helpful? Give feedback.
-
No I am not. The context here is only 11k to 15k (when it stops), and my window size (KV cache size) is 160K+. Further, I am not hitting the PER generation limit too as confirmed by my vLLM logs. I even tried to set a VERY high (65k) token generation limit to see if it it helps, and it did not. (APIServer pid=1) INFO 05-19 17:19:08 [logger.py:63] Received request chatcmpl-a8d4c651970416da: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None. |
Beta Was this translation helpful? Give feedback.
-
These files I am working with are created by the prompt, they only contain like 10-30 characters each, and they are only modified by the steps, they don't get bigger. |
Beta Was this translation helpful? Give feedback.
-
|
Okay... I can't replicate this. Maybe someone else can? |
Beta Was this translation helpful? Give feedback.
-
|
i also cannot replicate. This has been reported some times in the past and everytime it was a provider issue/upstream on inference layer. sending to discussions for now because absolutely not replicable here |
Beta Was this translation helpful? Give feedback.
-
|
Actually I can confirm the bug. That's unfortunate. |
Beta Was this translation helpful? Give feedback.
-
|
found something, potentially |
Beta Was this translation helpful? Give feedback.
-
|
@vektorprime set CHAT_RESPONSE_MAX_TOOL_CALL_RETRIES to 9999 as an env var for open webui - this is not supposed to do this, this is a workaround i am still investigating, but this is the fix for now |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.9.5
Ollama Version (if applicable)
NA
Operating System
Ubuntu 24
Browser (if applicable)
Latest firefox
Confirmation
README.md.Expected Behavior
The model should continue generating and tool calling, but it abruptly stops only when interfacing through open-webui.
Actual Behavior
Just stops. I have to prompt it to continue or something similar.
Here's an example of me prompting it to continue.
Steps to Reproduce
Quick summary:
I am using open-webui as the frontend to my locally hosted setup. I am consistently seeing conversations stopping even though the model is supposed to continue generating. This occurs when the backend is vLLM and llama-cpp. It also occurs with both Qwen3.6 and Gemma4 models.
System with ALL software up to date:
Ubuntu 24
Docker image of open-webui
How to reproduce:
Make sure native tool calling is enabled for your model
Disable web search and other tools for the conversation so they don't get in the way
Enable open-terminal (for file writing and access)
Use either llama-CPP or vLLM as the backend
Use any model, but I first noticed on Gemma 4 31B, and I mainly use Qwen3.7 27B Q8 (I tried many quants and chat templates)
Paste the following prompt, and you'll see the conversation just stop between task 10-18. Almost almost always it's closer to the upper range for me.
Here's how I paste my prompt:
The prompt:
The logs & screenshots section will show what it looks like.
If you try this with llama-cpp as the backend it does the same thing. If you run that same model with same exact settings and prompt but use the llama-server webui (with similar MCP) it works just fine.
Logs & Screenshots
Here's what it looks like when it stops:
Here's what vLLM shows at the end:
(APIServer pid=1) INFO 05-19 16:58:45 [logger.py:92] Generated response chatcmpl-82807bd2f5345ab6 (streaming complete): output**: '\n\n\n\nT9: no\n\nTask 10: In beta.txt, replace yellow with gold. Print the full contents joined by commas.\n\n<tool_call>\n<function=run_command>\n<parameter=command>\npython3 -c "\nlines = open('/home/user/beta.txt').read().strip().split('\n')\nlines = [l for l in lines if l.strip()]\nlines = [l.replace('yellow','gold') if l == 'yellow' else l for l in lines]\nopen('/home/user/beta.txt','w').write('\n'.join(lines) + '\n')\nprint(','.join(lines))\n"\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete**
(APIServer pid=1) INFO 05-19 16:58:45 [logger.py:63] Received request chatcmpl-8418f4846e0da28f: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None.
(APIServer pid=1) INFO: 172.17.0.1:56966 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-19 16:58:45 [async_llm.py:415] Added request chatcmpl-8418f4846e0da28f-8cc2de91.
(APIServer pid=1) INFO 05-19 16:58:48 [logger.py:92] Generated response chatcmpl-8418f4846e0da28f (streaming complete): output: 'The task 10 command is running. Let me wait for it.\n\n\n<tool_call>\n<function=get_process_status>\n<parameter=process_id>\n20260519-165845-6531de\n\n<parameter=wait>\n3\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Here's ANOTHER run with a new conversation, same exact settings, model etc. In this one there's a function call that never seems to run or show up:
(APIServer pid=1) INFO 05-19 17:15:48 [logger.py:92] Generated response chatcmpl-883f6dde7c01e292 (streaming complete): output: 'beta.txt currently has 5 lines (blue, gold, orange, red, silver). So N=5.\n\n\n<tool_call>\n<function=get_process_status>\n<parameter=process_id>\n20260519-171546-21a6eb\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete
(APIServer pid=1) INFO 05-19 17:15:49 [logger.py:63] Received request chatcmpl-a233d880ee7773ab: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None.
(APIServer pid=1) INFO: 172.17.0.1:52996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-19 17:15:49 [async_llm.py:415] Added request chatcmpl-a233d880ee7773ab-8930fb05.
(APIServer pid=1) INFO 05-19 17:15:51 [logger.py:92] Generated response chatcmpl-a233d880ee7773ab (streaming complete): output: 'beta.txt currently has 5 lines (blue, gold, orange, red, silver). So colors=5.\n\n\n<tool_call>\n<function=run_command>\n<parameter=command>\necho "colors=5" >> /home/user/beta.txt && tail -n 1 /home/user/beta.txt\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete
(APIServer pid=1) INFO 05-19 17:15:51 [loggers.py:271] Engine 000: Avg prompt throughput: 112.2 tokens/s, Avg generation throughput: 35.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 79.5%
And here's the screenshot for the second run:
Additional Information
We are not hitting a token generation limit, and the final_reason in vLLM shows streaming-complete. There's supposed to be another
Beta Was this translation helpful? Give feedback.
All reactions