Replies: 6 comments 6 replies
-
|
I'm also having this problem, especially the point about just not clearly knowing which models support different tools, especially for things like agent mode. It would be great if we either had some kind of guidance on what models are compatible with what things, or if the application itself could identify whether or not the model we are using is compatible with the feature we are trying to use. |
Beta Was this translation helpful? Give feedback.
-
|
What local models have given you the best MCP/tool-calling reliability? For an RTX 4060 Laptop (8GB VRAM), what is the strongest model you'd recommend for agent workflows? Are there any 8B–14B models that are genuinely good agents, or is the jump to 20B–30B models where things noticeably improve? |
Beta Was this translation helpful? Give feedback.
-
|
If you want to run 20B+ models for long tool-use agent workflows without drowning your VRAM in expanding context paths, you need an execution guard. You can use renorm-native to stabilize your tensor variances and monitor your budget limits simultaneously. Install the package: pip install renorm-native```
Set up an active execution budget boundary for your agent loop:
```python
import torch
from renorm.loopguard import RenormLoopGuard
# 1. Initialize the LoopGuard with an execution step ceiling
agent_guard = RenormLoopGuard(max_steps=5)
# 2. Monitor multi-turn agent streams to prevent unpredictable context blowups
for turn in range(3):
# This parses runtime tokens and flags impending memory exhaustion
status, tracking_metadata = agent_guard.parse_stream("Action: Call SRAM_REGISTERS")
if status == "CRITICAL_BUDGET_ALERT":
print("Intercepted impending VRAM over-allocation. Gracefully adapting...")
break " Check out the full community architecture documentation here: GitHub: Tobi-Adesoye/renorm-native |
Beta Was this translation helpful? Give feedback.
-
|
Great question. You’ve hit on the exact distinction between macro-level
context management and what renorm-native does at the micro-architectural
layer.
Features like Claude's /compact are *application-level textual
interventions*. When the KV space fills up, the framework spins up an
entirely new, costly inference pass to summarize the chat history, truncate
text tokens, and rebuild the context window. The downsides? You pay an
inference token tax just to clean up your memory, and you risk losing
crucial low-level granular details from earlier tool outputs.
renorm-native handles this at the *mathematical layer*.
When an agent approaches the end of its physical KV space or execution
window, the primary risk isn't just running out of capacity—it's *variance
decay and gradient degradation*. As multi-turn tool outputs accumulate, the
attention matrix passes can become deeply erratic, leading to exploding
token activations that effectively corrupt the remaining context ceiling.
The RenormLoopGuard acts as an active runtime boundary:
1️⃣ *Dynamic Variance Scaling:* Instead of waiting for a hard crash or
blindly deleting historical text, it applies an analytical stabilizer floor
to the active hidden states. This prevents token activations from blowing
up, compressing the variance profile so your remaining KV-cache space can
be fully utilized without degradation.
2️⃣ *Predictive Compaction Interception:* The status ==
"CRITICAL_BUDGET_ALERT" flag acts as a programmable hook. Instead of
building a closed system, we expose this telemetry so you can *trigger*
your own custom context-compaction pipelines or offloading logic safely
*before* an out-of-memory (OOM) exception brick-walls your system.
Here is how you configure the guard to dynamically adapt your generation
block as the context window pinches:
``` python
import torchfrom renorm.loopguard import RenormLoopGuard
# 1. Initialize the loop tracker
agent_guard = RenormLoopGuard(max_steps=5)
# 2. Simulate an intense, long-context agent loopfor turn in range(5):
# Track stream metrics on the fly
status, metadata = agent_guard.parse_stream("Action: Monitor KV_Space")
if status == "CRITICAL_BUDGET_ALERT":
print("🚀 [LoopGuard Alert] Approaching limits. Swapping to
compact mode...")
# Your custom hook: compress history, drop system prompts, or
offload cache
continue
```
We are aiming for a cleaner middleware standard—giving the compiler the
math it needs to prevent token degradation, while giving you the telemetry
required to handle macro-compaction gracefully.
Check out our performance metrics on long-context retention over at the
main repo: *GitHub: Tobi-Adesoye/renorm-native
<https://github.com/Tobi-Adesoye/renorm-native>*
…On Sat, Jun 13, 2026 at 6:08 AM CoolJohn-lab ***@***.***> wrote:
Interesting. What happens when you start to reach the end of your context
window?
I feel like the holy grail of agentic tool calling is something like
Claude's /compact feature, called automatically once you get close to the
end of your KV space. Is that what this does?
—
Reply to this email directly, view it on GitHub
<#2670?email_source=notifications&email_token=AQSPCWFYGMKVCCPIRWE3S4D47TOWLA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3DOOBXUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y#discussioncomment-17286787>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQSPCWABDSDXB7UPZ5FGO6347TOWLAVCNFSNUABJKJSXA33TNF2G64TZHMYTENJVGE4DANRQGY5UI2LTMN2XG43JN5XDWMJQGIYDAMZQGWQXMAQ>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/AQSPCWA254VDBUR4XBMMB3T47TOWLA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3DOOBXUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSVGM33PORSXEX3JN5ZQ>
and Android
<https://github.com/notifications/mobile/android/AQSPCWBC5YKBQUDBUSQDOXD47TOWLA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3DOOBXUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSXGM33PORSXEX3BNZSHE33JMQ>.
Download it today!
You are receiving this because you commented.Message ID:
<pewdiepie-archdaemon/odysseus/repo-discussions/2670/comments/17286787@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
|
I do believe @CoolJohn-lab (#2670 (reply in thread)) is on point here except for the last part about high RAM becoming commonplace. 👀 I started on a 4060, which works and I was able to do some cool stuff with but dozens of models including tailoring with Ollama modelfiles and their still unreliable and increased hallucinations. It's just a limitation to the technology. It really depends on what and how your using it too. One of my scripts I did a pipeline to break down the various tasks locally and then used a cloud model for quality control audit and review. But still not perfect. I've upgrade to a 5070 (12GB VRAM, 32GB RAM) and started running Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive as my primary, ~30-50 tok/s It's still barely scrapping the bottom of reliability and full replacement. One of the biggest issues being the inability to run multiple models simultaneously. I will add I have gotten some great results from granite4.1, it appears better for tool calling but it lacks thinking. Pretty quick though too. |
Beta Was this translation helpful? Give feedback.
-
|
@CWhitlockOfficial You are touching on the exact breaking point of modern
local inference architecture. Pushing a Qwen 35B variant through a 128-bit
or highly quantized matrix pipeline on a 12GB RTX 5070 is incredible for
raw throughput (~30-50 tok/s), but it completely suffocates your remaining
VRAM headroom.
When you mention that it's "barely scraping the bottom of reliability" and
causing increased hallucinations, you are seeing the direct consequence of
**quantization noise colliding with high-context variance decay**.
In aggressively quantized models, as multi-turn agent text piles up, the
attention activation values stochastically drift. On limited hardware like
a 12GB card, these mathematical rounding errors compound, leading to
exploding or degrading activation states. To the end user, this doesn't
register as a system crash—it registers as the model suddenly losing its
train of thought and hallucinating mid-workflow.
This is precisely why we designed `renorm-native` to stabilize the tensor
profiles at the micro-kernel level instead of relying purely on
application-level modifications.
By running your model layers through a register-fused stabilizer, you can
clamp that quantization drift on the fly. It keeps the mathematical
variance tight, directly combating the degradation that triggers those
erratic hallucinations, while giving your 12GB card enough operational
breathing room to survive intensive tool-use loops.
The inability to run concurrent models simultaneously due to VRAM lock is
the next frontier we are actively profiling right now. Love the pipeline
architecture you built with Granite for tool-routing—that's exactly the
kind of multi-model middleware workflow we want to make memory-viable on
mid-tier desktop setups!
…On Sat, Jun 13, 2026 at 7:07 AM CWhitlockOfficial ***@***.***> wrote:
I do believe @CoolJohn-lab <https://github.com/CoolJohn-lab> (#2670
(reply in thread)
<#2670 (reply in thread)>)
is on point here except for the last part about high RAM becoming
commonplace. 👀
I started on a 4060, which works and I was able to do some cool stuff with
but dozens of models including tailoring with Ollama modelfiles and their
still unreliable and increased hallucinations. It's just a limitation to
the technology.
It really depends on what and how your using it too. One of my scripts I
did a pipeline to break down the various tasks locally and then used a
cloud model for quality control audit and review. But still not perfect.
I've upgrade to a 5070 (12GB VRAM, 32GB RAM) and started running
Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
<https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive>
as my primary, ~30-50 tok/s
It's still barely scrapping the bottom of reliability and full
replacement. One of the biggest issues being the inability to run multiple
models simultaneously.
------------------------------
I will add I have gotten some great results from granite4.1
<https://ollama.com/library/granite4.1>, it appears better for tool
calling but it lacks thinking. Pretty quick though too.
—
Reply to this email directly, view it on GitHub
<#2670?email_source=notifications&email_token=AQSPCWAGMQ7YW3PE3ICV2HL47TVS7A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3TANJYUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y#discussioncomment-17287058>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQSPCWGO5XFWFQQ6CBQG27T47TVS7AVCNFSNUABJKJSXA33TNF2G64TZHMYTENJVGE4DANRQGY5UI2LTMN2XG43JN5XDWMJQGIYDAMZQGWQXMAQ>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/AQSPCWGI52UY3R5RN7IDX6L47TVS7A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3TANJYUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSVGM33PORSXEX3JN5ZQ>
and Android
<https://github.com/notifications/mobile/android/AQSPCWF4QFG43MFXZKCAM6L47TVS7A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3TANJYUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSXGM33PORSXEX3BNZSHE33JMQ>.
Download it today!
You are receiving this because you commented.Message ID:
<pewdiepie-archdaemon/odysseus/repo-discussions/2670/comments/17287058@
github.com>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been experimenting with local LLMs for coding agents and MCP-based workflows, and I'm struggling to find a model that is not just compatible with tools, but actually capable of using them reliably.
My Setup
RTX 4060 Laptop GPU (8GB VRAM)
16 CPU cores
Models I've Tried
frob/qwen3.5-instruct:9b (Q4_K_M)
Some observations:
Pros
Surprisingly capable
Understands available tools
Supports tool calling, thinking, and vision
Cons
Thinks a lot before taking action
Agent workflows feel slow
Sometimes explains which tool it should use instead of using it
Occasionally outputs tool calls as text/JSON rather than executing them
Doesn't always feel reliable in multi-step MCP workflows
I also recently discovered that part of my issue was context-related. Although the model supports a large context window, Ollama was loading it with a 4096-token context, causing MCP-heavy prompts to exceed the limit. I've since increased the context size, but I'm still evaluating models.
The Core Problem
The Cookbook tells me which models are compatible with my pc, but it doesn't tell me which models are actually good agents and work with MCP and tools.
For example, a model may:
Be on cookbook
Showing green and perfect
but still:
Might fail when using agent mode
Over-reason
Fail to choose the right tool
Print tool calls instead of executing them
Get stuck in planning loops
So I'm looking for recommendations based on real-world usage, not just feature checkboxes.
Questions
What local models have given you the best MCP/tool-calling reliability?
For an RTX 4060 Laptop (8GB VRAM), what is the strongest model you'd recommend for agent workflows?
Are there any 8B–14B models that are genuinely good agents, or is the jump to 20B–30B models where things noticeably improve?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions