Need advice on local models for MCP/tool use and agent workflows #2670

mdhvdhage · 2026-06-04T21:07:52Z

mdhvdhage
Jun 4, 2026

I've been experimenting with local LLMs for coding agents and MCP-based workflows, and I'm struggling to find a model that is not just compatible with tools, but actually capable of using them reliably.

My Setup
RTX 4060 Laptop GPU (8GB VRAM)
16 CPU cores
Models I've Tried

frob/qwen3.5-instruct:9b (Q4_K_M)

Some observations:

Pros

Surprisingly capable
Understands available tools
Supports tool calling, thinking, and vision

Cons

Thinks a lot before taking action
Agent workflows feel slow
Sometimes explains which tool it should use instead of using it
Occasionally outputs tool calls as text/JSON rather than executing them
Doesn't always feel reliable in multi-step MCP workflows

I also recently discovered that part of my issue was context-related. Although the model supports a large context window, Ollama was loading it with a 4096-token context, causing MCP-heavy prompts to exceed the limit. I've since increased the context size, but I'm still evaluating models.

The Core Problem

The Cookbook tells me which models are compatible with my pc, but it doesn't tell me which models are actually good agents and work with MCP and tools.

For example, a model may:

Be on cookbook
Showing green and perfect

but still:

Might fail when using agent mode
Over-reason
Fail to choose the right tool
Print tool calls instead of executing them
Get stuck in planning loops

So I'm looking for recommendations based on real-world usage, not just feature checkboxes.

Questions
What local models have given you the best MCP/tool-calling reliability?
For an RTX 4060 Laptop (8GB VRAM), what is the strongest model you'd recommend for agent workflows?
Are there any 8B–14B models that are genuinely good agents, or is the jump to 20B–30B models where things noticeably improve?

Thanks!

webbben · 2026-06-05T00:56:43Z

webbben
Jun 5, 2026

I'm also having this problem, especially the point about just not clearly knowing which models support different tools, especially for things like agent mode. It would be great if we either had some kind of guidance on what models are compatible with what things, or if the application itself could identify whether or not the model we are using is compatible with the feature we are trying to use.

4 replies

jorgeporragas Jun 5, 2026

Same here. I've been using Qwen3:14B and it often forgets it can look up things online or modify my calendar. I've been testing to see if I have to add skills so that it understands how to use the Python functions that come with Odysseus or if it will get the hang of things on its own.

mdhvdhage Jun 5, 2026
Author

yeah man, Felix got super gpus and could afford them. but i guess its difficult for us

jorgeporragas Jun 5, 2026

I think he's noticed which explains his latest post on wanting to simplify the system prompt. The initial prompt that tells the agent how to use Odysseus is very large on its own so the model has to read all of that first and then try to do what you asked it to and it just runs out of tokens. But actually using the tools should be relatively simple, so I think if they optimize the system prompt then smaller models should be quite good at doing what they need to do.

The main issue I've noticed with Odysseus so far is that the smaller agents simply aren't able to get the hang of Odysseus within their resource constraints. I'm hopeful they can fix it though.

mdhvdhage Jun 5, 2026
Author

True I too noticed that..
When i pull a model from ollama it runs perfectly in the terminal but fails to run in Odysseus...

jorgeporragas · 2026-06-05T19:59:08Z

jorgeporragas
Jun 5, 2026

What local models have given you the best MCP/tool-calling reliability?
So far, none of the ones I've tested. The smaller models are struggling to understand how to use Odysseus because the system prompt is too large, it seems.

For an RTX 4060 Laptop (8GB VRAM), what is the strongest model you'd recommend for agent workflows?
From what I've gathered in the discussion you should try to use a model that fits entirely in your GPU and still has enough memory to spare for context.

Are there any 8B–14B models that are genuinely good agents, or is the jump to 20B–30B models where things noticeably improve?
I tried some larger models. They are considerably better at reasoning and understanding how to use Odysseus (aka how to use the Python functions like manage_memory or manage_calendar ) from what I've noticed, but the problem is that they're harder to run and having a larger model and giving it a large enough context window to fit the already 10K+ tokens the model has to read before it even gets to your prompt is going to get resource intensive fast.

1 reply

CoolJohn-lab Jun 6, 2026

No, there aren't any good agentic 8B param models. Text editing, reasoning, web searching, all that stuff, sure, but agentic tool calling is about the hardest thing you can ask a AI model to do, outside of generating images or video.

16GB VRAM and 8GB of weights in a heavily optimised MoE model should really be the minimum spec for doing agentic work like Oddy does. And even then it is fairly limited and you will still bump up against confusing issues, wrong output and hallucinations.

A machine with 32GB VRAM or a MacBook M2 or above with 32GB+ of unified memory and then we are starting to get to the level where agentic is starting to cook. Heavily quantised 27B+ is then possible. With enough system RAM and VRAM combined, say 128GB of RAM and 32GB VRam, much bigger models are possible with severe MoE offloading and clever sharding strategies, but will run very slowly. Errors will be more rare and output higher quality and more detailed.

64GB of VRAM or unified memory and you start to be able to run the very capable Qwen 3.6 27B models entirely in VRAM, and you're really cooking at that point. With enough system RAM and MoE sharding, slow 80B param models are possible.
You're still limited to relatively simple tool-calling agentic processes though, you can't get too crazy. Everything in Oddy will work, but it still brain farts now and then.

For complex and serious agentic workflows that "just work" this is really out of reach still for the average consumer, 128GB and above models are out there, and some of them rival Claude or ChatGPT for agentic capability, but until consumer laptops ship with that much VRAM or unified memory plus AI accelerator chips as standard, that sort of capability is out of reach.

We don't have to wait too long though. In 2-3 years we will start to see 256GB as minimum spec for a laptop and 1TB+ unified memory will be commonplace on affordable devices.

Tobi-Adesoye · 2026-06-13T00:19:10Z

Tobi-Adesoye
Jun 13, 2026

If you want to run 20B+ models for long tool-use agent workflows without drowning your VRAM in expanding context paths, you need an execution guard. You can use renorm-native to stabilize your tensor variances and monitor your budget limits simultaneously.

Install the package:

pip install renorm-native```

Set up an active execution budget boundary for your agent loop:

```python
import torch
from renorm.loopguard import RenormLoopGuard

# 1. Initialize the LoopGuard with an execution step ceiling
agent_guard = RenormLoopGuard(max_steps=5)

# 2. Monitor multi-turn agent streams to prevent unpredictable context blowups
for turn in range(3):
    # This parses runtime tokens and flags impending memory exhaustion
    status, tracking_metadata = agent_guard.parse_stream("Action: Call SRAM_REGISTERS")
    if status == "CRITICAL_BUDGET_ALERT":
        print("Intercepted impending VRAM over-allocation. Gracefully adapting...")
        break

"

Check out the full community architecture documentation here: GitHub: Tobi-Adesoye/renorm-native

1 reply

CoolJohn-lab Jun 13, 2026

Interesting. What happens when you start to reach the end of your context window?

I feel like the holy grail of agentic tool calling is something like Claude's /compact feature, called automatically once you get close to the end of your KV space. Is that what this does?

Tobi-Adesoye · 2026-06-13T05:35:43Z

Tobi-Adesoye
Jun 13, 2026

Great question. You’ve hit on the exact distinction between macro-level context management and what renorm-native does at the micro-architectural layer. Features like Claude's /compact are *application-level textual interventions*. When the KV space fills up, the framework spins up an entirely new, costly inference pass to summarize the chat history, truncate text tokens, and rebuild the context window. The downsides? You pay an inference token tax just to clean up your memory, and you risk losing crucial low-level granular details from earlier tool outputs. renorm-native handles this at the *mathematical layer*. When an agent approaches the end of its physical KV space or execution window, the primary risk isn't just running out of capacity—it's *variance decay and gradient degradation*. As multi-turn tool outputs accumulate, the attention matrix passes can become deeply erratic, leading to exploding token activations that effectively corrupt the remaining context ceiling. The RenormLoopGuard acts as an active runtime boundary: 1️⃣ *Dynamic Variance Scaling:* Instead of waiting for a hard crash or blindly deleting historical text, it applies an analytical stabilizer floor to the active hidden states. This prevents token activations from blowing up, compressing the variance profile so your remaining KV-cache space can be fully utilized without degradation. 2️⃣ *Predictive Compaction Interception:* The status == "CRITICAL_BUDGET_ALERT" flag acts as a programmable hook. Instead of building a closed system, we expose this telemetry so you can *trigger* your own custom context-compaction pipelines or offloading logic safely *before* an out-of-memory (OOM) exception brick-walls your system. Here is how you configure the guard to dynamically adapt your generation block as the context window pinches: ``` python import torchfrom renorm.loopguard import RenormLoopGuard # 1. Initialize the loop tracker agent_guard = RenormLoopGuard(max_steps=5) # 2. Simulate an intense, long-context agent loopfor turn in range(5): # Track stream metrics on the fly status, metadata = agent_guard.parse_stream("Action: Monitor KV_Space") if status == "CRITICAL_BUDGET_ALERT": print("🚀 [LoopGuard Alert] Approaching limits. Swapping to compact mode...") # Your custom hook: compress history, drop system prompts, or offload cache continue ``` We are aiming for a cleaner middleware standard—giving the compiler the math it needs to prevent token degradation, while giving you the telemetry required to handle macro-compaction gracefully. Check out our performance metrics on long-context retention over at the main repo: *GitHub: Tobi-Adesoye/renorm-native <https://github.com/Tobi-Adesoye/renorm-native>*

…

On Sat, Jun 13, 2026 at 6:08 AM CoolJohn-lab ***@***.***> wrote: Interesting. What happens when you start to reach the end of your context window? I feel like the holy grail of agentic tool calling is something like Claude's /compact feature, called automatically once you get close to the end of your KV space. Is that what this does? — Reply to this email directly, view it on GitHub <#2670?email_source=notifications&email_token=AQSPCWFYGMKVCCPIRWE3S4D47TOWLA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3DOOBXUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y#discussioncomment-17286787>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSPCWABDSDXB7UPZ5FGO6347TOWLAVCNFSNUABJKJSXA33TNF2G64TZHMYTENJVGE4DANRQGY5UI2LTMN2XG43JN5XDWMJQGIYDAMZQGWQXMAQ> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/AQSPCWA254VDBUR4XBMMB3T47TOWLA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3DOOBXUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSVGM33PORSXEX3JN5ZQ> and Android <https://github.com/notifications/mobile/android/AQSPCWBC5YKBQUDBUSQDOXD47TOWLA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3DOOBXUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSXGM33PORSXEX3BNZSHE33JMQ>. Download it today! You are receiving this because you commented.Message ID: <pewdiepie-archdaemon/odysseus/repo-discussions/2670/comments/17286787@ github.com>

0 replies

CWhitlockOfficial · 2026-06-13T06:07:23Z

CWhitlockOfficial
Jun 13, 2026

I do believe @CoolJohn-lab (#2670 (reply in thread)) is on point here except for the last part about high RAM becoming commonplace. 👀

I started on a 4060, which works and I was able to do some cool stuff with but dozens of models including tailoring with Ollama modelfiles and their still unreliable and increased hallucinations. It's just a limitation to the technology.

It really depends on what and how your using it too. One of my scripts I did a pipeline to break down the various tasks locally and then used a cloud model for quality control audit and review. But still not perfect.

I've upgrade to a 5070 (12GB VRAM, 32GB RAM) and started running Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive as my primary, ~30-50 tok/s

It's still barely scrapping the bottom of reliability and full replacement. One of the biggest issues being the inability to run multiple models simultaneously.

I will add I have gotten some great results from granite4.1, it appears better for tool calling but it lacks thinking. Pretty quick though too.

0 replies

Tobi-Adesoye · 2026-06-13T07:32:10Z

Tobi-Adesoye
Jun 13, 2026

@CWhitlockOfficial You are touching on the exact breaking point of modern local inference architecture. Pushing a Qwen 35B variant through a 128-bit or highly quantized matrix pipeline on a 12GB RTX 5070 is incredible for raw throughput (~30-50 tok/s), but it completely suffocates your remaining VRAM headroom. When you mention that it's "barely scraping the bottom of reliability" and causing increased hallucinations, you are seeing the direct consequence of **quantization noise colliding with high-context variance decay**. In aggressively quantized models, as multi-turn agent text piles up, the attention activation values stochastically drift. On limited hardware like a 12GB card, these mathematical rounding errors compound, leading to exploding or degrading activation states. To the end user, this doesn't register as a system crash—it registers as the model suddenly losing its train of thought and hallucinating mid-workflow. This is precisely why we designed `renorm-native` to stabilize the tensor profiles at the micro-kernel level instead of relying purely on application-level modifications. By running your model layers through a register-fused stabilizer, you can clamp that quantization drift on the fly. It keeps the mathematical variance tight, directly combating the degradation that triggers those erratic hallucinations, while giving your 12GB card enough operational breathing room to survive intensive tool-use loops. The inability to run concurrent models simultaneously due to VRAM lock is the next frontier we are actively profiling right now. Love the pipeline architecture you built with Granite for tool-routing—that's exactly the kind of multi-model middleware workflow we want to make memory-viable on mid-tier desktop setups!

…

On Sat, Jun 13, 2026 at 7:07 AM CWhitlockOfficial ***@***.***> wrote: I do believe @CoolJohn-lab <https://github.com/CoolJohn-lab> (#2670 (reply in thread) <#2670 (reply in thread)>) is on point here except for the last part about high RAM becoming commonplace. 👀 I started on a 4060, which works and I was able to do some cool stuff with but dozens of models including tailoring with Ollama modelfiles and their still unreliable and increased hallucinations. It's just a limitation to the technology. It really depends on what and how your using it too. One of my scripts I did a pipeline to break down the various tasks locally and then used a cloud model for quality control audit and review. But still not perfect. I've upgrade to a 5070 (12GB VRAM, 32GB RAM) and started running Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive <https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive> as my primary, ~30-50 tok/s It's still barely scrapping the bottom of reliability and full replacement. One of the biggest issues being the inability to run multiple models simultaneously. ------------------------------ I will add I have gotten some great results from granite4.1 <https://ollama.com/library/granite4.1>, it appears better for tool calling but it lacks thinking. Pretty quick though too. — Reply to this email directly, view it on GitHub <#2670?email_source=notifications&email_token=AQSPCWAGMQ7YW3PE3ICV2HL47TVS7A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3TANJYUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y#discussioncomment-17287058>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSPCWGO5XFWFQQ6CBQG27T47TVS7AVCNFSNUABJKJSXA33TNF2G64TZHMYTENJVGE4DANRQGY5UI2LTMN2XG43JN5XDWMJQGIYDAMZQGWQXMAQ> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/AQSPCWGI52UY3R5RN7IDX6L47TVS7A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3TANJYUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSVGM33PORSXEX3JN5ZQ> and Android <https://github.com/notifications/mobile/android/AQSPCWF4QFG43MFXZKCAM6L47TVS7A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZSHA3TANJYUZZGKYLTN5XKOY3PNVWWK3TUUVSXMZLOOSXGM33PORSXEX3BNZSHE33JMQ>. Download it today! You are receiving this because you commented.Message ID: <pewdiepie-archdaemon/odysseus/repo-discussions/2670/comments/17287058@ github.com>

0 replies

Need advice on local models for MCP/tool use and agent workflows #2670

Uh oh!

Replies: 6 comments · 6 replies

Uh oh!

Uh oh!

Uh oh!

mdhvdhage Jun 5, 2026 Author

Uh oh!

Uh oh!

mdhvdhage Jun 5, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 6 replies

mdhvdhage Jun 5, 2026
Author

mdhvdhage Jun 5, 2026
Author