Idea: Odysseus finetune for smaller models (1–14B), collecting samples #2724

pewdiepie-archdaemon · 2026-06-05T01:48:49Z

pewdiepie-archdaemon
Jun 5, 2026
Maintainer

Hey everyone,

Right now Odysseus only really works well with larger models. The agent system prompt has grown to around 10K tokens at this point — that's the tool surface, the per-tool guidance, and the safety/style notes all baked in. Smaller models with shorter context windows or weaker instruction-following struggle to keep all of that loaded and do real work on top of it.

I'd like to fix that by training a finetune specifically for Odysseus. The goal: a model in the 1B–14B range that can handle the same tasks a larger general-purpose model would — chat, agent tool use, manage_notes/calendar/tasks/memory, deep research, document workflows, cookbook decisions. Self-hosted users on smaller hardware shouldn't be locked out of the agent path just because the prompt got too heavy.

A nice side benefit: it would also solve the "which model should I actually run?" question a lot of people have when setting up. "Run Odysseus-7B" is a much clearer answer than "pick a 7B that's good at tool use AND instruction-following AND has long context AND has been trained on the right kinds of data AND…"

Rough first plan:

Target ~20K samples across the major Odysseus task surfaces
Cover a range of base sizes — 1B, 3B, 7B, 14B — so people can pick what fits their hardware
Open the dataset alongside the model so the finetune is reproducible

Where I'd love help — sample gathering, ideally split per task so people can focus on what they know best. Real traces from your own use (anonymized, opt-in) or synthetic samples both work. Reply with the area you'd want to take on:

Research samples — deep_research multi-step flows
Task creation samples — manage_tasks scheduled / cron flows
Notes & reminder samples — manage_notes with due_date
Calendar event samples — manage_calendar with reminder_minutes
Memory consolidation samples — manage_memory edit/audit
Document workflow samples — open, edit, AI-assist
Agent tool use traces — general agent-mode chats with multi-tool sequences
Cookbook decision samples — model picking, serving choices

If you take a slice, gather as much as you can in that lane. We can sort out the format/quality bar in a follow-up once people self-assign.

The biggest open question for the discussion: is this even necessary?

It's possible the right answer isn't a finetune at all — maybe it's slimming the system prompt, or loading per-tool prompts just-in-time, or better Cookbook recommendations on which existing models work well. Genuinely open to "you don't need to train anything" if that's where the discussion lands.

— Felix

alteixeira20 · 2026-06-05T02:17:53Z

alteixeira20
Jun 5, 2026
Collaborator

I like the direction, but I’d be careful not to treat finetuning as the first fix before we prove where the bottleneck really is.

If the agent prompt is already around 10K tokens, part of the problem may be architectural rather than model-specific: too much static tool/context guidance is being loaded before we know which task path the user actually needs. A finetune might help smaller models follow Odysseus conventions better, but it may also hide the fact that the prompt/tool surface itself needs to be slimmer and more modular.

My instinct would be to approach this in stages:

measure where the prompt tokens are actually going;
define a small eval suite for the major Odysseus task surfaces;
benchmark a few existing 1B/3B/7B/14B models against those evals;
try prompt slimming / per-tool or just-in-time tool guidance;
only then decide whether a dedicated finetune is still needed.

That would also make the dataset work much safer. If we collect samples before having evals and a clear trace format, we risk building a dataset that reflects today’s prompt shape too closely instead of the behavior we actually want Odysseus models to learn.

So I’m not against the finetune idea. I think it could be valuable, especially for self-hosted users. I’d just prefer to make the first milestone an evidence-gathering and prompt-surface reduction pass, then use that to decide whether training is the right next step or whether a lighter architecture change gets us most of the benefit.

Happy to expand on any of those areas if useful; I’m mostly thinking in terms of keeping this measurable and avoiding locking the project into the wrong fix too early.

Best regards,
Alexandre.

2 replies

pewdiepie-archdaemon Jun 5, 2026
Maintainer Author

Alexandre — you're right, and I'm with you on the split.

Two problems, two fixes:

1. Prompt bloat. ~10K tokens of static tool guidance loaded every turn regardless of what the user is actually trying to do. That's a project-side problem and the fix is to slim + modularize the surface — per-tool prompts, just-in-time loading by task path, cut redundancy. I want to do this regardless of where the finetune discussion lands. Plan: audit where the tokens actually go, cut, then test each skill/MCP usage individually before reassembling, with a small eval suite alongside so the cuts are measurable rather than vibes-based.

2. "Which model do I run?" Even with a slimmer prompt, self-hosters still hit "I have no idea which 7B actually works well with Odysseus." A finetune could answer that — but as you point out, better Cookbook recommendations + a public eval scoreboard might answer it without ever training a model. So finetune goes to backlog.

One thing I'd still want to start in parallel, lightly: sample collection. Real trace data takes weeks to accumulate, and the eval suite from (1) is going to need example traces to score against anyway. So the sample work isn't wasted even if we never train — it becomes the eval set + a public dataset for whoever does want to train. Just won't be the immediate priority.

Rough order:

Prompt audit + cuts (measurable, near-term)
Small eval suite, built as the cuts go in
Benchmark existing 1B–14B candidates (Qwen3 / Llama 3 / DeepSeek / Gemma) — answer "which model to run" with data, not vibes
Sample collection in the background, format informed by the eval suite (so we don't bake in today's prompt shape)
Then decide if a dedicated finetune is still on the table

Thanks for the steer — exactly the kind of pushback I was hoping for when I posted this.

— Felix

alteixeira20 Jun 5, 2026
Collaborator

I think that split is the right way to frame it.

I opened #2750 to track the prompt-bloat side specifically. I tried to frame it as a parent roadmap/tracker rather than a giant implementation issue: measurement first, then small low-risk slimming PRs, then eval-guided modular/JIT guidance before touching anything more behavior-sensitive.

That is roughly how I interpreted your comment, with a bit of added structure around the PR order. I’d appreciate your take on whether the order/scope looks right before I start opening PRs against it.

My instinct for the first PR is still measurement-only: assembled prompt token breakdown diagnostics, with no prompt wording changes and no tool-selection changes. Once we have that baseline, the first actual slimming target looks like MCP prose filtering.

Zaptosis · 2026-06-05T02:22:48Z

Zaptosis
Jun 5, 2026

@pewdiepie-archdaemon look at the proposal to re-license as AGPL, its better at protecting against big tech abuses & ensuring the source code remains free.

Right now with a MIT license, OpenAI could take your projects code & implement these features into their closed source program despite it being the open source work of yourself & your community.

Based on what how you've spoke about these issues in your videos, I imagine once you look into this you'll become a full on copyleft crusader & promoting those licenses. Its such a cool legal concept that changes must remain open. Its a very powerful too to fight back against big tech & the closing off of the internet.

Also yes big models, small models, all good, very yes yes

0 replies

undergroundrap · 2026-06-05T05:06:22Z

undergroundrap
Jun 5, 2026

This makes sense to me too. Since #2750 is already tracking the prompt-bloat side, I’m happy to help with the measurement-first pass rather than jumping straight into prompt rewrites.

A focused first contribution I could take: diagnostics that report the assembled agent prompt/token breakdown by section/tool surface, with no behavior or wording changes. That would give the slimming/eval work a baseline and make later PRs easier to review.

If you already have a preferred shape for that output, I can follow it; otherwise I can propose something small and non-invasive on #2750 before opening a PR.

4 replies

alteixeira20 Jun 5, 2026
Collaborator

That seems like a great start to me, and yes feel free to elaborate please.

undergroundrap Jun 5, 2026

Great. I’d frame the first pass as measurement-only and explicitly non-behavioral.

What I’m thinking:

Add a small diagnostic path that builds the same agent prompt the runtime would send, then reports char/token breakdowns by source section.
Keep it read-only: no prompt wording changes, no tool-selection changes, no routing changes.
Make the output useful for later slimming PRs by showing which surfaces are actually expensive.

A useful breakdown might be:

base agent preamble / rules;
API-agent rules if present;
builtin tool sections;
MCP tool descriptions;
skill index / skill snippets;
document or active-context injection;
integration-specific prompt additions;
date/time/user-context preface.

For each section, I’d report something like:

section name;
character count;
estimated token count using the existing estimator where available;
percentage of total assembled prompt;
maybe top N largest tool sections so we can see where the bloat really is.

I’d keep the first PR boring on purpose: probably a helper plus focused tests, and either a CLI/dev-only route/log output depending on what fits the existing code best. The goal would be to create a baseline that later PRs can cite before cutting or modularizing anything.

After that, the next PRs can be small and evidence-based: “this section is X% of the prompt, here is a low-risk reduction, here is the before/after token count.”

If that shape sounds right, I can write up the exact proposed output format on #2750 before touching code.

alteixeira20 Jun 5, 2026
Collaborator

That shape sounds right to me.

I’d keep this first pass strictly measurement-only: no prompt wording changes, no routing changes, no tool-selection changes. Just a boring diagnostic that shows where the assembled prompt budget is going, with enough section-level detail to make later slimming PRs measurable.

The breakdown you listed looks like a good starting point:

base agent preamble/rules;
API-agent additions if present;
builtin tool sections;
MCP tool descriptions;
skill index/snippets;
active document/context injection;
integration-specific additions;
date/time/user-context preface;
top largest sections/tools by estimated token count.

I think the exact output format and implementation shape should move to #2750, though, so this thread can stay focused on the broader finetune vs prompt-slimming direction. Could you add a refined version of that proposal there before opening a PR?

From my side, the ideal first PR would be small and non-invasive: helper + focused tests, producing a clear baseline that later PRs can cite when reducing or modularizing prompt sections.

undergroundrap Jun 5, 2026

Sounds good. I’ll move the concrete diagnostic/output proposal over to #2750 and keep this thread focused on the broader direction.

I’ll keep the first implementation proposal measurement-only: helper + focused tests, with no prompt wording, routing, or tool-selection changes.

jorgeporragas · 2026-06-05T20:51:07Z

jorgeporragas
Jun 5, 2026

I want to help.

I've been using Oddyseus on a custom rig:

Ryzen 5 5600X
32GB DDR4 RAM
Radeon RX 9600 XT 16GB
1TB NVMe SSD
Fedora 44

and running different tests with different agents, particularly testing it to see if it can run the workspace using my Obsidian vault, create files and manage my calendar autonomously. I've also been documenting all my findings on a personal markdown file.

I'm new to self-hosting LLM models but I'd gladly gather samples from my particular use case as long as I know the proper way to share them, I don't want to spam the forums or throw useless data at you.

2 replies

johanne-ks-hub Jun 7, 2026

I would love to see when you get Obsidian and Odysseus to work together. I unfortunalty know little to nothing about coding, but did manage to get Odysseus to work on my Mac. Would be ideal for my workflow if I could integrate my Obsidian vaults with Odysseus.

jorgeporragas Jun 7, 2026

So far, I've had limited success, sadly. Odysseus has its own file structure, but it also includes built-in functions to read other files on your computer and index documents. Unfortunately, the AI rarely reads them correctly—or even finds them—even when given explicit file paths.

I haven't yet figured out whether this is an issue with how Odysseus is built or a limitation of the LLMs I'm using. I got Gemini 2.5 Flash running through the API, and while it's much better than Qwen 3 14B, it still occasionally struggles to see files or claims it doesn't have access to my computer. As a result, it sometimes fails to recognize the Python functions built into Odysseus.

My current assumption is that this is more of an Odysseus problem than an LLM problem. Since I'm running models locally, I have to stick to smaller models in the 7B–14B range to ensure my GPU can handle both the model and a reasonably large context window. I'll keep testing both local models and Gemini to see what I find.

If I get it working, I'll let you know!

bitboody · 2026-06-06T03:30:17Z

bitboody
Jun 6, 2026

@pewdiepie-archdaemon Instead of manual logs, we should introduce a 'Trace Collector' dev-tool built into Odysseus. It would allow users to flag successful agent interactions as dataset-ready with ease. This sets a low barrier to contribution and keeps the dataset quality consistent.

— Abdelrahman

4 replies

bitboody Jun 6, 2026

If you believe this is issue-worthy, I'd be happy to create one.

pewdiepie-archdaemon Jun 6, 2026
Maintainer Author

That would be super helpful tbh

bitboody Jun 6, 2026

On it 🫡

bitboody Jun 6, 2026

Done: #3060 — opened an issue for the proposal. @pewdiepie-archdaemon

ghost · 2026-06-07T02:41:53Z

ghost
Jun 7, 2026

1 reply

alteixeira20 Jun 7, 2026
Collaborator

Don't be lonely your security concerns made me get out of bed xDD

VykosMolt · 2026-06-07T09:35:49Z

VykosMolt
Jun 7, 2026

I think Odysseus should seriously test Ouro / Ouro-RLTT as a small-model backend.

Ouro is a looped language model, which means it gets extra effective reasoning depth by reusing latent computation over multiple internal steps before producing tokens. Parameter count undersells it badly.

Think of it like looking at a puzzle, Claude, or others like it, will look at it once and try to solve it, while Ouro will look at it once; do a part of it, look at it again and do another part and so on..

The original Ouro paper reports that the 1.4B and 2.6B models match up to 12B-class SOTA dense models across a range of benchmarks. That is the actual reason I think it belongs in this discussion: it is not merely a compressed small model, it is a different scaling direction. Dense small models mostly trade capability away to fit local hardware. Looped models spend the parameter budget differently: less static mass, more iterative latent computation.

Princetons RLTT version strengthens that further. Standard RL methods like GRPO reward only the final output/final latent state, which is a bad fit for looped models because the reasoning is distributed across the internal trajectory. RLTT instaed rewards the latent thought trajectory itself. Ouro RLTT improves substantially over GRPO on hard math/reasoning benchmarks.

I can share a Princeton/Ouro-RLTT checkpoint for testing if people are interested.

I would still separate two things: I am not claiming Ouro-RLTT automatically beats Qwen/Coder models at tool calling, file editing, or strict agent execution. Those are separate skills but for our niche, looped models are far more interesting than ordinary dense models at the same parameter count. Ouro-2.6B is closer to a 12B-class reasoning model in a 2.6B body, and the RLTT version pushes that advantage further beyond.

1 reply

VykosMolt Jun 7, 2026

Two more points which I forgot to add originally:
-Loaded Ouro 2.6B Thinking takes up around 5.7GB VRAM
-Looped models dont magically conjure up more knowledge, they do however manipulate knowledge much much more efficiently.

Idea: Odysseus finetune for smaller models (1–14B), collecting samples #2724

Uh oh!

pewdiepie-archdaemon Jun 5, 2026 Maintainer

Replies: 7 comments · 14 replies

Uh oh!

alteixeira20 Jun 5, 2026 Collaborator

Uh oh!

pewdiepie-archdaemon Jun 5, 2026 Maintainer Author

Uh oh!

alteixeira20 Jun 5, 2026 Collaborator

Uh oh!

Uh oh!

Uh oh!

alteixeira20 Jun 5, 2026 Collaborator

Uh oh!

Uh oh!

alteixeira20 Jun 5, 2026 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pewdiepie-archdaemon Jun 6, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alteixeira20 Jun 7, 2026 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pewdiepie-archdaemon
Jun 5, 2026
Maintainer

Replies: 7 comments 14 replies

alteixeira20
Jun 5, 2026
Collaborator

pewdiepie-archdaemon Jun 5, 2026
Maintainer Author

alteixeira20 Jun 5, 2026
Collaborator

alteixeira20 Jun 5, 2026
Collaborator

alteixeira20 Jun 5, 2026
Collaborator

pewdiepie-archdaemon Jun 6, 2026
Maintainer Author

alteixeira20 Jun 7, 2026
Collaborator