Idea: Odysseus finetune for smaller models (1–14B), collecting samples #2724
Replies: 7 comments 14 replies
-
|
I like the direction, but I’d be careful not to treat finetuning as the first fix before we prove where the bottleneck really is. If the agent prompt is already around 10K tokens, part of the problem may be architectural rather than model-specific: too much static tool/context guidance is being loaded before we know which task path the user actually needs. A finetune might help smaller models follow Odysseus conventions better, but it may also hide the fact that the prompt/tool surface itself needs to be slimmer and more modular. My instinct would be to approach this in stages:
That would also make the dataset work much safer. If we collect samples before having evals and a clear trace format, we risk building a dataset that reflects today’s prompt shape too closely instead of the behavior we actually want Odysseus models to learn. So I’m not against the finetune idea. I think it could be valuable, especially for self-hosted users. I’d just prefer to make the first milestone an evidence-gathering and prompt-surface reduction pass, then use that to decide whether training is the right next step or whether a lighter architecture change gets us most of the benefit. Happy to expand on any of those areas if useful; I’m mostly thinking in terms of keeping this measurable and avoiding locking the project into the wrong fix too early. Best regards, |
Beta Was this translation helpful? Give feedback.
-
|
@pewdiepie-archdaemon look at the proposal to re-license as AGPL, its better at protecting against big tech abuses & ensuring the source code remains free. Right now with a MIT license, OpenAI could take your projects code & implement these features into their closed source program despite it being the open source work of yourself & your community. Based on what how you've spoke about these issues in your videos, I imagine once you look into this you'll become a full on copyleft crusader & promoting those licenses. Its such a cool legal concept that changes must remain open. Its a very powerful too to fight back against big tech & the closing off of the internet. Also yes big models, small models, all good, very yes yes |
Beta Was this translation helpful? Give feedback.
-
|
This makes sense to me too. Since #2750 is already tracking the prompt-bloat side, I’m happy to help with the measurement-first pass rather than jumping straight into prompt rewrites. A focused first contribution I could take: diagnostics that report the assembled agent prompt/token breakdown by section/tool surface, with no behavior or wording changes. That would give the slimming/eval work a baseline and make later PRs easier to review. If you already have a preferred shape for that output, I can follow it; otherwise I can propose something small and non-invasive on #2750 before opening a PR. |
Beta Was this translation helpful? Give feedback.
-
|
I want to help. I've been using Oddyseus on a custom rig:
and running different tests with different agents, particularly testing it to see if it can run the workspace using my Obsidian vault, create files and manage my calendar autonomously. I've also been documenting all my findings on a personal markdown file. I'm new to self-hosting LLM models but I'd gladly gather samples from my particular use case as long as I know the proper way to share them, I don't want to spam the forums or throw useless data at you. |
Beta Was this translation helpful? Give feedback.
-
|
@pewdiepie-archdaemon Instead of manual logs, we should introduce a 'Trace Collector' dev-tool built into Odysseus. It would allow users to flag successful agent interactions as dataset-ready with ease. This sets a low barrier to contribution and keeps the dataset quality consistent. — Abdelrahman |
Beta Was this translation helpful? Give feedback.
-
|
I think Odysseus should seriously test Ouro / Ouro-RLTT as a small-model backend. Ouro is a looped language model, which means it gets extra effective reasoning depth by reusing latent computation over multiple internal steps before producing tokens. Parameter count undersells it badly. Think of it like looking at a puzzle, Claude, or others like it, will look at it once and try to solve it, while Ouro will look at it once; do a part of it, look at it again and do another part and so on.. The original Ouro paper reports that the 1.4B and 2.6B models match up to 12B-class SOTA dense models across a range of benchmarks. That is the actual reason I think it belongs in this discussion: it is not merely a compressed small model, it is a different scaling direction. Dense small models mostly trade capability away to fit local hardware. Looped models spend the parameter budget differently: less static mass, more iterative latent computation. Princetons RLTT version strengthens that further. Standard RL methods like GRPO reward only the final output/final latent state, which is a bad fit for looped models because the reasoning is distributed across the internal trajectory. RLTT instaed rewards the latent thought trajectory itself. Ouro RLTT improves substantially over GRPO on hard math/reasoning benchmarks. I can share a Princeton/Ouro-RLTT checkpoint for testing if people are interested. I would still separate two things: I am not claiming Ouro-RLTT automatically beats Qwen/Coder models at tool calling, file editing, or strict agent execution. Those are separate skills but for our niche, looped models are far more interesting than ordinary dense models at the same parameter count. Ouro-2.6B is closer to a 12B-class reasoning model in a 2.6B body, and the RLTT version pushes that advantage further beyond. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone,
Right now Odysseus only really works well with larger models. The agent system prompt has grown to around 10K tokens at this point — that's the tool surface, the per-tool guidance, and the safety/style notes all baked in. Smaller models with shorter context windows or weaker instruction-following struggle to keep all of that loaded and do real work on top of it.
I'd like to fix that by training a finetune specifically for Odysseus. The goal: a model in the 1B–14B range that can handle the same tasks a larger general-purpose model would — chat, agent tool use, manage_notes/calendar/tasks/memory, deep research, document workflows, cookbook decisions. Self-hosted users on smaller hardware shouldn't be locked out of the agent path just because the prompt got too heavy.
A nice side benefit: it would also solve the "which model should I actually run?" question a lot of people have when setting up. "Run Odysseus-7B" is a much clearer answer than "pick a 7B that's good at tool use AND instruction-following AND has long context AND has been trained on the right kinds of data AND…"
Rough first plan:
Where I'd love help — sample gathering, ideally split per task so people can focus on what they know best. Real traces from your own use (anonymized, opt-in) or synthetic samples both work. Reply with the area you'd want to take on:
If you take a slice, gather as much as you can in that lane. We can sort out the format/quality bar in a follow-up once people self-assign.
The biggest open question for the discussion: is this even necessary?
It's possible the right answer isn't a finetune at all — maybe it's slimming the system prompt, or loading per-tool prompts just-in-time, or better Cookbook recommendations on which existing models work well. Genuinely open to "you don't need to train anything" if that's where the discussion lands.
— Felix
Beta Was this translation helpful? Give feedback.
All reactions