π§ Ornith-1.0-35B β a 35B-A3B agentic-coding fine-tune that edges the base Qwen MoE at real coding, full 262K on 2Γ 3090 (π§ͺ experimental) #480
Replies: 10 comments 16 replies
-
|
I'm going to run the FP8 on DeepSWE when they release it. |
Beta Was this translation helpful? Give feedback.
-
|
I wonder how the smaller quants work on a single RTX3090? I'm guessing not much room for context? Which is OK for smaller bug fixing... |
Beta Was this translation helpful? Give feedback.
-
|
This is a useful comparison, especially because the difference shows up more on real coding than on the general 8-pack. The aider-polyglot edge is modest, but it is exactly the kind of signal I would want to validate with more workflow-shaped runs rather than only static benchmarks. For an agentic-coding fine-tune, I would probably compare Ornith vs base Qwen on:
The single-card question from @tomByrer is also interesting. A smaller quant on one 3090 may be less attractive for huge context, but still useful for scoped bug fixing if the workflow keeps retrieval/context tight. For coding models, the key metric may not be βwhich model scores higher globally,β but which one produces an accepted change with fewer retries, lower latency, and less wasted context. Iβd be interested to see the same benchmark packs compared against a managed OpenAI-compatible endpoint too, because that would make the local vs cloud tradeoff easier to reason about: local control and privacy vs managed latency, reliability, scaling, and cost per completed workflow. |
Beta Was this translation helpful? Give feedback.
-
|
First tests looks promising! Not a single loop with a huge codebase! |
Beta Was this translation helpful? Give feedback.
-
|
While this app is better than base 3.6 35B and even 27B at tool calling, and building stuff, it lacks knowledge to fix issues, like it's own builds as well. this is my conclusion after a few hours of testing. Anyone tested as well ? |
Beta Was this translation helpful? Give feedback.
-
|
After intensive testing, I can say with confidence that Ornith-1.0-35B is sadly another flop. |
Beta Was this translation helpful? Give feedback.
-
|
I used Ornith on a real low-level embedded-code audit task and then had the outputs judged by my GPT-5.5 x-high / Codex high-effort session. The codebase itself is not the point here; I used it only as a hard test case because it has datasheets, Zephyr/NCS reference code, register maps, hardware-specific traps, and enough complexity to expose hallucinations. Setup:
For comparison, I also looked at outputs from other models/tools:
Ranking from the judge pass
What Ornith did wellOrnith was not useless or shallow. It did find real issues, including some that required deeper understanding than just grepping the code. The first run was sharper in places. It found some real security/crypto-style problems and pointed at some real hardware-abstraction risks. It showed that the model can reason about low-level embedded code and not just summarize files. The second run was better overall. It was more cautious, better structured, and less likely to present every suspicion as a guaranteed critical bug. As a practical engineering aid, the second run was the more useful one. My short read:
What Ornith did wrongThe main problem was still hallucinated certainty. In the first run, it made at least one major register-map claim that was simply wrong when checked against the datasheet. It also diagnosed one real issue but put it in the wrong branch/context. That is the dangerous pattern: the model can be βnearβ the truth while still giving implementation guidance that would break low-level code if followed blindly. The second run reduced this problem but did not eliminate it. Some issues were stale, overstated, or needed more precise reference checking before implementation. So I would not let Ornith directly patch MCU register code without a judge pass. It is good at surfacing suspects; it is not yet reliable enough to be the final authority. How it compared with the othersOwl_alpha was the best overall in this test. It produced the highest signal-to-noise ratio and found the most confirmed true issues. Again, I only included it because the free mode was available; this was not an βopen model onlyβ comparison. Qwen35B was useful for roadmap-style thinking and test planning. It was less useful as a strict bug audit because it mixed current issues, historical issues, and design opinions. Qwen27B int4 autoround was the noisiest. It had some real ideas, but it also hallucinated several low-level hardware facts with high confidence. That made it unsafe to implement directly. Conclusion about OrnithOrnith looks promising, especially on the second run. It is clearly capable of finding real issues in a hard codebase with datasheets and reference implementations. But the important caveat is that it still needs supervision. It should be treated as a strong reviewer or bug-scout, not as an autonomous final judge for low-level embedded work. My final take:
|
Beta Was this translation helpful? Give feedback.
-
|
Guys, after spending some days with Ornith 35B, I would say this model is better than Qwen 27B! |
Beta Was this translation helpful? Give feedback.
-
|
Adding a field report from my dual-3090 Pi setup. Short version: Ornith-1.0-35B Q8_0 is now stable for me in plain mainline llama.cpp, and it stopped the failure mode where my local setup would eventually collapse into printing This is not a clean benchmark suite yet, just a real workflow report while I keep hammering it with work. What was failing for meWith the default Club 3090 Ornith path on my machine, the model was usable at first but would sometimes loop after longer agent sessions. The obvious failure mode was an infinite slash stream: I saw it more often after long Pi sessions or when subagents got involved. I initially thought this might be SGLang, tokenizer, or GGUF handling, so I tried to get SGLang working too. SGLang served the model and reported 262K context, but the GGUF path was not reliable for me. The tokenizer was not the root cause either; using the GGUF-derived/Qwen-compatible tokenizer did not fix the bad generation there. What finally stabilized it for me was going back to llama.cpp and making the serving + Pi settings stricter. Stable llama.cpp settingsThis is the command shape I am using now. It is plain docker run -d --gpus all \
--name llama-cpp-ornith-35b-q8 \
--restart unless-stopped \
--shm-size 32g \
-p 8090:8000 \
-v "$MODEL_DIR:/models:ro" \
ghcr.io/ggml-org/llama.cpp:server-cuda \
--host 0.0.0.0 \
--port 8000 \
-m /models/ornith-1.0-35b-gguf-q8/ornith-1.0-35b-Q8_0.gguf \
-c 262144 \
-n 16384 \
-b 2048 \
-ub 256 \
-ngl 99 \
-ts 0.55,0.45 \
-fa on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-np 1 \
--jinja \
--reasoning auto \
--reasoning-format deepseek \
--reasoning-budget 4096 \
--reasoning-preserve \
--chat-template-kwargs '{"preserve_thinking":true}' \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--repeat-penalty 1.08 \
--repeat-last-n 4096 \
--presence-penalty 0.0 \
--frequency-penalty 0.0The important differences from what I had before:
I have not needed DRY sampling yet, but I left hooks for experiments like: --dry-multiplier <value> --dry-allowed-length 2 --dry-penalty-last-n 4096So far, with the settings above, the Pi fix: force local-only subagentsThere was a second issue that was easy to miss: Pi itself was not always staying local. In one session, the main model was definitely local Ornith: but Ornith generated {
"subagent_type": "general-purpose",
"model": "sonnet",
"run_in_background": true
}The Pi subagents plugin treats The fix was to make Pi local-only and scope subagents:
{
"defaultProvider": "local-11-ornith-q8",
"defaultModel": "ornith-1.0-35b",
"enabledModels": [
"local-11-ornith-q8/ornith-1.0-35b"
]
}
{
"scopeModels": true
}Pi model entry: {
"local-11-ornith-q8": {
"baseUrl": "http://127.0.0.1:8090/v1",
"api": "openai-completions",
"apiKey": "llama.cpp",
"compat": {
"supportsDeveloperRole": false,
"supportsReasoningEffort": false,
"maxTokensField": "max_tokens",
"thinkingFormat": "deepseek"
},
"models": [
{
"id": "ornith-1.0-35b",
"name": "Ornith-1.0-35B Q8_0 (llama.cpp, 262K, safer)",
"contextWindow": 262144,
"maxTokens": 16384,
"reasoning": true,
"input": ["text"],
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
}
}
]
}
}I also removed remote providers/credentials from Pi on this machine:
After that, Real workflow results so farThe most useful test for me has been what I call my modem/router admin dashboard test. This is my own modem/router, and I give the agent the local IP plus the admin credentials I already own. It is not breaking a password or bypassing security; it is normal authenticated access. The task is to use Python to log in, inspect the available UI/API data, write scripts to collect status/config information, and build custom dashboards/summary tooling around that data. I am intentionally not posting the real password here, but the test does involve giving the model the modem/router IP and password inside the private local Pi session. Qwen3.6-27B could barely pass that test, and it took roughly an hour. Ornith-1.0-35B Q8_0 with the llama.cpp settings above completed it in under 30 minutes and produced a more usable script/workflow. I am also getting better results than Qwen3.6-27B on game-writing tasks. Examples: Mario-3D-style prototypes, Counter-Strike-style prototypes, and similar larger interactive scenes. Ornith tends to produce more complete glue code and better follow-through across files. I would currently rate it above Qwen3.6-27B for this kind of agentic/game coding, though I am still testing and giving it heavier workloads. My current readFor my rig, the model itself looks good. The unstable part was the serving/runtime envelope:
I will keep testing with heavier work and post more results when I have them. |
Beta Was this translation helpful? Give feedback.
-
|
edit: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
π§ͺ Experimental β a DeepReinforce agentic-coding RL fine-tune of Qwen3.6-35B-A3B. It ties the base Qwen MoE on general capability (8-pack 105 vs 110, within noise) but edges it on real coding (aider-polyglot-30 15/30 vs the base's 12β13), running the full 262K on dual cards at MoE speed. Posted as the coding-leaning 35B-A3B option. See "What'd help".
We just wired up
ik-llama/ornith35b-dualβ DeepReinforce's Ornith-1.0-35B, a 35B-A3B Qwen3-Next MoE (qwen35moe: 8 full-attention + 32 GDN/DeltaNet layers, ~3B active) agentic-coding RL fine-tune, served as the official Q8_0 GGUF. All credit for the model goes to DeepReinforce; ik_llama.cpp (ikawrakow) for the engine + the drafter-free ngram self-spec β see Credits.The headline: ties our production qwen3.6-35b-a3b on the 8-pack (105 vs 110) but beats it on aider-polyglot-30 β 15/30 vs ~13 β at ~109 tok/s decode on dual 3090, full 262K.
What it's for β coding-leaning, not a general upgrade
We already ship qwen3.6-35b-a3b (byteshape 110/150, fp8 dual production), and Ornith ties it on general capability (8-pack 105, within the Β±5β7 noise band; thinking-ON adds nothing β 105 = 105). Where the agentic-coding RL shows up is real coding: aider-polyglot-30 15/30 vs the base's 12β13, corroborated by a perfect bugfind 15/15. So reach for Ornith-35B when your workload is coding / debugging; for general use, the base is just as good. (The aider edge is modest β +2 on a 30-exercise, n=1 run β so treat it as a coding lean, not a landslide. Thinking-ON aider also lands 15/30, so reasoning adds nothing to coding either, consistent with the 8-pack.)
π΄ Results Card β 2Γ RTX 3090, ik_llama.cpp, temp 0.6
β Serving
ik-llama/ornith35b-dualΒ· Q8_0n_max 32)verify-stress 8/8 (NIAHβ240K) Β· soak-continuous PASS (0 growth, 0/100 silent-empty, p50 decode 112, 100.7% retention).
β‘ Quality β 8-pack /150 (
benchlocal-cli --full) + aider-polyglot-30β’ Takeaways
ngram-map-kaccepts (~0.59) where EAGLE/DFlash are KV-rollback-blocked on Qwen3-Next.n_max=32+ a balanced-tsto fit the verify buffer.Requirements
--spec-typeself-spec build) β GGUF, so no transformers/vLLM.temperature 0.6, top_p 0.95, top_k 20.Getting it / Run it
Serves an OpenAI-compatible API on
:8071(modelornith-1.0-35b). ngram is opt-in (SPEC_TYPE_ARGS='--spec-type ngram-map-k:n_max=32,...') β Q8 leaves tight 262K headroom, so lowern_max/ ctx if you enable it.What'd help
Credits
ngram-map-kself-spec.Beta Was this translation helpful? Give feedback.
All reactions