🧠 Ornith-1.0-35B — a 35B-A3B agentic-coding fine-tune that edges the base Qwen MoE at real coding, full 262K on 2× 3090 (🧪 experimental) #480

noonghunna · 2026-06-26T03:55:31Z

noonghunna
Jun 26, 2026
Maintainer

🧪 Experimental — a DeepReinforce agentic-coding RL fine-tune of Qwen3.6-35B-A3B. It ties the base Qwen MoE on general capability (8-pack 105 vs 110, within noise) but edges it on real coding (aider-polyglot-30 15/30 vs the base's 12–13), running the full 262K on dual cards at MoE speed. Posted as the coding-leaning 35B-A3B option. See "What'd help".

We just wired up ik-llama/ornith35b-dual — DeepReinforce's Ornith-1.0-35B, a 35B-A3B Qwen3-Next MoE (qwen35moe: 8 full-attention + 32 GDN/DeltaNet layers, ~3B active) agentic-coding RL fine-tune, served as the official Q8_0 GGUF. All credit for the model goes to DeepReinforce; ik_llama.cpp (ikawrakow) for the engine + the drafter-free ngram self-spec — see Credits.

The headline: ties our production qwen3.6-35b-a3b on the 8-pack (105 vs 110) but beats it on aider-polyglot-30 — 15/30 vs ~13 — at ~109 tok/s decode on dual 3090, full 262K.

What it's for — coding-leaning, not a general upgrade

We already ship qwen3.6-35b-a3b (byteshape 110/150, fp8 dual production), and Ornith ties it on general capability (8-pack 105, within the ±5–7 noise band; thinking-ON adds nothing — 105 = 105). Where the agentic-coding RL shows up is real coding: aider-polyglot-30 15/30 vs the base's 12–13, corroborated by a perfect bugfind 15/15. So reach for Ornith-35B when your workload is coding / debugging; for general use, the base is just as good. (The aider edge is modest — +2 on a 30-exercise, n=1 run — so treat it as a coding lean, not a landslide. Thinking-ON aider also lands 15/30, so reasoning adds nothing to coding either, consistent with the 8-pack.)

🎴 Results Card — 2× RTX 3090, ik_llama.cpp, temp 0.6

① Serving

Config	Spec-dec	KV / ctx	decode TPS (narr / code)	VRAM
`ik-llama/ornith35b-dual` · Q8_0	ngram, drafter-free (`n_max 32`)	q8_0 / 262K	108.8 / 108.6 (TTFT 191/214 ms)	~22 GB/card

verify-stress 8/8 (NIAH→240K) · soak-continuous PASS (0 growth, 0/100 silent-empty, p50 decode 112, 100.7% retention).

② Quality — 8-pack /150 (`benchlocal-cli --full`) + aider-polyglot-30

Pack	Ornith-35B (OFF / ON)	qwen3.6-35b-a3b (base, OFF)
toolcall-15	13 / 12	15
instructfollow-15	12 / 15	14
structoutput-15	14 / 14	14
dataextract-15	10 / 8	13
reasonmath-15	13 / 12	13
bugfind-15	15 / 15	13
hermesagent-20	12 / 11	11
cli-40	16 / 18	17
TOTAL 8-pack	105 / 105	110
aider-polyglot-30 (real coding)	15/30 OFF · 15/30 ON	13/30 (fp8 dual) · 12/30 (apex)

③ Takeaways

Edges the base at real coding — aider-30 15/30 vs ~13, with a perfect bugfind 15/15 (100%). That's the agentic-coding RL doing exactly what it should.
General-capability 8-pack is a wash — 105 ≈ 110, and reasoning-ON buys nothing (105 = 105). The fine-tune's value is coding, not breadth.
ngram works on the DeltaNet hybrid — no MTP head in the GGUF, yet ik_llama's drafter-free ngram-map-k accepts (~0.59) where EAGLE/DFlash are KV-rollback-blocked on Qwen3-Next.
262K KV is cheap (~2.7 GB) — the hybrid MoE holds attention KV on only ~10/40 layers. Q8 weights nearly fill dual, so ngram needs n_max=32 + a balanced -ts to fit the verify buffer.
Fast for a 35B — ~109 tok/s on dual because only ~3B params are active per token.

Requirements

2× 24 GB GPU (Q8_0 ≈ 35 GB — dual-card only).
ik_llama.cpp (the --spec-type self-spec build) — GGUF, so no transformers/vLLM.
Sampling (per the card): temperature 0.6, top_p 0.95, top_k 20.

Getting it / Run it

git clone <repo> && cd club-3090
WEIGHTS=deepreinforce-q8 ./scripts/setup.sh ornith-1.0-35b      # fetch the official Q8_0 GGUF
./scripts/switch.sh --force ik-llama/ornith35b-dual            # --force: 🧪 experimental

Serves an OpenAI-compatible API on :8071 (model ornith-1.0-35b). ngram is opt-in (SPEC_TYPE_ARGS='--spec-type ngram-map-k:n_max=32,...') — Q8 leaves tight 262K headroom, so lower n_max / ctx if you enable it.

What'd help

More aider / real-coding cross-rig data — the axis where Ornith-35B separates from the base; a second run pins down how real the +2 is.
Does the agentic-coding fine-tune help your coding workflow vs plain qwen3.6-35b-a3b? That's the whole question.

Credits

Model: DeepReinforce — Ornith-1.0-35B agentic-coding RL fine-tune + official Q8_0 GGUF.
Base architecture: Qwen3.6-35B-A3B (Qwen3-Next MoE).
Engine: ik_llama.cpp (ikawrakow) — drafter-free ngram-map-k self-spec.

Whamp · 2026-06-26T06:04:25Z

Whamp
Jun 26, 2026

I'm going to run the FP8 on DeepSWE when they release it.

0 replies

tomByrer · 2026-06-26T07:08:09Z

tomByrer
Jun 26, 2026

I wonder how the smaller quants work on a single RTX3090? I'm guessing not much room for context? Which is OK for smaller bug fixing...

4 replies

noonghunna Jun 27, 2026
Maintainer Author

Good question. The 35B here is Q8 (~35 GB) → dual-card only; it won't fit one 3090 alongside any KV. But its single-card sibling Ornith-1.0-9B (ik-llama/ornith9b-single, Q4_K_M ~5.5 GB) actually flips your worry around — it's small enough that context isn't the constraint: we ran it at the full 262K on one card with ngram spec-dec. So for scoped single-card bug-fixing you get plenty of context headroom; the 9B is the pick there, and the 35B-dual is for when you want the bigger model's coding edge at 262K. (A sub-Q8 35B quant could open a single-card 35B lane, but none's on disk yet — and it'd trade the fidelity that gives Ornith its edge.)

tomByrer Jun 28, 2026

Someone made a 3bit & 4_k_xs quants. With/without MTP:
https://huggingface.co/LordNeel/Ornith-1.0-35B-GGUF-llamacpp-tp1

Looking around Reddit, seems 3/5 or more think (at least the bigger quants) are better than Qwen's base 35B, though some say not so good tool-calling. Some also say Qwen 27B is still better than both.

noonghunna Jun 28, 2026
Maintainer Author

Tom had an interesting share on this.

https://x.com/no_stp_on_snek/status/2070474656788910183?s=20

tomByrer Jun 28, 2026

https://x.com/no_stp_on_snek/status/2070474656788910183?s=20

Thanks! I wonder if the "doing legitimate requests" can be improved by a Heritic version?
I'll have to ask what quants.

richardchen874-sys · 2026-06-27T04:17:32Z

richardchen874-sys
Jun 27, 2026

This is a useful comparison, especially because the difference shows up more on real coding than on the general 8-pack.

The aider-polyglot edge is modest, but it is exactly the kind of signal I would want to validate with more workflow-shaped runs rather than only static benchmarks.

For an agentic-coding fine-tune, I would probably compare Ornith vs base Qwen on:

cost / power per completed coding task
accepted patch rate
retry count before a usable fix
tool-call reliability
structured-output stability
latency during multi-step coding loops
context retention over longer sessions
whether the model needs thinking mode at all
whether bugfixing improves more than greenfield coding

The single-card question from @tomByrer is also interesting. A smaller quant on one 3090 may be less attractive for huge context, but still useful for scoped bug fixing if the workflow keeps retrieval/context tight.

For coding models, the key metric may not be “which model scores higher globally,” but which one produces an accepted change with fewer retries, lower latency, and less wasted context.

I’d be interested to see the same benchmark packs compared against a managed OpenAI-compatible endpoint too, because that would make the local vs cloud tradeoff easier to reason about: local control and privacy vs managed latency, reliability, scaling, and cost per completed workflow.

3 replies

noonghunna Jun 27, 2026
Maintainer Author

Spot on for a coding model — "accepted change with fewer retries, lower latency, less wasted context" is a better target than a global score, and it's exactly why the aider edge (real edits) mattered more here than the 8-pack. I just put the full run recipe for the local-vs-managed comparison on the #417 thread — the 8-pack + aider-polyglot against any OpenAI-compatible endpoint, with rate-limit pacing — so let's keep that head-to-head in one place. Run Ornith (or base qwen3.6-27b) through it and the numbers fold straight in. 🙏

richardchen874-sys Jun 30, 2026

That organization makes sense.

I agree this thread should stay focused on Ornith itself: whether the agentic-coding fine-tune actually improves real coding behavior versus the base Qwen MoE, especially on aider-polyglot and bug-finding style tasks.

The broader local-vs-managed endpoint comparison fits much better in #417, since that thread already has the cloud-run recipe, pacing / retry controls, spend guard, and JSON-based comparison flow.

I’ve already dropped the first managed endpoint deterministic-pack results there, and I’ll keep any follow-up work in that thread as well — especially reasoning-state checks and the heavier workflow packs like hermesagent-20, cli-40, and aider-polyglot-30.

For this Ornith thread, I’ll treat the key signal as coding-specific: accepted changes, retry behavior, latency, context waste, and whether the fine-tune improves real edits more than it improves general benchmark score.

tomByrer Jul 1, 2026

YouTuber tested Ornith vs Qwen 3.6 35b Q4_K_M on 16GB. Found Ornith faster, & just as good, maybe slightly better than Qwen. Exception was 'Human Eval' (5yo python questions) where Ornith ran out of thinking budget on some questions, but % of correct answers matched Qwen.

lolren · 2026-06-27T13:47:48Z

lolren
Jun 27, 2026

First tests looks promising! Not a single loop with a huge codebase!
Handles tasks perfectly!
Also, passes the 3D Mario challenge better than 3.6-35B-A3, and similar or better than 27B !

0 replies

lolren · 2026-06-27T20:26:08Z

lolren
Jun 27, 2026

While this app is better than base 3.6 35B and even 27B at tool calling, and building stuff, it lacks knowledge to fix issues, like it's own builds as well. this is my conclusion after a few hours of testing. Anyone tested as well ?

1 reply

tomByrer Jun 28, 2026

A redditor had good experience with his (single) math knowledge test. Others say not so good with tool calling, but perhaps your harness is better?

lolren · 2026-06-28T19:03:01Z

lolren
Jun 28, 2026

After intensive testing, I can say with confidence that Ornith-1.0-35B is sadly another flop.
Good in tool calling, doesn't loop. Terrible at getting complex stuff right, or fixing stuff.
Unless someone has better experience, which I would love to hear, I would say that Qwen base models are better than any finetune I've tried!

3 replies

tomByrer Jun 29, 2026

doesn't loop

IMHO that is an advantage; the harness should 'loop' (really use a Finite State Machine to loop), not the AI.
The AI may loop the wrong direction for hours if left alone without a harness keeping it in rails.

These are local models, not frontier with 20 subagents & 1M of context.

Terrible at getting complex stuff right

That is a bummer; might need another model to plan.

bad at fixing stuff

Is it bad at making & running tests?

tomByrer Jun 30, 2026

@lolren try this: https://www.reddit.com/r/LocalLLaMA/comments/1uj9viw/been_running_qwen3627b_through_a_3critic_harness/

lolren Jun 30, 2026

Thank you @tomByrer and sorry for the late reply. I've been very busy lately.
I've been putting lately Ornith 1.0-35B trough some serious tests, and the results are not bad.
The tests involves close to 400k lines of code , to analyse every function, compared to the datasheet and another open source implementation, and write a report at the end.
The results are not bad, no loops, sprung multiple agents.. From time to time, I get a issue with llamacpp, which just outputs gibberish. (/////////////////////// repeating forever). But this is not a model issue.
When it works, the results are very good, I've ran the test multiple times and compare it with Qwen 27B , Owl Alpha (free on open router). Test ran for hours each time.
At tool calling, the model is better than 27B , but somehow 27B is smarter and picks stuff better than Ornith.

lolren · 2026-06-30T13:45:40Z

lolren
Jun 30, 2026

I used Ornith on a real low-level embedded-code audit task and then had the outputs judged by my GPT-5.5 x-high / Codex high-effort session. The codebase itself is not the point here; I used it only as a hard test case because it has datasheets, Zephyr/NCS reference code, register maps, hardware-specific traps, and enough complexity to expose hallucinations.

Setup:

Prompt size: 4 lines.
Ornith was launched from the LM control panel.
The available reference corpus was large: about 199,502 text/source files and 43,814,634 lines after excluding git metadata, binaries, PDFs, images, archives, build outputs and release artifacts.
The corpus included the target source, datasheet text, Zephyr mainline code and NCS/Nordic reference material.
I ran Ornith twice on the same style of task:
- Ornith first run: 436-line report.
- Ornith second run: 940-line report.

For comparison, I also looked at outputs from other models/tools:

Owl_alpha: 753-line report. This was included because its free mode was available; I am not claiming it is an open model.
Qwen27B int4 autoround: 1,007-line report.
Qwen-3.6-35B-Q8: three reports/plans, 766, 873 and 834 lines.

Ranking from the judge pass

Rank	Model/report	Judge result
1	Owl_alpha	Best overall signal. It found the most true issues with the least dangerous noise. Was included in the test because it's free!
2	Ornith second run	Strong second. More careful than the first run, fewer wild claims, better wording.
3	Ornith first run	Had some very good deep finds, but also more reckless hallucinations.
4	Qwen-3.6-35B-Q8 reports	Useful for planning and broad context, but less reliable as a direct bug-finding audit.
5	Qwen27B int4 autoround	Most noisy. Some good ideas, but too many high-confidence hallucinations.

What Ornith did well

Ornith was not useless or shallow. It did find real issues, including some that required deeper understanding than just grepping the code.

The first run was sharper in places. It found some real security/crypto-style problems and pointed at some real hardware-abstraction risks. It showed that the model can reason about low-level embedded code and not just summarize files.

The second run was better overall. It was more cautious, better structured, and less likely to present every suspicion as a guaranteed critical bug. As a practical engineering aid, the second run was the more useful one.

My short read:

First run: higher-risk, higher-variance. Some excellent finds, but more reckless.
Second run: less flashy, more useful. Better as an engineering checklist.

What Ornith did wrong

The main problem was still hallucinated certainty.

In the first run, it made at least one major register-map claim that was simply wrong when checked against the datasheet. It also diagnosed one real issue but put it in the wrong branch/context. That is the dangerous pattern: the model can be “near” the truth while still giving implementation guidance that would break low-level code if followed blindly.

The second run reduced this problem but did not eliminate it. Some issues were stale, overstated, or needed more precise reference checking before implementation.

So I would not let Ornith directly patch MCU register code without a judge pass. It is good at surfacing suspects; it is not yet reliable enough to be the final authority.

How it compared with the others

Owl_alpha was the best overall in this test. It produced the highest signal-to-noise ratio and found the most confirmed true issues. Again, I only included it because the free mode was available; this was not an “open model only” comparison.

Qwen35B was useful for roadmap-style thinking and test planning. It was less useful as a strict bug audit because it mixed current issues, historical issues, and design opinions.

Qwen27B int4 autoround was the noisiest. It had some real ideas, but it also hallucinated several low-level hardware facts with high confidence. That made it unsafe to implement directly.

Conclusion about Ornith

Ornith looks promising, especially on the second run. It is clearly capable of finding real issues in a hard codebase with datasheets and reference implementations.

But the important caveat is that it still needs supervision. It should be treated as a strong reviewer or bug-scout, not as an autonomous final judge for low-level embedded work.

My final take:

Ornith first run: useful but reckless.
Ornith second run: genuinely useful and much safer.
Best use case: generate a candidate bug list, then have a stronger verifier check each claim against source and datasheets.
Biggest weakness: confident hallucinations on hardware/register details.
Overall: good model, not magic, and noticeably better on the second pass.

1 reply

noonghunna Jun 30, 2026
Maintainer Author

Thank you for sharing your findings @lolren

lolren · 2026-07-01T07:22:40Z

lolren
Jul 1, 2026

Guys, after spending some days with Ornith 35B, I would say this model is better than Qwen 27B!
No tools calling issues, no loops, crazy fast on 2x3090!
Has anyone tested this model as his main local agent ?

1 reply

noonghunna Jul 1, 2026
Maintainer Author

wait till they release the 31b dense model. I reckon that will be a better comparison with 27b.
Ornith 35B is certainly better at agentic coding by the looks of the benchmark and community feedback i've seen so far.
As long as it doesn't get stuck in loops, could certainly be a daily driver.

lolren · 2026-07-01T09:21:50Z

lolren
Jul 1, 2026

Adding a field report from my dual-3090 Pi setup. Short version: Ornith-1.0-35B Q8_0 is now stable for me in plain mainline llama.cpp, and it stopped the failure mode where my local setup would eventually collapse into printing ////////// forever.

This is not a clean benchmark suite yet, just a real workflow report while I keep hammering it with work.

What was failing for me

With the default Club 3090 Ornith path on my machine, the model was usable at first but would sometimes loop after longer agent sessions. The obvious failure mode was an infinite slash stream:

///////////////////////////////////////////////
///////////////////////////////////////////////
...

I saw it more often after long Pi sessions or when subagents got involved. I initially thought this might be SGLang, tokenizer, or GGUF handling, so I tried to get SGLang working too. SGLang served the model and reported 262K context, but the GGUF path was not reliable for me. The tokenizer was not the root cause either; using the GGUF-derived/Qwen-compatible tokenizer did not fix the bad generation there.

What finally stabilized it for me was going back to llama.cpp and making the serving + Pi settings stricter.

Stable llama.cpp settings

This is the command shape I am using now. It is plain ghcr.io/ggml-org/llama.cpp:server-cuda, not ik_llama, and no ngram/spec-dec path:

docker run -d --gpus all \
  --name llama-cpp-ornith-35b-q8 \
  --restart unless-stopped \
  --shm-size 32g \
  -p 8090:8000 \
  -v "$MODEL_DIR:/models:ro" \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  --host 0.0.0.0 \
  --port 8000 \
  -m /models/ornith-1.0-35b-gguf-q8/ornith-1.0-35b-Q8_0.gguf \
  -c 262144 \
  -n 16384 \
  -b 2048 \
  -ub 256 \
  -ngl 99 \
  -ts 0.55,0.45 \
  -fa on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -np 1 \
  --jinja \
  --reasoning auto \
  --reasoning-format deepseek \
  --reasoning-budget 4096 \
  --reasoning-preserve \
  --chat-template-kwargs '{"preserve_thinking":true}' \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --repeat-penalty 1.08 \
  --repeat-last-n 4096 \
  --presence-penalty 0.0 \
  --frequency-penalty 0.0

The important differences from what I had before:

-n 16384: do not let agent sessions generate forever.
--reasoning auto --reasoning-format deepseek: Pi sees reasoning_content cleanly.
--reasoning-budget 4096: keeps the reasoning channel bounded.
--reasoning-preserve and --chat-template-kwargs '{"preserve_thinking":true}': this seemed important for long sessions, especially after previous assistant reasoning/tool turns are in context.
--repeat-penalty 1.08 --repeat-last-n 4096: this is probably the most direct mitigation for the slash-loop attractor. The Club-style closest Ornith defaults I saw use repeat penalty 1.0; bumping it slightly over a large window made a visible difference here.
Full 262144 context stayed enabled.
KV is q8_0/q8_0.
Sampling is the card-style temp 0.6, top_p 0.95, top_k 20.

I have not needed DRY sampling yet, but I left hooks for experiments like:

--dry-multiplier <value> --dry-allowed-length 2 --dry-penalty-last-n 4096

So far, with the settings above, the ////////// runaway has not reproduced.

Pi fix: force local-only subagents

There was a second issue that was easy to miss: Pi itself was not always staying local.

In one session, the main model was definitely local Ornith:

provider: local-11-ornith-q8
model: ornith-1.0-35b

but Ornith generated Agent tool calls with:

{
  "subagent_type": "general-purpose",
  "model": "sonnet",
  "run_in_background": true
}

The Pi subagents plugin treats "sonnet" as a fuzzy model override, so if OpenRouter/Sonnet is available it can resolve remotely. That made my "local Ornith" run unexpectedly spin subagents on OpenRouter.

The fix was to make Pi local-only and scope subagents:

~/.pi/agent/settings.json:

{
  "defaultProvider": "local-11-ornith-q8",
  "defaultModel": "ornith-1.0-35b",
  "enabledModels": [
    "local-11-ornith-q8/ornith-1.0-35b"
  ]
}

~/.pi/agent/subagents.json:

{
  "scopeModels": true
}

Pi model entry:

{
  "local-11-ornith-q8": {
    "baseUrl": "http://127.0.0.1:8090/v1",
    "api": "openai-completions",
    "apiKey": "llama.cpp",
    "compat": {
      "supportsDeveloperRole": false,
      "supportsReasoningEffort": false,
      "maxTokensField": "max_tokens",
      "thinkingFormat": "deepseek"
    },
    "models": [
      {
        "id": "ornith-1.0-35b",
        "name": "Ornith-1.0-35B Q8_0 (llama.cpp, 262K, safer)",
        "contextWindow": 262144,
        "maxTokens": 16384,
        "reasoning": true,
        "input": ["text"],
        "cost": {
          "input": 0,
          "output": 0,
          "cacheRead": 0,
          "cacheWrite": 0
        }
      }
    ]
  }
}

I also removed remote providers/credentials from Pi on this machine:

removed openrouter, deepseek, and zai from ~/.pi/agent/models.json
cleared stored remote auth from ~/.pi/agent/auth.json
removed global OPENROUTER_API_KEY, OPENAI_API_KEY, OPENAI_BASE_URL, and similar remote exports from shell startup
added a system rule telling the agent not to set Agent.model; subagents should inherit the current local model

After that, pi --list-models --offline only showed my local providers, and a Pi smoke test hit local Ornith cleanly.

Real workflow results so far

The most useful test for me has been what I call my modem/router admin dashboard test. This is my own modem/router, and I give the agent the local IP plus the admin credentials I already own. It is not breaking a password or bypassing security; it is normal authenticated access. The task is to use Python to log in, inspect the available UI/API data, write scripts to collect status/config information, and build custom dashboards/summary tooling around that data. I am intentionally not posting the real password here, but the test does involve giving the model the modem/router IP and password inside the private local Pi session.

Qwen3.6-27B could barely pass that test, and it took roughly an hour. Ornith-1.0-35B Q8_0 with the llama.cpp settings above completed it in under 30 minutes and produced a more usable script/workflow.

I am also getting better results than Qwen3.6-27B on game-writing tasks. Examples: Mario-3D-style prototypes, Counter-Strike-style prototypes, and similar larger interactive scenes. Ornith tends to produce more complete glue code and better follow-through across files. I would currently rate it above Qwen3.6-27B for this kind of agentic/game coding, though I am still testing and giving it heavier workloads.

My current read

For my rig, the model itself looks good. The unstable part was the serving/runtime envelope:

SGLang GGUF was not usable for me yet.
Default/local Club 3090 Ornith settings could loop into ////////// on my setup.
Plain llama.cpp with bounded reasoning, preserved thinking, deepseek reasoning parsing, a 16K output cap, and a 4096-token repeat window is stable so far.
Pi needed a separate local-only/subagent-scope fix, otherwise the model could ask for sonnet and accidentally leave the local stack.

I will keep testing with heavier work and post more results when I have them.

2 replies

lolren Jul 1, 2026

Just tested my work Huawei Portable Router! it aced it in about 20 minutes after I've given the IP and password. Plain 35B always fails on this test.
There is something to this model!

lolren Jul 1, 2026

I've given Ornith 35B a task that is almost impossible, just to see it fail, and to check if it starts looping or stalls ..
It's at it for 2 hours, still trying, but no loops! the above settings work!

lolren · 2026-07-01T18:07:34Z

lolren
Jul 1, 2026

docker run -d --gpus all \ --name llama-cpp-ornith-35b-q8 \ --restart unless-stopped \ --shm-size 32g \ -p 8090:8000 \ -v "$MODEL_DIR:/models:ro" \ ghcr.io/ggml-org/llama.cpp:server-cuda \ --host 0.0.0.0 \ --port 8000 \ -m /models/ornith-1.0-35b-gguf-q8/ornith-1.0-35b-Q8_0.gguf \ -c 262144 \ -n 16384 \ -b 2048 \ -ub 256 \ -ngl 99 \ -ts 0.55,0.45 \ -fa on \ --cache-type-k f16 \ --cache-type-v f16 \ -np 1 \ --jinja \ --reasoning auto \ --reasoning-format deepseek \ --reasoning-budget 4096 \ --reasoning-preserve \ --chat-template-kwargs '{"preserve_thinking":true}' \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --repeat-penalty 1.08 \ --repeat-last-n 4096 \ --presence-penalty 0.0 \ --frequency-penalty 0.0

edit:
cache type f16 is better
after switching to f16 kv cache, its better at fixing stuff!

1 reply

noonghunna Jul 2, 2026
Maintainer Author

good going @lolren

i'm still tied up migrating composes to the new vllm, will give this a shot once i'm free from my battles.

🧠 Ornith-1.0-35B — a 35B-A3B agentic-coding fine-tune that edges the base Qwen MoE at real coding, full 262K on 2× 3090 (🧪 experimental) #480

Uh oh!

noonghunna Jun 26, 2026 Maintainer

What it's for — coding-leaning, not a general upgrade

🎴 Results Card — 2× RTX 3090, ik_llama.cpp, temp 0.6

① Serving

② Quality — 8-pack /150 (benchlocal-cli --full) + aider-polyglot-30

③ Takeaways

Requirements

Getting it / Run it

What'd help

Credits

Replies: 10 comments · 16 replies

Uh oh!

Uh oh!

Uh oh!

noonghunna Jun 27, 2026 Maintainer Author

Uh oh!

Uh oh!

noonghunna Jun 28, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

noonghunna Jun 27, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ranking from the judge pass

What Ornith did well

What Ornith did wrong

How it compared with the others

Conclusion about Ornith

Uh oh!

noonghunna Jun 30, 2026 Maintainer Author

Uh oh!

Uh oh!

noonghunna Jul 1, 2026 Maintainer Author

Uh oh!

What was failing for me

Stable llama.cpp settings

Pi fix: force local-only subagents

Real workflow results so far

My current read

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noonghunna Jul 2, 2026 Maintainer Author

noonghunna
Jun 26, 2026
Maintainer

② Quality — 8-pack /150 (`benchlocal-cli --full`) + aider-polyglot-30

Replies: 10 comments 16 replies

noonghunna Jun 27, 2026
Maintainer Author

noonghunna Jun 28, 2026
Maintainer Author

noonghunna Jun 27, 2026
Maintainer Author

noonghunna Jun 30, 2026
Maintainer Author

noonghunna Jul 1, 2026
Maintainer Author

noonghunna Jul 2, 2026
Maintainer Author