Tools for model merging, expert pruning, differential competence-map extraction, and GGUF quantization — built on top of transformers, safetensors, and llama.cpp. The kit started as a fork-flavored alternative to mergekit (omnimerge_v2 recipe) and grew to cover Gemma 4 MoE surgery and Qwen3.5 hybrid-attention frankenmerges.
Status: research code. Not packaged for pip yet. Scripts assume the directory layout described below; paths inside scripts may need editing for your environment.
| Path | Purpose |
|---|---|
omnimergekit.py |
The main merge script. Methods: dare_ties, omnimerge_v2, plus features obim, darex, emr, fisher. |
competence/ |
Differential competence-map pipeline: extract per-source per-task fisher signal → combine across tasks → feed into omnimergekit.py --fisher. |
gemma4/ |
Gemma 4 MoE surgery: expert drop, DERN-style redistribution, CD-maps for contribution-aware quants, hybrid-expert assembly. |
quantization/ |
quantize_gguf.py (multi-tier quants with imatrix), convert_to_4bit.py, publish_model.py (HF push with frontmatter). |
eval/ |
Eval drivers — GPQA Diamond, LiveCodeBench (lcb_llama_server.py), HE/MBPP rescore-with-fence-strip helpers. |
recipes/ |
End-to-end pipelines: 4B MicroCoder series, Gemma 4 109e/98e/120e/128e, 27B Omnimerge. |
pod/ |
RunPod / Vast.ai helpers (setup, parallel run, retrieve, README publish). |
docs/ |
Method docs, experiment journals, recipe deep-dives. |
experiments/ |
Per-experiment notebooks / logs. |
# 1. Run HE / MBPP / etc. eval on each source model, save lm-eval samples_*.jsonl
# 2. Extract per-source fisher signal restricted to docs THAT source uniquely solved
python competence/competence_extract.py \
--model $SRC1 --samples $EVAL/source1/samples_humaneval_*.jsonl \
--task he --keep-doc-ids 1,4,7,11 \
--output maps/source1__he.safetensors \
--max-len 4096 --chunk-len 1280 # chunked grad-accum: full context on small VRAM
# 3. Combine across tasks (per source) into single competence map
python competence/competence_combine.py \
--map "source1:humaneval:results.json:maps/source1__he.safetensors" \
--map "source1:mbpp:results.json:maps/source1__mbpp.safetensors" \
--raw-rate --signal weight_taylor \
--output-dir maps/combined/
# 4. Merge with Fisher-aware omnimerge_v2
python omnimergekit.py \
--base $BASE --source $SRC1 --source $SRC2 --source $SRC3 \
--output merged/ \
--method omnimerge_v2 --v2-features fisher,darex \
--weights 0.35,0.40,0.25 --density 0.53 --darex-q 0.85 \
--fisher "maps/combined/source1.safetensors,maps/combined/source2.safetensors,maps/combined/source3.safetensors" \
--pr682-turbo --skip-patterns "model.visual,mtp.layers" --device cudaSee docs/METHOD_omnimerge_v2.md for the math (OBIM-lite + DAREx-q + EMR election + Fisher).
# Drop weakest 19 experts/layer (128e → 109e), preserving routing semantics
python gemma4/expert_pruning/expert_drop.py \
--model gemma-4-26B-A4B-it --n-keep 109 \
--analysis gemma4/neuron_analysis/expert_neuron_v4.json \
--output gemma-4-A4B-109e/ --recalibrate-router 2000
# Quantize with contribution-aware (CD) maps
python gemma4/cd_maps/generate_cd_maps_from_contribution.py \
--analysis gemma4/neuron_analysis/expert_neuron_v4.json \
--output cd_maps/
python quantization/quantize_gguf.py gemma-4-A4B-109e \
--tier CD-Q4_K_M --cd-maps cd_maps/ \
--imatrix calibration.dat --out gemma-4-A4B-109e-CD-Q4_K_M.ggufSee docs/METHOD_gemma4_pruning.md for full method.
For "small student + larger teacher + small task corpus" scenarios that pair
naturally with merge recipes (merge-then-distill is the standard order),
see docs/METHOD_kl_distillation.md.
It covers FKL / RKL / GKD / Hybrid losses, the on-policy vs off-policy
trade-off, the recommended starter recipe, and the cross-tokenizer case.
Findings backed by full HE-164 + MBPP-378 evals on DS-Coder-1.3B-Instruct
(student) with DS-Coder-6.7B-Instruct (same-vocab teacher). Forward KL on
student-sampled positions ("DistillSpec") is the recommended default —
it beat reverse-KL (MiniLLM) by 1.8 pp and SFT by 4.3 pp HE at the same
recipe. Cross-vocab plumbing factored into the standalone library
mann1x/cross-tokenizer-distill.
These are baked into the recipe scripts. They cost real compute or a published-model rollback to learn.
- Always
--use_cache <path>and--log_sampleson lm_eval. Without sqlite cache, any death (PEG parser, OOM, network blip) restarts from 0. Without samples, you can't tell whetherpass@1=0means "model bad" or "scorer crashed on markdown fences". imatrix.datMUST be archived next to every quant. Recompute is 15-20 min of GPU time and depends on calibration data + seed. Lose it → quant cannot be reproduced bit-for-bit.- Never run
lm_evalon a chat model via/v1/completionswithout an explicit chat template. Gemma 4 / Qwen3.5 reasoning variants emit fenced code that scorers can'texec(). Use/v1/chat/completions + apply_chat_template, or rescore samples with the fence-strip helpers ineval/. - Gemma 4 needs
--reasoning-format deepseek --reasoning-budget 8192when served viallama-server. Without budget, it emits malformed channel tokens and crashes eval mid-run. - Qwen3.5 has hybrid linear/full attention. Without
flash-linear-attentionandcausal-conv1dinstalled, gradient extraction OOMs at >1k context. Either install them or use chunked-grad-accum (competence_extract.py --chunk-len).
ManniX-ITA/Qwen3.5-27B-Omnimerge-v2— 27B frankenmerge (4 sources, OBIM-lite + DAREx-q + EMR + Fisher).ManniX-ITA/Qwen3.6-27B-Omnimerge-v3a— cross-base v3a (Qwen3.6 base + 3 Qwen3.5 sources).ManniX-ITA/Qwen3.6-27B-Omnimerge-v3b— same-base v3b.- Gemma 4 A4B 109e — pruned MoE (128e → 109e), 75.25% → 71.72% GPQA Diamond, ~12 GB Q4_K_M.
MIT (see LICENSE).
If you use this in published work:
@misc{omnimergekit,
author = {Calpini, Federico},
title = {omnimergekit: model merging, expert pruning, and differential competence maps},
year = {2026},
howpublished = {\url{https://github.com/mann1x/omnimergekit}}
}