### GPU requirements
MergeKit's evo evaluators are Ray actors that request 1 GPU each. Even the serial path reserves a GPU in Ray. Running evo fully on CPU would require non-trivial code changes (actor resource specs, evaluation backend). This notebook therefore keeps Part A as a read-only walkthrough and provides Part B as a CPU-only portable template you can run locally.

### Genotype → MergeConfiguration demo (safe to run on CPU)
This cell uses MergeKit's public API to: (1) build a `ModelGenome` from a simple definition, (2) generate an initial genotype, (3) convert it to a `MergeConfiguration`, and (4) render the YAML.

In [2]:
# Minimal, CPU-safe demo: construct genome definition and convert a genotype to MergeConfiguration
from mergekit.evo.genome import ModelGenome, ModelGenomeDefinition
from mergekit.common import ModelReference
import torch

# Use two small public models as references (only their configs are read here).
# You can replace with your own. This cell does not load weights.
genome_def = ModelGenomeDefinition(
    models=[
        ModelReference(model="Qwen/Qwen2.5-0.5B"),
        ModelReference(model="Qwen/Qwen2.5-0.5B-Instruct"),
    ],
    base_model=ModelReference(model="Qwen/Qwen2.5-0.5B"),
    merge_method="dare_ties",
    layer_granularity=2,  # group layers for fewer parameters
    normalize=None,        # auto for supported methods
    allow_negative_weights=True,
    filters=None,
    smooth=False,
 )

genome = ModelGenome(genome_def, trust_remote_code=False)
x0 = genome.initial_genotype(random=False)  # shape: [groups, models, param_sets, params]
print('genotype shape:', tuple(x0.shape))

merge_cfg = genome.genotype_merge_config(x0)
print('merge method:', merge_cfg.merge_method)
print('num referenced models:', len(merge_cfg.referenced_models()))
print('slices:', len(merge_cfg.slices or []))
print('Merge configuration YAML:')
print(merge_cfg.to_yaml())

  from .autonotebook import tqdm as notebook_tqdm


genotype shape: (12, 2, 1, 2)
merge method: dare_ties
num referenced models: 2
slices: 12
Merge configuration YAML:
base_model: Qwen/Qwen2.5-0.5B
dtype: bfloat16
merge_method: dare_ties
parameters:
  int8_mask: 1.0
  normalize: 1.0
slices:
- sources:
  - layer_range: [0, 2]
    model: Qwen/Qwen2.5-0.5B
    parameters:
      density: 1.0
      weight: 0.5
  - layer_range: [0, 2]
    model: Qwen/Qwen2.5-0.5B-Instruct
    parameters:
      density: 1.0
      weight: 0.5
- sources:
  - layer_range: [2, 4]
    model: Qwen/Qwen2.5-0.5B
    parameters:
      density: 1.0
      weight: 0.5
  - layer_range: [2, 4]
    model: Qwen/Qwen2.5-0.5B-Instruct
    parameters:
      density: 1.0
      weight: 0.5
- sources:
  - layer_range: [4, 6]
    model: Qwen/Qwen2.5-0.5B
    parameters:
      density: 1.0
      weight: 0.5
  - layer_range: [4, 6]
    model: Qwen/Qwen2.5-0.5B-Instruct
    parameters:
      density: 1.0
      weight: 0.5
- sources:
  - layer_range: [6, 8]
    model: Qwen/Qwen2.5-0.5B


### What happens next (on GPU runs)
- The `MergeConfiguration` feeds `MergePlanner`, which builds a tensor DAG per-parameter and per-layer: gather → merge op → save.
- The evaluator merges on-disk or in-memory, then calls `lm_eval.simple_evaluate` to compute metrics.
- CMA-ES in `mergekit.scripts.evolve` optimizes the genotype by repeatedly evaluating populations and updating the distribution.

## Part B — Portable evolutionary merging for any PyTorch model (CPU)
We'll implement a small CPU-only evolutionary search that mirrors MergeKit's ideas:
- A genotype controlling per-layer weights (and optional density).
- A merge function that combines parent state_dicts using those params.
- A fitness function (validation loss/accuracy).
- A simple evolutionary optimizer (CMA-ES if available; fallback to random-restart hill climbing).

In [3]:
# Toy dataset and model (CPU)
import torch, torch.nn as nn, torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
import math, random

torch.manual_seed(0)
device = 'cpu'

# Synthetic 2D classification (two moons-like but simpler)
def make_data(n=1024):
    x = torch.rand(n, 2) * 2 - 1  # [-1,1]^2
    y = (x[:,0] * x[:,1] > 0).long()  # XOR-ish
    return x, y

Xtr, ytr = make_data(2048)
Xva, yva = make_data(512)
train_loader = DataLoader(TensorDataset(Xtr, ytr), batch_size=128, shuffle=True)
val_loader = DataLoader(TensorDataset(Xva, yva), batch_size=256)

class MLP(nn.Module):
    def __init__(self, d=2, h=32, c=2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d, h), nn.ReLU(),
            nn.Linear(h, h), nn.ReLU(),
            nn.Linear(h, c)
        )
    def forward(self, x): return self.net(x)

def train_one(model, steps=200, lr=1e-2):
    opt = torch.optim.AdamW(model.parameters(), lr=lr)
    model.train()
    it = iter(train_loader)
    for _ in range(steps):
        try: xb, yb = next(it)
        except StopIteration: it = iter(train_loader); xb, yb = next(it)
        opt.zero_grad()
        loss = F.cross_entropy(model(xb), yb)
        loss.backward(); opt.step()

def eval_acc(model):
    model.eval(); corr=tot=0
    with torch.no_grad():
        for xb, yb in val_loader:
            pred = model(xb).argmax(-1)
            corr += (pred==yb).sum().item(); tot += yb.numel()
    return corr/tot

# Train two different parents
parent_a = MLP(); parent_b = MLP()
train_one(parent_a, steps=300)
train_one(parent_b, steps=300, lr=8e-3)
print('Parent A acc:', eval_acc(parent_a))
print('Parent B acc:', eval_acc(parent_b))

Parent A acc: 0.974609375
Parent B acc: 0.98046875


In [4]:
# Genome encoding: [n_layer_groups, n_models, n_param_sets(=1), n_params]
# We'll mirror MergeKit's 'dare_ties' shape but implement a simplified op:
#  - param 0: weight in [0, +inf) (abs applied);
#  - param 1: density in [0,1] (sparsifies the delta merge).

import numpy as np

layers = [n for n,_ in parent_a.named_parameters()]
# Group linear layers by layer index for a coarse "layer_granularity"
layer_groups = [
    [n for n in layers if 'net.0' in n or 'net.1' in n],
    [n for n in layers if 'net.2' in n or 'net.3' in n],
    [n for n in layers if 'net.4' in n],
]
n_groups = len(layer_groups); n_models=2; n_sets=1; n_params=2

def init_genotype():
    g = torch.zeros(n_groups, n_models, n_sets, n_params)
    g[:,:,:,0] = 0.5  # equal weights
    g[:,:,:,1] = 1.0  # full density
    return g

def apply_genotype(parent_a, parent_b, g):
    # Simplified "dare_ties-like": base + sum_i w_i * sparsify(delta_i, density)
    # Where delta_i = parent_i - base; base = parent_a here; density in [0,1]
    g = g.clone(); g[:,:,:,0] = g[:,:,:,0].abs(); g[:,:,:,1] = g[:,:,:,1].abs().clamp(0,1)
    weight = g[:,:,:,0]; density = g[:,:,:,1]
    base_sd = dict(parent_a.state_dict()); b_sd = dict(parent_b.state_dict())
    out = MLP(); out_sd = out.state_dict()
    for gi, names in enumerate(layer_groups):
        w_a = weight[gi,0,0,0].item(); w_b = weight[gi,1,0,0].item()
        d = density[gi,0,0,1].item() if n_params>1 else 1.0
        for name in names:
            if name not in out_sd: continue
            A = base_sd[name]; B = b_sd[name]
            delta = B - A
            # Sparsify delta using top-|delta| mask by fraction d
            k = max(1, int(delta.numel()*d))
            flat = delta.flatten().abs()
            if k < flat.numel():
                thresh = flat.kthvalue(flat.numel()-k).values
                mask = (delta.abs() >= thresh)
            else:
                mask = torch.ones_like(delta, dtype=torch.bool)
            merged = A + w_b * (delta * mask)
            out_sd[name] = merged.to(out_sd[name].dtype)
    out.load_state_dict(out_sd)
    return out

def fitness(g):
    model = apply_genotype(parent_a, parent_b, g)
    return eval_acc(model)  # higher is better

g0 = init_genotype()
print('Initial fitness:', fitness(g0))

IndexError: too many indices for tensor of dimension 3

In [None]:
# Evolutionary search: CMA-ES if available, else simple random-restart hill climbing
try:
    import cma
    use_cma = True
except Exception:
    use_cma = False

def run_cma_es(iters=10, sigma0=0.2, popsize=None):
    x0 = g0.view(-1).numpy()
    es = cma.CMAEvolutionStrategy(x0, sigma0, {'popsize': popsize} if popsize else {})
    best = (-1.0, x0)
    for _ in range(iters):
        xs = es.ask()
        fits = []
        for x in xs:
            g = torch.tensor(x).view_as(g0)
            fits.append(-(-fitness(g)))  # CMA minimizes
        es.tell(xs, [-f for f in fits])
        bi = int(np.argmax(fits)); bfit = fits[bi]
        if bfit > best[0]: best = (bfit, xs[bi])
    return best[0], torch.tensor(best[1]).view_as(g0)

def run_hill_climb(iters=50, samples=16, step=0.2):
    best_g = g0.clone(); best_f = fitness(best_g)
    for _ in range(iters):
        cand = [best_g + step*torch.randn_like(best_g) for _ in range(samples)]
        vals = [fitness(g) for g in cand]
        bi = int(np.argmax(vals))
        if vals[bi] > best_f: best_f, best_g = vals[bi], cand[bi]
    return best_f, best_g

best_score, best_g = (run_cma_es(iters=15) if use_cma else run_hill_climb())
print('Best fitness:', best_score)

### Adapting to your own architecture
- Extract and align parent `state_dict`s; ensure parameter name and shape match. If not, write mapping code.
- Choose layer groups (granularity). You can group by blocks or specific submodules.
- Define your merge operator over tensors: e.g., base + w_i * masked deltas; or SLERP-like for normalized vectors.
- Encode merge parameters into a genotype tensor consistent with your operator.
- Define a fitness function appropriate for your task (loss, accuracy, BLEU, etc.).
- Swap the toy optimizer with CMA-ES or another EA for stronger search.

## Appendix — Running MergeKit evo on GPU
From this repo, prepare a YAML like in your working Colab and run the CLI. Example flags: `--storage-path`, `--max-fevals`, `--num-gpus`, `--no-vllm`, `--allow-crimes` (if optimizing benchmarks), `--no-wandb`, `--save-final-model`.