# Merge algorithms

## 1. SLERP

**Spherical Linear Interpolation** (SLERP) is a method used to smoothly interpolate between two vectors. It maintains a constant rate of change and preserves the geometric properties of the spherical space in which the vectors reside.

There are several reasons to prefer SLERP over a traditional linear interpolation. For example, in high-dimensional spaces, linear interpolation can lead to a decrease in the magnitude of the interpolated vector (i.e., it reduces the scale of weights). Moreover, the change in direction of the weights often represents more meaningful information (like feature learning and representation) than the magnitude of change.

SLERP is implemented using the following steps:

Normalize the input vectors to unit length, ensuring they represent directions rather than magnitudes
Calculate the angle between these vectors using their dot product.
If the vectors are nearly collinear, it defaults to linear interpolation for efficiency. Otherwise, SLERP computing scale factors based on the interpolation factor t (t=0 = 100% of the first vector, t=1 = 100% of model 2) and the angle between the vectors.
These factors are used to weigh the original vectors, which are then summed to obtain the interpolated vector.
SLERP is currently the most popular merging method, but it is limited to combining only two models at a time. It is still possible to hierarchically combine multiple models,

## 2. TIES

Introduced in this paper by Yadav et al., TIES-Merging is designed to efficiently merge multiple task-specific models into a single multitask model. It addresses two main challenges in model merging:

Redundancy in model parameters: It identifies and eliminates redundant parameters within task-specific models. This is achieved by focusing on the changes made during fine-tuning, identifying the top-k% most significant changes, and discarding the rest.
Disagreement between parameter signs: Conflicts arise when different models suggest opposing adjustments to the same parameter. TIES-Merging resolves these conflicts by creating a unified sign vector that represents the most dominant direction of change across all models.
TIES-Merging is divided into the following three steps:

Trim: Reduces redundancy in task-specific models by retaining only a fraction the most significant parameters (density parameter) and resetting the rest to zero.
Elect Sign: Resolves sign conflicts across different models by creating a unified sign vector based on the most dominant direction (positive or negative) in terms of cumulative magnitude.
Disjoint Merge: Averages parameter values that align with the unified sign vector, excluding zero values.
Unlike SLERP, TIES can merge multiple models at a time.

With this config, we use Mistral-7B as a base model to calculate the delta weights. We merge the same two models: mistral-ft-optimized-1218 (50%) and NeuralHermes-2.5-Mistral-7B (30%) with normalization. Here, the density means that we’re only retaining 50% of the parameters of each model (the other half comes from the base model).

Note that the sum of the weights is not equal to 1 in the config, but the normalize: true parameter will automatically normalize them internally. This config is inspired by the parameters provided by the author of OpenHermes-2.5-neural-chat-7b-v3-1-7B.

You can find the final model on the Hugging Face Hub at mlabonne/NeuralPipe-7B-ties.

## 3. DARE

Introduced by Yu et al. (2023), DARE uses an approach similar to TIES with two main differences:

Pruning: DARE randomly reset fine-tuned weights to their original values (those of the base model).
Rescaling: DARE rescales the weights to keep the expectations of model outputs approximately unchanged. It adds the rescaled weights of both (or more) models to the weights of the base model with a scale factor.
Mergekit’s implementation of this method has two flavours: with the sign election step of TIES (dare_ties) or without (dare_linear).

In this configuration, we merge three different models based on Mistral-7B using dare_ties. This time, I chose weights that sum to 1 (the sum should be between 0.9 and 1.1). The density parameter is a little higher than what’s recommended in the paper (<0.5), but it looks like it gives consistently better results (see this discussion).

You can find it on the Hugging Face Hub at mlabonne/Daredevil-7B. It’s also the best merge model in this article, outperforming even Marcoro14-7B-slerp.

## 4. Passthrough

The passthrough method differs significantly from the previous ones. By concatenating layers from different LLMs, it can produce models with an exotic number of parameters (e.g., 9B with two 7B parameter models). These models are often referred to as “frankenmerges” or “Frankenstein models” by the community.

This technique is very experimental, but it managed to create impressive models, like goliath-120b using two Llama 2 70B models. The recently released SOLAR-10.7B-v1.0 also uses the same idea, called depth-up scaling in their paper.

The resulting frankenmerge will have all the 32 layers from the first model and 8 additional layers from the second model. This creates a frankenmerge with a total of 40 layers and 8.99B parameters. This config is inspired by GML-Mistral-merged-v1.

You can find the final model on the Hugging Face Hub at mlabonne/NeuralPipe-9B-merged.

In [2]:
!git clone https://github.com/cg123/mergekit.git
!cd mergekit && pip install -q -e .

Cloning into 'mergekit'...
remote: Enumerating objects: 2157, done.[K
remote: Counting objects: 100% (472/472), done.[K
remote: Compressing objects: 100% (210/210), done.[K
remote: Total 2157 (delta 316), reused 377 (delta 262), pack-reused 1685[K
Receiving objects: 100% (2157/2157), 670.45 KiB | 944.00 KiB/s, done.
Resolving deltas: 100% (1458/1458), done.


In [7]:
import yaml

MODEL_NAME = "Marcoro14-7B-slerp"
yaml_config = """
models:
  - model: mistralai/Mistral-7B-v0.1
  - model: samir-fama/SamirGPT-v1
    parameters:
      density: 0.53
      weight: 0.4
  - model: abacusai/Slerp-CM-mist-dpo
    parameters:
      density: 0.53
      weight: 0.3
  - model: EmbeddedLLM/Mistral-7B-Merge-14-v0.2
    parameters:
      density: 0.53
      weight: 0.3
merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
  int8_mask: true
dtype: bfloat16
"""

# Save config as yaml file
with open('config.yaml', 'w', encoding="utf-8") as f:
    f.write(yaml_config)

We run the merge command with the following parameters:

--copy-tokenizer to copy the tokenizer from the base model
--allow-crimes and --out-shard-size to chunk the models into smaller shards that can be computed on a CPU with low RAM
--lazy-unpickle to enable the experimental lazy unpickler for lower memory usage
In addition, some models can require the --trust_remote_code flag (this is not the case with Mistral-7B).

This command will download the weights of all the models listed in the merge configuration and run the selected merge method (it should take ~10 minutes).

In [11]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!mergekit-yaml config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickle

config.json: 100%|█████████████████████████████| 648/648 [00:00<00:00, 1.92MB/s]
config.json: 100%|█████████████████████████████| 571/571 [00:00<00:00, 1.60MB/s]
config.json: 100%|█████████████████████████████| 652/652 [00:00<00:00, 2.08MB/s]
Warmup loader cache:   0%|                                | 0/4 [00:00<?, ?it/s]
Fetching 8 files:   0%|                                   | 0/8 [00:00<?, ?it/s][A

tokenizer.json:   0%|                               | 0.00/1.80M [00:00<?, ?B/s][A[A


tokenizer_config.json: 100%|███████████████| 1.06k/1.06k [00:00<00:00, 3.43MB/s][A[A[A



tokenizer.model:   0%|                               | 0.00/493k [00:00<?, ?B/s][A[A[A



model-00002-of-00002.safetensors:   0%|             | 0.00/4.51G [00:00<?, ?B/s][A[A[A[A




special_tokens_map.json: 100%|█████████████████| 487/487 [00:00<00:00, 3.81MB/s][A[A[A[A[A





model.safetensors.index.json: 100%|████████| 22.8k/22.8k [00:00<00:00, 85.5MB/s][A[A[A[A[A





model-00001-of-00

The model is now merged and saved in the merge directory. Before uploading it, we can create a README file with all the information required for reproducibility. The following code block defines a Jinja template and automatically fills it with the data from the merge configuration.

In [None]:
!pip install -qU huggingface_hub

from huggingface_hub import ModelCard, ModelCardData
from jinja2 import Template

username = "mlabonne"

template_text = """
---
license: apache-2.0
tags:
- merge
- mergekit
- lazymergekit
{%- for model in models %}
- {{ model }}
{%- endfor %}
---

# {{ model_name }}

{{ model_name }} is a merge of the following models using [mergekit](https://github.com/cg123/mergekit):

{%- for model in models %}
* [{{ model }}](https://huggingface.co/{{ model }})
{%- endfor %}

## 🧩 Configuration

\```yaml
{{- yaml_config -}}
\```
"""

# Create a Jinja template object
jinja_template = Template(template_text.strip())

# Get list of models from config
data = yaml.safe_load(yaml_config)
if "models" in data:
    models = [data["models"][i]["model"] for i in range(len(data["models"])) if "parameters" in data["models"][i]]
elif "parameters" in data:
    models = [data["slices"][0]["sources"][i]["model"] for i in range(len(data["slices"][0]["sources"]))]
elif "slices" in data:
    models = [data["slices"][i]["sources"][0]["model"] for i in range(len(data["slices"]))]
else:
    raise Exception("No models or slices found in yaml config")

# Fill the template
content = jinja_template.render(
    model_name=MODEL_NAME,
    models=models,
    yaml_config=yaml_config,
    username=username,
)

# Save the model card
card = ModelCard(content)
card.save('merge/README.md')
