<a href="https://colab.research.google.com/github/jankovicsandras/ml/blob/main/LazierMergekit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🥱😴 LazierMergekit

> 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course)

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne).

This notebook allows you to easily merge multiple models using [mergekit](https://github.com/cg123/mergekit). To evaluate your merges, see [🧐 LLM AutoEval](https://colab.research.google.com/drive/1Igs3WZuXAIv9X0vwqiE90QlEPys8e8Oa?usp=sharing#scrollTo=elyxjYI_rY5W).

Added [Llama.cpp](https://github.com/ggerganov/llama.cpp) quantization.

*Special thanks to [@cg123](https://github.com/cg123) for this library and [@mrfakename](https://gist.github.com/fakerybakery) who told me about sharding (see his [Gist](https://gist.github.com/fakerybakery/d30a4d31b4f914757c1381166b9c683b)).*

In [3]:
MODEL_NAME = "Mindentural-7B-240124"
yaml_config = """
slices:
  - sources:
      - model: OpenPipe/mistral-ft-optimized-1218
        layer_range: [0, 32]
      - model: HyperbeeAI/Tulpar-7b-v2
        layer_range: [0, 32]
merge_method: slerp
base_model: OpenPipe/mistral-ft-optimized-1218
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16
"""

In [4]:
# @title ## Run merge

# @markdown ### Runtime type
# @markdown Select your runtime (CPU, High RAM, GPU)

runtime = "CPU" # @param ["CPU", "CPU + High-RAM", "GPU"]

# @markdown ### Mergekit arguments
# @markdown Use the `main` branch by default, [`mixtral`](https://github.com/cg123/mergekit/blob/mixtral/moe.md) if you want to create a Mixture of Experts.

branch = "main" # @param ["main", "mixtral"]
trust_remote_code = False # @param {type:"boolean"}

# Install mergekit
if branch == "main":
    !git clone https://github.com/cg123/mergekit.git
    !cd mergekit && pip install -qqq -e . --progress-bar off
elif branch == "mixtral":
    !git clone -b mixtral https://github.com/cg123/mergekit.git
    !cd mergekit && pip install -qqq -e . --progress-bar off
    !pip install -qqq -U transformers --progress-bar off

# Save config as yaml file
with open('config.yaml', 'w', encoding="utf-8") as f:
    f.write(yaml_config)

# Base CLI
if branch == "main":
    cli = "mergekit-yaml config.yaml merge --copy-tokenizer"
elif branch == "mixtral":
    cli = "mergekit-moe config.yaml merge --copy-tokenizer"

# Additional arguments
if runtime == "CPU":
    cli += " --allow-crimes --out-shard-size 1B --lazy-unpickle"
elif runtime == "GPU":
    cli += " --cuda --low-cpu-memory"
if trust_remote_code:
    cli += " --trust-remote-code"

print(cli)

# Merge models
!{cli}

fatal: destination path 'mergekit' already exists and is not an empty directory.
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
  Building editable for mergekit (pyproject.toml) ... [?25l[?25hdone
mergekit-yaml config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickle
Warmup loader cache:   0% 0/2 [00:00<?, ?it/s]
Fetching 8 files: 100% 8/8 [00:00<00:00, 9675.44it/s]
Warmup loader cache:  50% 1/2 [00:00<00:00,  6.21it/s]
Fetching 8 files: 100% 8/8 [00:00<00:00, 41070.30it/s]
Warmup loader cache: 100% 2/2 [00:00<00:00,  6.33it/s]
100% 1457/1457 [09:01<00:00,  2.69it/s]


In [21]:
!rm -rf /root/.cache/huggingface/hub # IMPORTANT! Deleting the cached source models to free up space
!ls -la
!ls -la merge

total 36
drwxr-xr-x  1 root root 4096 Jan 24 13:09 .
drwxr-xr-x  1 root root 4096 Jan 24 12:15 ..
drwxr-xr-x  1 root root 4096 Jan 19 14:19 .config
-rw-r--r--  1 root root  398 Jan 24 12:29 config.yaml
drwxr-xr-x 21 root root 4096 Jan 24 12:48 llama.cpp
drwxr-xr-x  2 root root 4096 Jan 24 12:38 merge
drwxr-xr-x  1 root root 4096 Jan 19 14:20 sample_data
total 14146392
drwxr-xr-x 2 root root       4096 Jan 24 12:38 .
drwxr-xr-x 1 root root       4096 Jan 24 13:09 ..
-rw-r--r-- 1 root root        651 Jan 24 12:38 config.json
-rw-r--r-- 1 root root        398 Jan 24 12:38 mergekit_config.yml
-rw-r--r-- 1 root root 1912681584 Jan 24 12:30 model-00001-of-00008.safetensors
-rw-r--r-- 1 root root 1979781456 Jan 24 12:31 model-00002-of-00008.safetensors
-rw-r--r-- 1 root root 1946243968 Jan 24 12:33 model-00003-of-00008.safetensors
-rw-r--r-- 1 root root 1979781416 Jan 24 12:34 model-00004-of-00008.safetensors
-rw-r--r-- 1 root root 1862349080 Jan 24 12:35 model-00005-of-00008.safetensors
-rw-

In [7]:
# Llama.cpp install
! git clone https://github.com/ggerganov/llama.cpp
! cd llama.cpp && make
! python3 -m pip install -r llama.cpp/requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 16800, done.[K
remote: Counting objects: 100% (6191/6191), done.[K
remote: Compressing objects: 100% (348/348), done.[K
remote: Total 16800 (delta 6039), reused 5882 (delta 5843), pack-reused 10609[K
Receiving objects: 100% (16800/16800), 19.04 MiB | 19.60 MiB/s, done.
Resolving deltas: 100% (11729/11729), done.


In [22]:
# converting to Llama.cpp format
!python3 llama.cpp/convert.py merge

Loading model file merge/model-00001-of-00008.safetensors
Loading model file merge/model-00001-of-00008.safetensors
Loading model file merge/model-00002-of-00008.safetensors
Loading model file merge/model-00003-of-00008.safetensors
Loading model file merge/model-00004-of-00008.safetensors
Loading model file merge/model-00005-of-00008.safetensors
Loading model file merge/model-00006-of-00008.safetensors
Loading model file merge/model-00007-of-00008.safetensors
Loading model file merge/model-00008-of-00008.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=32768, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=10000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('merge'))
Found vocab files: {'tokenizer.model': PosixPath('merge/tokenizer.model'), 'vocab.json': None, 'tokenizer.json': PosixPath('merge/tokenizer.json')}
Loading vocab file '

In [23]:
# quantizing to q5_k_m GGUF with Llama.cpp
! llama.cpp/quantize merge/ggml-model-f16.gguf merge/m_q5_k_m.gguf q5_k_m

main: build = 1963 (c9b316c7)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'merge/ggml-model-f16.gguf' to 'merge/m_q5_k_m.gguf' as Q5_K_M
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from merge/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 1433

In [24]:
! ls -la merge |grep gguf

-rw-r--r-- 1 root root 14484731456 Jan 24 13:22 ggml-model-f16.gguf
-rw-r--r-- 1 root root  5131409024 Jan 24 13:42 m_q5_k_m.gguf


In [None]:
# https://stackoverflow.com/questions/48774285/how-to-download-file-created-in-colaboratory-workspace
from google.colab import files
files.download('merge/m_q5_k_m.gguf')

In [None]:
# @title ## Upload model to Hugging Face { display-mode: "form" }
# @markdown Enter your HF username and the name of Colab secret that stores your [Hugging Face access token](https://huggingface.co/settings/tokens).
username = 'mlabonne' # @param {type:"string"}
token = 'HF_TOKEN' # @param {type:"string"}
license = "apache-2.0" # @param ["apache-2.0", "cc-by-nc-4.0", "mit", "openrail"] {allow-input: true}

!pip install -qU huggingface_hub

import yaml

from huggingface_hub import ModelCard, ModelCardData, HfApi
from google.colab import userdata
from jinja2 import Template

if branch == "main":
    template_text = """
---
license: {{ license }}
base_model:
{%- for model in models %}
  - {{ model }}
{%- endfor %}
tags:
- merge
- mergekit
- lazymergekit
{%- for model in models %}
- {{ model }}
{%- endfor %}
---

# {{ model_name }}

{{ model_name }} is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):

{%- for model in models %}
* [{{ model }}](https://huggingface.co/{{ model }})
{%- endfor %}

## 🧩 Configuration

```yaml
{{- yaml_config -}}
```

## 💻 Usage

```python
!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "{{ username }}/{{ model_name }}"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```
"""

    # Create a Jinja template object
    jinja_template = Template(template_text.strip())

    # Get list of models from config
    data = yaml.safe_load(yaml_config)
    if "models" in data:
        models = [data["models"][i]["model"] for i in range(len(data["models"])) if "parameters" in data["models"][i]]
    elif "parameters" in data:
        models = [data["slices"][0]["sources"][i]["model"] for i in range(len(data["slices"][0]["sources"]))]
    elif "slices" in data:
        models = [data["slices"][i]["sources"][0]["model"] for i in range(len(data["slices"]))]
    else:
        raise Exception("No models or slices found in yaml config")

    # Fill the template
    content = jinja_template.render(
        model_name=MODEL_NAME,
        models=models,
        yaml_config=yaml_config,
        username=username,
    )

elif branch == "mixtral":
    template_text = """
---
license: {{ license }}
base_model:
{%- for model in models %}
  - {{ model }}
{%- endfor %}
tags:
- moe
- frankenmoe
- merge
- mergekit
- lazymergekit
{%- for model in models %}
- {{ model }}
{%- endfor %}
---

# {{ model_name }}

{{ model_name }} is a Mixure of Experts (MoE) made with the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):

{%- for model in models %}
* [{{ model }}](https://huggingface.co/{{ model }})
{%- endfor %}

## 🧩 Configuration

```yaml
{{- yaml_config -}}
```

## 💻 Usage

```python
!pip install -qU transformers bitsandbytes accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "{{ username }}/{{ model_name }}"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True},
)

messages = [{"role": "user", "content": "Explain what a Mixture of Experts is in less than 100 words."}]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```
"""

    # Create a Jinja template object
    jinja_template = Template(template_text.strip())

    # Fill the template
    data = yaml.safe_load(yaml_config)
    models = [model['source_model'] for model in data['experts']]

    content = jinja_template.render(
        model_name=MODEL_NAME,
        models=models,
        yaml_config=yaml_config,
        username=username,
        license=license
    )

# Save the model card
card = ModelCard(content)
card.save('merge/README.md')

# Defined in the secrets tab in Google Colab
api = HfApi(token=userdata.get(token))

# Upload merge folder
api.create_repo(
    repo_id=f"{username}/{MODEL_NAME}",
    repo_type="model",
    exist_ok=True,
)
api.upload_folder(
    repo_id=f"{username}/{MODEL_NAME}",
    folder_path="merge",
)