# Merging model

This notebook shows how to create your own mixture of experts with mergekit.

I used the following code to create kaitchup/Maixtchup-4x7b which is using 4 different Mistral 7B as experts.



In [1]:
!git clone -b mixtral https://github.com/cg123/mergekit.git
!cd mergekit && pip install -e .
!pip install --upgrade transformers

fatal: destination path 'mergekit' already exists and is not an empty directory.


In [None]:
merge_config = """
base_model: mistralai/Mistral-7B-Instruct-v0.2
dtype: float16
gate_mode: cheap_embed
experts:
  - source_model: mlabonne/AlphaMonarch-7B
    positive_prompts:
    - "chat"
    - "assistant"
    - "tell me"
    - "explain"
    - "I want"
  - source_model: beowolx/CodeNinja-1.0-OpenChat-7B
    positive_prompts:
    - "code"
    - "python"
    - "javascript"
    - "programming"
    - "algorithm"
"""

with open('config.yaml', 'w') as f:
    f.write(merge_config)

The merge itself only requires a CPU but note that you will need a lot of space on your disk since we have to download all the experts.

In [None]:
!mergekit-moe config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickle --trust-remote-code

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
model-00007-of-00008.safetensors:  64% 1.27G/1.98G [00:32<00:22, 31.5MB/s][A[A[A[A[A







model-00001-of-00008.safetensors:  67% 1.26G/1.89G [00:32<00:20, 31.3MB/s][A[A[A[A[A[A[A[A






model-00002-of-00008.safetensors:  65% 1.26G/1.95G [00:32<00:21, 31.5MB/s][A[A[A[A[A[A[A

model-00005-of-00008.safetensors:  64% 1.27G/1.98G [00:32<00:22, 31.2MB/s][A[A


model-00004-of-00008.safetensors:  68% 1.33G/1.95G [00:32<00:19, 31.6MB/s][A[A[A



model-00006-of-00008.safetensors:  65% 1.26G/1.95G [00:32<00:21, 32.0MB/s][A[A[A[A





model-00003-of-00008.safetensors:  64% 1.27G/1.98G [00:32<00:23, 30.7MB/s][A[A[A[A[A[A




model-00007-of-00008.safetensors:  65% 1.28G/1.98G [00:32<00:22, 31.5MB/s][A[A[A[A[A







model-00001-of-00008.safetensors:  67% 1.27G/1.89G [00:32<00:20, 31.0MB/s][A[A[A[A[A[A[A[A






model-00002-of-00008.safetensors:  65% 1.27G/1.9

This is the code I used to test inference and push the model to the HF Hub. For this part, you will need at least 24 GB of GPU VRAM.

# Inference

In [2]:
!pip install bitsandbytes accelerate



In [1]:
from transformers import AutoTokenizer
import transformers
import torch
import accelerate

model = r".\merge" #If you want to test your own model, replace this value with the model directory path

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True},
)

messages = [{"role": "user", "content": "Do you know how to cook pasta?"}]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

  from .autonotebook import tqdm as notebook_tqdm
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards:  31%|███       | 4/13 [22:20<56:56, 379.66s/it]

# Push to HF hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
model.push_to_hub("")
tokenizer.push_to_hub("")