Support for Mistral models #4111

kalomaze · 2023-09-27T17:58:25Z

A new set of 7b foundational models that claim to beat all 13b Llama 2 models in benchmarks.

https://huggingface.co/mistralai/Mistral-7B-v0.1
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

I am unsure if anything special with the tokenizer or with how the context extension is designed make this a unique challenge, as this is not a typical Llama 2 train.

SrVill · 2023-09-27T19:16:40Z

And add the mistral instruction-template please don't forget

rahimnathwani · 2023-09-28T06:48:41Z

The requirements for the instruction template seem to be:

<s> at the beginning of the conversation, but not later.
[INST] before the user message
[/INST] before the bot message
Bot ends each message with </s>

I'm using Mistral right now with the settings in the attached screenshot. I put the <s> part in the 'command for chat-instruct mode' box. But that is not saved as part of the instruction template.

I'm not sure how to handle the </s> part, but it doesn't show up as part of the bot's responses, at least in chat-instruct mode with the cai-chat theme.

That sentence end character also doesn't show in the notebook view, e.g.

<s>[INST]Tell me the name of Mary J Blige's first album[/INST] The name of Mary J. Blige's first studio album is "What's the 411?" It was released on August 26, 1992, by Puffy Records and became her debut solo album after previously recording with the group Children of the Corn.[INST]Tell me more about that group[/INST] Children of the Corn were an American hip hop group composed of Mary J. Blige, K-Ci and K-Poke, DJ Clue, and Redman. They formed in New York City during the mid-1980s and gained popularity with their debut single "I Can't Quit You Baby" which became a dance club hit in 1987. The group continued to release music throughout the 1990s, including several successful albums, before disbanding in 2000.

travco · 2023-09-29T21:31:10Z

This should work as a YAML for Mistral-Instruct. Also Mistral appears to work out of the box if you use a GGUF file and llama.cpp as a loader. Though I'm not 100% on if llamacpp is properly doing sliding window and grouped-query attention, haven't been keeping up as much as I'd like with the state of that - but it will take a larger (8K+) context window without breaking.

user: '[INST] '
bot: ' [/INST]'
turn_template: <s><|user|><|user-message|><|bot|>\n<|bot-message|></s>\n

The sentence-end character won't show up on the first "turn" if you export the chat to notebook view, but will be placed before the next chat "turn" when you send it.

fat-tire · 2023-10-01T23:16:04Z

Just to pipe in here-- TheBloke/Mistral-7B-Instruct-v0.1-AWQ seems to work alright with ooba. There is some occasional discontinuity between the question I asked and the answer. Sometimes it seems to answer questions from earlier and sometimes it gets answers factually wrong... but it works.

DeSinc · 2023-10-03T03:22:32Z

Just to pipe in here-- TheBloke/Mistral-7B-Instruct-v0.1-AWQ seems to work alright with ooba. There is some occasional discontinuity between the question I asked and the answer. Sometimes it seems to answer questions from earlier and sometimes it gets answers factually wrong... but it works.

How did you get it to work? I tried TheBloke/dolphin-2.0-mistral-7B-AWQ and it gives the explicit message that it's not supported:

TypeError: mistral isn’t supported yet.

Ooba updated and tried that specific model too and same error

fat-tire · 2023-10-03T05:53:41Z

Apply PR #3999 and it should work. I use n_batch set to 1 and check no_inject_fused_attention to avoid memory errors.

lijunle · 2023-10-04T17:16:30Z

TheBloke just released the GPTQ version.

https://huggingface.co/TheBloke/Mistralic-7B-1-GPTQ

anammari · 2023-10-05T00:50:17Z

None of the GPTQ models has loaded successfully with me.

I've tried mistral_7b_instruct_v0.1_gptq and mistralic_7b_1_gptq and tried all available loaders.

I've then applied PR #3999 and downloaded the mistral_7b_instruct_v0.1_awq model which has loaded successfully using AutoAWQ loader. However, the LLM stays unresponsive in both chat and chat-instruct modes.

Moreover, the conda console also throws an error after submitting the prompt in chat:

Traceback (most recent call last):
  File "C:\Users\Ahmad\github\text-generation-webui\modules\callbacks.py", line 56, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "C:\Users\Ahmad\github\text-generation-webui\modules\text_generation.py", line 347, in generate_with_callback
    shared.model.generate(**kwargs)
  File "C:\Users\Ahmad\.conda\envs\textgen\lib\site-packages\awq\models\base.py", line 36, in generate
    return self.model.generate(*args, **kwargs)
  File "C:\Users\Ahmad\.conda\envs\textgen\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\Ahmad\.conda\envs\textgen\lib\site-packages\transformers\generation\utils.py", line 1652, in generate
    return self.sample(
  File "C:\Users\Ahmad\.conda\envs\textgen\lib\site-packages\transformers\generation\utils.py", line 2770, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Pozaza · 2023-10-06T16:20:55Z

Solution 1: (speed will be 2x times slower)

I got it to work by modifying 2 files in packages:

venv\Lib\site-packages\auto_gptq\modeling\_const.py
venv\Lib\site-packages\auto_gptq\modeling\auto.py

modified files: https://gist.github.com/Pozaza/c8335bbcbbd4a73dd3bec1a9644b6865

Solution 2:

activate your virtual environment
pip install git+https://github.com/huggingface/transformers
Start webui, and choose any ExLlama model loader

Thireus · 2023-10-06T16:38:30Z

Mistral GPTQ models give me the following error when loaded with exllamav2:

RuntimeError: q_weight and gptq_qzeros have incompatible shapes

Anyone facing the same issue?

InfernalWraith · 2023-10-09T21:37:45Z

TheBloke/Mistral-7B-OpenOrca-GPTQ worked for me with the ExLlamav2_HF model loader out of the box. It also worked for the Transformer model loader, though I got the following error so I had to tick the disable_exllama option to get it to work

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.
You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object

Using an RTX 3070, with ExLlamav2_HF I get about 11.5 tokens/s, whereas with Transformer I get about 4.5.

yquemener · 2023-10-29T23:32:50Z

For those who want to try @InfernalWraith solution: if you don't have disable_exllama as an option, "auto-devices" do the trick as well

github-actions · 2023-12-11T23:16:39Z

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

controldev · 2024-02-13T09:59:32Z

Mistral GPTQ models give me the following error when loaded with exllamav2:
RuntimeError: q_weight and gptq_qzeros have incompatible shapes
Anyone facing the same issue?

This is still a problem.

oliverban · 2024-03-24T21:55:05Z

I'm getting this as well? :(

oldmanjk · 2024-03-28T19:51:59Z

Same problem here:

Traceback (most recent call last):
  File "/home/j/text-generation-webui/modules/ui_model_menu.py", line 245, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/j/text-generation-webui/modules/models.py", line 87, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/j/text-generation-webui/modules/models.py", line 380, in ExLlamav2_HF_loader
    return Exllamav2HF.from_pretrained(model_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/j/text-generation-webui/modules/exllamav2_hf.py", line 181, in from_pretrained
    return Exllamav2HF(config)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/j/text-generation-webui/modules/exllamav2_hf.py", line 50, in __init__
    self.ex_model.load(split)
  File "/home/j/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 266, in load
    for item in f: x = item
  File "/home/j/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 284, in load_gen
    module.load()
  File "/home/j/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 189, in load
    self.q_proj.load()
  File "/home/j/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/linear.py", line 55, in load
    self.q_handle = ext.make_q_matrix(w, self.temp_dq)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/j/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/ext.py", line 236, in make_q_matrix
    return ext_c.make_q_matrix(w["qweight"],
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: q_weight and gptq_qzeros have incompatible shapes

bryankruman · 2024-03-31T02:46:04Z

I'm also getting the same problem as many users above...
RuntimeError: q_weight and gptq_qzeros have incompatible shapes

Ragnarok700 · 2024-04-14T21:32:55Z

Also unable to load with error:
RuntimeError: q_weight and gptq_qzeros have incompatible shapes

kalomaze added the enhancement New feature or request label Sep 27, 2023

github-actions bot added the stale label Dec 11, 2023

github-actions bot closed this as completed Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Mistral models #4111

Support for Mistral models #4111

kalomaze commented Sep 27, 2023 •

edited

SrVill commented Sep 27, 2023

rahimnathwani commented Sep 28, 2023 •

edited

travco commented Sep 29, 2023 •

edited

fat-tire commented Oct 1, 2023

DeSinc commented Oct 3, 2023 •

edited

fat-tire commented Oct 3, 2023

lijunle commented Oct 4, 2023

anammari commented Oct 5, 2023

Pozaza commented Oct 6, 2023 •

edited

Thireus commented Oct 6, 2023 •

edited

InfernalWraith commented Oct 9, 2023

yquemener commented Oct 29, 2023

github-actions bot commented Dec 11, 2023

controldev commented Feb 13, 2024

oliverban commented Mar 24, 2024

oldmanjk commented Mar 28, 2024

bryankruman commented Mar 31, 2024

Ragnarok700 commented Apr 14, 2024

Support for Mistral models #4111

Support for Mistral models #4111

Comments

kalomaze commented Sep 27, 2023 • edited

SrVill commented Sep 27, 2023

rahimnathwani commented Sep 28, 2023 • edited

travco commented Sep 29, 2023 • edited

fat-tire commented Oct 1, 2023

DeSinc commented Oct 3, 2023 • edited

fat-tire commented Oct 3, 2023

lijunle commented Oct 4, 2023

anammari commented Oct 5, 2023

Pozaza commented Oct 6, 2023 • edited

Solution 1: (speed will be 2x times slower)

Solution 2:

Thireus commented Oct 6, 2023 • edited

InfernalWraith commented Oct 9, 2023

yquemener commented Oct 29, 2023

github-actions bot commented Dec 11, 2023

controldev commented Feb 13, 2024

oliverban commented Mar 24, 2024

oldmanjk commented Mar 28, 2024

bryankruman commented Mar 31, 2024

Ragnarok700 commented Apr 14, 2024

kalomaze commented Sep 27, 2023 •

edited

rahimnathwani commented Sep 28, 2023 •

edited

travco commented Sep 29, 2023 •

edited

DeSinc commented Oct 3, 2023 •

edited

Pozaza commented Oct 6, 2023 •

edited

Thireus commented Oct 6, 2023 •

edited