exl2 of merge loads in with 2K truncation length instead of 8K for GGUF? #5750

jim-plus · 2024-03-25T16:24:05Z

Describe the bug

I've not uploaded the exl2 quant yet because of the issue, but the safetensors version is here:
https://huggingface.co/grimjim/kunoichi-lemon-royale-7B

In short, I am getting a truncation length of 2048 instead of 8192 when loading the 8.0bpw h8 exl2 in ooba, but get the full 8192 when loading the Q8_0 GGUF equivalent. Updating ooba did not help, though I wonder if the exllamav2 components need a refresh to sync with upstream.

Below is the config.json of the loaded exl2; I manually corrected the calibration length from 2048 to 8192, but that did not resolve the issue. There is no apparent origin for the 2048 truncation value upon loading. The Q8_0 GGUF of the model correctly loads with 8192 context length.

{
    "_name_or_path": "SanjiWatsuki/Kunoichi-7B",
    "architectures": [
        "MistralForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "initializer_range": 0.02,
    "intermediate_size": 14336,
    "max_position_embeddings": 8192,
    "model_type": "mistral",
    "num_attention_heads": 32,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "rms_norm_eps": 1e-05,
    "rope_theta": 10000.0,
    "sliding_window": 4096,
    "tie_word_embeddings": false,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.38.2",
    "use_cache": true,
    "vocab_size": 32000,
    "quantization_config": {
        "quant_method": "exl2",
        "version": "0.0.16",
        "bits": 8.0,
        "head_bits": 6,
        "calibration": {
            "rows": 100,
            "length": 8192,
            "dataset": "(default)"
        }
    }
}

Is there an existing issue for this?

I have searched the existing issues

Reproduction

Currently: download safetensors version, quant with exllamav2 to get 8.0bpw h8, then load in ooba.

Screenshot

No response

Logs

12:15:34-126265 INFO     Loading "kunoichi-lemon-royale-7B-8.0bpw_h8_exl2"
12:15:47-116039 INFO     LOADER: "ExLlamav2_HF"
12:15:47-117824 INFO     TRUNCATION LENGTH: 2048
12:15:47-118828 INFO     INSTRUCTION TEMPLATE: "Alpaca"
12:15:47-119831 INFO     Loaded the model in 12.99 seconds.

System Info

Windows 11, Nvidia RTX4060Ti 16GB.

The text was updated successfully, but these errors were encountered:

jim-plus · 2024-03-25T16:30:41Z

Also, I specify max_seq_len to 8192 when loading the model, but that is not being respected.

oldmanjk · 2024-03-27T06:09:07Z

Can confirm. Exl2 context definitely borked

turboderp · 2024-03-28T05:35:09Z

I've identified what I think is the issue. The problem is the quantization_config key added in more recent EXL2 conversion which lists the number of bits used to convert the model. In EXL2 this value is a float, but TGW parses it as an int leading to an exception in update_model_parameters. The function is then aborted before it reaches the max_seq_len key and so it never updates that and uses the default value of 2048 instead.

A quick fix would be to change line 199 of modules/models_settings.py from:

            value = int(value)

to

            value = int(float(value))

This seems to work here, but I don't have a clear picture of where else the wbits entry could be used, or if a better way would be to not interpret "quantization_config.bits" -> "wbits" for EXL2 models when reading the config. @oobabooga ?

jim-plus added the bug Something isn't working label Mar 25, 2024

oobabooga closed this as completed in 624faa1 Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exl2 of merge loads in with 2K truncation length instead of 8K for GGUF? #5750

exl2 of merge loads in with 2K truncation length instead of 8K for GGUF? #5750

jim-plus commented Mar 25, 2024

jim-plus commented Mar 25, 2024

oldmanjk commented Mar 27, 2024

turboderp commented Mar 28, 2024

exl2 of merge loads in with 2K truncation length instead of 8K for GGUF? #5750

exl2 of merge loads in with 2K truncation length instead of 8K for GGUF? #5750

Comments

jim-plus commented Mar 25, 2024

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

jim-plus commented Mar 25, 2024

oldmanjk commented Mar 27, 2024

turboderp commented Mar 28, 2024