Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exl2 of merge loads in with 2K truncation length instead of 8K for GGUF? #5750

Closed
1 task done
jim-plus opened this issue Mar 25, 2024 · 3 comments
Closed
1 task done
Labels
bug Something isn't working

Comments

@jim-plus
Copy link

Describe the bug

I've not uploaded the exl2 quant yet because of the issue, but the safetensors version is here:
https://huggingface.co/grimjim/kunoichi-lemon-royale-7B

In short, I am getting a truncation length of 2048 instead of 8192 when loading the 8.0bpw h8 exl2 in ooba, but get the full 8192 when loading the Q8_0 GGUF equivalent. Updating ooba did not help, though I wonder if the exllamav2 components need a refresh to sync with upstream.

Below is the config.json of the loaded exl2; I manually corrected the calibration length from 2048 to 8192, but that did not resolve the issue. There is no apparent origin for the 2048 truncation value upon loading. The Q8_0 GGUF of the model correctly loads with 8192 context length.

{
    "_name_or_path": "SanjiWatsuki/Kunoichi-7B",
    "architectures": [
        "MistralForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "initializer_range": 0.02,
    "intermediate_size": 14336,
    "max_position_embeddings": 8192,
    "model_type": "mistral",
    "num_attention_heads": 32,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "rms_norm_eps": 1e-05,
    "rope_theta": 10000.0,
    "sliding_window": 4096,
    "tie_word_embeddings": false,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.38.2",
    "use_cache": true,
    "vocab_size": 32000,
    "quantization_config": {
        "quant_method": "exl2",
        "version": "0.0.16",
        "bits": 8.0,
        "head_bits": 6,
        "calibration": {
            "rows": 100,
            "length": 8192,
            "dataset": "(default)"
        }
    }
}

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

Currently: download safetensors version, quant with exllamav2 to get 8.0bpw h8, then load in ooba.

Screenshot

No response

Logs

12:15:34-126265 INFO     Loading "kunoichi-lemon-royale-7B-8.0bpw_h8_exl2"
12:15:47-116039 INFO     LOADER: "ExLlamav2_HF"
12:15:47-117824 INFO     TRUNCATION LENGTH: 2048
12:15:47-118828 INFO     INSTRUCTION TEMPLATE: "Alpaca"
12:15:47-119831 INFO     Loaded the model in 12.99 seconds.

System Info

Windows 11, Nvidia RTX4060Ti 16GB.
@jim-plus jim-plus added the bug Something isn't working label Mar 25, 2024
@jim-plus
Copy link
Author

Also, I specify max_seq_len to 8192 when loading the model, but that is not being respected.

@oldmanjk
Copy link

Can confirm. Exl2 context definitely borked

@turboderp
Copy link
Contributor

I've identified what I think is the issue. The problem is the quantization_config key added in more recent EXL2 conversion which lists the number of bits used to convert the model. In EXL2 this value is a float, but TGW parses it as an int leading to an exception in update_model_parameters. The function is then aborted before it reaches the max_seq_len key and so it never updates that and uses the default value of 2048 instead.

A quick fix would be to change line 199 of modules/models_settings.py from:

            value = int(value)

to

            value = int(float(value))

This seems to work here, but I don't have a clear picture of where else the wbits entry could be used, or if a better way would be to not interpret "quantization_config.bits" -> "wbits" for EXL2 models when reading the config. @oobabooga ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants