You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In short, I am getting a truncation length of 2048 instead of 8192 when loading the 8.0bpw h8 exl2 in ooba, but get the full 8192 when loading the Q8_0 GGUF equivalent. Updating ooba did not help, though I wonder if the exllamav2 components need a refresh to sync with upstream.
Below is the config.json of the loaded exl2; I manually corrected the calibration length from 2048 to 8192, but that did not resolve the issue. There is no apparent origin for the 2048 truncation value upon loading. The Q8_0 GGUF of the model correctly loads with 8192 context length.
Currently: download safetensors version, quant with exllamav2 to get 8.0bpw h8, then load in ooba.
Screenshot
No response
Logs
12:15:34-126265 INFO Loading "kunoichi-lemon-royale-7B-8.0bpw_h8_exl2"
12:15:47-116039 INFO LOADER: "ExLlamav2_HF"
12:15:47-117824 INFO TRUNCATION LENGTH: 2048
12:15:47-118828 INFO INSTRUCTION TEMPLATE: "Alpaca"
12:15:47-119831 INFO Loaded the model in 12.99 seconds.
System Info
Windows 11, Nvidia RTX4060Ti 16GB.
The text was updated successfully, but these errors were encountered:
I've identified what I think is the issue. The problem is the quantization_config key added in more recent EXL2 conversion which lists the number of bits used to convert the model. In EXL2 this value is a float, but TGW parses it as an int leading to an exception in update_model_parameters. The function is then aborted before it reaches the max_seq_len key and so it never updates that and uses the default value of 2048 instead.
A quick fix would be to change line 199 of modules/models_settings.py from:
value=int(value)
to
value=int(float(value))
This seems to work here, but I don't have a clear picture of where else the wbits entry could be used, or if a better way would be to not interpret "quantization_config.bits" -> "wbits" for EXL2 models when reading the config. @oobabooga ?
Describe the bug
I've not uploaded the exl2 quant yet because of the issue, but the safetensors version is here:
https://huggingface.co/grimjim/kunoichi-lemon-royale-7B
In short, I am getting a truncation length of 2048 instead of 8192 when loading the 8.0bpw h8 exl2 in ooba, but get the full 8192 when loading the Q8_0 GGUF equivalent. Updating ooba did not help, though I wonder if the exllamav2 components need a refresh to sync with upstream.
Below is the config.json of the loaded exl2; I manually corrected the calibration length from 2048 to 8192, but that did not resolve the issue. There is no apparent origin for the 2048 truncation value upon loading. The Q8_0 GGUF of the model correctly loads with 8192 context length.
Is there an existing issue for this?
Reproduction
Currently: download safetensors version, quant with exllamav2 to get 8.0bpw h8, then load in ooba.
Screenshot
No response
Logs
System Info
The text was updated successfully, but these errors were encountered: