server.py not starting with GPTQ latest git 534edc7 #445

alexl83 · 2023-03-19T21:47:10Z

Describe the bug

launching latest text-generation-webui code with latest opqwop200 /
GPTQ-for-LLaMa
throws up a python error:


Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 98, in load_model
    from modules.GPTQ_loader import load_quantized
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 11, in <module>
    import opt
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/opt.py", line 424
    model = load_quant(args.model, args.load, args.wbits, args.groupsize))
                                                                         ^
SyntaxError: unmatched ')'

Is there an existing issue for this?

I have searched the existing issues

Reproduction

cd text-generation-webui/repositories/GPTQ-for-LLaMa
git pull
pip install -r requirements.txt
python setup_cuda.py install
cd ../..
python server.py --auto-devices --gpu-memory 16 --gptq-bits 4 --cai-chat --listen --extensions gallery llama_prompts --model llama-30b --settings ~/oobabooga/settings.json

Screenshot

No response

Logs

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 98, in load_model
    from modules.GPTQ_loader import load_quantized
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 11, in <module>
    import opt
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/opt.py", line 424
    model = load_quant(args.model, args.load, args.wbits, args.groupsize))
                                                                         ^
SyntaxError: unmatched ')'



### System Info

```shell
Ryzen 7700X
RTX 4090
Ubuntu 22.10 amd64
micromamba environment
python 3.10.9
pytorch 1.13.1
torchaudio 0.13.1
torchvision 0.14.1

The text was updated successfully, but these errors were encountered:

alexl83 · 2023-03-19T22:07:22Z

As it seems, 'load_quant()' in 'modules/GPTQ_loader.py' needs to pass one more (new) positional argument to qwopqwop200 /
GPTQ-for-LLaMa: 'groupsize'

after correcting SyntaxError, here's the trace:

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 100, in load_model
    model = load_quantized(model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 55, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
TypeError: load_quant() missing 1 required positional argument: 'groupsize'

MillionthOdin16 · 2023-03-19T22:09:15Z

Change model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)to
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)

From the args documentation -1 sets to default size.

alexl83 · 2023-03-19T22:34:34Z

Thanks, passing the value triggers another exception:

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Loading model ...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 100, in load_model
    model = load_quantized(model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 55, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 246, in load_quant
    model.load_state_dict(torch.load(checkpoint))
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        Missing key(s) in state_dict: "model.layers.0.self_attn.q_proj.qzeros", ...

MillionthOdin16 · 2023-03-19T22:37:03Z

Yea, it looks like there's more issues with the GPTQ changes today than just syntax. I rolled back the GPTQ repo to yesterdays version without any of his changes today and it works fine. I saw same error as you before the rollback.

alexl83 · 2023-03-19T22:38:17Z

Yea, it looks like there's more issues with the GPTQ changes today than just syntax. I rolled back the GPTQ repo to yesterdays version without any of his changes today and it works fine.

Will do the same for now; I'd be curious to understand if re-quantizing the models with today's code would fix the loading
Thanks for helping out! :)

RedTopper · 2023-03-19T22:43:28Z

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)

git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4

Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4

It's what I'm using for my container at the moment.

MillionthOdin16 · 2023-03-19T22:44:07Z

I actually don't know anymore... It seems like it might be more broken than I thought. I'm using the pre-quantized models from HF, so you might be right about versions alex.

(text-generation-webui) PS text-generation-webui> python server.py --model llama-7b --load-in-4bit  --auto-devices   
Warning: --load-in-4bit is deprecated and will be removed. Use --gptq-bits 4 instead.

Loading llama-7b...
Loading model ...
Done.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Loaded the model in 2.71 seconds.
text-generation-webui\lib\site-packages\gradio\deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
  File "text-generation-webui\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "text-generation-webui\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "text-generation-webui\modules\callbacks.py", line 65, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "text-generation-webui\modules\text_generation.py", line 199, in generate_with_callback
    shared.model.generate(**kwargs)
  File "text-generation-webui\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\generation\utils.py", line 1452, in generate
    return self.sample(
  File "text-generation-webui\lib\site-packages\transformers\generation\utils.py", line 2468, in sample
    outputs = self(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 198, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.zeros)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: int) -> None

Invoked with: tensor([[ 0.0436, -0.0149,  0.0150,  ...,  0.0267,  0.0112, -0.0011],
        [ 0.0032, -0.0213,  0.0215,  ...,  0.0320, -0.0013, -0.0199],
        [-0.0021,  0.0065, -0.0123,  ...,  0.0199, -0.0018, -0.0081],
        ...,
        [ 0.0074,  0.0389,  0.0164,  ..., -0.0429, -0.0018, -0.0133],
        [ 0.0305,  0.0061,  0.0262,  ...,  0.0096,  0.0096,  0.0033],
        [-0.0431, -0.0260,  0.0012,  ...,  0.0075, -0.0076, -0.0037]],
       device='cuda:0'), tensor([[ 2004248423,  2020046951,  1734903431,  ..., -2024113529,
         -1772648858,  1988708488],
        [ 2004318071,  1985447543,  1719101303,  ...,  1738958728,
          1734834296,  1988584549],
        [-2006481289, -2038991241,  2003200134,  ..., -1734780278,
         -2055714936, -1401572265],
        ...,
        [-2022213769, -2021226889,  1735947895,  ...,  2002357398,
          1483176039, -1215859063],
        [ 2005366614, -2022148249,  1752733576,  ...,   394557864,
          1986418055,  1483962710],
        [ 1735820935,  1988720743, -2056755593,  ..., -1468438152,
          1718123383,  1150911352]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'), tensor([[0.0318],
        [0.0154],
        [0.0123],
        ...,
        [0.0191],
        [0.0206],
        [0.0137]], device='cuda:0'), tensor([[0.2229],
        [0.1079],
        [0.0860],
        ...,
        [0.1529],
        [0.1439],
        [0.0960]], device='cuda:0')

MillionthOdin16 · 2023-03-19T22:46:13Z

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4

It's what I'm using for my container at the moment.

Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.

oobabooga · 2023-03-19T22:47:01Z

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4

It's what I'm using for my container at the moment.

This solves it for me.

This bug report is in the wrong repository, by the way. You should tell @qwopqwop200 about it.

RedTopper · 2023-03-19T22:49:25Z

Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.

Yes, it's working for me with that specific commit.

Specificially, it's set up like this right now: https://github.com/RedTopper/Text-Generation-Webui-Podman/blob/main/Containerfile#L14-L15

MillionthOdin16 · 2023-03-19T22:50:58Z

Awesome. Thanks

alexl83 · 2023-03-19T22:51:34Z

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4
It's what I'm using for my container at the moment.
Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.

Prediction broken for me too with yday's commit:

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Loading model ...
Done.
Loaded the model in 6.81 seconds.
Loading the extension "gallery"... Ok.
Loading the extension "llama_prompts"... Ok.
/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alex/oobabooga/text-generation-webui/modules/callbacks.py", line 65, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/alex/oobabooga/text-generation-webui/modules/text_generation.py", line 201, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
    outputs = self.model(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/quant.py", line 198, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.zeros)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
    1. (arg0: at::Tensor, arg1: at::Tensor, arg2: at::Tensor, arg3: at::Tensor, arg4: at::Tensor, arg5: int) -> None

Invoked with: tensor([[-0.0500, -0.0130, -0.0012,  ...,  0.0039, -0.0046, -0.0232],
        [-0.0420,  0.0025, -0.0313,  ..., -0.0309,  0.0211, -0.0179],
        [-0.0116,  0.0273,  0.0387,  ...,  0.0043, -0.0025,  0.0179],
        ...,
        [-0.0071, -0.0465, -0.0059,  ...,  0.0018,  0.0062, -0.0076],
        [-0.0218,  0.0511, -0.0048,  ...,  0.0093,  0.0003,  0.0119],
        [ 0.0235, -0.0288, -0.0288,  ..., -0.0232, -0.0172,  0.0103]],
       device='cuda:0'), tensor([[ 1719302009,  2004449128,  1234793881,  ..., -2019973256,
         -1502063032,  2037938296],
        [ 2019915367,  2004252535,  1750500728,  ..., -1736926794,
           965175426, -1465341558],
        [-1753778313, -2005497737, -1215805527,  ..., -2005514360,
          1450617205, -2020972629],
        ...,
        [ 2005431670,  1701348758,  1790806215,  ..., -1967744889,
          1970501769,  2055776885],
        [ 1718114184,  1970689672,  1183483512,  ...,  2053671319,
         -1752840856,  1570348373],
        [ 1734838390,  2022205543,  1734843030,  ..., -1737918327,
          2002028378, -1500927849]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'), tensor([[0.0111],
        [0.0150],
        [0.0077],
        ...,
        [0.0194],
        [0.0119],
        [0.0131]], device='cuda:0'), tensor([[0.0779],
        [0.1051],
        [0.0613],
        ...,
        [0.1551],
        [0.0830],
        [0.1045]], device='cuda:0')

MillionthOdin16 · 2023-03-19T22:57:14Z

I wonder if they are actually testing on a quantized model, or a non-quantized one. I don't know where to go from here haha

alexl83 · 2023-03-19T22:57:47Z

I 'fixed' inference by:

cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python install_cuda.py install

Today's changes break things however

iChristGit · 2023-03-19T22:59:48Z

I 'fixed' inference by:

cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python install_cuda_py install

Today's changes break things however

I also have the same issue, the last line is not working in your reply.

alexl83 · 2023-03-19T23:00:41Z

I 'fixed' inference by:
cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python install_cuda_py install
Today's changes break things however
I also have the same issue, the last line is not working in your reply.

fixed typo:
python install_cuda.py install

RedTopper · 2023-03-19T23:02:23Z

I 'fixed' inference by: <snip>

That would make sense - you need to also rebuild the cuda package with the .cpp files from that commit. The container starts fresh from each build so the compiled version always matches the python code used in the repo.

MillionthOdin16 · 2023-03-19T23:03:18Z

Awesome! Worked for me too. I completely forgot to rebuild the kernel -_-

alexl83 · 2023-03-19T23:06:53Z

In any case, I reported qwopqwop200/GPTQ-for-LLaMa#62 to qwopqwop200 /
GPTQ-for-LLaMa

alexl83 · 2023-03-19T23:15:09Z

qwopqwop200 replied, as of today, LLaMA models need to be re-quantized to work with newset code

I'll test and report back ;-)

oobabooga · 2023-03-19T23:15:59Z

qwopqwop200 replied, as of today, LLaMA models need to be re-quantized to work with newset code

@zoidbb help?

alexl83 · 2023-03-19T23:36:42Z

Sum up:

latest GPTQ-for-LLaMa code
re-quantized HF LLaMA model(s) to 4bit GPTQ
Changed models/GPTQ_loader.py
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
to
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)

works for me, tested with LLaMA-7B and LLaMA-13B
Tomorrow I'm going to re-quantize 30B/65B

Ph0rk0z · 2023-03-20T12:04:34Z

So this is why I couldn't load the models after I fixed the ) bug.

But now we can quantize in different group size. Which one is the best for performance and coherence? I hate that I have to re-do this, btw.

terbo · 2023-03-20T15:32:16Z

Re-quantize means running python llama.py ..\..\models\llama-13b-hf c4 --wbits 2 --groupsize 128 --save ..\..\models\llama13b-2bit.pt from GPTQ-For-Llama?

This requires a ton of VRAM, and I have 2 8GB cards but it only maxes out one cards memory.
How can this be done locally? I previously downloaded the decapoda research files.

Edit: nvm, found a 13b model with the lora integrated that loads.

satvikpendem · 2023-03-20T20:18:14Z

@alexl83 Would you be able to host the fixed quantized files somewhere, perhaps on Hugging Face?

jllllll · 2023-03-20T21:01:57Z

When recompiling GPTQ on Windows, I accidentally forgot to use the x64 native tools cmd. It then successfully compiled using Visual Studio 2022 on its own, which is interesting considering everyone has been saying that only VS 2019 will work.

oobabooga · 2023-03-20T21:01:59Z

I recommend using the previous GPTQ commit for now

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
python setup_cuda.py install

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#installation

xNul · 2023-03-21T15:08:10Z

When recompiling GPTQ on Windows, I accidentally forgot to use the x64 native tools cmd. It then successfully compiled using Visual Studio 2022 on its own, which is interesting considering everyone has been saying that only VS 2019 will work.

I noticed this as well. I was going off of the Reddit thread at the time, but I guess it is wrong.

KnoBuddy · 2023-03-21T20:05:22Z

I keep getting: "CUDA Extension not installed." I'm on Windows 11 native. I have used the older commit (git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4) of GPTQ and ensured to install the .whl correctly. Cuda is certainly installed. Running python import torch torch.cuda.is_available() returns true.

This is my first time installing Llama so I'm not sure if this is just a perfect storm of changes happening or what. It appears that the GPTQ_loader.py was changed yesterday to "model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, shared.args.gptq_pre_layer)" see post and yet still doesn't seem to work with the current branch of GPTQ.

Something about requantinization too? No idea what my issue is. I'm sure there is a whole lot more I am missing since I'm just now diving in today.

xNul · 2023-03-21T21:54:47Z

@KnoBuddy if delete your environment and files and rollback text-generation-webui to two days ago, these instructions I made should work for you. You might be able to replace the python setup_cuda.py install line with installing the .whl. If installing the .whl doesn't work, then try to use the python setup_cuda.py install line. If that returns some compiler missing error, you need to install VS BuildTools like I mention in the instructions.

jllllll · 2023-03-22T00:49:19Z

@KnoBuddy "CUDA Extension not installed." is specifically referring to GPTQ-for-LLaMa. I've had this issue before after installing an outdated wheel. I uploaded a Windows wheel yesterday, along with the batch script that I use to install everything above that:
#457 (comment)
Maybe that will work for you, if not I can try compiling a new wheel, but that wheel should work. If you use the batch script, make sure not to run it as admin. If you have issues with permissions and need to run it as admin, add a cd /D command pointing to your current directory just after the first call line. Also, make sure to install the .whl file while it is inside the GPTQ-for-LLaMa folder. I've had issues with it not installing properly outside that folder.

xNul · 2023-03-22T03:32:58Z

@jllllll what does installing ninja before cuda compilation do?

jllllll · 2023-03-22T06:18:46Z

@jllllll what does installing ninja before cuda compilation do?

When doing the compilation without ninja, there is a message saying that the compilation would be faster with ninja. I don't notice much difference, but I install it anyway.

Ph0rk0z · 2023-03-22T15:07:15Z

ninja sets compile time parameters.

alexl83 added the bug Something isn't working label Mar 19, 2023

alexl83 changed the title ~~server.py not starting with GPQT latest git 534edc7~~ server.py not starting with GPTQ latest git 534edc7 Mar 19, 2023

oobabooga mentioned this issue Mar 20, 2023

Support for LLaMA models #147

Closed

gianfra-t mentioned this issue Mar 21, 2023

Llama 4-bit install instructions no longer work (CUDA_HOME environment variable is not set) #416

Closed

1 task

IamDavyG mentioned this issue Mar 24, 2023

Quantizing GALACTICA? qwopqwop200/GPTQ-for-LLaMa#46

Closed

alexl83 closed this as completed Mar 28, 2023

ukrolelo mentioned this issue Apr 17, 2023

Gibberish output #1308

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server.py not starting with GPTQ latest git 534edc7 #445

server.py not starting with GPTQ latest git 534edc7 #445

alexl83 commented Mar 19, 2023 •

edited

alexl83 commented Mar 19, 2023 •

edited

MillionthOdin16 commented Mar 19, 2023

alexl83 commented Mar 19, 2023 •

edited by oobabooga

MillionthOdin16 commented Mar 19, 2023 •

edited

alexl83 commented Mar 19, 2023

RedTopper commented Mar 19, 2023

MillionthOdin16 commented Mar 19, 2023

MillionthOdin16 commented Mar 19, 2023

oobabooga commented Mar 19, 2023

RedTopper commented Mar 19, 2023 •

edited

MillionthOdin16 commented Mar 19, 2023

alexl83 commented Mar 19, 2023

MillionthOdin16 commented Mar 19, 2023

alexl83 commented Mar 19, 2023 •

edited

iChristGit commented Mar 19, 2023

alexl83 commented Mar 19, 2023

RedTopper commented Mar 19, 2023

MillionthOdin16 commented Mar 19, 2023

alexl83 commented Mar 19, 2023

alexl83 commented Mar 19, 2023

oobabooga commented Mar 19, 2023

alexl83 commented Mar 19, 2023 •

edited

Ph0rk0z commented Mar 20, 2023

terbo commented Mar 20, 2023 •

edited

satvikpendem commented Mar 20, 2023

jllllll commented Mar 20, 2023

oobabooga commented Mar 20, 2023

xNul commented Mar 21, 2023

KnoBuddy commented Mar 21, 2023 •

edited

xNul commented Mar 21, 2023

jllllll commented Mar 22, 2023 •

edited

xNul commented Mar 22, 2023

jllllll commented Mar 22, 2023

Ph0rk0z commented Mar 22, 2023

server.py not starting with GPTQ latest git 534edc7 #445

server.py not starting with GPTQ latest git 534edc7 #445

Comments

alexl83 commented Mar 19, 2023 • edited

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

alexl83 commented Mar 19, 2023 • edited

MillionthOdin16 commented Mar 19, 2023

alexl83 commented Mar 19, 2023 • edited by oobabooga

MillionthOdin16 commented Mar 19, 2023 • edited

alexl83 commented Mar 19, 2023

RedTopper commented Mar 19, 2023

MillionthOdin16 commented Mar 19, 2023

MillionthOdin16 commented Mar 19, 2023

oobabooga commented Mar 19, 2023

RedTopper commented Mar 19, 2023 • edited

MillionthOdin16 commented Mar 19, 2023

alexl83 commented Mar 19, 2023

MillionthOdin16 commented Mar 19, 2023

alexl83 commented Mar 19, 2023 • edited

iChristGit commented Mar 19, 2023

alexl83 commented Mar 19, 2023

RedTopper commented Mar 19, 2023

MillionthOdin16 commented Mar 19, 2023

alexl83 commented Mar 19, 2023

alexl83 commented Mar 19, 2023

oobabooga commented Mar 19, 2023

alexl83 commented Mar 19, 2023 • edited

Ph0rk0z commented Mar 20, 2023

terbo commented Mar 20, 2023 • edited

satvikpendem commented Mar 20, 2023

jllllll commented Mar 20, 2023

oobabooga commented Mar 20, 2023

xNul commented Mar 21, 2023

KnoBuddy commented Mar 21, 2023 • edited

xNul commented Mar 21, 2023

jllllll commented Mar 22, 2023 • edited

xNul commented Mar 22, 2023

jllllll commented Mar 22, 2023

Ph0rk0z commented Mar 22, 2023

alexl83 commented Mar 19, 2023 •

edited

alexl83 commented Mar 19, 2023 •

edited

alexl83 commented Mar 19, 2023 •

edited by oobabooga

MillionthOdin16 commented Mar 19, 2023 •

edited

RedTopper commented Mar 19, 2023 •

edited

alexl83 commented Mar 19, 2023 •

edited

alexl83 commented Mar 19, 2023 •

edited

terbo commented Mar 20, 2023 •

edited

KnoBuddy commented Mar 21, 2023 •

edited

jllllll commented Mar 22, 2023 •

edited