Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server.py not starting with GPTQ latest git 534edc7 #445

Closed
1 task done
alexl83 opened this issue Mar 19, 2023 · 34 comments
Closed
1 task done

server.py not starting with GPTQ latest git 534edc7 #445

alexl83 opened this issue Mar 19, 2023 · 34 comments
Labels
bug Something isn't working

Comments

@alexl83
Copy link

alexl83 commented Mar 19, 2023

Describe the bug

launching latest text-generation-webui code with latest opqwop200 /
GPTQ-for-LLaMa
throws up a python error:


Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 98, in load_model
    from modules.GPTQ_loader import load_quantized
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 11, in <module>
    import opt
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/opt.py", line 424
    model = load_quant(args.model, args.load, args.wbits, args.groupsize))
                                                                         ^
SyntaxError: unmatched ')'

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

cd text-generation-webui/repositories/GPTQ-for-LLaMa
git pull
pip install -r requirements.txt
python setup_cuda.py install
cd ../..
python server.py --auto-devices --gpu-memory 16 --gptq-bits 4 --cai-chat --listen --extensions gallery llama_prompts --model llama-30b --settings ~/oobabooga/settings.json

Screenshot

No response

Logs

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 98, in load_model
    from modules.GPTQ_loader import load_quantized
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 11, in <module>
    import opt
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/opt.py", line 424
    model = load_quant(args.model, args.load, args.wbits, args.groupsize))
                                                                         ^
SyntaxError: unmatched ')'


### System Info

```shell
Ryzen 7700X
RTX 4090
Ubuntu 22.10 amd64
micromamba environment
python 3.10.9
pytorch 1.13.1
torchaudio 0.13.1
torchvision 0.14.1
@alexl83 alexl83 added the bug Something isn't working label Mar 19, 2023
@alexl83 alexl83 changed the title server.py not starting with GPQT latest git 534edc7 server.py not starting with GPTQ latest git 534edc7 Mar 19, 2023
@alexl83
Copy link
Author

alexl83 commented Mar 19, 2023

As it seems, 'load_quant()' in 'modules/GPTQ_loader.py' needs to pass one more (new) positional argument to qwopqwop200 /
GPTQ-for-LLaMa: 'groupsize'

after correcting SyntaxError, here's the trace:

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 100, in load_model
    model = load_quantized(model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 55, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
TypeError: load_quant() missing 1 required positional argument: 'groupsize'

@MillionthOdin16
Copy link

Change model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)to
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)

From the args documentation -1 sets to default size.

@alexl83
Copy link
Author

alexl83 commented Mar 19, 2023

Thanks, passing the value triggers another exception:

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Loading model ...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/models.py", line 100, in load_model
    model = load_quantized(model_name)
  File "/home/alex/oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 55, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 246, in load_quant
    model.load_state_dict(torch.load(checkpoint))
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        Missing key(s) in state_dict: "model.layers.0.self_attn.q_proj.qzeros", ...

@MillionthOdin16
Copy link

MillionthOdin16 commented Mar 19, 2023

Yea, it looks like there's more issues with the GPTQ changes today than just syntax. I rolled back the GPTQ repo to yesterdays version without any of his changes today and it works fine. I saw same error as you before the rollback.

@alexl83
Copy link
Author

alexl83 commented Mar 19, 2023

Yea, it looks like there's more issues with the GPTQ changes today than just syntax. I rolled back the GPTQ repo to yesterdays version without any of his changes today and it works fine.

Will do the same for now; I'd be curious to understand if re-quantizing the models with today's code would fix the loading
Thanks for helping out! :)

@RedTopper
Copy link

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)

git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4

Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4

It's what I'm using for my container at the moment.

@MillionthOdin16
Copy link

I actually don't know anymore... It seems like it might be more broken than I thought. I'm using the pre-quantized models from HF, so you might be right about versions alex.

(text-generation-webui) PS text-generation-webui> python server.py --model llama-7b --load-in-4bit  --auto-devices   
Warning: --load-in-4bit is deprecated and will be removed. Use --gptq-bits 4 instead.

Loading llama-7b...
Loading model ...
Done.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Loaded the model in 2.71 seconds.
text-generation-webui\lib\site-packages\gradio\deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
  File "text-generation-webui\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "text-generation-webui\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "text-generation-webui\modules\callbacks.py", line 65, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "text-generation-webui\modules\text_generation.py", line 199, in generate_with_callback
    shared.model.generate(**kwargs)
  File "text-generation-webui\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\generation\utils.py", line 1452, in generate
    return self.sample(
  File "text-generation-webui\lib\site-packages\transformers\generation\utils.py", line 2468, in sample
    outputs = self(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\lib\site-packages\transformers\models\llama\modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "text-generation-webui\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 198, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.zeros)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: int) -> None

Invoked with: tensor([[ 0.0436, -0.0149,  0.0150,  ...,  0.0267,  0.0112, -0.0011],
        [ 0.0032, -0.0213,  0.0215,  ...,  0.0320, -0.0013, -0.0199],
        [-0.0021,  0.0065, -0.0123,  ...,  0.0199, -0.0018, -0.0081],
        ...,
        [ 0.0074,  0.0389,  0.0164,  ..., -0.0429, -0.0018, -0.0133],
        [ 0.0305,  0.0061,  0.0262,  ...,  0.0096,  0.0096,  0.0033],
        [-0.0431, -0.0260,  0.0012,  ...,  0.0075, -0.0076, -0.0037]],
       device='cuda:0'), tensor([[ 2004248423,  2020046951,  1734903431,  ..., -2024113529,
         -1772648858,  1988708488],
        [ 2004318071,  1985447543,  1719101303,  ...,  1738958728,
          1734834296,  1988584549],
        [-2006481289, -2038991241,  2003200134,  ..., -1734780278,
         -2055714936, -1401572265],
        ...,
        [-2022213769, -2021226889,  1735947895,  ...,  2002357398,
          1483176039, -1215859063],
        [ 2005366614, -2022148249,  1752733576,  ...,   394557864,
          1986418055,  1483962710],
        [ 1735820935,  1988720743, -2056755593,  ..., -1468438152,
          1718123383,  1150911352]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'), tensor([[0.0318],
        [0.0154],
        [0.0123],
        ...,
        [0.0191],
        [0.0206],
        [0.0137]], device='cuda:0'), tensor([[0.2229],
        [0.1079],
        [0.0860],
        ...,
        [0.1529],
        [0.1439],
        [0.0960]], device='cuda:0')

@MillionthOdin16
Copy link

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)

git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4

Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4

It's what I'm using for my container at the moment.

Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.

@oobabooga
Copy link
Owner

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)

git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4

Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4

It's what I'm using for my container at the moment.

This solves it for me.

This bug report is in the wrong repository, by the way. You should tell @qwopqwop200 about it.

@RedTopper
Copy link

RedTopper commented Mar 19, 2023

Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.

Yes, it's working for me with that specific commit.

Specificially, it's set up like this right now: https://github.com/RedTopper/Text-Generation-Webui-Podman/blob/main/Containerfile#L14-L15

@MillionthOdin16
Copy link

Awesome. Thanks

@alexl83
Copy link
Author

alexl83 commented Mar 19, 2023

If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course)

git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4

Corresponds to this commit yesterday: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/468c47c01b4fe370616747b6d69a2d3f48bab5e4
It's what I'm using for my container at the moment.

Did you get the model to output predictions in your container? Mine appears to load the model, but throws an error on prediction.

Prediction broken for me too with yday's commit:

Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Traceback (most recent call last):
  File "/home/alex/oobabooga/text-generation-webui/server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
Loading settings from /home/alex/oobabooga/settings.json...
Loading llama-30b...
Loading model ...
Done.
Loaded the model in 6.81 seconds.
Loading the extension "gallery"... Ok.
Loading the extension "llama_prompts"... Ok.
/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alex/oobabooga/text-generation-webui/modules/callbacks.py", line 65, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/alex/oobabooga/text-generation-webui/modules/text_generation.py", line 201, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
    outputs = self.model(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/quant.py", line 198, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.zeros)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
    1. (arg0: at::Tensor, arg1: at::Tensor, arg2: at::Tensor, arg3: at::Tensor, arg4: at::Tensor, arg5: int) -> None

Invoked with: tensor([[-0.0500, -0.0130, -0.0012,  ...,  0.0039, -0.0046, -0.0232],
        [-0.0420,  0.0025, -0.0313,  ..., -0.0309,  0.0211, -0.0179],
        [-0.0116,  0.0273,  0.0387,  ...,  0.0043, -0.0025,  0.0179],
        ...,
        [-0.0071, -0.0465, -0.0059,  ...,  0.0018,  0.0062, -0.0076],
        [-0.0218,  0.0511, -0.0048,  ...,  0.0093,  0.0003,  0.0119],
        [ 0.0235, -0.0288, -0.0288,  ..., -0.0232, -0.0172,  0.0103]],
       device='cuda:0'), tensor([[ 1719302009,  2004449128,  1234793881,  ..., -2019973256,
         -1502063032,  2037938296],
        [ 2019915367,  2004252535,  1750500728,  ..., -1736926794,
           965175426, -1465341558],
        [-1753778313, -2005497737, -1215805527,  ..., -2005514360,
          1450617205, -2020972629],
        ...,
        [ 2005431670,  1701348758,  1790806215,  ..., -1967744889,
          1970501769,  2055776885],
        [ 1718114184,  1970689672,  1183483512,  ...,  2053671319,
         -1752840856,  1570348373],
        [ 1734838390,  2022205543,  1734843030,  ..., -1737918327,
          2002028378, -1500927849]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'), tensor([[0.0111],
        [0.0150],
        [0.0077],
        ...,
        [0.0194],
        [0.0119],
        [0.0131]], device='cuda:0'), tensor([[0.0779],
        [0.1051],
        [0.0613],
        ...,
        [0.1551],
        [0.0830],
        [0.1045]], device='cuda:0')

@MillionthOdin16
Copy link

I wonder if they are actually testing on a quantized model, or a non-quantized one. I don't know where to go from here haha

@alexl83
Copy link
Author

alexl83 commented Mar 19, 2023

I 'fixed' inference by:

cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python install_cuda.py install

Today's changes break things however

@iChristGit
Copy link

I 'fixed' inference by:

cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python install_cuda_py install

Today's changes break things however

I also have the same issue, the last line is not working in your reply.

@alexl83
Copy link
Author

alexl83 commented Mar 19, 2023

I 'fixed' inference by:

cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python install_cuda_py install

Today's changes break things however

I also have the same issue, the last line is not working in your reply.

fixed typo:
python install_cuda.py install

@RedTopper
Copy link

I 'fixed' inference by: <snip>

That would make sense - you need to also rebuild the cuda package with the .cpp files from that commit. The container starts fresh from each build so the compiled version always matches the python code used in the repo.

@MillionthOdin16
Copy link

Awesome! Worked for me too. I completely forgot to rebuild the kernel -_-

@alexl83
Copy link
Author

alexl83 commented Mar 19, 2023

In any case, I reported qwopqwop200/GPTQ-for-LLaMa#62 to qwopqwop200 /
GPTQ-for-LLaMa

@alexl83
Copy link
Author

alexl83 commented Mar 19, 2023

qwopqwop200 replied, as of today, LLaMA models need to be re-quantized to work with newset code

I'll test and report back ;-)

@oobabooga
Copy link
Owner

qwopqwop200 replied, as of today, LLaMA models need to be re-quantized to work with newset code

@zoidbb help?

@alexl83
Copy link
Author

alexl83 commented Mar 19, 2023

Sum up:

latest GPTQ-for-LLaMa code
re-quantized HF LLaMA model(s) to 4bit GPTQ
Changed models/GPTQ_loader.py
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
to
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)

works for me, tested with LLaMA-7B and LLaMA-13B
Tomorrow I'm going to re-quantize 30B/65B

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 20, 2023

So this is why I couldn't load the models after I fixed the ) bug.

But now we can quantize in different group size. Which one is the best for performance and coherence? I hate that I have to re-do this, btw.

@terbo
Copy link

terbo commented Mar 20, 2023

Re-quantize means running python llama.py ..\..\models\llama-13b-hf c4 --wbits 2 --groupsize 128 --save ..\..\models\llama13b-2bit.pt from GPTQ-For-Llama?

This requires a ton of VRAM, and I have 2 8GB cards but it only maxes out one cards memory.
How can this be done locally? I previously downloaded the decapoda research files.

Edit: nvm, found a 13b model with the lora integrated that loads.

@satvikpendem
Copy link

@alexl83 Would you be able to host the fixed quantized files somewhere, perhaps on Hugging Face?

@jllllll
Copy link
Contributor

jllllll commented Mar 20, 2023

When recompiling GPTQ on Windows, I accidentally forgot to use the x64 native tools cmd. It then successfully compiled using Visual Studio 2022 on its own, which is interesting considering everyone has been saying that only VS 2019 will work.

@oobabooga
Copy link
Owner

I recommend using the previous GPTQ commit for now

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
python setup_cuda.py install

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#installation

@xNul
Copy link
Contributor

xNul commented Mar 21, 2023

When recompiling GPTQ on Windows, I accidentally forgot to use the x64 native tools cmd. It then successfully compiled using Visual Studio 2022 on its own, which is interesting considering everyone has been saying that only VS 2019 will work.

I noticed this as well. I was going off of the Reddit thread at the time, but I guess it is wrong.

@KnoBuddy
Copy link

KnoBuddy commented Mar 21, 2023

I keep getting: "CUDA Extension not installed." I'm on Windows 11 native. I have used the older commit (git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4) of GPTQ and ensured to install the .whl correctly. Cuda is certainly installed. Running python import torch torch.cuda.is_available() returns true.

This is my first time installing Llama so I'm not sure if this is just a perfect storm of changes happening or what. It appears that the GPTQ_loader.py was changed yesterday to "model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, shared.args.gptq_pre_layer)" see post and yet still doesn't seem to work with the current branch of GPTQ.

Something about requantinization too? No idea what my issue is. I'm sure there is a whole lot more I am missing since I'm just now diving in today.

@xNul
Copy link
Contributor

xNul commented Mar 21, 2023

@KnoBuddy if delete your environment and files and rollback text-generation-webui to two days ago, these instructions I made should work for you. You might be able to replace the python setup_cuda.py install line with installing the .whl. If installing the .whl doesn't work, then try to use the python setup_cuda.py install line. If that returns some compiler missing error, you need to install VS BuildTools like I mention in the instructions.

@jllllll
Copy link
Contributor

jllllll commented Mar 22, 2023

@KnoBuddy "CUDA Extension not installed." is specifically referring to GPTQ-for-LLaMa. I've had this issue before after installing an outdated wheel. I uploaded a Windows wheel yesterday, along with the batch script that I use to install everything above that:
#457 (comment)
Maybe that will work for you, if not I can try compiling a new wheel, but that wheel should work. If you use the batch script, make sure not to run it as admin. If you have issues with permissions and need to run it as admin, add a cd /D command pointing to your current directory just after the first call line. Also, make sure to install the .whl file while it is inside the GPTQ-for-LLaMa folder. I've had issues with it not installing properly outside that folder.

@xNul
Copy link
Contributor

xNul commented Mar 22, 2023

@jllllll what does installing ninja before cuda compilation do?

@jllllll
Copy link
Contributor

jllllll commented Mar 22, 2023

@jllllll what does installing ninja before cuda compilation do?

When doing the compilation without ninja, there is a message saying that the compilation would be faster with ninja. I don't notice much difference, but I install it anyway.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 22, 2023

ninja sets compile time parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests