Backend cleanup #6025

oobabooga · 2024-05-19T16:04:59Z

Bump AutoAWQ, AutoGPTQ, HQQ, AQLM and transformers to the latest versions.
Remove QuIP#. Last commit was 3 months ago, no llama-3 QuIP# models on HF, no precompiled wheels for the library after months. AQLM performs similarly and is more active.
Remove GPTQ-for-LLaMa.
- It doesn't work with the current PyTorch version and I didn't manage to recompile it with GitHub Actions easily.
- It was kept for compatibility with old NVIDIA GPUs, but I don't think anyone still uses it.
Make AutoGPTQ functional again my removing the inject_fused_attention option (apparently it has a bug).
HQQ is not functional (I couldn't load mobiuslabsgmbh_Llama-2-7b-chat-hf_1bitgs8_hqq, which is the most popular HQQ model on HF), but I'm keeping the loader for now because it's in active development and promising.

Ph0rk0z · 2024-05-20T11:23:43Z

Any big reason to dump these? I mean people probably have models in those formats. One day they will pull and then randomly lose the ability to run them. It doesn't seem there's any code that has made these backends a hassle yet, besides existing.

I don't think anyone still uses it.

Not a lot of maxwell people, but IIRC, it was the only way for them.

oobabooga · 2024-05-21T16:24:56Z

Any big reason to dump these?

Keeping the repository/documentation clean, not having to fix an obsolete backend, not forcing people to download the wheels for each pip install -r requirements.txt, not having an untested backend (does it even work with Llama-3?).

I'll remove it since I think it only serves an imaginary type of user. People with old GPUs are running llama.cpp nowadays, not GPTQ-for-LLaMa.

GlobalMeltdown · 2024-06-10T02:03:38Z

I used GPTQ and prefer it over GGUF because performance wise on my 3070ti is much better. It's a shame TheBloke was mainly the only good source for GPTQ. But honestly why would I want to to play with balancing GGUF's when I can just one and done it (either it fits or it doesnt) with GPTQ and inference speed and response is great compared to similar gguf's.

So I hope your joking about removing its functionality, mind you i only use ExLlamav2_HF to run them. Not previous iterations, so if ExLlamav2_HF remains I have no qualms.

oobabooga added 10 commits May 19, 2024 07:07

Remove QuIP#

6a3f97d

Bump AutoAWQ to 0.2.5

2392878

Bump AutoGPTQ to 0.7.1, remove ROCm support, remove CUDA 11.8 support

0b21656

Bump HQQ to 0.1.7.post2

fbc63b3

Bump AQLM to 1.1.5

569f6b6

Bump transformers to 4.41

a3f8adc

Remove inject_fused_attention option from AutoGPTQ to make it functional

89fe454

Remove GPTQ-for-LLaMa

dbb1094

Attempt to fix HQQ

2d2b930

Remove obsolete code

0d7e503

oobabooga mentioned this pull request May 19, 2024

fix: AutoAWQ from_quantized to from_pretrained #5938

Closed

Merge branch 'dev' into backend-cleanup

5dba274

oobabooga merged commit bd7cc42 into dev May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend cleanup #6025

Backend cleanup #6025

oobabooga commented May 19, 2024 •

edited

Ph0rk0z commented May 20, 2024

oobabooga commented May 21, 2024

GlobalMeltdown commented Jun 10, 2024 •

edited

Backend cleanup #6025

Backend cleanup #6025

Conversation

oobabooga commented May 19, 2024 • edited

Ph0rk0z commented May 20, 2024

oobabooga commented May 21, 2024

GlobalMeltdown commented Jun 10, 2024 • edited

oobabooga commented May 19, 2024 •

edited

GlobalMeltdown commented Jun 10, 2024 •

edited