Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend cleanup #6025

Merged
merged 11 commits into from
May 21, 2024
Merged

Backend cleanup #6025

merged 11 commits into from
May 21, 2024

Conversation

oobabooga
Copy link
Owner

@oobabooga oobabooga commented May 19, 2024

  • Bump AutoAWQ, AutoGPTQ, HQQ, AQLM and transformers to the latest versions.
  • Remove QuIP#. Last commit was 3 months ago, no llama-3 QuIP# models on HF, no precompiled wheels for the library after months. AQLM performs similarly and is more active.
  • Remove GPTQ-for-LLaMa.
    • It doesn't work with the current PyTorch version and I didn't manage to recompile it with GitHub Actions easily.
    • It was kept for compatibility with old NVIDIA GPUs, but I don't think anyone still uses it.
  • Make AutoGPTQ functional again my removing the inject_fused_attention option (apparently it has a bug).
  • HQQ is not functional (I couldn't load mobiuslabsgmbh_Llama-2-7b-chat-hf_1bitgs8_hqq, which is the most popular HQQ model on HF), but I'm keeping the loader for now because it's in active development and promising.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented May 20, 2024

Any big reason to dump these? I mean people probably have models in those formats. One day they will pull and then randomly lose the ability to run them. It doesn't seem there's any code that has made these backends a hassle yet, besides existing.

I don't think anyone still uses it.

Not a lot of maxwell people, but IIRC, it was the only way for them.

@oobabooga
Copy link
Owner Author

Any big reason to dump these?

Keeping the repository/documentation clean, not having to fix an obsolete backend, not forcing people to download the wheels for each pip install -r requirements.txt, not having an untested backend (does it even work with Llama-3?).

I'll remove it since I think it only serves an imaginary type of user. People with old GPUs are running llama.cpp nowadays, not GPTQ-for-LLaMa.

@oobabooga oobabooga merged commit bd7cc42 into dev May 21, 2024
@GlobalMeltdown
Copy link

GlobalMeltdown commented Jun 10, 2024

I used GPTQ and prefer it over GGUF because performance wise on my 3070ti is much better. It's a shame TheBloke was mainly the only good source for GPTQ. But honestly why would I want to to play with balancing GGUF's when I can just one and done it (either it fits or it doesnt) with GPTQ and inference speed and response is great compared to similar gguf's.

So I hope your joking about removing its functionality, mind you i only use ExLlamav2_HF to run them. Not previous iterations, so if ExLlamav2_HF remains I have no qualms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants