Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Roadmap #57

Open
16 of 36 tasks
tgaddair opened this issue Nov 22, 2023 · 32 comments
Open
16 of 36 tasks

Project Roadmap #57

tgaddair opened this issue Nov 22, 2023 · 32 comments
Labels
enhancement New feature or request

Comments

@tgaddair
Copy link
Contributor

tgaddair commented Nov 22, 2023

WIP project roadmap for LoRAX. We'll continue to update this over time.

v0.10

  • Speculative decoding adapters
  • AQLM

v0.11

  • Prefix caching
  • Embedding endpoint
  • BERT support
  • Embedding adapters
  • Classification adapters

Previous Releases

v0.9

  • Adapter memory pool

Backlog

Models

  • Llama
  • Mistral
  • GPT2
  • Qwen
  • Mixtral
  • Phi
  • Bloom
  • BERT
  • Stable-Diffusion

Adapters

Throughput / Latency

  • Paged Attention v2
  • Lookahead Decoding
  • SGMV with variable ranks
  • SGMV with tensor parallelism

Quantization

  • bitsandbytes
  • GPT-Q
  • AWQ

Usability

  • Prebuilt server wheels
  • SkyPilot usage guide
  • Example notebooks
@tgaddair tgaddair added the enhancement New feature or request label Nov 22, 2023
@tgaddair tgaddair pinned this issue Nov 22, 2023
@RileyCodes
Copy link

is AWQ supported?

@tgaddair
Copy link
Contributor Author

Hey @RileyCodes, not yet, will add that to the roadmap!

@abhibst
Copy link

abhibst commented Nov 23, 2023

does we have tested bitsandbytes Quantization ?

@tgaddair
Copy link
Contributor Author

Hey @abhibst, I've done some basic sanity checks on it, but haven't tested it very thoroughly. Please feel free to report any issues you encounter and I'll take a look!

@abhibst
Copy link

abhibst commented Nov 23, 2023

Sure Thanks for confirming

@arnavgarg1 arnavgarg1 unpinned this issue Nov 28, 2023
@tgaddair tgaddair pinned this issue Nov 29, 2023
@sansavision
Copy link

How would you go about adding this in Stable Diffusion? I am really interested in experimenting with that.

@tgaddair
Copy link
Contributor Author

Hey @sansavision, at a high level it would look a lot like the LoRA pipeline used in Diffusers: https://github.com/huggingface/api-inference-community/blob/main/docker_images/diffusers/app/pipelines/text_to_image.py#L25

A v0 shouldn't be too bad, we would basically just run a single forward pass to generate the image and perform postprocessing (as part of the existing Prefill step) and short-circuit the Decode step.

@flozi00
Copy link
Collaborator

flozi00 commented Dec 3, 2023

If no one has started I will start working on awq tomorrow

@tgaddair
Copy link
Contributor Author

tgaddair commented Dec 3, 2023

Nice! Thanks @flozi00, that would be awesome!

@SamGalanakis
Copy link

Any plans to support vision transformers from huggingface / timm? A lot of potential use cases there for deploying many classifiers. If not what would that entail? Would be open to contributing if possible.

@tgaddair
Copy link
Contributor Author

tgaddair commented Dec 6, 2023

Hey @SamGalanakis, great suggestion! The plan at the moment is to start by supporting text classifiers. Once that framework is in place for that, it should be hopefully relatively straightforward to support image classifiers as well. Happy to start a thread on Discord to discuss!

@flozi00
Copy link
Collaborator

flozi00 commented Dec 6, 2023

Whisper would be also very cool 😄

@SamGalanakis
Copy link

@tgaddair Ok clear, joined the discord will look out for it!

@Hap-Zhang
Copy link

Hi, @tgaddair , could I know how long it will take to support the stable diffusion model?

@tgaddair
Copy link
Contributor Author

Hey @Hap-Zhang, the plan at the moment is to add it after we add support for embedding generation and text classification. Both of those are planned for January 2024, so in the next month.

@Hap-Zhang
Copy link

@tgaddair Okay, got it. Thank you very much for your efforts. Stay tuned for it.

@AdithyanI
Copy link

If we could have OpenAI compatible endpoints that would be great too. So we can use this as drop in replacement for OpenAI models :)

@tgaddair
Copy link
Contributor Author

tgaddair commented Jan 8, 2024

Hey @AdithyanI, yes, this should be coming this week or next! See #145 to follow progress.

@AdithyanI
Copy link

AdithyanI commented Jan 8, 2024

@tgaddair oh wow that would be awesome! Thank you so much for the work here.
If you need someone to test it out; let me know. Happy to test it out.

Is the discord still open for others to join :) ?
I followed the link of the repo, and it says it is expired.

@tgaddair
Copy link
Contributor Author

tgaddair commented Jan 9, 2024

@AdithyanI this should be landing some time today :)

#170

@tgaddair
Copy link
Contributor Author

tgaddair commented Jan 9, 2024

Hey @AdithyanI, the Discord should be available. Are you using this link?

https://discord.gg/CBgdrGnZjy

@AdithyanI
Copy link

@tgaddair I asked for outlines repo authors to add support to this : outlines-dev/outlines#523
Then it would be great to have text guided generation :)

image

I don't know how hard is it to integrate that here.
Do you folks know if this is something that can be supported by LORAX?

@tgaddair
Copy link
Contributor Author

Thanks for starting the Outlines thread @AdithyanI! Looks like the maintainer created an issue #176. Excited to explore this integration!

@K-Mistele
Copy link

Would it be possible to add in context length-scaling methods like Self-Extend , Rope scaling, and/or yarn scaling? I know that llama.cpp has a good implementation of these in their server, and self-extend in particular is much more stable than rope or yarn. Having long context or doing context enhancement is super important for RAG applications.

@thincal
Copy link
Contributor

thincal commented Feb 26, 2024

About the supported models, could you consider the ChatGLM3 ? @tgaddair

@thincal
Copy link
Contributor

thincal commented Mar 10, 2024

  • LongLoRA

It seems that LongLoRA proposed shifted short attention is compatible with Flash-Attention, and not required during inference (ref: https://huggingface.co/Yukang/Llama-2-13b-longlora-8k#highlights), if that is true, could you share what's the planed support in LoRAX inference side? thanks @tgaddair

@remiconnesson
Copy link

remiconnesson commented Mar 17, 2024

Do you plan on supporting AQLM to setve LoRa of Mixtral Instruct with Lorax?

@tgaddair
Copy link
Contributor Author

Hey @thincal, the last thing we need to support LongLoRA, if I remember correctly, is #231 which @geoffreyangus is planning to pick up next week.

@remiconnesson, we have PR #233 from @flozi00 for AQLM. It's pretty close to landing, but just needs a little additional work to finish it up. If no one else picks it up, I can probably take a look in the next week or two.

@amir-in-a-cynch
Copy link

Are T5 based models on the Roadmap?

@remiconnesson
Copy link

@tgaddair

@remiconnesson, we have PR #233 from @flozi00 for AQLM. It's pretty close to landing, but just needs a little additional work to finish it up. If no one else picks it up, I can probably take a look in the next week or two.

Hello :) How far do you think we are for this PR to be merged? :)

@tgaddair
Copy link
Contributor Author

tgaddair commented Apr 3, 2024

Hey @remiconnesson, will probably be the next thing I take a look at after wrapping up speculative decoding this week.

@amir-in-a-cynch we can definitely add T5 to the roadmap!

@tomrance
Copy link

Hello, will you integrate / merge / migrate to the latest hugging face text-generation-inference as it is back now with Apache 2.0 license?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests