-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for mosaicml/mpt-30b-instruct model #491
Comments
did you try --trust-remote-code while running the docker |
it's very slow. This model is not supported for sharding at the moment in text-generation-inference.
|
Then try implementing a rudimentary implementation of it, you can use rust or js as router and Python for inference, copy the custom kernels from the repo, modify them as suitable, mpt already has an implementation for flash attention in its "remote code file" use that and batch_encode_plus while tokenizer and batch_decode, implement batching on router server and volla u have your own server ready for inference |
Maybe you could write me one as an example? |
Am working on one right now, if you would like to help out (discord: mantrakp) |
I am also very interested in this |
(And can we expect an optimized tgi implementation soon?) |
Take example to other models we have done in There's also some files in If you succeed PRs are welcome ! |
is mpt even supported #290 ? |
It's supported on the "best effort basis". I started some work to actually support it, but it means rewriting flash attention (the cuda version) with added bias, which may take some time. |
Sad news: I didn't succeed, the mpt model is a bit different, i tried loading it but it didn't work as expected and keeps mixing up tokens. I am looking forward for your implementation Narsil, sorry for the wait. A pre thanks (an advance thank) to narsil :D |
Can you guide on how you started writing the flash attention part and what are your thoughts on implementing dynamic batching for this as it only supports 1 concurrent request for now on AutoModel. |
This won't require work once we have flash attention. |
Please correct me if I'm wrong but do we need to implement this since mpt-30 models already has flashattention usage prebuilt in its config? import torch
import transformers
name = 'mosaicml/mpt-30b-chat'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton' # change this to use triton-based FlashAttention
config.init_device = 'cuda:0' # For fast initialization directly on GPU!
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
torch_dtype=torch.bfloat16, # Load model weights in bfloat16
trust_remote_code=True
) |
Because it doesn't implement the flash attention we want. This is Triton's flash attention, which doesn't support "unpadded" batching, which is the one necessary to work nicely on TGI (removing batching, removes a LOT of issues and unnecessary memory and speeds up inference much more than flash by itself). Flash attention actually doesn't play that big of a role in speeding things up at inference, since most of the time is spent in |
Here is the non flash version (as a temporary measure since modifying the kernel is taking more time than I anticipated: #514 This should enable sharding at least. |
Many thanks for this. Looking forward to the flash class too. |
Triton is the only flash attention implementation that supports ALiBi, if I understand this correctly. So for TGI, if we want to use MPT with ALiBi, does that leave us with just the native pytorch implementation? |
We will fork and add it ourselves to the flash attention cuda kernels. |
# What does this PR do? This adds a non flash version of MPT. Flash is harder because we need to create a bias ready cuda kernel of flash attention. Fixes huggingface/text-generation-inference#361 Fixes huggingface/text-generation-inference#491 Fixes huggingface/text-generation-inference#290
# What does this PR do? This adds a non flash version of MPT. Flash is harder because we need to create a bias ready cuda kernel of flash attention. Fixes huggingface/text-generation-inference#361 Fixes huggingface/text-generation-inference#491 Fixes huggingface/text-generation-inference#290
# What does this PR do? This adds a non flash version of MPT. Flash is harder because we need to create a bias ready cuda kernel of flash attention. Fixes huggingface/text-generation-inference#361 Fixes huggingface/text-generation-inference#491 Fixes huggingface/text-generation-inference#290
Feature request
I was wondering if there will be a support for the newly released mpt-30b-instruct
Motivation
It's not possible to use
mosaicml/mpt-30b-instruct
model:ValueError: sharded is not supported for AutoModel
Your contribution
I am not sure how you can add support for new LLM models. (if there is a step by step guide as where to start would be great and I can contribute)
The text was updated successfully, but these errors were encountered: