Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do i use trust_remove_code=True for mosaic models? #361

Closed
cfregly opened this issue May 23, 2023 · 6 comments · Fixed by #514
Closed

How do i use trust_remove_code=True for mosaic models? #361

cfregly opened this issue May 23, 2023 · 6 comments · Fixed by #514

Comments

@cfregly
Copy link

cfregly commented May 23, 2023

Feature request

pretrained_model_dir = 'mosaicml/mpt-7b'

model = AutoModelForCausalLM.from_config(config, trust_remote_code=True, torch_dtype=torch.float16)

https://discuss.huggingface.co/t/how-to-use-trust-remote-code-true-with-load-checkpoint-and-dispatch/39849/1

Motivation

model-specific params

Your contribution

sure

@OlivierDehaene
Copy link
Member

#363 is a first step to support mosaicml/mpt-7b. However, it seems that the past_key_values layout used by this model is not compatible with this repository. I will try some workaround on the coming days.

@metemadi
Copy link

metemadi commented May 26, 2023

hi! i am able to get the server to run properly with an MPT model (converted to HF format using their scripts in their llm-foundry repo).. I can run generate fine using normal python and HF generate().. but the generate server (using the same generate parameters) doesn't work sadly.. it just returns an empty string... wondering if you ran into the same problem? thank you!!

okay.. edit.. seems like the issue is the the decoding is stopping prematurely due to the EOS token.. is there a way to make it just generate up to the sequence length? looking through it looks like there is something in rust-land to turn this on (it looks like for testing), but I can't pass anything through the REST endpoint to trigger this behavior

@harryjulian
Copy link

@metemadi did you ever get any further with this?

I managed to stand up MPT-7B in the container but I was also only getting a single returned token.

@harryjulian
Copy link

@OlivierDehaene Do you know exactly what's causing the issues with the MPT model? I'm looking at making a fix.

@metemadi
Copy link

metemadi commented Jun 19, 2023

Hi @harryjulian and others! So unfortunately no - I can run the code (I take an MPT model I trained and then used their helper script in the llm-foundary repo to convert it to an HF checkpoint - then import it as any other HF model), but it just generates <|endoftext|> the whole time (therefore stopping generation). I even tricked the model (by modifying the configs) into not thinking <|endoftext|> was a stop token.. but guess what happened? It just generated a BUNCH of <|endoftext|> (up to the number of new tokens requested)... so then I threw in the towel and just used the generate() function for my application : ) I even tried another library (gpt4all) to get a chat interface.. which also has a HF import option.. and that gave me a different error. A huge thank you to everyone for looking at this - the llm-foundary tools are so dead simple to use for training (and very fast).. would love to be able to get custom MPT checkpoints working with your amazing high-performance streaming inference library!

@Narsil
Copy link
Collaborator

Narsil commented Jul 1, 2023

PR #514 should help run MPT models on TGI.

It doesn't use flash (yet) becauses that requires forking and extending flash attention kernel.

@Narsil Narsil mentioned this issue Jul 1, 2023
5 tasks
OlivierDehaene pushed a commit that referenced this issue Jul 3, 2023
# What does this PR do?


This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
#361
Fixes
#491
Fixes
#290
AIProphet added a commit to AIProphet/text-generation-inference that referenced this issue Jul 12, 2023
# What does this PR do?


This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
huggingface/text-generation-inference#361
Fixes
huggingface/text-generation-inference#491
Fixes
huggingface/text-generation-inference#290
verdant621 added a commit to verdant621/text-generation-inference that referenced this issue Oct 19, 2023
# What does this PR do?


This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
huggingface/text-generation-inference#361
Fixes
huggingface/text-generation-inference#491
Fixes
huggingface/text-generation-inference#290
cr313 added a commit to cr313/text-generation-inference-load-test that referenced this issue Apr 19, 2024
# What does this PR do?


This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
huggingface/text-generation-inference#361
Fixes
huggingface/text-generation-inference#491
Fixes
huggingface/text-generation-inference#290
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants