Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integration with Hugging Face transformers #713

Closed
saattrupdan opened this issue Feb 28, 2024 · 14 comments · Fixed by #728
Closed

Add integration with Hugging Face transformers #713

saattrupdan opened this issue Feb 28, 2024 · 14 comments · Fixed by #728

Comments

@saattrupdan
Copy link
Contributor

Presentation of the new feature

It should be possible to use the transformers package for inference of generative models, and simply add structured generation from outlines as a "plugin" rather than needing to wrap all models in outlines-specific classes, as it seems like is the current approach.

Instead, transformers supports a prefix_allowed_tokens_fn argument in the generate method, which is a function that returns the allowed tokens to be generated at a given step. The outlines package could thus have a simple function/class, which can be given as this argument, analogous to the current vLLM integration with logits_processors.

Where does it fit in Outlines?

Allows easier integration into inference frameworks that people are mostly using, making outlines more useful to many people.

Are you willing to open a PR?

I would be willing to implement this in a PR, yes. The implementation would be very similar to the current vLLM integration. If I am going to do this, I would need some guidance on preferred directory structures, however. The vLLM integration is inside the serve directory. Should vLLM and this transformers integration be moved into a separate integrations directory, perhaps?

@rlouf
Copy link
Member

rlouf commented Feb 28, 2024

We have a very good reason to not use the generate method. Why would you want to do that?

@saattrupdan
Copy link
Contributor Author

We have a very good reason to not use the generate method. Why would you want to do that?

Oh, I see, how come?

And my use case is part of a larger evaluation framework, where one of the tasks require structured generation. I use vLLM but fall back to transformers if the model architecture is not supported by vLLM. It would thus be quite convenient to plug in support in the existing code, rather than work with a new model abstraction.

The lmfe structured generation package also feature a convenience function to allow integration with transformers, so I thought you wouldn't mind integrating into that as well?

@rlouf
Copy link
Member

rlouf commented Feb 28, 2024

Oh, I see, how come?

Among other things we need to implementing more sampling algorithms than transformers, see #673.

And my use case is part of a larger evaluation framework, where one of the tasks require structured generation. I use vLLM but fall back to transformers if the model architecture is not supported by vLLM. It would thus be quite convenient to plug in support in the existing code, rather than work with a new model abstraction.

How do you currently use vLLM in your code?

@saattrupdan
Copy link
Contributor Author

saattrupdan commented Feb 29, 2024

Among other things we need to implementing more sampling algorithms than transformers, see #673.

I see. For my case, I would either need to use outlines for all non-vLLM generation (including non-structured), or have outlines plug in to the generate method of Hugging Face. Would outlines be able to be used for all generation?

How do you currently use vLLM in your code?

I wrap vLLM models in a VLLMModel class here, making their API compatible with regular PreTrainedModel from transformers. I then have a convenience function get_ner_logits_processors, which uses outlines to build the logits processors to be included in the vllm.LLM.generate call.

I've now added my own wrapper for the prefix_allowed_tokens_fn in the PreTrainedModel.generate call, using the convenience function get_ner_prefix_allowed_tokens_fn. This uses a custom JSONPrefixAllowedTokens, adapted from your JSONLogitsProcessor, and something like that is what I was hoping to import directly from outlines rather than copying your source code and tweaking it.

Does that make sense? If this seems to out of scope for outlines then feel free to close this issue, and I'll just use my own wrapper 🙂

@rlouf
Copy link
Member

rlouf commented Mar 1, 2024

Would outlines be able to be used for all generation?

It should via generate.text. What kwargs do you pass to generate?

I wrap vLLM models in a VLLMModel class here, making their API compatible with regular PreTrainedModel from transformers. I then have a convenience function get_ner_logits_processors, which uses outlines to build the logits processors to be included in the vllm.LLM.generate call.

I've now added my own wrapper for the prefix_allowed_tokens_fn in the PreTrainedModel.generate call, using the convenience function get_ner_prefix_allowed_tokens_fn. This uses a custom JSONPrefixAllowedTokens, adapted from your JSONLogitsProcessor, and something like that is what I was hoping to import directly from outlines rather than copying your source code and tweaking it.

Would a outlines.models.vllm alias for all this help?

@saattrupdan
Copy link
Contributor Author

@rlouf Just looked a bit more in the code base.

If the generation speed is at least as fast as with transformers, then I think it could be useful to use your SequenceGenerator for the generation. I would need to be able to have it return logprobs as well, however (and turn this on and on, depending on the data). Basically, if the generation function supports everything in a Hugging Face GenerationConfig, that'd be great.

But it sounds like there's no interest from your side to integrate into transformers, only wrap around it? In that case I think I'll stick with my own prefix_allowed_tokens_fn I linked to above.

@rlouf
Copy link
Member

rlouf commented Mar 1, 2024

If the generation speed is at least as fast as with transformers, then I think it could be useful to use your SequenceGenerator for the generation.

As far as I know, yes.

But it sounds like there's no interest from your side to integrate into transformers, only wrap around it? In that case I think I'll stick with my own prefix_allowed_tokens_fn I linked to above.

I'm not sure what you mean here.

@saattrupdan
Copy link
Contributor Author

saattrupdan commented Mar 1, 2024

I'm not sure what you mean here.

I just mean to use the API from Hugging Face transformers, but plug in your structured generation functionality. So without using your Transformer abstraction or anything like that. Literally do everything exactly as normally with transformers, but plug in support for structured generation from outlines.

That's what I love about your vLLM integration, as that's precisely what you do: You allow me to continue working with the vLLM API, and all I have to change in my code is import your JSONLogitsProcessor and plug it into my vLLM config. Two lines of code changed in the code base.

But to do that for Hugging Face transformers is not as easy, since your suggestion was essentially to go from using their API to your custom outlines API. If you had a simple JSONPrefixAllowedFn class (or something like that), then I could do the same as I do with vLLM: Simply import it and plug it into my generate call. Two lines of code changed, and it just works.

Using your outlines.generate could also work, but in that case I'd love to just put in my (Hugging Face) model and put in the exact same arguments that I would put into a generate call. This would just require a lot more maintenance from your side though, since Hugging Face change their API all the time. That's why I think viewing outlines as a "plug-in", as described above, seems a lot more robust in the long run.

@rlouf
Copy link
Member

rlouf commented Mar 1, 2024

That's why I think viewing outlines as a "plug-in", as described above, seems a lot more robust in the long run.

I would agree if we were not working on features that are not available in transformers, such as #667 #657 #673. We can still integrate the way you suggest with transformers, via logits processors. A middle ground solution would be to add transformers specific logits processor in the library. Would that work for you?

@saattrupdan
Copy link
Contributor Author

I would agree if we were not working on features that are not available in transformers, such as #667 #657 #673.

Yeah I completely get that. Down the line this might very well mean that outlines will be go-to generation API of choice for me!

We can still integrate the way you suggest with transformers, via logits processors. A middle ground solution would be to add transformers specific logits processor in the library. Would that work for you?

Yeah I think this is exactly what I would like! The analogy of logits processors in transformers are these prefix allowed tokens functions, but they are very similar to the logits processors. I could add a PR that adds this. Where abouts in the code base would fit? serve/transformers.py?

@rlouf
Copy link
Member

rlouf commented Mar 1, 2024

Yeah I think this is exactly what I would like! The analogy of logits processors in transformers are these prefix allowed tokens functions, but they are very similar to the logits processors. I could add a PR that adds this. Where abouts in the code base would fit? serve/transformers.py?

Yes. I'm not sure where it would make the most sense, maybe in a high-level module processors ?

@saattrupdan
Copy link
Contributor Author

I'm not sure where it would make the most sense, maybe in a high-level module processors ?

Sounds good to me. Should I move the vLLM processors as well in that case? I could keep a reference to them in serve/vllm.py to ensure backwards compatibility.

@rlouf
Copy link
Member

rlouf commented Mar 2, 2024

Yes that would be great, thank you!

@saattrupdan
Copy link
Contributor Author

PR open now: #728

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants