Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

Does this work for LLaMA models? #57

Closed
marcoripa96 opened this issue Mar 17, 2023 · 21 comments
Closed

Does this work for LLaMA models? #57

marcoripa96 opened this issue Mar 17, 2023 · 21 comments
Labels
enhancement New feature or request

Comments

@marcoripa96
Copy link

marcoripa96 commented Mar 17, 2023

I wanna stream the output of LLaMA models is it possible using this?

@peakji
Copy link
Member

peakji commented Mar 17, 2023

The implementation of LLaMA has just been merged into HF transformers, but it seems that it did not make it into the latest release.

Basaran will support LLaMA as soon as it is available in HF transformers.

@deep-diver
Copy link

deep-diver commented Mar 21, 2023

I am building a Gradio app with LLaMA, can I borrow your codebase of token generator?

I have checked StreamModel works nicely with LLaMA, and my project is an open source project under Apache license.

@peakji
Copy link
Member

peakji commented Mar 21, 2023

I am building a Gradio app with LLaMA, can I borrow your codebase of token generator?

Yes, of course! Basaran is released under the MIT License.

I have checked StreamModel works nicely with LLaMA...

If that's the case, do you think it's necessary for us to release Basaran on PyPI for developers to use it as a library? This would be more maintainable than copying and pasting.

Also, which LLaMA model are you testing? Could you share the link to the model on the Hugging Face Hub so that we can test it as well? Thanks!

@deep-diver
Copy link

I have fine-tuned 30B version of Alpaca-LoRA! I am currently running a demo publicly

https://notebooksf.jarvislabs.ai/43j3x9FSS8Tg0sqvMlDgKPo9vsoSTTKRsX4RIdC3tNd6qeQ6ktlA0tyWRAR3fe_l/

And here is the repo for this app
: https://github.com/deep-diver/Alpaca-LoRA-Serve

Super thanks ! Yeay separate packages would be wonderful!!

@marcoripa96
Copy link
Author

I was about to ask the same thing: to provide a different package with the streaming classes and helpers.

@deep-diver did you managed to make it work with alpaca-lora? I tried (using alpaca-lora 7B) by copying and pasting classes and helpers of basaran and I always ended up with an error caused by the attention mask tensor shape being different from what was required during inference.
Do you have the code where you tried that? Thanks a lot!

@deep-diver
Copy link

Check this out. I will share when it is done. Some minor issues are remained

https://twitter.com/algo_diver/status/1638079375085305856

@peakji
Copy link
Member

peakji commented Mar 21, 2023

Basaran is now available as a library on PyPI. To use it programmatically, install it with pip:

pip install basaran

Use the load_model function to load the specified model and generate streaming output by calling the model:

from basaran.model import load_model

model = load_model("user/repo")

for choice in model("once upon a time"):
    print(choice)

The examples directory contains examples of using Basaran as a library.

@marcoripa96
Copy link
Author

Any idea on why does this error happens?

image

I'll leave the colab link here: https://colab.research.google.com/drive/1RBmL1tsAnKZhKoHkHkKiT07625gw-4Fg?usp=sharing

@peakji
Copy link
Member

peakji commented Mar 22, 2023

@marcoripa96 The error seems to be related to the model it self. Can you run the model without using Basaran?

AFAIK, LLaMA support in HF Transformer is still in active development.

@deep-diver
Copy link

check out the updated demo
: https://notebookse.jarvislabs.ai/BuOu_VbEuUHb09VEVHhfnFq4-PMhBRVCcfHBRCOrq7c4O9GI4dIGoidvNf76UsRL

I am curious how to boost the inference speed up?

@peakji
Copy link
Member

peakji commented Mar 22, 2023

I am curious how to boost the inference speed up?

@deep-diver Have you tried quantization? Basaran provides two quantization options, load_in_8bit (INT8) and half_precision (FP16), but I'm not sure if LLaMA's implementation supports them.

@deep-diver
Copy link

I am using peft for 8bit since fp16 wouldt let the 30b model fit into 40gb(which is the limit of vram that I have)

@deep-diver
Copy link

Just curious StreamModel supports all parameters in GenerationConfig?

@peakji
Copy link
Member

peakji commented Mar 22, 2023

I am using peft for 8bit since fp16 wouldt let the 30b model fit into 40gb(which is the limit of vram that I have)

Then I guess this is probably the fastest speed possible under the current resource constraints.

Just curious StreamModel supports all parameters in GenerationConfig?

No. All model arguments are supported, but most generation options (like top_k) are not supported.

@deep-diver
Copy link

is that because of a sort of technical difficulties? like hard to monkey patch the current implementation of transformers?

@peakji
Copy link
Member

peakji commented Mar 22, 2023

Most generation options require a reimplementation to support streaming output, simply copying code from transformers won't work.

Considering that Basaran's goal is to be compatible with the OpenAI API, the common arguments were prioritized for implementation.

@deep-diver
Copy link

thanks for the clarification. I can see you made a huge contiribution to the open source world with this one. I really want to say thank you! :)

@peakji peakji added the enhancement New feature or request label Mar 28, 2023
@lslslslslslslslslslsls
Copy link

lslslslslslslslslslsls commented Mar 30, 2023

I try to host Alpaca LoRA with basaran. Basically, it works good. However, it fails to concat the subword to original word when decoding. It seems the llama tokenizer always output space when decode a single token, that breaks the logic of whitespace handling in streaming decoding https://github.com/hyperonym/basaran/blob/master/basaran/tokenizer.py#L41
image

is there any approach to turn off the white space prefix, or refine the logic of whitespace handling?

@deep-diver
Copy link

deep-diver commented Mar 30, 2023

Have a look at my repo
https://github.com/deep-diver/Alpaca-LoRA-Serve

I also run a couple of demos. You can find info in Readme

@peakji
Copy link
Member

peakji commented Mar 30, 2023

@oreo-yum Thanks for the feedback, the screenshot is very helpful!

In case of last-minute changes, we plan to add full support only after the new version of HuggingFace Transformer, which includes LLaMA implementation, is released.

Before that, we will study and modify the logic of StreamTokenizer in advance. Don't worry, it's just another weird tokenizer...

@peakji
Copy link
Member

peakji commented Apr 14, 2023

Basaran v0.15.3 now officially supports LLaMA! Tested with Enoch/llama-7b-hf (Dockerfile) and zpn/llama-7b.

Theoretically, it should support any LLaMA-based models, such as Alpaca. Just keep in mind is to avoid outdated models using legacy class names and configs, like decapoda-research/llama-7b-hf.

playground

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants