-
-
Notifications
You must be signed in to change notification settings - Fork 81
Does this work for LLaMA models? #57
Comments
The implementation of LLaMA has just been merged into HF transformers, but it seems that it did not make it into the latest release. Basaran will support LLaMA as soon as it is available in HF transformers. |
I am building a Gradio app with LLaMA, can I borrow your codebase of token generator? I have checked |
Yes, of course! Basaran is released under the MIT License.
If that's the case, do you think it's necessary for us to release Basaran on PyPI for developers to use it as a library? This would be more maintainable than copying and pasting. Also, which LLaMA model are you testing? Could you share the link to the model on the Hugging Face Hub so that we can test it as well? Thanks! |
I have fine-tuned 30B version of Alpaca-LoRA! I am currently running a demo publicly https://notebooksf.jarvislabs.ai/43j3x9FSS8Tg0sqvMlDgKPo9vsoSTTKRsX4RIdC3tNd6qeQ6ktlA0tyWRAR3fe_l/ And here is the repo for this app Super thanks ! Yeay separate packages would be wonderful!! |
I was about to ask the same thing: to provide a different package with the streaming classes and helpers. @deep-diver did you managed to make it work with alpaca-lora? I tried (using alpaca-lora 7B) by copying and pasting classes and helpers of basaran and I always ended up with an error caused by the attention mask tensor shape being different from what was required during inference. |
Check this out. I will share when it is done. Some minor issues are remained |
Basaran is now available as a library on PyPI. To use it programmatically, install it with pip install basaran Use the from basaran.model import load_model
model = load_model("user/repo")
for choice in model("once upon a time"):
print(choice) The examples directory contains examples of using Basaran as a library. |
Any idea on why does this error happens? I'll leave the colab link here: https://colab.research.google.com/drive/1RBmL1tsAnKZhKoHkHkKiT07625gw-4Fg?usp=sharing |
@marcoripa96 The error seems to be related to the model it self. Can you run the model without using Basaran? AFAIK, LLaMA support in HF Transformer is still in active development. |
check out the updated demo I am curious how to boost the inference speed up? |
@deep-diver Have you tried quantization? Basaran provides two quantization options, |
I am using peft for 8bit since fp16 wouldt let the 30b model fit into 40gb(which is the limit of vram that I have) |
Just curious StreamModel supports all parameters in GenerationConfig? |
Then I guess this is probably the fastest speed possible under the current resource constraints.
No. All model arguments are supported, but most generation options (like |
is that because of a sort of technical difficulties? like hard to monkey patch the current implementation of |
Most generation options require a reimplementation to support streaming output, simply copying code from Considering that Basaran's goal is to be compatible with the OpenAI API, the common arguments were prioritized for implementation. |
thanks for the clarification. I can see you made a huge contiribution to the open source world with this one. I really want to say thank you! :) |
I try to host Alpaca LoRA with basaran. Basically, it works good. However, it fails to concat the subword to original word when decoding. It seems the llama tokenizer always output space when decode a single token, that breaks the logic of whitespace handling in streaming decoding https://github.com/hyperonym/basaran/blob/master/basaran/tokenizer.py#L41 is there any approach to turn off the white space prefix, or refine the logic of whitespace handling? |
Have a look at my repo I also run a couple of demos. You can find info in Readme |
@oreo-yum Thanks for the feedback, the screenshot is very helpful! In case of last-minute changes, we plan to add full support only after the new version of HuggingFace Transformer, which includes LLaMA implementation, is released. Before that, we will study and modify the logic of |
Basaran v0.15.3 now officially supports LLaMA! Tested with Enoch/llama-7b-hf (Dockerfile) and zpn/llama-7b. Theoretically, it should support any LLaMA-based models, such as Alpaca. Just keep in mind is to avoid outdated models using legacy class names and configs, like decapoda-research/llama-7b-hf. |
I wanna stream the output of LLaMA models is it possible using this?
The text was updated successfully, but these errors were encountered: