Does this work for LLaMA models? #57

marcoripa96 · 2023-03-17T13:25:25Z

I wanna stream the output of LLaMA models is it possible using this?

peakji · 2023-03-17T13:57:35Z

The implementation of LLaMA has just been merged into HF transformers, but it seems that it did not make it into the latest release.

Basaran will support LLaMA as soon as it is available in HF transformers.

deep-diver · 2023-03-21T00:17:11Z

I am building a Gradio app with LLaMA, can I borrow your codebase of token generator?

I have checked StreamModel works nicely with LLaMA, and my project is an open source project under Apache license.

peakji · 2023-03-21T02:22:14Z

I am building a Gradio app with LLaMA, can I borrow your codebase of token generator?

Yes, of course! Basaran is released under the MIT License.

I have checked StreamModel works nicely with LLaMA...

If that's the case, do you think it's necessary for us to release Basaran on PyPI for developers to use it as a library? This would be more maintainable than copying and pasting.

Also, which LLaMA model are you testing? Could you share the link to the model on the Hugging Face Hub so that we can test it as well? Thanks!

deep-diver · 2023-03-21T02:37:05Z

I have fine-tuned 30B version of Alpaca-LoRA! I am currently running a demo publicly

https://notebooksf.jarvislabs.ai/43j3x9FSS8Tg0sqvMlDgKPo9vsoSTTKRsX4RIdC3tNd6qeQ6ktlA0tyWRAR3fe_l/

And here is the repo for this app
: https://github.com/deep-diver/Alpaca-LoRA-Serve

Super thanks ! Yeay separate packages would be wonderful!!

marcoripa96 · 2023-03-21T08:37:29Z

I was about to ask the same thing: to provide a different package with the streaming classes and helpers.

@deep-diver did you managed to make it work with alpaca-lora? I tried (using alpaca-lora 7B) by copying and pasting classes and helpers of basaran and I always ended up with an error caused by the attention mask tensor shape being different from what was required during inference.
Do you have the code where you tried that? Thanks a lot!

deep-diver · 2023-03-21T09:06:21Z

Check this out. I will share when it is done. Some minor issues are remained

https://twitter.com/algo_diver/status/1638079375085305856

peakji · 2023-03-21T17:38:26Z

Basaran is now available as a library on PyPI. To use it programmatically, install it with pip:

pip install basaran

Use the load_model function to load the specified model and generate streaming output by calling the model:

from basaran.model import load_model

model = load_model("user/repo")

for choice in model("once upon a time"):
    print(choice)

The examples directory contains examples of using Basaran as a library.

marcoripa96 · 2023-03-21T20:17:06Z

Any idea on why does this error happens?

I'll leave the colab link here: https://colab.research.google.com/drive/1RBmL1tsAnKZhKoHkHkKiT07625gw-4Fg?usp=sharing

peakji · 2023-03-22T01:15:08Z

@marcoripa96 The error seems to be related to the model it self. Can you run the model without using Basaran?

AFAIK, LLaMA support in HF Transformer is still in active development.

deep-diver · 2023-03-22T02:14:23Z

check out the updated demo
: https://notebookse.jarvislabs.ai/BuOu_VbEuUHb09VEVHhfnFq4-PMhBRVCcfHBRCOrq7c4O9GI4dIGoidvNf76UsRL

I am curious how to boost the inference speed up?

peakji · 2023-03-22T02:29:14Z

I am curious how to boost the inference speed up?

@deep-diver Have you tried quantization? Basaran provides two quantization options, load_in_8bit (INT8) and half_precision (FP16), but I'm not sure if LLaMA's implementation supports them.

deep-diver · 2023-03-22T02:30:43Z

I am using peft for 8bit since fp16 wouldt let the 30b model fit into 40gb(which is the limit of vram that I have)

deep-diver · 2023-03-22T02:33:51Z

Just curious StreamModel supports all parameters in GenerationConfig?

peakji · 2023-03-22T02:46:14Z

I am using peft for 8bit since fp16 wouldt let the 30b model fit into 40gb(which is the limit of vram that I have)

Then I guess this is probably the fastest speed possible under the current resource constraints.

Just curious StreamModel supports all parameters in GenerationConfig?

No. All model arguments are supported, but most generation options (like top_k) are not supported.

deep-diver · 2023-03-22T03:06:06Z

is that because of a sort of technical difficulties? like hard to monkey patch the current implementation of transformers?

peakji · 2023-03-22T04:17:15Z

Most generation options require a reimplementation to support streaming output, simply copying code from transformers won't work.

Considering that Basaran's goal is to be compatible with the OpenAI API, the common arguments were prioritized for implementation.

deep-diver · 2023-03-22T04:30:00Z

thanks for the clarification. I can see you made a huge contiribution to the open source world with this one. I really want to say thank you! :)

lslslslslslslslslslsls · 2023-03-30T12:03:21Z

I try to host Alpaca LoRA with basaran. Basically, it works good. However, it fails to concat the subword to original word when decoding. It seems the llama tokenizer always output space when decode a single token, that breaks the logic of whitespace handling in streaming decoding https://github.com/hyperonym/basaran/blob/master/basaran/tokenizer.py#L41

is there any approach to turn off the white space prefix, or refine the logic of whitespace handling?

deep-diver · 2023-03-30T12:06:49Z

Have a look at my repo
https://github.com/deep-diver/Alpaca-LoRA-Serve

I also run a couple of demos. You can find info in Readme

peakji · 2023-03-30T13:04:54Z

@oreo-yum Thanks for the feedback, the screenshot is very helpful!

In case of last-minute changes, we plan to add full support only after the new version of HuggingFace Transformer, which includes LLaMA implementation, is released.

Before that, we will study and modify the logic of StreamTokenizer in advance. Don't worry, it's just another weird tokenizer...

peakji · 2023-04-14T18:06:28Z

Basaran v0.15.3 now officially supports LLaMA! Tested with Enoch/llama-7b-hf (Dockerfile) and zpn/llama-7b.

Theoretically, it should support any LLaMA-based models, such as Alpaca. Just keep in mind is to avoid outdated models using legacy class names and configs, like decapoda-research/llama-7b-hf.

deep-diver mentioned this issue Mar 21, 2023

Streaming response deep-diver/LLM-As-Chatbot#7

Closed

peakji added the enhancement New feature or request label Mar 28, 2023

codito mentioned this issue Apr 1, 2023

Support for llama.cpp/ggml models #107

Closed

This was referenced Apr 10, 2023

how to run model in total offline? #109

Closed

add_special_tokens should be False. #121

Closed

peakji closed this as completed Apr 14, 2023

drusepth mentioned this issue Apr 14, 2023

Support using other/local LLMs Significant-Gravitas/AutoGPT#25

Open

peakji mentioned this issue Apr 16, 2023

ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported. #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does this work for LLaMA models? #57

Does this work for LLaMA models? #57

marcoripa96 commented Mar 17, 2023 •

edited

Loading

peakji commented Mar 17, 2023 •

edited

Loading

deep-diver commented Mar 21, 2023 •

edited

Loading

peakji commented Mar 21, 2023

deep-diver commented Mar 21, 2023

marcoripa96 commented Mar 21, 2023

deep-diver commented Mar 21, 2023

peakji commented Mar 21, 2023

marcoripa96 commented Mar 21, 2023

peakji commented Mar 22, 2023

deep-diver commented Mar 22, 2023

peakji commented Mar 22, 2023

deep-diver commented Mar 22, 2023

deep-diver commented Mar 22, 2023

peakji commented Mar 22, 2023

deep-diver commented Mar 22, 2023

peakji commented Mar 22, 2023

deep-diver commented Mar 22, 2023

lslslslslslslslslslsls commented Mar 30, 2023 •

edited

Loading

deep-diver commented Mar 30, 2023 •

edited

Loading

peakji commented Mar 30, 2023

peakji commented Apr 14, 2023 •

edited

Loading

Does this work for LLaMA models? #57

Does this work for LLaMA models? #57

Comments

marcoripa96 commented Mar 17, 2023 • edited Loading

peakji commented Mar 17, 2023 • edited Loading

deep-diver commented Mar 21, 2023 • edited Loading

peakji commented Mar 21, 2023

deep-diver commented Mar 21, 2023

marcoripa96 commented Mar 21, 2023

deep-diver commented Mar 21, 2023

peakji commented Mar 21, 2023

marcoripa96 commented Mar 21, 2023

peakji commented Mar 22, 2023

deep-diver commented Mar 22, 2023

peakji commented Mar 22, 2023

deep-diver commented Mar 22, 2023

deep-diver commented Mar 22, 2023

peakji commented Mar 22, 2023

deep-diver commented Mar 22, 2023

peakji commented Mar 22, 2023

deep-diver commented Mar 22, 2023

lslslslslslslslslslsls commented Mar 30, 2023 • edited Loading

deep-diver commented Mar 30, 2023 • edited Loading

peakji commented Mar 30, 2023

peakji commented Apr 14, 2023 • edited Loading

marcoripa96 commented Mar 17, 2023 •

edited

Loading

peakji commented Mar 17, 2023 •

edited

Loading

deep-diver commented Mar 21, 2023 •

edited

Loading

lslslslslslslslslslsls commented Mar 30, 2023 •

edited

Loading

deep-diver commented Mar 30, 2023 •

edited

Loading

peakji commented Apr 14, 2023 •

edited

Loading