-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations #19
Comments
Hey thanks for doing this... Could you open a PR, and I can have a look? I believe doing this would resolve #1 as well, and allow CPU/MacOS folks to use it, so it would be a great contribution! |
It should also resolve #7 ! |
Hi, I think this is a misunderstanding, I did not implement anything, I just tried to use the --use_kv_cache arg provided in the |
Ah, sorry for the confusion... So, the fastest inference is the default's we've currently got, NOT "vanilla"... The current default is flash decoding (ref: https://crfm.stanford.edu/2023/10/12/flashdecoding.html) Are you running this on a NVIDIA GPU? We do have implementations for non-flash attention based kv-caching (e.g. if you don't have a NVIDIA GPU), which is what should be getting utilized when you try to change |
No problem, thank you for the fast responses! :)
Yes, on a 4090 What is the fastest inference possible? It takes around 8 seconds for me to generate this text as speech:
For text LLMs, a 1B model can be as fast as 300 tokens/second with flash attention, is it possible to optimize this model too for low latency? Thank you! |
In fact, it always takes around 8 seconds, no matter the input text length |
100% - lots of optimisations to be done here! re your questions:
|
@SinanAkkoyun, out of interest, what are you building with TTS? |
@vatsalaggarwal (I encountered hallucinations with "How?" and some short sentence prompts when cloning a different voice embedding, the EOT token is sometimes not being generated, which allows hallucinations. But, as soon as you release fine-tuning, I see no issue in that.) Batching is nice, but I am building at a real-time voice assistant for my visualization company (@sidroopdaska thanks for your interest); 11labs offers sub 200ms responses, which are necessary for a long sentence, which I would love to see with your model! :) I did not look all too much into the arch, is the following possible?:
With those, the model should be able to be extremely low latency and extremely high quality. I will do my best to work with you on that |
I am excited to collaborate further, feel free to connect with me on Discord |
Very nice, thanks, will check it out when I get a sec! Are you able to share the reference you had issue with? I might be able to help! Yeah, totally get your point about streaming latency. So, we've got 4 models stacked on top of each other: I) causal LLM 1B, 2) non-causal LLM 15Mn, 3) MultiBand Diffusion, 4) DeepFilterNet. The causal LLM 1B is naturally streamable, however the current implementation is not faster than real-time, so as you suggested that would need to be improved. The non-causal LLM 15Mn is super fast, and we also have a streamable version we can push after we've finished testing it. MultiBand diffusion is supposed to support streaming by default but needs to played with to enable this. DeepFilterNet is a super tiny model and shouldn't be a problem I think. In terms of quantisation, and the backbone of the architecture. It's roughly a GPT2 with some changes:
|
I've sent you the reference and output WAV to your e-mail. I thank you for the support but with fine-tuning this should be no problem anymore! Can you say when you will release the fine-tuning script? That sounds very promising! I am looking very forward to the streaming capabilities, those would be phenomenal given your speech quality! Then, quantization would not be needed anymore and one can rely on precise bf16 outputs Thank you very much, may I ask why you did not use the Llama2 architecture as a base (it utilizes SwiGLU and RMSNorm) and how long did the training take for over 100k hours of speech? |
thanks! we're still trying to work it out as we have a few things to do... i think it's likely someone could peft/lora working quicker than we'll be able to push this, but we'll try our best...
awesome!
yeah, he had to swap in/out components of llama2 to make the various pieces work properly... will try to write about it! on training time, it depended on the number of GPUs and type of GPUs, so also hard to answer that one... |
By the way @SinanAkkoyun i sent you a dm request on discord |
Great, thanks! |
Personally, my wife and I just want to do free local tts for chrome text reading. Textbooks for school mostly. |
Hey! Thank you so so much for this repo and great work, this is what the world needs right now, I have been waiting for such a great foundation model for years!
When wanting to use vanilla KV cache (I suppose that's the fastest inference?), I get this error:
I would be super grateful for help, thanks!
The text was updated successfully, but these errors were encountered: