-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: unbuffered token stream #109
Comments
the go-gpt4all-j backend has support as well for unbuffered token stream, so we should (almost) all good as 2 out of 3 backends supports it till now |
implemented for llama.cpp backend! |
This is the most exciting feature for my use case! I'm wondering, do you have already planned how the API will support this? Thanks! |
LocalAI is following the OpenAI specs. Therefore tokens are pushed via SSE (server-sent events) streams. It currently works already for |
I am getting memory issues with LocalAI on 16 GiB of RAM even in the smaller chat model. This only happens on "stream: true". I guess this is related to buffering being enabled in the request? If I make a single POST request without "stream: true", the request works properly and I receive a proper answer. I am using GPT4All models. |
|
Now this should be quite easy at least for the llama.cpp backend: go-skynet/go-llama.cpp#28 thanks to @noxer's contribution ( ❤️ ) now it's just a matter of wiring things up in the SSE callback here in the server
The text was updated successfully, but these errors were encountered: