Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: unbuffered token stream #109

Open
2 of 4 tasks
mudler opened this issue Apr 27, 2023 · 6 comments
Open
2 of 4 tasks

feature: unbuffered token stream #109

mudler opened this issue Apr 27, 2023 · 6 comments
Labels
good first issue Good for newcomers up for grabs Tickets that no-one is currently working on

Comments

@mudler
Copy link
Owner

mudler commented Apr 27, 2023

Now this should be quite easy at least for the llama.cpp backend: go-skynet/go-llama.cpp#28 thanks to @noxer's contribution ( ❤️ ) now it's just a matter of wiring things up in the SSE callback here in the server

  • go-llama.cpp
  • gpt4all.cpp
  • gpt2.cpp
  • rwkv.cpp
@mudler mudler added good first issue Good for newcomers up for grabs Tickets that no-one is currently working on labels Apr 28, 2023
@mudler
Copy link
Owner Author

mudler commented May 1, 2023

the go-gpt4all-j backend has support as well for unbuffered token stream, so we should (almost) all good as 2 out of 3 backends supports it till now

@mudler
Copy link
Owner Author

mudler commented May 2, 2023

implemented for llama.cpp backend!

@mudler mudler mentioned this issue May 3, 2023
@apiad
Copy link

apiad commented May 3, 2023

This is the most exciting feature for my use case! I'm wondering, do you have already planned how the API will support this? Thanks!

@mudler
Copy link
Owner Author

mudler commented May 6, 2023

This is the most exciting feature for my use case! I'm wondering, do you have already planned how the API will support this? Thanks!

LocalAI is following the OpenAI specs. Therefore tokens are pushed via SSE (server-sent events) streams. It currently works already for llama.cpp models, like: vicuna, alpaca, wizardlm, and alikes.

@EchedelleLR
Copy link

I am getting memory issues with LocalAI on 16 GiB of RAM even in the smaller chat model.

This only happens on "stream: true". I guess this is related to buffering being enabled in the request?

If I make a single POST request without "stream: true", the request works properly and I receive a proper answer.

I am using GPT4All models.

@localai-bot
Copy link
Contributor

⚠️⚠️⚠️⚠️⚠️

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me,
I don't support (yet) replying!

⚠️⚠️⚠️⚠️⚠️

To fix the issue, we need to update the llamaOpts variable with the correct options based on the given model options. Here's an updated version of the code that should work for all three backends:

import (
    "fmt"

    "github.com/go-skynet/LocalAI/pkg/grpc/base"
    pb "github.com/go-skynet/LocalAI/pkg/grpc/proto"
    llama "github.com/go-skynet/go-llama.cpp"
)
type LLM struct {
    base.Base
    llama *llama.LLama
}
func (llm *LLM) Load(opts *pb.ModelOptions) error {
    llamaOpts := []llama.ModelOption{}
    if opts.ContextSize != 0 {
        llamaOpts = append(llamaOpts, llama.SetContext(int(opts.ContextSize)))
    }
    if opts.F16Memory {
        llamaOpts = append(llamaOpts, llama.EnableF16Memory)
    }
    if opts.Embeddings {
        llamaOpts = append(llamaOpts, llama.EnableEmbeddings)
    }
    if opts.NGPULayers != 0 {
        llamaOpts = append(llamaOpts, llama.SetGPULayers(int(opts.NGPULayers)))
    }
    switch backend := llama.GetBackend(); backend {
    case llama.BackendCUDA:
        if llama.IsNVIDIA() {
            llamaOpts = append(llamaOpts, llama.EnableTensorRT)
        }
    case llama.BackendOpenCL:
        llamaOpts = append(llamaOpts, llama.SetPlatform("opencl"))
    case llama.BackendVulkan:
        llamaOpts = append(llamaOpts, llama.SetPlatform("vulkan"))
    }
    llm.llama.SetOptions(llamaOpts...)
    return nil
}

This code checks the backend and adds any necessary options based on the given model options. It also uses the GetBackend() method to determine the current backend being used, which is needed for some of the llama options like EnableTensorRT.

Sources:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers up for grabs Tickets that no-one is currently working on
Projects
None yet
Development

No branches or pull requests

4 participants