feature: unbuffered token stream #109

mudler · 2023-04-27T16:59:11Z

Now this should be quite easy at least for the llama.cpp backend: go-skynet/go-llama.cpp#28 thanks to @noxer's contribution ( ❤️ ) now it's just a matter of wiring things up in the SSE callback here in the server

go-llama.cpp
gpt4all.cpp
gpt2.cpp
rwkv.cpp

mudler · 2023-05-01T18:12:18Z

the go-gpt4all-j backend has support as well for unbuffered token stream, so we should (almost) all good as 2 out of 3 backends supports it till now

mudler · 2023-05-02T20:17:35Z

implemented for llama.cpp backend!

apiad · 2023-05-03T11:32:03Z

This is the most exciting feature for my use case! I'm wondering, do you have already planned how the API will support this? Thanks!

mudler · 2023-05-06T08:32:11Z

This is the most exciting feature for my use case! I'm wondering, do you have already planned how the API will support this? Thanks!

LocalAI is following the OpenAI specs. Therefore tokens are pushed via SSE (server-sent events) streams. It currently works already for llama.cpp models, like: vicuna, alpaca, wizardlm, and alikes.

EchedelleLR · 2023-06-10T13:27:56Z

I am getting memory issues with LocalAI on 16 GiB of RAM even in the smaller chat model.

This only happens on "stream: true". I guess this is related to buffering being enabled in the request?

If I make a single POST request without "stream: true", the request works properly and I receive a proper answer.

I am using GPT4All models.

localai-bot · 2023-07-19T23:05:29Z

⚠️⚠️⚠️⚠️⚠️

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me,
I don't support (yet) replying!

⚠️⚠️⚠️⚠️⚠️

To fix the issue, we need to update the llamaOpts variable with the correct options based on the given model options. Here's an updated version of the code that should work for all three backends:

import (
    "fmt"

    "github.com/go-skynet/LocalAI/pkg/grpc/base"
    pb "github.com/go-skynet/LocalAI/pkg/grpc/proto"
    llama "github.com/go-skynet/go-llama.cpp"
)
type LLM struct {
    base.Base
    llama *llama.LLama
}
func (llm *LLM) Load(opts *pb.ModelOptions) error {
    llamaOpts := []llama.ModelOption{}
    if opts.ContextSize != 0 {
        llamaOpts = append(llamaOpts, llama.SetContext(int(opts.ContextSize)))
    }
    if opts.F16Memory {
        llamaOpts = append(llamaOpts, llama.EnableF16Memory)
    }
    if opts.Embeddings {
        llamaOpts = append(llamaOpts, llama.EnableEmbeddings)
    }
    if opts.NGPULayers != 0 {
        llamaOpts = append(llamaOpts, llama.SetGPULayers(int(opts.NGPULayers)))
    }
    switch backend := llama.GetBackend(); backend {
    case llama.BackendCUDA:
        if llama.IsNVIDIA() {
            llamaOpts = append(llamaOpts, llama.EnableTensorRT)
        }
    case llama.BackendOpenCL:
        llamaOpts = append(llamaOpts, llama.SetPlatform("opencl"))
    case llama.BackendVulkan:
        llamaOpts = append(llamaOpts, llama.SetPlatform("vulkan"))
    }
    llm.llama.SetOptions(llamaOpts...)
    return nil
}

This code checks the backend and adds any necessary options based on the given model options. It also uses the GetBackend() method to determine the current backend being used, which is needed for some of the llama options like EnableTensorRT.

Sources:

mudler added good first issue Good for newcomers up for grabs Tickets that no-one is currently working on labels Apr 28, 2023

mudler mentioned this issue May 2, 2023

feat: add stream events #152

Merged

mudler mentioned this issue May 3, 2023

Streaming mode? #157

Closed

mudler mentioned this issue May 6, 2023

API - websocket ? #191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: unbuffered token stream #109

feature: unbuffered token stream #109

mudler commented Apr 27, 2023 •

edited

Loading

mudler commented May 1, 2023

mudler commented May 2, 2023

apiad commented May 3, 2023

mudler commented May 6, 2023

EchedelleLR commented Jun 10, 2023

localai-bot commented Jul 19, 2023

feature: unbuffered token stream #109

feature: unbuffered token stream #109

Comments

mudler commented Apr 27, 2023 • edited Loading

mudler commented May 1, 2023

mudler commented May 2, 2023

apiad commented May 3, 2023

mudler commented May 6, 2023

EchedelleLR commented Jun 10, 2023

localai-bot commented Jul 19, 2023

⚠️⚠️⚠️⚠️⚠️

⚠️⚠️⚠️⚠️⚠️

mudler commented Apr 27, 2023 •

edited

Loading