Usage Features #863

dave-gray101 · 2023-08-03T23:16:54Z

"The Usage Feature PR"

This PR implements a few semi-related features I've been working on - while they aren't all linked together exactly, they lay the groundwork for some of the multi-node, multi-model scaling work I'm interested in continuing to prototype. List of features (as best I remember):

Plumbing for passing Token Usage back from generation methods
- Counts tokens consumed in the response via existing streaming infra
- For prompt, set up to use the new TokenizeString method for go-llama backend, fallbacks to 0 for the unimplemented ones (Soon!)
  - Bumps go-llama to f03869d188b72c8a617bea3a36cf8eb43f73445c in the Makefile
- Works on /edit, /completion and /chat (streaming and 'regular')
  - I'm not really sure that OpenAI actually puts a Usage object on their streaming responses... but it seems useful and we've thus far felt willing to enhance the spec for flexibility....
New grpc function / related endpoint: /backend/monitor - returns memory usage and status for a backend.
- Base golang grpc server returns process-level memory details, backends can enhance with additional consumption breakdowns.
- Uses the BSD-licensed gopsutil, should work cross platform.
- Per-backend enhancement is very much a future scope goal, but not yet.
- Not implemented on the python side - but I repurposed an old pre-grpc test sampler in cases where that fails - admittedly not tested in its current configuration and highly limited, but it's better than nothing.
New startup option: --preload-backend-only : used to skip starting up the LocalAI server, and only start the preloaded grpc backends.
Responsibility for "Backend Locking" was moved from api to the backends themselves. Handled by base golang grpc server, but external / python backends are not locked until we know we need to.

…ly that is broken

mudler · 2023-08-04T10:21:12Z

That's looking good! thanks again @dave-gray101 for your great contributions!

I'll have a look later in the detail, but a quick glance looks ok to me. Just a style note, maybe let's call it feature flags instead of Chicken? maybe it's more idiomatic.

Re remote gRPC, this is quite possible already (and I use it for debugging) so I'm leaving here a note for those who are interested in running remote gRPCs in LocalAI.

LocalAI have the --external-grpc-backends parameter in the CLI that can be used either to specify a local backend (a file) or a remote URL:
https://github.com/go-skynet/LocalAI/blob/4aa5dac768a5255667097db1c043581196fa46b2/pkg/model/initializers.go#L144

So for instance, both specifying a remote URL or a file works and make the backend managed by LocalAI:

./local-ai --debug --threads 14 --image-path tmp-imgr --external-grpc-backends "huggingface.py:/home/mudler/_git/LocalAI/extra/grpc/huggingface/huggingface.py"

./local-ai --debug --threads 14 --image-path tmp-imgr --external-grpc-backends "my-awesome-backend:host:port"

To try it out, you can for instance, after running make build to run one of the grpc server in backend-assets/grpc and use the printed URL directly in there - to use then the backend you need to specify the backend name in the model config file.

Back to your open question: To support remote backends, I guess later maybe we can make the backend report metrics via gRPC, and we stop assuming that we are getting stat from the main process.

…kens are ints not strings so fix this

… a hack, but worth testing if it makes horizontal scaling simpler and easier

Makefile

api/backend/embeddings.go

api/backend/image.go

api/backend/llm.go

api/backend/lock.go

api/localai/backend_monitor.go

pkg/grpc/interface.go

mudler · 2023-08-17T21:11:23Z

this is looking good overall! just few nits, that should not make it a blocker as could be just follow-ups, the real one is testing this on a GPU as it supersedes #896 that I was holding off to merge for the same reason (manual QA)

@dave-gray101 would you mind me merging meanwhile #906? It might conflict with your PR, but actually is just about running make protobuf again and committing the results

If you can't test on GPU I'll give it a shot if not today, tomorrow at max

dave-gray101 · 2023-08-18T01:05:02Z

this is looking good overall! just few nits, that should not make it a blocker as could be just follow-ups, the real one is testing this on a GPU as it supersedes #896 that I was holding off to merge for the same reason (manual QA)

@dave-gray101 would you mind me merging meanwhile #906? It might conflict with your PR, but actually is just about running make protobuf again and committing the results

If you can't test on GPU I'll give it a shot if not today, tomorrow at max

Hey @mudler ! From my testing, CUDA seems to work fine with this branch. Granted... my test criteria consists of "normal output comes out, and process manager shows GPU load"

dave-gray101 added 12 commits July 25, 2023 04:07

groundworK feature: POC monitor api to dump out gRPC process info

9d5449e

missed backend_monitor.go

47aa59f

config.Backend if available

e0fc4f0

.gitignore fix, also try a different id

cd03587

needs both parts

a13d640

oops

b7835ad

refactor before merge

dc86c63

merge

908d5a6

prompt/completion split does _not_ work yet, but the plumbing does!

1c73726

merge

ac93c59

stash progress - completion works, stub of tokenizestring but current…

d63da08

…ly that is broken

minor fix, doesn't crash

63faef2

promote temporary chicken bit to semi-permanent feature flag name, to…

c11e0fc

…kens are ints not strings so fix this

mudler added the enhancement New feature or request label Aug 7, 2023

dave-gray101 added 15 commits August 11, 2023 01:05

stash successful build before update

0184673

merge

c9f9f69

per-backend locks

cf0b582

"remove" the old lock via comment for testing

537e272

streaming token count fix

449ce99

per-backend status / memory info dump

935cd54

merge

bc74812

go mod tidy after the merge

ee60b4d

merge

795905b

exp: preload-backend-only, to easily spawn non-api nodes. Somewhat of…

865763e

… a hack, but worth testing if it makes horizontal scaling simpler and easier

missed cm -> cl refactor fix

14f233e

merge

5d83c82

update go-llama

f07a469

copy the last tokenusage block to the stop msg if streaming

b56e4b3

Merge branch 'master' into feat-psutil-usage

1719ec4

dave-gray101 marked this pull request as ready for review August 17, 2023 06:33

dave-gray101 changed the title ~~WIP: Usage Features~~ Usage Features Aug 17, 2023

dave-gray101 requested a review from mudler August 17, 2023 07:21

use the old local backend sampler code as a fallback... for python?

0a61c52

dave-gray101 enabled auto-merge (squash) August 17, 2023 08:01