Performance issues with consumers that have multiple filter subjects #4888

svenfoo · 2023-12-15T16:50:12Z

Observed behavior

We are experiencing performance issues on the NATS server when creating consumers with many (on the order of 100) filter subjects.

To give you an idea of the setup, here's the stream information:

Information for Stream beam-instance-config created 2023-11-17 07:17:47

             Subjects: config.beam-instance.>
             Replicas: 1
              Storage: File

Options:

            Retention: Limits
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: 1
        Maximum Bytes: unlimited
          Maximum Age: unlimited
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited


Cluster Information:

                 Name: nats
               Leader: nats-0

State:

             Messages: 8,307
                Bytes: 2.9 MiB
             FirstSeq: 419,820 @ 2023-12-13T14:28:25 UTC
              LastSeq: 438,109 @ 2023-12-15T14:25:07 UTC
     Deleted Messages: 9,983
     Active Consumers: 36
   Number of Subjects: 8,307

If we create the 36 consumers on that stream with a single filter subject such as config.beam-instance.*.> so that each consumer gets all the messages, performance is okay. As you can see above the stream only has a single message per subject, and the consumers are using DeliverLastPerSubjectPolicy. So all in all 8.307 messages are sent to each of the 36 consumers. I've taken a profile that covers the consumer subscription and the delivery of all messages:

Now we would like to avoid sending all messages to all consumers. Each consumer actually only needs a subset of the messages, on average about 1000 messages. So we changed the consumer subscription so that each consumer subscribes using around 100 filter subjects. Each filter subject has a wildcard, they look like this, for example: config.beam-instance.44742b24-9b67-11ee-800a-b445062a7e37.>. So instead of going for all UUIDS by using a '*`, we select the messages for the about hundreds UUIDs that the consumer is actually interested in.

We expected that this would improve the performance and reduce the load on the NATS server that is leading this stream. Instead we measured that the performance degraded and we see that the NATS server is struggling to keep up. According to the server logs it is falling behind while it is handling the subscriptions:

Falling behind in health check, commit 486029 != applied 485993
JetStream is not current with the meta leader

I've also taken a profile of this situation, and it shows a very different picture:

As you can see the server now uses a lot more CPU cycles and a good deal of the time is spent in the function tokenizeSubjectIntoSlice(). This is what I am hoping to be able to address with the changes proposed in #4886

Expected behavior

We expected to see a performance win by doing more fine-grained subscriptions.

Server and client version

Server v2.10.7 using the 2.10.7-alpine image
GO client v1.31.0

Host environment

Linux 6.4
Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz

Steps to reproduce

No response

The text was updated successfully, but these errors were encountered:

derekcollison · 2023-12-15T21:40:12Z

We can take a look, but in general that many filtered subjects might warrant a design review. Wildcards can be used to dramatically lessen the number of subjects needed.

svenfoo · 2023-12-16T11:12:06Z

We can take a look, but in general that many filtered subjects might warrant a design review. Wildcards can be used to dramatically lessen the number of subjects needed.

We are using a wildcard solution right now. But it leads to each consumer receiving a lot more messages than needed. The clients then look at the subject of the messages they get and they drop all messages they are not interested in. This works, but it creates more network traffic and load on the clients than needed. We would like to move the filtering to the server. The API allows us to do that, but the performance issues on the server side doesn't.

Maybe I'm missing something obvious, but I don't see how we can implement this with less filter subjects. We have a larger number of UUIDs. For each UUID there is a small number of messages:

config.<UUID>.abc
...
config.<UUID>.foo
config.<UUID>.bar

Client A needs the messages for a set of UUIDs, let's call it Set_A. This set has approx. 100 members. Client B needs a different set of messages, Set_B. The sets are potentially overlapping though. If they were distinct we could use a scheme like

config.<client-id>.abc
...
config.<client-id>.foo
config.<client-id>.bar

But that is not the case, client B is also interested in some, but not all, of the UUIDs that client A is interested in. Putting the same messages multiple times on the stream (with different subjects) would blow up the stream. It would also mean that we could not any longer make a change to one message and have all interested clients receive an update.

So what we are doing right now is to have each client install a single consumer with a wildcard subject that matches all messages. What we are aiming for is to have each client install a single consumer that filters for the UUIDs that it needs to know about. We already tried to have each client install a (single filter subject) consumer per UUID. That means we ended up with about 100 consumers per client, that performed even worse.

Note that the stream info shown above is from a small test setup. The profiles I have attached are also taken from that test setup. I can already see the NATS server falling behind in the logs from that small test setup. Our actual use case in production has more consumers (approx. 3.600) and more messages (approx. 30.000) on that stream. And we want to be able to support even larger setups.

derekcollison · 2023-12-19T23:49:14Z

Designing subject spaces can be challenging.

The best results we have seen come from working closely with our partners and customers on the initial design of the system to achieve their goals. They bring the domain expertise and desired outcomes and the Synadia team brings their expertise in distributed systems and NATS.io tech stack.

svenfoo · 2023-12-22T08:00:39Z

We have created a benchmark that allows to easily reproduce the problem reported here:
https://github.com/holoplot/nats-bench

Jarema · 2023-12-22T14:50:35Z

@svenfoo Thanks! Will take a look into it.

mvrhov · 2024-01-09T12:42:10Z

We have very similar problem as described above. Each of our servers is interested in 100-350 subjects (The subjects are random on the consumer, and the producer doesn't know where to send them). In total there is about 20_000 different subjects. CPU is pretty high with one KV, however the plan was to use about 10 different Key-value streams. The payload is relative small I thing median being at around 2kiB
If nothing will improve in this regard, in the next few months then we'll go back to the drawing board and will probably not use NATS anymore. Will re-consider and see on what we can do to improve things with GRPC and redis stack further.

derekcollison · 2024-01-09T14:11:06Z

@mvrhov why are you using consumers with KV? KV Gets use direct get mechanisms that avoid consumers all together. These are being used with millions of subjects.

derekcollison · 2024-01-09T14:12:48Z

@svenfoo Can we switch to email? derek@synadia.com. Thanks.

mvrhov · 2024-01-09T14:43:49Z

@mvrhov why are you using consumers with KV? KV Gets use direct get mechanisms that avoid consumers all together. These are being used with millions of subjects.

The consumer needs to be notified of the specific KV change that happen So we are filtering on the key changes. And as consumers can crash the consumer that will take over should be able to get last value. (Well we actually store the last two per key as they might be needed for debugging purposes.)

derekcollison · 2024-01-09T15:50:17Z

Are you using KV watchers?

jnmoyne · 2024-01-10T06:56:31Z

@mvrhov can you expand on "CPU" is pretty high with one KV and on how your use case uses the subjects, like how many KV watchers you are creating, what's the update rate?
Regardless, one thing to keep in mind is that you can easily scale your NATS cluster horizontally to spread the KV buckets (streams) between them.

mvrhov · 2024-01-10T07:34:51Z

We go under the hood an do a watch on the jetstream directly as we need to update the list of keys (via UpdateConsumer) we are interested in when we start a new consumer.

subOpts := []nats.SubOpt{
	nats.BindStream(kvBucketNamePre + bucketName),
	nats.OrderedConsumer(),
	nats.DeliverNew(),
	nats.ConsumerFilterSubjects(emptySubject), // this is intentionally "empty", so that we don't subscribe to all messages in stream
}

sub, err := js.Subscribe("", kv.receive, subOpts...)
if err != nil {
	return nil, fmt.Errorf("kv subscribe: %w", err)
}

we did use the Watch before and the CPU usage was even worse.

We created only one KV with about 15_000 keys. Update rate for those keys is once every 3-15 minutes. In total we wanted to have 4 times KV, 3 times PubSub, 2 times Work Queue. Most of those are very very spiky in nature. The number of Keys/items in each of those is approx 15_000. And it would rise approx 5_000 items per year. IMHO this are pretty small numbers and this shouldn't have such a high CPU usage.

We don't want to scale this as with GRPC and Redis everything works. But we'd like to change architecture but not at the expense of using more resources because of that.

With watch we had 41k subscribers for one KV and with going to Filtered subjects we have 40 subscribers for one KV.

With Watch:

With filtered:

edit: It's way better with filtered. And I had a plan to test this further in the next 10 days, but I've stumbled across this. And it looks just like we observed.

derekcollison · 2024-01-10T19:37:03Z

Preferred method is KV watchers.

What client language and version?
What server version?

mvrhov · 2024-01-10T20:10:00Z

With KV watchers we used pkg.go.dev/github.com/nats-io/nats.go@v1.28.0 and NATS 2.10.1

Now with filtering its pkg.go.dev/github.com/nats-io/nats.go@v1.21.0 and NATS 2.10.4

I all tests the go version is always 1.21.x

derekcollison · 2024-01-10T20:11:16Z

Please upgrade to 2.10.7 (2.10.8 will be released today and we encourage all users to upgrade to that once it lands).

derekcollison · 2024-01-10T20:11:51Z

Also nats.go latest is 1.31.0..

mvrhov · 2024-01-10T20:13:00Z

Will be done automatically over the next few days as we are doing a release soon.

mvrhov · 2024-01-10T20:14:01Z

Ah it was a typo of course we use 1.31 and not 1.21, I copy pasted the string and changed only one number

This change introduces a new LoadNextMsgMulti into the store layer. It is passed a sublist that is maintained by the consumer. The store layer matches potential messages across any positive match in the sublist. Resolves: #4888 Signed-off-by: Derek Collison <derek@nats.io>

nickchomey · 2024-04-04T23:14:22Z

For whatever it is worth, I am also planning to use a JS consumer with filter subjects to "watch" many KV subkeys, rather than use multiple KV Watchers. (though I'll benchmark both when I implement it).
I am making an SSE mechanism for Caddy + Nats, whereby browser clients create an SSE connection with Caddy, and then the Nats.go client in the Caddy plugin creates a single JS filterSubjects consumer and adds each new SSE user's ID to the subject to the filter. (e.g. Users.{user id}.*). When there is a change, it'll send the update to the relevant user's browser via the open SSE connection.

There will be many users in total, but fewer (on a relative basis) connected at any given time, so it doesn't make sense to watch the majority of users who are not connected (but whose keys will be changing based on things that others do). And I cant think of any other way that this could be modeled than by user ID, so as to enable a singular wildcard KV watch. And I have to figure it should be more performant to create a single JS Consumer filterSubjects than a new KV Watch connection for each user.

derekcollison · 2024-04-05T00:23:32Z

We should add the ability to have a single KV watcher take multiple filters..

/cc @Jarema @piotrpio

Jarema · 2024-04-05T07:55:54Z

@nickchomey nats-io/nats-architecture-and-design#277

svenfoo added the defect Suspected defect such as a bug or regression label Dec 15, 2023

svenfoo mentioned this issue Dec 15, 2023

RFC: Introduce subject package #4886

Open

derekcollison assigned neilalexander and Jarema Dec 22, 2023

cjbottaro mentioned this issue Feb 5, 2024

Slowness in 2.10 with stream with many subjects #5040

Closed

github-actions bot added the stale This issue has had no activity in a while label Mar 17, 2024

derekcollison self-assigned this Mar 29, 2024

derekcollison added enhancement Enhancement to existing functionality and removed stale This issue has had no activity in a while labels Mar 29, 2024

derekcollison mentioned this issue Apr 3, 2024

[IMPROVED] Multi-Filtered Consumers #5274

Merged

derekcollison closed this as completed in #5274 Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues with consumers that have multiple filter subjects #4888

Performance issues with consumers that have multiple filter subjects #4888

svenfoo commented Dec 15, 2023

derekcollison commented Dec 15, 2023

svenfoo commented Dec 16, 2023

derekcollison commented Dec 19, 2023

svenfoo commented Dec 22, 2023

Jarema commented Dec 22, 2023

mvrhov commented Jan 9, 2024 •

edited

derekcollison commented Jan 9, 2024 •

edited

derekcollison commented Jan 9, 2024 •

edited

mvrhov commented Jan 9, 2024

derekcollison commented Jan 9, 2024

jnmoyne commented Jan 10, 2024

mvrhov commented Jan 10, 2024 •

edited

derekcollison commented Jan 10, 2024

mvrhov commented Jan 10, 2024 •

edited

derekcollison commented Jan 10, 2024

derekcollison commented Jan 10, 2024

mvrhov commented Jan 10, 2024

mvrhov commented Jan 10, 2024

nickchomey commented Apr 4, 2024 •

edited

derekcollison commented Apr 5, 2024

Jarema commented Apr 5, 2024

Performance issues with consumers that have multiple filter subjects #4888

Performance issues with consumers that have multiple filter subjects #4888

Comments

svenfoo commented Dec 15, 2023

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

derekcollison commented Dec 15, 2023

svenfoo commented Dec 16, 2023

derekcollison commented Dec 19, 2023

svenfoo commented Dec 22, 2023

Jarema commented Dec 22, 2023

mvrhov commented Jan 9, 2024 • edited

derekcollison commented Jan 9, 2024 • edited

derekcollison commented Jan 9, 2024 • edited

mvrhov commented Jan 9, 2024

derekcollison commented Jan 9, 2024

jnmoyne commented Jan 10, 2024

mvrhov commented Jan 10, 2024 • edited

derekcollison commented Jan 10, 2024

mvrhov commented Jan 10, 2024 • edited

derekcollison commented Jan 10, 2024

derekcollison commented Jan 10, 2024

mvrhov commented Jan 10, 2024

mvrhov commented Jan 10, 2024

nickchomey commented Apr 4, 2024 • edited

derekcollison commented Apr 5, 2024

Jarema commented Apr 5, 2024

mvrhov commented Jan 9, 2024 •

edited

derekcollison commented Jan 9, 2024 •

edited

derekcollison commented Jan 9, 2024 •

edited

mvrhov commented Jan 10, 2024 •

edited

mvrhov commented Jan 10, 2024 •

edited

nickchomey commented Apr 4, 2024 •

edited