Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow clients slow down the whole broker #95

Closed
alexsporn opened this issue Sep 1, 2022 · 6 comments · Fixed by #97
Closed

Slow clients slow down the whole broker #95

alexsporn opened this issue Sep 1, 2022 · 6 comments · Fixed by #97
Assignees
Labels
discussion Something to be discussed

Comments

@alexsporn
Copy link
Contributor

alexsporn commented Sep 1, 2022

We are using the MQTT broker and publishing messages directly to all clients using the broker's Publish() func.
This func adds a new publish packet to the inlineMessages.pub buffered channel (size 1024) and the inlineClient() loop will publish those packets to all subscribed clients.
For each subscribed client this will call client.WritePacket() which in the end will call Write() on the clients writer.

If a single subscribed client is too slow, the clients write buffer will fill up and the whole inlineClient() loop will hang until this client's buffer has space again (see awaitEmpty inside Write()). Shortly after the inlineMessages.pub buffered channel will fill up and further calls to Publish() will hang.

This means a single slow client (even one using QoS 0 with no guarantees of receiving packets) can make the whole broker wait indefinitely and not deliver any more packets to any client.

A possible workaround for this could be to instead of waiting for the buffer to be freed, to just return a "client buffer full" error and skip sending the packet to this client. If the client is using QoS 1/2 the inflight message retry mechanism should try to re-deliver the message.

What do you think? I can write a PR with this changes. Or do you have a better solution to this problem?

@mochi-co
Copy link
Collaborator

mochi-co commented Sep 2, 2022

Hi @alexsporn! This is very interesting - the possibility never occurred to me.

Currently I am inclined to think the best solution is the one you have described:

  1. If the buffer is full, then writing the message should fail with the error message to the embedding platform.
  2. The QOS of the inline-publisher is always 2 (exactly once), so we don't have to modify how this is handled.
  3. If the QOS of the receiving client subscription is 1/2, then the message should be added to the client's inflight messages queue.

Perhaps we should also make the buffer size for inline publish an value in server.Options. @alexsporn what's your use case which triggered this?

In the meantime I have increased the buffer to 4096 in v1.3.2 👍🏻

@alexsporn
Copy link
Contributor Author

Hi @mochi-co , thanks for looking in to the issue.

We are using MQTT over WebSocket as a Pub/Sub mechanism to listen to messages processed by our node software.
We faced some issues on one of the nodes running the MQTT broker which has a JavaScript-client (QoS 0) always connected and receives all unfiltered messages (between 50-300 a second). Due to this client slowing down the broker and blocking the Publish() function from enqueueing any more messages, the node started to slow down itself and not process any more messages.

Initially I thought it could be an issue in how we handle the incoming messages and publish them, so I went on to reproduce the bug. Using a JavaScript client (https://github.com/mqttjs/MQTT.js), publishing about 2000 packets a second and forcing the client to sleep between incoming messages to simulate slow processing of each packet, I could reproduce the MQTT broker lockup. Normally I'd say this would be no issue, but this can be used as a Denial-Of-Service attack on public brokers.

With the proposed change, the slow QoS 0 client will not influence any other connected clients and slow down the broker. As soon as the slow client clears up enough buffer it will start receiving messages again.

If the slow client is using QoS 1/2 this opens up another "attack vector" to the broker. If a long InflightTTL is used (defaults to 24 hours), then you can force the memory usage of the broker to quickly go up by using a couple of slow clients. All pending packets will stay in the inflight messages queue.

I totally understand that QoS 1/2 give certain guarantees on how MQTT behaves, but a slow client should not influence the brokers performance. Maybe we need a max count of inflight messages per client?

What do you think?

@mochi-co
Copy link
Collaborator

mochi-co commented Sep 7, 2022

Hi @alexsporn, thanks for your comprehensive reply :) My apologies for not replying to this earlier, I have been very busy lately...

I absolutely agree with all of the issues you've highlighted here, and have been trying to think about the best way to handle this and ensure we don't create any unintended consequences.

I plan to look into it more thoroughly between now and the weekend if I get some time, but tentatively I think the correct (even expected) behaviour would be to drop the packet if the QOS is 0 and the client buffer is fully, otherwise to add it to the inflight queue. This should apply to both inline-message publishing by the embedding service, and also when a client publishes to the broker and the message is delegated out to subscribing clients.

A brief reminder of the code suggests that writing to clients is blocking (in as much as we wait to write to the client's buffer if it's full). This makes me suspect that a client publishing to a topic with many subscribers could theoretically block until all clients are iterated, which is not ideal. I will have a think about how we might alleviate this bottleneck.

@mochi-co
Copy link
Collaborator

@alexsporn I merged your recent PR, can you try pulling down master and seeing it the problem still exists? :) Thank you!

@mochi-co
Copy link
Collaborator

@alexsporn I've reverted #97 and reopened this issue as the solution for #97 causes the broker to stall (as per #101) under heavy load. I believe this may be related to the broker dropping acks if the queue is full rather than waiting.

@mochi-co
Copy link
Collaborator

This issue has been resolved in v2.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Something to be discussed
Projects
None yet
2 participants