Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load a backlog of messages on the cluster before upgrade #64

Closed
ferozjilla opened this issue Mar 26, 2020 · 7 comments
Closed

Load a backlog of messages on the cluster before upgrade #64

ferozjilla opened this issue Mar 26, 2020 · 7 comments
Assignees
Labels
upgrades Any work related to upgrades

Comments

@ferozjilla
Copy link
Contributor

ferozjilla commented Mar 26, 2020

Is your feature request related to a problem? Please describe.

At the moment, we do not load a backlog of messages in our cluster before we upgrade and test the results. This can be seen on these lines:

  • starting RabbitTestTool without any backlog here
  • running rollout restart on the stateful set as soon as consumers connect here

Loading a backlog of messages is useful as it tests our logic that certain nodes that are critical to synchronization should wait for the sync to complete before being rolled. Otherwise, messages may be lost.

Describe the solution you'd like

The solution has two parts -

  • working out a value to set for the backlog of messages
  • setting the backlog of messages

The size of the backlog

From the rabbitmq memory docs, we know that paging starts happening at 50% of the memory high watermark. The idea here is that paging will further add time for the synchronisation to occur. This in turn increases the likelihood that nodes need to wait for sync before they can be rolled. So, let's set the backlog to be above 50% of the memory high watermark to create this situation.

RabbitMQ internals and Maths 🤓

  • Work out the absolute value of the high watermark: vm_memory_high_watermark.absolute
  • Ensure that paging happens above 50% of the high watermark by looking at: vm_memory_high_watermark_paging_ratio
  • Calculate 70% of the absolute value of the high watermark.

Setting the value

The RabbitTestTool has a flag to set the initial backlog (initialPublish perhaps), and the topology file also includes the size of each message. Set this combination such that (number of messages) * (size of a message) is about 70% of the high watermark, the value calculated in the previous section.

At this point, we have a backlog, and messages are being paged to disk.

@ferozjilla ferozjilla added the upgrades Any work related to upgrades label Mar 26, 2020
@ferozjilla ferozjilla added this to To do in RabbitMQ Cluster Kubernetes Operator via automation Mar 26, 2020
@ferozjilla ferozjilla self-assigned this Apr 29, 2020
@Zerpet Zerpet self-assigned this Apr 29, 2020
@Zerpet
Copy link
Collaborator

Zerpet commented Apr 29, 2020

I'm having a bit of a 🤯 here. Initial backlog is set to 100k, message size is set to 16 bytes, we have 4 queues, 2 mirror with ha-all and 2 quorum as of here:

https://github.com/pivotal/rabbitmq-for-kubernetes-upgrades/blob/bdbde65ac75c41c3b34e592fe8f69d4b9339fe78/topologies/direct-safe.json#L12-L14

Therefore each node should have 100k messages, per queue (leader or mirror) times 16 bytes, therefore:

100.000 x 4 x 16 = 6400000 bytes
6400000 bytes / 1024 / 1024 = 6.1 Mb

However, I observe the node memory going up to ~600 Mb 🤯 Moreover, the memory reports from a node shows ~89Mb for quorum queue and ~100ish MB for mirrors. These figures change over time as the backlog is being drained or consumed. Still, what the 🤯

@ferozjilla
Copy link
Contributor Author

ferozjilla commented Apr 29, 2020

We could look at the Erlang grafana dashboard made by the core team to see where the memory is being used up. Admittedly, the maths is an oversimplification since it does not account for how Erlang uses the memory.

Also, Gerhard's TGIR: RMQ ate my RAM

@mkuratczyk
Copy link
Collaborator

Definitely reach out to our friends - there are known rough edges, especially with quorum queues, so it would be one of them (known or not yet known).

@Zerpet
Copy link
Collaborator

Zerpet commented Apr 29, 2020

Context

We reached out to the Core team with our analysis and expectations. We deployed Prometheus-Grafana in dev2-bunny and we could not observe anything outstanding or massively obvious explaining the behaviour. We are waiting for the Core team to provide some insights regarding the memory utilisation.

We did a rolling restart of a 3-node RMQ cluster with 1.5M ready messages of size 16 bytes on each node, using 3 classic mirrored queues. We observed that nodes 1 and 2 rolled fairly quick and pushed the queue masters to node 0. Subsequently, node 0 became mirror sync critical and it stayed in Terminating state for some time until the other two nodes finished synchronising the queues.

The problem was made worse by the memory usage being close to the high memory watermark (one node got OOM killed) and the synchronisation took a relatively long time > 5 minutes. Even though RabbitMQ was not unavailable per se since we were able to connect to, however the queues were "unavailable" because they were synchronising for a very long time.

Conclussions

  • Our preStop hook is working as intended
  • It's not wise to rollout the cluster when the memory usage is close to high memory water mark

@Zerpet
Copy link
Collaborator

Zerpet commented Apr 29, 2020

And the answer to the mystery is in RabbitMQ docs:

  • Payload: >= 1 byte, variable size, typically few hundred bytes to a few hundred kilobytes
  • Protocol attributes: >= 0 bytes, variable size, contains headers, priority, timestamp, reply to, etc.
  • RabbitMQ metadata: >= 720 bytes, variable size, contains exchange, routing keys, message properties, persistence, redelivery status, etc.
  • RabbitMQ message ordering structure: 16 bytes

If we consider minimum values for metadata and attributes, we have 736 bytes of messages size. If we estimate 1024 bytes of metadata, 1040 bytes would be our message size. This by 1.5M messages is roughly 1 GB and 1.4 GB.

@j4mcs j4mcs moved this from To do to In progress in RabbitMQ Cluster Kubernetes Operator Apr 30, 2020
@Zerpet Zerpet moved this from In progress to To do in RabbitMQ Cluster Kubernetes Operator Apr 30, 2020
@Zerpet Zerpet assigned Zerpet and unassigned Zerpet and ferozjilla Apr 30, 2020
@Zerpet Zerpet moved this from To do to In progress in RabbitMQ Cluster Kubernetes Operator Apr 30, 2020
@Zerpet Zerpet moved this from In progress to To do in RabbitMQ Cluster Kubernetes Operator Apr 30, 2020
@Zerpet Zerpet self-assigned this Apr 30, 2020
@Zerpet Zerpet moved this from To do to In progress in RabbitMQ Cluster Kubernetes Operator Apr 30, 2020
@Zerpet Zerpet moved this from In progress to To do in RabbitMQ Cluster Kubernetes Operator Apr 30, 2020
@Zerpet Zerpet assigned Zerpet and unassigned Zerpet Apr 30, 2020
@Zerpet Zerpet moved this from To do to In progress in RabbitMQ Cluster Kubernetes Operator Apr 30, 2020
@Zerpet
Copy link
Collaborator

Zerpet commented May 1, 2020

Context

Tweaked the default values in run-test.sh file in https://github.com/pivotal/rabbitmq-for-kubernetes-upgrades/commit/199e51d58cffbdb959d2296a194fc7fef3b69d71. This script is used mostly in the pipeline and makes sense to adapt this script, rather than the topology file. The topology file has no delay in consumer consumption, which is desired most of the cases and a even value for initial backlog. All these values can be tweaked via command line arguments.

Using an initial backlog of 120.000 messages per queue of size 16 bytes, we are able to generate a load of ~70-80% of the high water mark (~800 MB). This link sheds light on how to calculate the total message size. We are using four queues, two of each type, quorum and mirror. The publisher rate is set at 100 messages per second and consumer processing time is 1 millisecond. Using this restrictions, we are able to keep ready messages in the queues at all times during the test duration (120 seconds).

The unavailability period is still set to 30 seconds. Lower values feel too aggressive and may report false positives. We could consider testing with 20 seconds threshold, although we should explore how long does a leader election or master relocation takes in our setup to ensure we are not setting a too stretch value.

The following screenshots show the memory available before hitting high memory water mark, the number of ready messages and number of incoming/outgoing messages.

memory and ready msg
incoming msg
outgoing msg

@Zerpet
Copy link
Collaborator

Zerpet commented May 5, 2020

Verified today that the pipeline is running well with an initial backlog of messages, according to the tool configuration. We have to let it run and generate some data to analise if there is any signs of data loss or unavailability.

@Zerpet Zerpet closed this as completed May 5, 2020
RabbitMQ Cluster Kubernetes Operator automation moved this from In progress to Done May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upgrades Any work related to upgrades
Projects
No open projects
Development

No branches or pull requests

3 participants