Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to RabbitMQ server 3.7 is broken #72

Closed
bgandon opened this issue Mar 29, 2018 · 5 comments
Closed

Upgrade to RabbitMQ server 3.7 is broken #72

bgandon opened this issue Mar 29, 2018 · 5 comments
Labels

Comments

@bgandon
Copy link
Contributor

bgandon commented Mar 29, 2018

Base issue

Here is the listing of /var/vcap/packages after an upgrade to the release v240.0.0:

# ls -l
total 32
lrwxrwxrwx 1 root root 73 Mar 14 22:17 bosh-dns -> /var/vcap/data/packages/bosh-dns/67b977dca6e9b86cad54af73c08150b34e99d309
lrwxrwxrwx 1 root root 71 Mar 14 22:17 erlang -> /var/vcap/data/packages/erlang/1c7771c7774d4c7c97a1ba9a666b5ad5fb45a0c2
lrwxrwxrwx 1 root root 78 Mar 14 22:17 node_exporter -> /var/vcap/data/packages/node_exporter/923a6fbd61d30904b8ff3da59fdba3e57fc2743a
lrwxrwxrwx 1 root root 80 Mar 14 22:17 rabbitmq-common -> /var/vcap/data/packages/rabbitmq-common/30344ae448f136ceda4ac0fb595d561674514d9f
lrwxrwxrwx 1 root root 80 Mar 14 22:17 rabbitmq-server -> /var/vcap/data/packages/rabbitmq-server/e2bba8e813a3677de354efdedca9da94e4d12cb9
lrwxrwxrwx 1 root root 84 Mar 28 16:11 rabbitmq-server-3.6 -> /var/vcap/data/packages/rabbitmq-server-3.6/ccf881b215c3493c2f215d4c51af19e585e8ddc9
lrwxrwxrwx 1 root root 84 Mar 28 16:11 rabbitmq-server-3.7 -> /var/vcap/data/packages/rabbitmq-server-3.7/9d526adf0dd98198e4a1d73ca00cd1002473ff27
lrwxrwxrwx 1 root root 93 Mar 28 16:11 rabbitmq-upgrade-preparation -> /var/vcap/data/packages/rabbitmq-upgrade-preparation/61fc96b42e4be83a9065284a00608be67eb0cba7

The rabbitmq-server link is pointing at the former package e2bba8e... from the deployment of previous version 238 of the Bosh Release. Here you need to know that Bosh keeps the previous packages around in order to speed up any subsequent rollback.

And here is what the configure_rmq_version() (from pre-start.nash template) has created:

# ls -l /var/vcap/packages/rabbitmq-server/rabbitmq-server* 
lrwxrwxrwx 1 root root 42 Mar 28 16:11 /var/vcap/packages/rabbitmq-server/rabbitmq-server-3.7 -> /var/vcap/packages/rabbitmq-server-3.7

In pre-start.bash, the use of ln without removing any pre-exiting /var/vcap/packages/rabbitmq-server (file or directory or link) is a classical tricky case:

configure_rmq_version() {
  ln -f -s /var/vcap/packages/rabbitmq-server-"$RMQ_SERVER_VERSION" /var/vcap/packages/rabbitmq-server
}

Instead, any existing /var/vcap/packages/rabbitmq-server link should be removed first, and the ln invocation should not need the -f flag.

Further upgrade issue(s)

I tried to fix the Bosh Release with this code:

 configure_rmq_version() {
  rm -rf /var/vcap/packages/rabbitmq-server
  ln -s /var/vcap/packages/rabbitmq-server-"$RMQ_SERVER_VERSION" /var/vcap/packages/rabbitmq-server
 }

But then I hit an issue when actually upgrading the cluster. The canary node in my deployment fails at starting rabbitmq-server job with the new RabbitMQ 3.7 binary.

To my understanding, I see that the configure_rmq_version() is called after rabbitmq-config-vars.bash has been loaded, because the actual engine version is define there. The problem is that this should be done earlier, in order to ensure that no operation involving rabbitmqctl are made before the package link is properly created.

Currently, run_rabbitmq_upgrade_preparation_shutdown_cluster is called before configure_rmq_version() and when rabbitmq-upgrade-preparation is run, it uses the rabbitmqctl from the previous deployment (because the link to the new one is not already set).

Normally, a fresh new deployment should not even work because rabbitmq-upgrade-preparation is not supposed to find any /var/vcap/packages/rabbitmq-server/bin/rabbitmqctl since the /var/vcap/packages/rabbitmq-server link is not created yet. I didn't test this case though.

Anyway, I'll submit my work-in-progress patch and let you dig into the issue further.

Globally, the Bash scripts are too complicated to understand

This Bosh Release is hard to debug. Especially the Bash scripts from the rabbitmq-server job templates: though they are individually well-written, and functions are individually properly named (which is obviously the result of good programing skills in the first place), the scripts are too complicated as a whole. They need to be refactored in order to simplify things. Currently, there's I'm still trying to figure out which awesome features of this Bosh Release could possibly lead to such tangled code.

As a return on experience with the Cassandra Bosh Release, we leveraged the move to BPM to drive a major cut into bloated Bash scripts. But the situation was not even close to the amount of script lines we can see in these rabbitmq-server job templates.

The current issue I'm raising here really looks like a consequence of this complexity. Thus the remark.

@cf-gitbot
Copy link
Member

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

@mkuratczyk
Copy link
Contributor

Hi Benjamin. Thank you very much for a detailed report. There are at least two issues here:

  1. The link is not set correctly. We'll definitely investigate that. In our testing when upgrading to v240 the link was set as expected but in your situation it clearly was not. We'll try to understand what caused this different behaviour.

  2. Canary startup failure. This is most likely caused by RabbitMQ's lack of support for in-place 3.6 to 3.7 upgrade which is the very reason we decided to package both versions in the bosh release. Basically your canary 3.7 node can't join the existing 3.6 cluster. There is a new version property which defaults to 3.6 so that when you upgrade to v240 you should not have had any issues. You ran into that because as I understand you changed the link manually to 3.7. You can read more about the recommended way of migrating to 3.7 here: http://www.rabbitmq.com/blue-green-upgrade.html

We are definitely considering BPM and other ways to simplify this bosh release as it indeed grew more complex than we'd like it to be, especially now with 3.6 and 3.7 packaged together. However we prioritised shipping 3.7 in the bosh release since 3.6 will soon be deprecated

@bgandon
Copy link
Contributor Author

bgandon commented Mar 29, 2018

Hi Michal,

Indeed, I've blindly set the version: "3.7" property in my deployment manifest. The reason is that there were no warning about that in the v240.0.0 or v239.0.0 release notes. On the contrary, the note about the soon upcoming sunset of version 3.6 led me to upgrading as soon as I could.

Moreover, as I had seen there is some upgrade-related code in the release, I thought the 3.6-to-3.7 in-place upgrade would be supported and smooth. (As I'm no RabbitMQ expert, I wasn't aware of the recommended blue-green way of upgrading a cluster either.)

For my own use-case, the solution is easy. I'm not running a production cluster, so I can wipe it out and rebuild it from scratch. But with one of my clients (relying on the PCF tile), we are not far from going to production. We'll need to think soon and carefully about a proper upgrade path!

Benjamin

@mkuratczyk
Copy link
Contributor

Hi, I've added a warning to the v239 release notes (this version added 3.7 package).

As for the tile, I sent you a message on Slack to discuss the details. Our immediate plan is for the tile to simply prevent you from doing what you just did manually (if you have 3.6, it should stay on 3.6 but there are new on-demand plans with 3.7 available).

@mkuratczyk
Copy link
Contributor

#73 has been merged. I've edited the commit message to better reflect the problem (it's not related to upgrading to 3.7 - it's just the same if you remain on 3.6 - the symlink is still not created successfully without this fix). Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants