-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to RabbitMQ server 3.7 is broken #72
Comments
We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story. The labels on this github issue will be updated when the story is started. |
Hi Benjamin. Thank you very much for a detailed report. There are at least two issues here:
We are definitely considering BPM and other ways to simplify this bosh release as it indeed grew more complex than we'd like it to be, especially now with 3.6 and 3.7 packaged together. However we prioritised shipping 3.7 in the bosh release since 3.6 will soon be deprecated |
Hi Michal, Indeed, I've blindly set the Moreover, as I had seen there is some upgrade-related code in the release, I thought the 3.6-to-3.7 in-place upgrade would be supported and smooth. (As I'm no RabbitMQ expert, I wasn't aware of the recommended blue-green way of upgrading a cluster either.) For my own use-case, the solution is easy. I'm not running a production cluster, so I can wipe it out and rebuild it from scratch. But with one of my clients (relying on the PCF tile), we are not far from going to production. We'll need to think soon and carefully about a proper upgrade path! Benjamin |
Hi, I've added a warning to the v239 release notes (this version added 3.7 package). As for the tile, I sent you a message on Slack to discuss the details. Our immediate plan is for the tile to simply prevent you from doing what you just did manually (if you have 3.6, it should stay on 3.6 but there are new on-demand plans with 3.7 available). |
#73 has been merged. I've edited the commit message to better reflect the problem (it's not related to upgrading to 3.7 - it's just the same if you remain on 3.6 - the symlink is still not created successfully without this fix). Thank you! |
Base issue
Here is the listing of
/var/vcap/packages
after an upgrade to the release v240.0.0:The
rabbitmq-server
link is pointing at the former packagee2bba8e...
from the deployment of previous version 238 of the Bosh Release. Here you need to know that Bosh keeps the previous packages around in order to speed up any subsequent rollback.And here is what the
configure_rmq_version()
(frompre-start.nash
template) has created:In
pre-start.bash
, the use ofln
without removing any pre-exiting/var/vcap/packages/rabbitmq-server
(file or directory or link) is a classical tricky case:Instead, any existing
/var/vcap/packages/rabbitmq-server
link should be removed first, and theln
invocation should not need the-f
flag.Further upgrade issue(s)
I tried to fix the Bosh Release with this code:
But then I hit an issue when actually upgrading the cluster. The canary node in my deployment fails at starting
rabbitmq-server
job with the new RabbitMQ 3.7 binary.To my understanding, I see that the
configure_rmq_version()
is called afterrabbitmq-config-vars.bash
has been loaded, because the actual engine version is define there. The problem is that this should be done earlier, in order to ensure that no operation involvingrabbitmqctl
are made before the package link is properly created.Currently,
run_rabbitmq_upgrade_preparation_shutdown_cluster
is called beforeconfigure_rmq_version()
and whenrabbitmq-upgrade-preparation
is run, it uses therabbitmqctl
from the previous deployment (because the link to the new one is not already set).Normally, a fresh new deployment should not even work because
rabbitmq-upgrade-preparation
is not supposed to find any/var/vcap/packages/rabbitmq-server/bin/rabbitmqctl
since the/var/vcap/packages/rabbitmq-server
link is not created yet. I didn't test this case though.Anyway, I'll submit my work-in-progress patch and let you dig into the issue further.
Globally, the Bash scripts are too complicated to understand
This Bosh Release is hard to debug. Especially the Bash scripts from the
rabbitmq-server
job templates: though they are individually well-written, and functions are individually properly named (which is obviously the result of good programing skills in the first place), the scripts are too complicated as a whole. They need to be refactored in order to simplify things. Currently, there's I'm still trying to figure out which awesome features of this Bosh Release could possibly lead to such tangled code.As a return on experience with the Cassandra Bosh Release, we leveraged the move to BPM to drive a major cut into bloated Bash scripts. But the situation was not even close to the amount of script lines we can see in these
rabbitmq-server
job templates.The current issue I'm raising here really looks like a consequence of this complexity. Thus the remark.
The text was updated successfully, but these errors were encountered: