Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RabbitMQ fails on upgrade when 2 nodes are specified that are not clustered. #1979

Closed
przemyslavic opened this issue Jan 18, 2021 · 3 comments
Assignees
Labels
Milestone

Comments

@przemyslavic
Copy link
Collaborator

przemyslavic commented Jan 18, 2021

Describe the bug
There is an issue with RabbitMQ upgrade process failing during upgrade to v3.8.9.
The failing configuration is at least 2 non clustered nodes.

How to reproduce
Steps to reproduce the behavior:

  1. Deploy a 0.8 cluster with RabbitMQ component enabled (al least 2 vms)
  2. Upgrade the cluster to the develop branch.

Expected behavior
The cluster has been upgraded successfully.

Config files

---
kind: epiphany-cluster
name: default
provider: <provider>
specification:
  components:
    rabbitmq:
      count: 2
---        
kind: configuration/rabbitmq
title: "RabbitMQ"
provider: <provider>
name: default
specification:
  rabbitmq_plugins:
    - rabbitmq_management_agent
    - rabbitmq_management
  cluster:
    is_clustered: false

Environment

  • Cloud provider: [all]
  • OS: [all]

Additional context
The upgrade process fails on TASK [upgrade : RabbitMQ | Join a node to the cluster].

2021-01-17T00:53:40.4385876Z[38;21m00:53:40 INFO cli.engine.ansible.AnsibleCommand - TASK [upgrade : RabbitMQ | Join a node to the cluster] *************************[0m
2021-01-17T00:53:40.7005828Z[38;21m00:53:40 INFO cli.engine.ansible.AnsibleCommand - skipping: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com][0m
2021-01-17T00:53:42.1028562Z[31;21m00:53:42 ERROR cli.engine.ansible.AnsibleCommand - fatal: [ec2-yy-yy-yy-yy.eu-west-3.compute.amazonaws.com]: FAILED! => {"changed": true, "cmd": ["rabbitmqctl", "join_cluster", "rabbit@ec2-xx-xx-xx-xx"], "delta": "0:00:00.541081", "end": "2021-01-17 00:53:41.997904", "msg": "non-zero return code", "rc": 69, "start": "2021-01-17 00:53:41.456823", "stderr": "Error: unable to perform an operation on node 'rabbit@ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com'. Please see diagnostics information and suggestions below.\n\nMost common reasons for this are:\n\n * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)\n * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)\n * Target node is not running\n\nIn addition to the diagnostics info below:\n\n * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more\n * Consult server logs on node rabbit@ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com\n * If target node is configured to use long node names, don't forget to use --longnames with CLI tools\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com']\n\nrabbit@ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com:\n  * connected to epmd (port 4369) on ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com\n  * epmd reports node 'rabbit' uses port 25672 for inter-node and CLI tool traffic \n  * TCP connection succeeded but Erlang distribution failed \n\n  * Authentication failed (rejected by the remote node), please check the Erlang cookie\n\n\nCurrent node details:\n * node name: 'rabbitmqcli-2076-rabbit@ec2-yy-yy-yy-yy.eu-west-3.compute.amazonaws.com'\n * effective user's home directory: /var/lib/rabbitmq\n * Erlang cookie hash: <hash>==", "stderr_lines": ["Error: unable to perform an operation on node 'rabbit@ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com'. Please see diagnostics information and suggestions below.", "", "Most common reasons for this are:", "", " * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)", " * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)", " * Target node is not running", "", "In addition to the diagnostics info below:", "", " * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more", " * Consult server logs on node rabbit@ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com", " * If target node is configured to use long node names, don't forget to use --longnames with CLI tools", "", "DIAGNOSTICS", "===========", "", "attempted to contact: ['rabbit@ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com']", "", "rabbit@ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com:", "  * connected to epmd (port 4369) on ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com", "  * epmd reports node 'rabbit' uses port 25672 for inter-node and CLI tool traffic ", "  * TCP connection succeeded but Erlang distribution failed ", "", "  * Authentication failed (rejected by the remote node), please check the Erlang cookie", "", "", "Current node details:", " * node name: 'rabbitmqcli-2076-rabbit@ec2-yy-yy-yy-yy.eu-west-3.compute.amazonaws.com'", " * effective user's home directory: /var/lib/rabbitmq", " * Erlang cookie hash: <hash>=="], "stdout": "Clustering node rabbit@ec2-yy-yy-yy-yy.eu-west-3.compute.amazonaws.com with rabbit@ec2-xx-xx-xx-xx", "stdout_lines": ["Clustering node rabbit@ec2-yy-yy-yy-yy.eu-west-3.compute.amazonaws.com with rabbit@ec2-xx-xx-xx-xx"]}

The command rabbitmqctl join_cluster is run even though we are not creating a cluster. We need to check the specification and value of the is clustered parameter before running some tasks.

@przemyslavic przemyslavic added this to the S20210128 milestone Jan 18, 2021
@przemyslavic przemyslavic added this to To do in All Work via automation Jan 18, 2021
@przemyslavic przemyslavic added this to Needs triage in Bugs via automation Jan 18, 2021
@atsikham
Copy link
Contributor

Verified that the option with no clustering is not checked in the upgrade role. Previously added this task as a dependency for backporting - #1933, #1934, #1935.

@atsikham atsikham self-assigned this Jan 19, 2021
@atsikham atsikham removed their assignment Jan 19, 2021
to-bar added a commit to to-bar/epiphany that referenced this issue Jan 19, 2021
to-bar added a commit that referenced this issue Jan 19, 2021
* Deprecate Elasticsearch OSS v6

* Add #1979 to known issues
to-bar added a commit that referenced this issue Jan 19, 2021
* Deprecate Elasticsearch OSS v6

* Add #1979 to known issues
@atsikham atsikham assigned atsikham and unassigned atsikham Jan 20, 2021
@przemyslavic przemyslavic self-assigned this Jan 26, 2021
@przemyslavic
Copy link
Collaborator Author

Tested together with #1984.
✅ upgrade from 3.7.10 to 3.8.9
✅ upgrade from 3.8.3 to 3.8.9
✅ 1 RabbitMQ node
✅ 2 RabbitMQ nodes
✅ 2 RabbitMQ nodes clustered
Azure/AWS x Ubuntu/RHEL

@atsikham
Copy link
Contributor

PR #1989

@mkyc mkyc closed this as completed Jan 26, 2021
Bugs automation moved this from Needs triage to Closed Jan 26, 2021
All Work automation moved this from To do to DoD Check Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
All Work
  
DoD Check
Bugs
  
Closed
Development

No branches or pull requests

3 participants