Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error upgrading AWS/RHEL/calico cluster from 0.6 to 0.7 #1461

Closed
przemyslavic opened this issue Jul 14, 2020 · 5 comments · Fixed by #1472
Closed

[BUG] Error upgrading AWS/RHEL/calico cluster from 0.6 to 0.7 #1461

przemyslavic opened this issue Jul 14, 2020 · 5 comments · Fixed by #1472
Assignees
Labels
Milestone

Comments

@przemyslavic
Copy link
Collaborator

przemyslavic commented Jul 14, 2020

Describe the bug
The cluster cannot be upgraded from version 0.6 to version 0.7 in the following configuration: AWS/RHEL/calico.
Ansible fails on task [kubernetes_master : Check if NetworkManager service is loaded].

To Reproduce
Steps to reproduce the behavior:

  1. execute epicli apply -f test.yml using epiphanyplatform/epicli:0.6.0 docker image (configuration given below)
  2. execute epicli upgrade -b /path/to/build/directory/ (from develop branch)

Expected behavior
The cluster has been upgraded without errors.

Config files
Configuration that should be included in the yaml file:

---
kind: configuration/kubernetes-master
name: default
provider: aws
specification:
  advanced:
    networking:
      plugin: calico

OS (please complete the following information):

  • OS: [e.g. RHEL7]

Cloud Environment (please complete the following information):

  • Cloud Provider [AWS]

Additional context
Log:

2020-07-13T19:55:34.5656419Z 19:55:34 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Apply network plugin configured by user] *************
2020-07-13T19:55:34.7674392Z 19:55:34 INFO cli.engine.ansible.AnsibleCommand - included: /shared/build/06todevawrhcalico/ansible/roles/kubernetes_master/tasks/./cni-plugins/calico.yml for ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com
2020-07-13T19:55:34.8351576Z 19:55:34 INFO cli.engine.ansible.AnsibleCommand - 
2020-07-13T19:55:34.8358102Z 19:55:34 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Check if /etc/NetworkManager/conf.d exists] **********
2020-07-13T19:55:36.2312123Z 19:55:36 INFO cli.engine.ansible.AnsibleCommand - ok: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com]
2020-07-13T19:55:36.3470696Z 19:55:36 INFO cli.engine.ansible.AnsibleCommand - 
2020-07-13T19:55:36.3471764Z 19:55:36 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Copy calico.conf to /etc/NetworkManager/conf.d] ******
2020-07-13T19:55:39.1072062Z 19:55:39 INFO cli.engine.ansible.AnsibleCommand - changed: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com]
2020-07-13T19:55:39.1818824Z 19:55:39 INFO cli.engine.ansible.AnsibleCommand - 
2020-07-13T19:55:39.1823770Z 19:55:39 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Check if NetworkManager service is loaded] ***********
2020-07-13T19:55:40.5262268Z 19:55:40 INFO cli.engine.ansible.AnsibleCommand - fatal: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com]: FAILED! => {"changed": false, "cmd": "set -o pipefail && systemctl list-units --type=service --state=loaded | grep -q NetworkManager.service", "delta": "0:00:00.027016", "end": "2020-07-13 19:55:40.309699", "msg": "non-zero return code", "rc": 141, "start": "2020-07-13 19:55:40.282683", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

The set -o pipefail && systemctl list-units --type=service --state=loaded | grep -q NetworkManager.service command returns an error code.

@przemyslavic
Copy link
Collaborator Author

I performed additional tests. Not every build in such a configuration fails.
Log of successful run:

2020-07-14T10:52:15.4736599Z 10:52:15 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Apply network plugin configured by user] *************0m
2020-07-14T10:52:15.5438077Z 10:52:15 INFO cli.engine.ansible.AnsibleCommand - included: /shared/build/06todevawrhcalico/ansible/roles/kubernetes_master/tasks/./cni-plugins/calico.yml for ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com0m
2020-07-14T10:52:15.5781389Z 10:52:15 INFO cli.engine.ansible.AnsibleCommand - 0m
2020-07-14T10:52:15.5789711Z 10:52:15 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Check if /etc/NetworkManager/conf.d exists] **********0m
2020-07-14T10:52:16.9333872Z 10:52:16 INFO cli.engine.ansible.AnsibleCommand - ok: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com]0m
2020-07-14T10:52:16.9770122Z 10:52:16 INFO cli.engine.ansible.AnsibleCommand - 0m
2020-07-14T10:52:16.9780119Z 10:52:16 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Copy calico.conf to /etc/NetworkManager/conf.d] ******0m
2020-07-14T10:52:19.6510874Z 10:52:19 INFO cli.engine.ansible.AnsibleCommand - changed: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com]0m
2020-07-14T10:52:19.7295285Z 10:52:19 INFO cli.engine.ansible.AnsibleCommand - 0m
2020-07-14T10:52:19.7296236Z 10:52:19 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Check if NetworkManager service is loaded] ***********0m
2020-07-14T10:52:21.0531788Z 10:52:21 INFO cli.engine.ansible.AnsibleCommand - ok: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com]0m
2020-07-14T10:52:21.0776623Z 10:52:21 INFO cli.engine.ansible.AnsibleCommand - 0m
2020-07-14T10:52:21.0777798Z 10:52:21 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Reload NetworkManager service] ***********************0m
2020-07-14T10:52:23.1329992Z 10:52:23 INFO cli.engine.ansible.AnsibleCommand - changed: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com]0m
2020-07-14T10:52:23.1536247Z 10:52:23 INFO cli.engine.ansible.AnsibleCommand - 0m
2020-07-14T10:52:23.1537234Z 10:52:23 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Create calico deployment] ****************************0m
2020-07-14T10:52:25.7732369Z 10:52:25 INFO cli.engine.ansible.AnsibleCommand - changed: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com]0m
2020-07-14T10:52:25.8042662Z 10:52:25 INFO cli.engine.ansible.AnsibleCommand - 0m
2020-07-14T10:52:25.8043988Z 10:52:25 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Apply calico definition] *****************************0m
2020-07-14T10:52:27.9514499Z 10:52:27 INFO cli.engine.ansible.AnsibleCommand - changed: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com]0m

It is possible that adding a retry to this task will solve the problem.

@rafzei
Copy link
Contributor

rafzei commented Jul 14, 2020

Hmm what I see there is return code 141 which means it is a pipeerror.
Long story short: grep is too fast or systemctl is to slow :)
Long story long: looks like grep -q exit with rc:0 (because already finished what was requested for) and systemctl command still writing its output to pipe.

@przemyslavic
Copy link
Collaborator Author

przemyslavic commented Jul 15, 2020

Or we can use the no pipe and no grep command and then check the output.

[ec2-user@ec2-15-188-239-115 ~]$ systemctl list-units --type=service --state=loaded NetworkManager.service
UNIT                   LOAD   ACTIVE SUB     DESCRIPTION
NetworkManager.service loaded active running Network Manager

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

[ec2-user@ec2-15-188-239-115 ~]$ systemctl list-units --type=service --state=loaded test.service
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

@rafzei rafzei self-assigned this Jul 15, 2020
@rafzei rafzei linked a pull request Jul 16, 2020 that will close this issue
@rafzei
Copy link
Contributor

rafzei commented Jul 16, 2020

We don't need to check if NM service is loaded to reload, we have to check the service state and if is running, reload the configuration. If it's not running, skip reloading.

@rafzei rafzei reopened this Jul 17, 2020
@przemyslavic przemyslavic self-assigned this Jul 17, 2020
@mkyc mkyc modified the milestones: 0.7.1, S20200729 Jul 17, 2020
@przemyslavic
Copy link
Collaborator Author

The cluster has been upgraded several times in the specified configuration and the reported problem was not noticed after the fix.

@toszo toszo closed this as completed Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants