Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offline control plane recover #10660

Merged

Conversation

yuha0
Copy link
Contributor

@yuha0 yuha0 commented Nov 28, 2023

/kind bug

What this PR does / why we need it:

  1. bug fix:

    Add ignore_unreachable: true to Remove etcd data dir task, so that the playbook does not fail with node "unreachable" state when the broken etcd node is unreachable.

  2. documentation

    • The runbook steps are in two different paragraphs in two different sections. Combine them to make runbook steps clear.

    • At the bottom there's a suggestion:

      * If your new control plane nodes have new ip addresses you may have to change settings in various places.

      I find the wording to be a bit confusing, makes it not super useful as a suggestion to users:

      • what does "may" mean? Do I need to update IPs or not?
      • what are these "various places" exactly? How do I know that I didn't miss any places?
        In fact, The suggestion was added in 48a1828 4 years ago. But a new task added 2 years ago, in ee0f1e9, automatically update API server arg with updated etcd node ip addresses. Please let me know if I am wrong, but from what I found in the repo, and based on my experience using the playbook last week, this suggestion seems to be unnecessary at this point.

Which issue(s) this PR fixes:

Fixes #10649

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixes running `recover-control-plane.yml` with offline broken etcd nodes.

ignore_errors ignores errors occur within "file" module. However, when
the target node is offline, the playbook will still fail at this task
with node "unreachable" state. Setting "ignore_unreachable: true" allows
the playbook to bypass offline nodes and move on to proceed recovery
tasks on remaining online nodes.
The suggestion was added in 48a1828 4
years ago. But a new task added 2 years ago, in
ee0f1e9, automatically update API
server arg with updated etcd node ip addresses. This suggestion is no
longer needed.
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 28, 2023
Copy link

linux-foundation-easycla bot commented Nov 28, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Nov 28, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @yuha0!

It looks like this is your first PR to kubernetes-sigs/kubespray 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kubespray has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 28, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @yuha0. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 28, 2023
@yankay
Copy link
Member

yankay commented Dec 7, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 7, 2023
@VannTen
Copy link
Contributor

VannTen commented Dec 11, 2023

Why do you add ignore_unreachable only to that task ? I mean, in cases where the node is unreachable, the other task with ignore_errors would have the same problem.

@yuha0
Copy link
Contributor Author

yuha0 commented Dec 14, 2023

Hi @VannTen thanks for looking into this.

I mean, in cases where the node is unreachable, the other task with ignore_errors would have the same problem.

Not exactly.

First of all, my understanding is that, in ansible, with_items executes the task on each item in the list, and only returns one single status code (and it's up to the user to register the result and iterate over each item's return code). So in this case, whether ignore_errors/ignore_unreachable are needed would depend on if you want to tolerate all nodes in the group being unreachable or having task execution errors.

When I ran the playbook a few weeks ago, I only needed to ignore unreachable on that one single task. Because that particular task runs only on broken nodes and not on health ones:

delegate_to: "{{ item }}"
with_items: "{{ groups['broken_etcd'] }}"

However, other tasks will run on healthy etcd nodes as well. For example, the immediate task after the above is:

- name: Delete old certificates
shell: "rm {{ etcd_cert_dir }}/*{{ item }}*"
with_items: "{{ groups['broken_etcd'] }}"
register: delete_old_cerificates
ignore_errors: true
when: groups['broken_etcd']

which removes broken etcd nodes' certs from healthy nodes (fwiw, I think this task would be nicer if implemented with file module that sets the status to be absent, and not a rm shell command without -f, for idempotence. But that's not relevant to this issue/PR). I don't think leaving those old, unused cert files in that directory is harmful, but I do expect the task to be executed successfully on the healthy etcd nodes -- if a healthy node is somehow unreachable, then the user is having a bigger problem and this playbook won't be helpful at all.

Here's a small POC:

inventory.ini:

[healthy_etcd]
a_reachable_node
[broken_etcd]
an_unreachable_node

task.yaml:

- hosts: all
  tasks:
  - name: remove file with file module
    ignore_errors: true
    ignore_unreachable: true
    delegate_to: "{{ item }}"
    with_items: "{{ groups['broken_etcd'] }}"
    file:
      path: /tmp/test
      state: absent
    when:
    - groups['broken_etcd']
  - name: remove file with shell module
    ignore_errors: true
    shell: "rm -f /tmp/test"
    with_items: "{{ groups['broken_etcd'] }}"
    when:
    - groups['broken_etcd']
  - name: post removal
    debug:
      msg: "previous tasks have been passed/skipped without problem"

This play only works when ignore_unreachable is true in the first task.

@VannTen
Copy link
Contributor

VannTen commented Jan 11, 2024

I see, that makes sense, at least enough sense for me
/lgtm
/assign @floryut @yankay
(for approval)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2024
Copy link
Member

@floryut floryut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense, at least enough sense for me /lgtm /assign @floryut @yankay (for approval)

lgtm for me 👍

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: floryut, yuha0

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2024
@k8s-ci-robot k8s-ci-robot merged commit 0e971a3 into kubernetes-sigs:master Jan 22, 2024
63 checks passed
pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request May 8, 2024
* ignore_unreachable for etcd dir cleanup

ignore_errors ignores errors occur within "file" module. However, when
the target node is offline, the playbook will still fail at this task
with node "unreachable" state. Setting "ignore_unreachable: true" allows
the playbook to bypass offline nodes and move on to proceed recovery
tasks on remaining online nodes.

* Re-arrange control plane recovery runbook steps

* Remove suggestion to manually update IP addresses

The suggestion was added in 48a1828 4
years ago. But a new task added 2 years ago, in
ee0f1e9, automatically update API
server arg with updated etcd node ip addresses. This suggestion is no
longer needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

recover-control-plane.yml does not work when the broken node is offline
5 participants