Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure the master service are restarted always #7551

Conversation

ingvagabund
Copy link
Member

@ingvagabund ingvagabund commented Mar 16, 2018

Due to #5367, the master service are restarted and running right after the etcd cluster is scaled up. The original assumption was both master services are stopped. So this refactoring change causes the master services not to be restarted. So the master-config.yaml is configured to use the etcd3 backend but the change is not reflected in the masters until they are restarted.

So here is what is going on:

  1. the first etcd member is migrated
  2. all remaining members are cleansed (their /var/lib/etcd/member is removed, ...)
  3. etcd cluster is scale-up (so all etcd members are on v3) [1]
  4. master-config.yaml is updated and restart master api and restart master controllers handlers are run [1] (not right away but eventually) which restarts both master services so they are running now
  5. master-config.yaml is updated to use the etcd3 storage back-end [3]
  6. The master services is to be restarted but they are not cause they are already running. So state: started is not enough to restart them. The state needs to be set to restarted so the services are restarted and the updated configuration takes affect.

[1] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-etcd/private/migrate.yml#L111
[2] https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_master/tasks/update_etcd_client_urls.yml#L6
[3] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-etcd/private/migrate.yml#L141
[4] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-etcd/private/migrate.yml#L150

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 16, 2018
Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master services shouldn't be running at this point, but hopefully this will save us from some regressions in the future just in case.

PR #7556 fixes the underlying issue.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 16, 2018

@ingvagabund: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/gcp 8dbab24 link /test gcp
ci/openshift-jenkins/system-containers 8dbab24 link /test system-containers
ci/openshift-jenkins/extended_conformance_install_crio 8dbab24 link /test crio

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ingvagabund
Copy link
Member Author

/cherrypick release-3.9

@openshift-cherrypick-robot

@ingvagabund: once the present PR merges, I will cherry-pick it on top of release-3.9 in a new PR and assign it to you.

In response to this:

/cherrypick release-3.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@liggitt
Copy link
Contributor

liggitt commented Mar 16, 2018

  1. the first etcd member is migrated
  2. all remaining members are cleansed (their /var/lib/etcd/member is removed, ...)
  3. etcd cluster is scale-up (so all etcd members are on v3) [1]
  4. master-config.yaml is updated and restart master api and restart master controllers handlers are run [1] (not right away but eventually) which restarts both master services so they are running now
  5. master-config.yaml is updated to use the etcd3 storage back-end [3]
  6. The master services is to be restarted but they are not cause they are already running. So state: started is not enough to restart them. The state needs to be set to restarted so the services are restarted and the updated configuration takes affect.

what is the difference between master-config.yaml update in step 4 and step 5? the apiservers should never be started in v2 mode after migration is performed (even if they are forced to restart in v3 later)

@liggitt
Copy link
Contributor

liggitt commented Mar 16, 2018

we need something like #7544 done as part of the migration to ensure that an api server never starts in v2 mode after a migration is run

@michaelgugino
Copy link
Contributor

we need something like #7544 done as part of the migration to ensure that an api server never starts in v2 mode after a migration is run

The api servers weren't supposed to be restarted yet; this commit is intended to hopefully back-stop a regression in the future (which shouldn't happen anyway).

@liggitt
Copy link
Contributor

liggitt commented Mar 16, 2018

The api servers weren't supposed to be restarted yet; this commit is intended to hopefully back-stop a regression in the future (which shouldn't happen anyway).

Understood, but we also need to ensure that nothing else (host restart, running another playbook, etc) can possibly start and run an apiserver in v2 mode post migration.

@eparis
Copy link
Member

eparis commented Mar 16, 2018

Instead of stopping them, should we stop and mask them? Then unmask and start? Does the ansible systemd module handle mask?

@michaelgugino
Copy link
Contributor

Understood, but we also need to ensure that nothing else (host restart, running another playbook, etc) can possibly start and run an apiserver in v2 mode post migration.

This is not possible if the playbook runs to completion. This would be not very typical.

Instead of stopping them, should we stop and mask them? Then unmask and start? Does the ansible systemd module handle mask?

This is not a bad idea. We'd need to make sure we backup the systemd files though, because I believe we are placing the unit files in /etc/systemd/system already, so if we mask without backing up the service units, there will be no restarting ;)

@sdodson
Copy link
Member

sdodson commented Mar 20, 2018

Instead of stopping them, should we stop and mask them? Then unmask and start? Does the ansible systemd module handle mask?

Just catching up on things here. First, since etcd v2 to v3 migration is something that must be complete prior to upgrading to 3.7 the only versions of the playbooks that are relevant are release-3.7 and release-3.6 branches, newer branches are out of date and lack @vrutkovs fixes to forego the error prone scaleup playbook. We should remove the migration tooling from master branch.

Looks like ansible-2.4 was the first to support masking services, we mask services elsewhere via systemctl command, we could do that and I think we should. I suspect some customers may have monitoring that may attempt to restart the service and we should protect against that. I also agree that using restarted for the final step is a good idea but hopefully no longer relevant once we've masked the service.

@vrutkovs
Copy link
Member

Instead of stopping them, should we stop and mask them? Then unmask and start?

Openshift can be configured to restart machine instead of service, so masking services might be unnecessary.

While this PR makes sense, it should be submitted against release-3.6 branch and cherry-picked to release-3.7. etcd migration code would soon be removed from master (and probably from `release-3.9 too)

@vrutkovs
Copy link
Member

Created PRs for 3.6 and 3.7 branches

Generally these precautions would be unnecessary (as the playbook stops masters and doesn't call scaleup anymore), but there might be a case where this happens.

Closing this PR as its not required in master - migrate code would soon be removed

@vrutkovs vrutkovs closed this Mar 27, 2018
@ingvagabund ingvagabund deleted the restart-masters-after-etcd-v3-backend-changed branch March 27, 2018 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_3.6 affects_3.7 priority/P1 size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants