New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure the master service are restarted always #7551
Make sure the master service are restarted always #7551
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The master services shouldn't be running at this point, but hopefully this will save us from some regressions in the future just in case.
PR #7556 fixes the underlying issue.
@ingvagabund: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cherrypick release-3.9 |
@ingvagabund: once the present PR merges, I will cherry-pick it on top of release-3.9 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
what is the difference between master-config.yaml update in step 4 and step 5? the apiservers should never be started in v2 mode after migration is performed (even if they are forced to restart in v3 later) |
we need something like #7544 done as part of the migration to ensure that an api server never starts in v2 mode after a migration is run |
The api servers weren't supposed to be restarted yet; this commit is intended to hopefully back-stop a regression in the future (which shouldn't happen anyway). |
Understood, but we also need to ensure that nothing else (host restart, running another playbook, etc) can possibly start and run an apiserver in v2 mode post migration. |
Instead of stopping them, should we stop and mask them? Then unmask and start? Does the ansible systemd module handle mask? |
This is not possible if the playbook runs to completion. This would be not very typical.
This is not a bad idea. We'd need to make sure we backup the systemd files though, because I believe we are placing the unit files in /etc/systemd/system already, so if we mask without backing up the service units, there will be no restarting ;) |
Just catching up on things here. First, since etcd v2 to v3 migration is something that must be complete prior to upgrading to 3.7 the only versions of the playbooks that are relevant are release-3.7 and release-3.6 branches, newer branches are out of date and lack @vrutkovs fixes to forego the error prone scaleup playbook. We should remove the migration tooling from master branch. Looks like ansible-2.4 was the first to support masking services, we mask services elsewhere via systemctl command, we could do that and I think we should. I suspect some customers may have monitoring that may attempt to restart the service and we should protect against that. I also agree that using restarted for the final step is a good idea but hopefully no longer relevant once we've masked the service. |
Openshift can be configured to restart machine instead of service, so masking services might be unnecessary. While this PR makes sense, it should be submitted against |
Created PRs for 3.6 and 3.7 branches Generally these precautions would be unnecessary (as the playbook stops masters and doesn't call Closing this PR as its not required in master - migrate code would soon be removed |
Due to #5367, the master service are restarted and running right after the etcd cluster is scaled up. The original assumption was both master services are stopped. So this refactoring change causes the master services not to be restarted. So the master-config.yaml is configured to use the etcd3 backend but the change is not reflected in the masters until they are restarted.
So here is what is going on:
/var/lib/etcd/member
is removed, ...)restart master api
andrestart master controllers
handlers are run [1] (not right away but eventually) which restarts both master services so they are running nowstate: started
is not enough to restart them. The state needs to be set torestarted
so the services are restarted and the updated configuration takes affect.[1] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-etcd/private/migrate.yml#L111
[2] https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_master/tasks/update_etcd_client_urls.yml#L6
[3] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-etcd/private/migrate.yml#L141
[4] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-etcd/private/migrate.yml#L150