Make sure the master service are restarted always #7551

ingvagabund · 2018-03-16T16:18:53Z

Due to #5367, the master service are restarted and running right after the etcd cluster is scaled up. The original assumption was both master services are stopped. So this refactoring change causes the master services not to be restarted. So the master-config.yaml is configured to use the etcd3 backend but the change is not reflected in the masters until they are restarted.

So here is what is going on:

the first etcd member is migrated
all remaining members are cleansed (their /var/lib/etcd/member is removed, ...)
etcd cluster is scale-up (so all etcd members are on v3) [1]
master-config.yaml is updated and restart master api and restart master controllers handlers are run [1] (not right away but eventually) which restarts both master services so they are running now
master-config.yaml is updated to use the etcd3 storage back-end [3]
The master services is to be restarted but they are not cause they are already running. So state: started is not enough to restart them. The state needs to be set to restarted so the services are restarted and the updated configuration takes affect.

[1] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-etcd/private/migrate.yml#L111
[2] https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_master/tasks/update_etcd_client_urls.yml#L6
[3] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-etcd/private/migrate.yml#L141
[4] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-etcd/private/migrate.yml#L150

michaelgugino

The master services shouldn't be running at this point, but hopefully this will save us from some regressions in the future just in case.

PR #7556 fixes the underlying issue.

openshift-ci-robot · 2018-03-16T17:34:57Z

@ingvagabund: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/gcp	`8dbab24`	link	`/test gcp`
ci/openshift-jenkins/system-containers	`8dbab24`	link	`/test system-containers`
ci/openshift-jenkins/extended_conformance_install_crio	`8dbab24`	link	`/test crio`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ingvagabund · 2018-03-16T18:26:06Z

/cherrypick release-3.9

openshift-cherrypick-robot · 2018-03-16T18:26:07Z

@ingvagabund: once the present PR merges, I will cherry-pick it on top of release-3.9 in a new PR and assign it to you.

In response to this:

/cherrypick release-3.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

liggitt · 2018-03-16T20:04:12Z

the first etcd member is migrated

all remaining members are cleansed (their /var/lib/etcd/member is removed, ...)

etcd cluster is scale-up (so all etcd members are on v3) [1]

master-config.yaml is updated and restart master api and restart master controllers handlers are run [1] (not right away but eventually) which restarts both master services so they are running now

master-config.yaml is updated to use the etcd3 storage back-end [3]

The master services is to be restarted but they are not cause they are already running. So state: started is not enough to restart them. The state needs to be set to restarted so the services are restarted and the updated configuration takes affect.

what is the difference between master-config.yaml update in step 4 and step 5? the apiservers should never be started in v2 mode after migration is performed (even if they are forced to restart in v3 later)

liggitt · 2018-03-16T20:40:33Z

we need something like #7544 done as part of the migration to ensure that an api server never starts in v2 mode after a migration is run

michaelgugino · 2018-03-16T21:25:29Z

we need something like #7544 done as part of the migration to ensure that an api server never starts in v2 mode after a migration is run

The api servers weren't supposed to be restarted yet; this commit is intended to hopefully back-stop a regression in the future (which shouldn't happen anyway).

liggitt · 2018-03-16T21:29:21Z

The api servers weren't supposed to be restarted yet; this commit is intended to hopefully back-stop a regression in the future (which shouldn't happen anyway).

Understood, but we also need to ensure that nothing else (host restart, running another playbook, etc) can possibly start and run an apiserver in v2 mode post migration.

eparis · 2018-03-16T23:58:02Z

Instead of stopping them, should we stop and mask them? Then unmask and start? Does the ansible systemd module handle mask?

michaelgugino · 2018-03-17T00:07:20Z

Understood, but we also need to ensure that nothing else (host restart, running another playbook, etc) can possibly start and run an apiserver in v2 mode post migration.

This is not possible if the playbook runs to completion. This would be not very typical.

Instead of stopping them, should we stop and mask them? Then unmask and start? Does the ansible systemd module handle mask?

This is not a bad idea. We'd need to make sure we backup the systemd files though, because I believe we are placing the unit files in /etc/systemd/system already, so if we mask without backing up the service units, there will be no restarting ;)

sdodson · 2018-03-20T21:43:36Z

Instead of stopping them, should we stop and mask them? Then unmask and start? Does the ansible systemd module handle mask?

Just catching up on things here. First, since etcd v2 to v3 migration is something that must be complete prior to upgrading to 3.7 the only versions of the playbooks that are relevant are release-3.7 and release-3.6 branches, newer branches are out of date and lack @vrutkovs fixes to forego the error prone scaleup playbook. We should remove the migration tooling from master branch.

Looks like ansible-2.4 was the first to support masking services, we mask services elsewhere via systemctl command, we could do that and I think we should. I suspect some customers may have monitoring that may attempt to restart the service and we should protect against that. I also agree that using restarted for the final step is a good idea but hopefully no longer relevant once we've masked the service.

vrutkovs · 2018-03-27T12:45:14Z

Instead of stopping them, should we stop and mask them? Then unmask and start?

Openshift can be configured to restart machine instead of service, so masking services might be unnecessary.

While this PR makes sense, it should be submitted against release-3.6 branch and cherry-picked to release-3.7. etcd migration code would soon be removed from master (and probably from `release-3.9 too)

vrutkovs · 2018-03-27T12:56:38Z

Created PRs for 3.6 and 3.7 branches

Generally these precautions would be unnecessary (as the playbook stops masters and doesn't call scaleup anymore), but there might be a case where this happens.

Closing this PR as its not required in master - migrate code would soon be removed

Make sure the master service are restarted always

8dbab24

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 16, 2018

ingvagabund requested review from sdodson, michaelgugino, mtnbikenc and kwoodson March 16, 2018 16:20

ingvagabund added affects_3.6 affects_3.7 affects_3.8 affects_3.9 labels Mar 16, 2018

michaelgugino approved these changes Mar 16, 2018

View reviewed changes

ingvagabund removed the affects_3.8 label Mar 16, 2018

sdodson assigned vrutkovs Mar 27, 2018

sdodson added priority/P1 and removed affects_3.9 labels Mar 27, 2018

This was referenced Mar 27, 2018

[3.6] etcd migrate: make sure masters are restarted instead of started #7669

Closed

[3.7] etcd migrate: make sure masters are restarted instead of started #7670

Merged

vrutkovs closed this Mar 27, 2018

ingvagabund deleted the restart-masters-after-etcd-v3-backend-changed branch March 27, 2018 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure the master service are restarted always #7551

Make sure the master service are restarted always #7551

ingvagabund commented Mar 16, 2018 •

edited

michaelgugino left a comment

openshift-ci-robot commented Mar 16, 2018 •

edited

ingvagabund commented Mar 16, 2018

openshift-cherrypick-robot commented Mar 16, 2018

liggitt commented Mar 16, 2018 •

edited

liggitt commented Mar 16, 2018

michaelgugino commented Mar 16, 2018

liggitt commented Mar 16, 2018 •

edited

eparis commented Mar 16, 2018

michaelgugino commented Mar 17, 2018

sdodson commented Mar 20, 2018

vrutkovs commented Mar 27, 2018

vrutkovs commented Mar 27, 2018

Make sure the master service are restarted always #7551

Make sure the master service are restarted always #7551

Conversation

ingvagabund commented Mar 16, 2018 • edited

michaelgugino left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Mar 16, 2018 • edited

ingvagabund commented Mar 16, 2018

openshift-cherrypick-robot commented Mar 16, 2018

liggitt commented Mar 16, 2018 • edited

liggitt commented Mar 16, 2018

michaelgugino commented Mar 16, 2018

liggitt commented Mar 16, 2018 • edited

eparis commented Mar 16, 2018

michaelgugino commented Mar 17, 2018

sdodson commented Mar 20, 2018

vrutkovs commented Mar 27, 2018

vrutkovs commented Mar 27, 2018

ingvagabund commented Mar 16, 2018 •

edited

openshift-ci-robot commented Mar 16, 2018 •

edited

liggitt commented Mar 16, 2018 •

edited

liggitt commented Mar 16, 2018 •

edited