Bug 1741817: mcd: Add /run/machine-config-daemon-force stamp file #1086

cgwalters · 2019-08-26T19:56:35Z

This causes the MCD to skip validating against currentConfig (or pendingConfig).

Related: https://bugzilla.redhat.com/show_bug.cgi?id=1741817

A long time ago, this PR introduced the current model:
#245
One aspect of this is we need to avoid reboot loops; that was
a real-world problem early in OpenShift development, although
it is probably unlikely to re-occur today.

Another problem is that we can't simply reconcile by default because
we don't have a mechanism to coordinate reboots:
#662 (comment)

However, this PR should aid disaster recovery scenarios and others
where administrators want the MCD to "just do it".

This causes the MCD to skip validating against `currentConfig` (or `pendingConfig`). Related: https://bugzilla.redhat.com/show_bug.cgi?id=1741817 A long time ago, this PR introduced the current model: openshift#245 One aspect of this is we need to avoid reboot loops; that was a real-world problem early in OpenShift development, although it is probably unlikely to re-occur today. Another problem is that we can't simply reconcile by default because we don't have a mechanism to coordinate reboots: openshift#662 (comment) However, this PR should aid disaster recovery scenarios and others where administrators want the MCD to "just do it".

kikisdeliveryservice · 2019-08-26T20:08:58Z

cc: @rphillips

cgwalters · 2019-08-26T20:14:23Z

To be clear the DR instructions should then include:

$ touch /run/machine-config-daemon-force

rphillips · 2019-08-26T20:22:09Z

/cc @bergerhoffer

kikisdeliveryservice · 2019-08-26T20:32:46Z

To be clear the DR instructions should then include:
$ touch /run/machine-config-daemon-force

Noted on BZ, but we're looking for the above command to appear as 9.e. in the DR instructions something like:
d. Copy the /etc/kubernetes/kubelet-ca.crt file to all other master hosts and nodes.
e. Add file to force the MCD to accept this certificate update:
$ touch /run/machine-config-daemon-force

openshift-ci-robot · 2019-08-26T20:54:48Z

@cgwalters: This pull request references Bugzilla bug 1741817, which is valid. The bug has been moved to the POST state.

In response to this:

Bug 1741817: mcd: Add /run/machine-config-daemon-force stamp file

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

runcom · 2019-08-26T21:04:51Z

vpc errors on e2e-aws-op

/approve

looks really good

cgwalters · 2019-08-26T21:07:26Z

I didn't test manually (or add an e2e test) yet - deploying this locally I hit a very slow drain in a default aws cluster, MCD on the final worker:

I0826 20:52:59.790642   26695 update.go:984] Update prepared; beginning drain
I0826 20:52:59.806819   26695 update.go:89] cordoned node "ip-10-0-163-146.us-east-2.compute.internal"
I0826 20:52:59.900324   26695 update.go:93] ignoring DaemonSet-managed pods: tuned-tljbr, dns-default-nvw66, node-ca-gnwvl, machine-config-daemon-bbh6f, node-exporter-dtx8v, multus-457nl, ovs-ztl85, sdn-dbf7f; deleting pods with local storage: alertmanager-main-1, alertmanager-main-2, kube-state-metrics-5c5cbcdc8-s6l0
I0826 20:53:07.660461   26695 update.go:89] pod "image-registry-5c746c44d5-pwsb7" removed (evicted)
I0826 20:53:08.060889   26695 update.go:89] pod "redhat-operators-bcdf945cf-4br8c" removed (evicted)
I0826 20:53:10.460443   26695 update.go:89] pod "router-default-bcdbc8466-zhbnh" removed (evicted)
I0826 20:53:10.661613   26695 update.go:89] pod "prometheus-k8s-0" removed (evicted)
I0826 20:53:10.860619   26695 update.go:89] pod "kube-state-metrics-5c5cbcdc8-s6l85" removed (evicted)
I0826 20:53:11.260544   26695 update.go:89] pod "community-operators-5fddc68748-vvltk" removed (evicted)
I0826 20:53:11.860548   26695 update.go:89] pod "certified-operators-7d75b75747-wb6mn" removed (evicted)
I0826 20:53:12.463396   26695 update.go:89] pod "alertmanager-main-2" removed (evicted)
I0826 20:53:12.660590   26695 update.go:89] pod "prometheus-adapter-589d7c5776-kvfr7" removed (evicted)
I0826 20:53:13.462698   26695 update.go:89] pod "alertmanager-main-1" removed (evicted)
I0826 20:53:14.122589   26695 update.go:89] pod "prometheus-operator-56cc67bc95-lmjct" removed (evicted)
I0826 20:53:14.460483   26695 update.go:89] pod "openshift-state-metrics-5f48947c54-t9kff" removed (evicted)
I0826 21:03:09.123262   26695 update.go:89] pod "downloads-5fdcfc8686-tjs5l" removed (evicted)

We really need to roll up into status something like "draining node ip-10-0-163-146.us-east-2.compute.internal".

runcom · 2019-08-26T21:10:05Z

We really need to roll up into status something like "draining node ip-10-0-163-146.us-east-2.compute.internal".

like a "Working reason"

runcom · 2019-08-26T21:10:59Z

I0826 21:03:09.123262   26695 update.go:89] pod "downloads-5fdcfc8686-tjs5l" removed (evicted)

This looks like the pod that made it slow - and I think it's not listening to drain (same bug as the registry one perhaps, I"ll check)

rphillips · 2019-08-27T13:04:07Z

/retest
/lgtm

Doc PR: openshift/openshift-docs#16399

openshift-ci-robot · 2019-08-27T13:04:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, rphillips, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ericavonb · 2019-08-27T17:16:30Z

pkg/daemon/daemon.go

@@ -892,8 +892,12 @@ func (dn *Daemon) checkStateOnFirstRun() error {
 		glog.Infof("Validating against current config %s", state.currentConfig.GetName())
 		expectedConfig = state.currentConfig
 	}
-	if !dn.validateOnDiskState(expectedConfig) {
-		return fmt.Errorf("unexpected on-disk state validating against %s", expectedConfig.GetName())
+	if _, err := os.Stat(constants.MachineConfigDaemonForceFile); err != nil {


Shouldn't we be more specific about the type of error here? Also is there a way to force when there's some issue preventing reading or setting this file successfully?

openshift-ci-robot · 2019-08-27T17:23:58Z

@cgwalters: All pull requests linked via external trackers have merged. Bugzilla bug 1741817 has been moved to the MODIFIED state.

In response to this:

Bug 1741817: mcd: Add /run/machine-config-daemon-force stamp file

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rphillips · 2019-09-05T13:46:57Z

/cherrypick release-4.1

openshift-cherrypick-robot · 2019-09-05T13:47:01Z

@rphillips: #1086 failed to apply on top of branch "release-4.1":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	pkg/daemon/constants/constants.go
M	pkg/daemon/daemon.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/daemon/daemon.go
Auto-merging pkg/daemon/constants/constants.go
CONFLICT (content): Merge conflict in pkg/daemon/constants/constants.go
Patch failed at 0001 daemon: Add /run/machine-config-daemon-force stamp file

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

… file openshift#1086

[release-4.1] Bug 1749271: Backport mcd: Add /run/machine-config-daemon-force stampfile #1086

cgwalters · 2020-07-01T19:00:01Z

After some discussion on this I feel like this PR was wrong: what we really want is something more like "automatically reconcile any changes we detect rather than go degraded".

runcom · 2020-07-01T19:04:38Z

The only thing we need to make sure is that whatever we’re force applying, it must be reflected into a rendered mc (like it is usually). I think one of the DR scenario advices to modify the files on host and use the force file which is ineffective if those changes aren’t in a the rendered machine config as well.

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 26, 2019

openshift-ci-robot requested review from ashcrow and kikisdeliveryservice August 26, 2019 19:57

cgwalters force-pushed the force-flag branch from 3afa566 to afead64 Compare August 26, 2019 20:07

openshift-ci-robot requested a review from bergerhoffer August 26, 2019 20:22

kikisdeliveryservice changed the title ~~daemon: Add /run/machine-config-daemon-force stamp file~~ Bug 1741817: mcd: Add /run/machine-config-daemon-force stamp file Aug 26, 2019

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 26, 2019

openshift-ci-robot assigned rphillips Aug 27, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2019

ericavonb reviewed Aug 27, 2019

View reviewed changes

openshift-merge-robot merged commit 9e770a7 into openshift:master Aug 27, 2019

rphillips mentioned this pull request Sep 5, 2019

[release-4.1] Bug 1749271: Backport mcd: Add /run/machine-config-daemon-force stampfile #1086 #1103

Merged

rphillips added a commit to rphillips/machine-config-operator that referenced this pull request Sep 5, 2019

Bug 1749271: Backport mcd: Add /run/machine-config-daemon-force stamp…

b6162e7

… file openshift#1086

openshift-merge-robot added a commit that referenced this pull request Sep 6, 2019

Merge pull request #1103 from rphillips/backports/1086

a004938

[release-4.1] Bug 1749271: Backport mcd: Add /run/machine-config-daemon-force stampfile #1086

runcom mentioned this pull request Jul 2, 2020

pkg/daemon: make force use the force #1891

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1741817: mcd: Add /run/machine-config-daemon-force stamp file #1086

Bug 1741817: mcd: Add /run/machine-config-daemon-force stamp file #1086

cgwalters commented Aug 26, 2019

kikisdeliveryservice commented Aug 26, 2019

cgwalters commented Aug 26, 2019

rphillips commented Aug 26, 2019

kikisdeliveryservice commented Aug 26, 2019

openshift-ci-robot commented Aug 26, 2019

runcom commented Aug 26, 2019

cgwalters commented Aug 26, 2019

runcom commented Aug 26, 2019

runcom commented Aug 26, 2019

rphillips commented Aug 27, 2019

openshift-ci-robot commented Aug 27, 2019

ericavonb Aug 27, 2019

openshift-ci-robot commented Aug 27, 2019

rphillips commented Sep 5, 2019

openshift-cherrypick-robot commented Sep 5, 2019

cgwalters commented Jul 1, 2020

runcom commented Jul 1, 2020

Bug 1741817: mcd: Add /run/machine-config-daemon-force stamp file #1086

Bug 1741817: mcd: Add /run/machine-config-daemon-force stamp file #1086

Conversation

cgwalters commented Aug 26, 2019

kikisdeliveryservice commented Aug 26, 2019

cgwalters commented Aug 26, 2019

rphillips commented Aug 26, 2019

kikisdeliveryservice commented Aug 26, 2019

openshift-ci-robot commented Aug 26, 2019

runcom commented Aug 26, 2019

cgwalters commented Aug 26, 2019

runcom commented Aug 26, 2019

runcom commented Aug 26, 2019

rphillips commented Aug 27, 2019

openshift-ci-robot commented Aug 27, 2019

ericavonb Aug 27, 2019

Choose a reason for hiding this comment

openshift-ci-robot commented Aug 27, 2019

rphillips commented Sep 5, 2019

openshift-cherrypick-robot commented Sep 5, 2019

cgwalters commented Jul 1, 2020

runcom commented Jul 1, 2020