Bug 1814397: pkg/daemon: fix deletion of stale files #1586

runcom · 2020-03-26T10:55:09Z

We have a serious bug in how we backup "original" files and restore them.
Here, "original" means files that ship with RHCOS. Think of a default Chrony
or another system daemon configuration file. When the MCD kicks in and writes
to those files, we want to backup the original one (the shipped-with-RHCOS) in order
to restore it if a user deletes the MC that modified it (this was the initial bug reported
in GitHub at #782).

However, that patch that fixed #782
was causing the following; if you shipped a file with just one MC, removing it would wipe it out and that works.
However, if you modified that file later again with another MC, a backup
file will be created for the first MC, and when deleting the file by deleting the second MC, it will
restore the initial file shipped with the first MC instead of wiping it out completely which
it should have since that file was never meant to be backed up because it wasn't on RHCOS from the beginning.

This patch now differentiates between files that are already on RHCOS (on-disk so to speak)
and files that are shipped with an MC. For the former, the MCD will create a backup as it's doing today,
for the latter instead, the MCD creates a placeholder file that tells it to just get rid
of the file altogether (along with adding all the necessary checks and actions in order to create those backup files).

The issue popped up on upgrade paths where the new manifests rendered by the MCO don't contain a certain file.
The MCD notices that and go ahead trying to remove the file. It however notices that a backup file
(which was created for an MC shipped file and later other MC have modified it) is there and tries to restore it (also failing
with invalid cross-link device error, but that's another issue which I'm fixing here as well by using cp directly).

Really hoping all the above makes sense. Otherwise, an e2e can be consulted here #1590

Signed-off-by: Antonio Murdaca runcom@linux.com

runcom · 2020-03-26T10:55:29Z

@cgwalters ptal 🙏

We have a serious bug in how we backup "original" files and restore them. Here, "original" means files that ship with RHCOS. Think of a default Chrony or another system daemon configuration file. When the MCD kicks in and writes to those files, we want to backup the original one (the shipped-with-RHCOS) in order to restore it if a user deletes the MC that modified it (this was the initial bug reported in GitHub at openshift#782). However, that patch that fixed openshift#782 was causing the following; if you shipped a file with just _one_ MC, removing it would wipe it out and that works. However, if you modified that file later again with another MC, a backup file will be created for the first MC, and when deleting the file by deleting the second MC, it will restore the initial file shipped with the first MC instead of wiping it out completely which it should have since that file was never meant to be backed up because it wasn't on RHCOS from the beginning. This patch now differentiates between files that are already on RHCOS (on-disk so to speak) and files that are shipped with an MC. For the former, the MCD will create a backup as it's doing today, for the latter instead, the MCD creates a placeholder file that tells it to just get rid of the file altogether (along with adding all the necessary checks and actions in order to create those backup files). The issue popped up on upgrade paths where the new manifests rendered by the MCO don't contain a certain file. The MCD notices that and go ahead trying to remove the file. It however notices that a backup file (which was created for an MC shipped file and later other MC have modified it) is there and tries to restore it (also failing with invalid cross-link device error, but that's another issue which I'm fixing here as well by using cp directly). Really hoping all the above makes sense. Signed-off-by: Antonio Murdaca <runcom@linux.com>

openshift-ci-robot · 2020-03-26T11:36:29Z

@runcom: This pull request references Bugzilla bug 1814397, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1814397: pkg/daemon: fix deletion of stale files

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-03-26T11:56:53Z

@runcom: This pull request references Bugzilla bug 1814397, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1814397: pkg/daemon: fix deletion of stale files

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

runcom · 2020-03-26T13:27:28Z

/retest

jlebon

However, if you modified that file later again with another MC, a backup
file will be created for the first MC

I don't understand this. Why do we backup the file if it's not actually newly modified? Isn't that the root problem here?

When deleting the second MC, shouldn't it notice that the desired file contents have a new target and reconcile it back to the first MC?

jlebon · 2020-03-26T13:31:07Z

pkg/daemon/update.go

+		// create a noorig file that tells the MCD that the file wasn't present on disk in RHCOS
+		// so it can just remove it when deleting stale data, as opposed as restoring a file
+		// that was shipped _with_ RHCOS (e.g. a default chrony config).


This is somewhat of a corner case, but: does this correctly handle an upgrade whose base OSTree introduces a config file which we previously had modified? We'd write the noorig file, but then if it becomes stale would just nuke it entirely instead of restoring the base one, right?

This would be a lot easier if we used the /usr/etc as the canonical "host" defaults, but yeah I understand we need to support traditional systems too and it's easier to have the same path for both.

Thinking more on this: unless I'm misunderstanding, I'm not sure if using noorig is the right approach since it's signaling something that can change. One could imagine the opposite scenario as well: a file we previously shipped in the host and modified with an MC is now dropped out of the host; once you nuke the MC, do we restore the original file instead of deleting it?

does this correctly handle an upgrade whose base OSTree introduces a config file which we previously had modified?

Hm, now I think about it, if we are rebooting into a new oscontainer with a new version of a file, how are we consolidating that today?

And yeah this would be another ugly scenario. Correct me if I'm wrong but I think this would only happen if we somehow created a new MC and the OSTree to write the same file right?

jlebon · 2020-03-26T13:31:44Z

pkg/daemon/update.go

+				}
+				glog.V(2).Infof("Removing file %q completely", f.Path)
+			} else if _, err := os.Stat(origFileName(f.Path)); err == nil {
+				if out, err := exec.Command("cp", "-a", "--reflink=auto", origFileName(f.Path), f.Path).CombinedOutput(); err != nil {


Minor: I see the exec.Command("cp", ...) is used elsewhere too. Could probably use a factoring out to make it cleaner to call.

yuqi-zhang

Right now if we want to fix existing clusters we need to instead iterate over all versions of non-renderd MCs to see whether the file at any point was written, and if not, wipe from disk. This is ugly but this proposed solution won't fix existing clusters.

As a side note do you know why we don't back up systemd units

yuqi-zhang · 2020-03-26T13:49:19Z

pkg/daemon/update.go

 		return nil
 	}
+	if _, err := os.Stat(fpath); err != nil {


So while this would work with newly created clusters, existing clusters would still run into this problem after we upgrade to this version, since we already lost the fact that the file originally didn't exist

This will be covered by patching release-1 (in this case 4.3) - see #1587 <- this patch takes care of a best effort to understand which file was on the host/shipped by an rpm like chrony and the like

yuqi-zhang · 2020-03-26T14:01:12Z

Why do we backup the file if it's not actually newly modified? Isn't that the root problem here?

No that's not the problem. This issue only occurs when

the user specifies a file /my/file that didn't originally exist with MC1
the user then overwrites /my/file with MC2 with new contents (here the MCO makes a backup copy of MC1 /my/file.orig when it shouldn't have)
the user deletes MC1 and MC2
the MCO recognizes that we had a backup, and since the rendered MC doesn't have /my/file in it anywhere, that must mean we need to back up from /my/file.orig, causing a diff from expected behaviour.

This backup originally is meant to fix scenarios like:

the user specifies a file /my/file that with MC1, and /my/file DID already exist on disk with a default file (here the MCO makes a backup copy of /my/file.orig which is correct)
the user deletes MC1
the MCO recognizes that we had a backup, and since the rendered MC doesn't have /my/file in it anywhere, that must mean we need to back up from /my/file.orig, which is fine

runcom · 2020-03-26T14:06:10Z

As a side note do you know why we don't back up systemd units

there's a BZ we need to fix to fix that - it's tracked for now and you're right, we should be doing that for systemd units (and dropins as well)

jlebon · 2020-03-26T14:06:24Z

the user then overwrites /my/file with MC2 with new contents (here the MCO makes a backup copy of MC1 /my/file.orig when it shouldn't have)

That's what I mean though. If we fix just that, then the current orig approach still works fine (notwithstanding the issues mentioned here) and this whole issue goes away, right?

runcom · 2020-03-26T14:07:57Z

the user then overwrites /my/file with MC2 with new contents (here the MCO makes a backup copy of MC1 /my/file.orig when it shouldn't have)

That's what I mean though. If we fix just that, then the current orig approach still works fine (notwithstanding the issues mentioned here) and this whole issue goes away, right?

that's what this does right? it makes sure to create the backup file only if the file isn't already there

openshift-ci-robot · 2020-03-26T14:32:33Z

@runcom: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws	`2480784`	link	`/test e2e-aws`
ci/prow/e2e-gcp-upgrade	`2480784`	link	`/test e2e-gcp-upgrade`
ci/prow/e2e-aws-scaleup-rhel7	`2480784`	link	`/test e2e-aws-scaleup-rhel7`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters · 2020-03-26T16:08:36Z

It's worth looking at how ostree handles the base problem here. The way it works is we always checkout the new defaults, then calculate a diff between current defaults and current, and apply that diff to the new.

It's a lot like git rebase.

The most important thing here is we're not explicitly trying to delete files that no longer apply - we're always starting from the new base.

It's also how ostree handles the base /usr updates for the readonly files - we checkout the new tree, we don't try to diff current/new and apply that.

Whereas this is trying to mutate the current state directly into a new state - a lot like how rpm -Uvh works.

On a related topic, this "mutate the current" is also extremely hard to do transactionally - neither rpm -Uvh nor MCO changes are.

I think the long term what we want to do is capture what I'll call the "local diff" - any changes made on the node that are not part of the MCO. Then on upgrades, we:

apply OS base defaults
apply new MCO config on top
apply local diff

Then in no case do we need to even care about a diff versus the old MCO state, dealing with .orig files and the like.

But this is harder to do while we need to care about !CoreOS.

cgwalters · 2020-03-26T16:16:07Z

It should be really pretty easy to put together a set of unit tests for this right? Or perhaps better, a more "blackbox" test that runs in a pod:

machine-config-daemon ignition-files-update /path/to/basedir /path/to/mc1.yaml /path/to/mc2.yaml

Basically ignition-files-update would be a CLI entrypoint that just calls into updateFiles() without requiring a cluster even.

jlebon

Had a chat with @runcom about this, and I think we're on the same page. I had trouble trying to formulate what bothered me, but I think I can pinpoint it now: creating an .mcdnoorig isn't actually necessary; the condition we care about (whether we're doing a modification on top of another MC) can be calculated by just analyzing the generated MCs since we have both the old and new generated configs at hand.

And given that .mcdnoorig files don't actually track the right thing (as per discussions in #1586 (comment)), it wasn't clear to me if we were doing more harm than good. I think I've convinced myself though that the amount of harm is about the same. :)

Anyway, I think this approach works too as a stopgap and I appreciate that it's easier to achieve than doing another diff as described above.

jlebon · 2020-03-26T16:16:10Z

pkg/daemon/update.go

 		return nil
 	}
+	if _, err := os.Stat(fpath); err != nil {


Should check specifically for ENOENT here: https://golang.org/pkg/os/#IsNotExist

openshift-ci-robot · 2020-03-26T16:18:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlebon, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sinnykumari · 2020-03-30T11:51:11Z

Did we close this PR by mistake 🤔

yuqi-zhang · 2020-03-30T13:34:22Z

This has been superceded by #1593

openshift-ci-robot requested review from ashcrow and ericavonb March 26, 2020 10:55

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 26, 2020

runcom force-pushed the fix-orig branch from e7df338 to 2480784 Compare March 26, 2020 11:05

runcom changed the title ~~pkg/daemon: fix deletion of stale files~~ Bug 1814397: pkg/daemon: fix deletion of stale files Mar 26, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 26, 2020

runcom mentioned this pull request Mar 26, 2020

test/e2e: add e2e to test files proper deletion #1590

Closed

ashcrow requested a review from yuqi-zhang March 26, 2020 13:11

jlebon reviewed Mar 26, 2020

View reviewed changes

yuqi-zhang reviewed Mar 26, 2020

View reviewed changes

cgwalters mentioned this pull request Mar 26, 2020

move rhel7 worker node updates entirely to openshift-ansible #1592

Open

runcom mentioned this pull request Mar 26, 2020

Bug 1817458: [release-4.3] pkg/daemon: fix deletion of stale files #1587

Closed

jlebon approved these changes Mar 26, 2020

View reviewed changes

yuqi-zhang mentioned this pull request Mar 26, 2020

Bug 1814397: fix wrongful backup of files not originally on the system #1593

Merged

runcom closed this Mar 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1814397: pkg/daemon: fix deletion of stale files #1586

Bug 1814397: pkg/daemon: fix deletion of stale files #1586

runcom commented Mar 26, 2020 •

edited

runcom commented Mar 26, 2020

openshift-ci-robot commented Mar 26, 2020

openshift-ci-robot commented Mar 26, 2020

runcom commented Mar 26, 2020

jlebon left a comment

jlebon Mar 26, 2020

jlebon Mar 26, 2020

yuqi-zhang Mar 26, 2020

jlebon Mar 26, 2020

yuqi-zhang left a comment

yuqi-zhang Mar 26, 2020

runcom Mar 26, 2020

yuqi-zhang commented Mar 26, 2020

runcom commented Mar 26, 2020

jlebon commented Mar 26, 2020

runcom commented Mar 26, 2020

openshift-ci-robot commented Mar 26, 2020

cgwalters commented Mar 26, 2020

cgwalters commented Mar 26, 2020

jlebon left a comment

jlebon Mar 26, 2020

openshift-ci-robot commented Mar 26, 2020

sinnykumari commented Mar 30, 2020

yuqi-zhang commented Mar 30, 2020

Bug 1814397: pkg/daemon: fix deletion of stale files #1586

Bug 1814397: pkg/daemon: fix deletion of stale files #1586

Conversation

runcom commented Mar 26, 2020 • edited

runcom commented Mar 26, 2020

openshift-ci-robot commented Mar 26, 2020

openshift-ci-robot commented Mar 26, 2020

runcom commented Mar 26, 2020

jlebon left a comment

Choose a reason for hiding this comment

jlebon Mar 26, 2020

Choose a reason for hiding this comment

jlebon Mar 26, 2020

Choose a reason for hiding this comment

yuqi-zhang Mar 26, 2020

Choose a reason for hiding this comment

jlebon Mar 26, 2020

Choose a reason for hiding this comment

yuqi-zhang left a comment

Choose a reason for hiding this comment

yuqi-zhang Mar 26, 2020

Choose a reason for hiding this comment

runcom Mar 26, 2020

Choose a reason for hiding this comment

yuqi-zhang commented Mar 26, 2020

runcom commented Mar 26, 2020

jlebon commented Mar 26, 2020

runcom commented Mar 26, 2020

openshift-ci-robot commented Mar 26, 2020

cgwalters commented Mar 26, 2020

cgwalters commented Mar 26, 2020

jlebon left a comment

Choose a reason for hiding this comment

jlebon Mar 26, 2020

Choose a reason for hiding this comment

openshift-ci-robot commented Mar 26, 2020

sinnykumari commented Mar 30, 2020

yuqi-zhang commented Mar 30, 2020

runcom commented Mar 26, 2020 •

edited