Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2020003: Node upgrade stuck due to not writing through dangling symlink '/etc/machine-config-daemon/orig/etc/issue.mcdorig' #2681

Merged

Conversation

jkyros
Copy link
Contributor

@jkyros jkyros commented Jul 19, 2021

This PR fixes a condition where:

  1. The mcd backs up a relative/dangling symlink while backing up original files during a configuration rollout
  2. The next time configuration gets rolled out, the mcd checks to see if it has already backed up the original files
  3. os.Stat returns an error against the previously backed-up symlink because the symlink's target does not exist
  4. The mcd interprets that error as the file not existing and tries to copy over it, resulting in "cp: not writing through dangling symlink"

This PR adds a symlink check before the file existence check, to make sure we don't try to back up over the dangling symlink.

This also adds a unit test for original file backups that tests the symlink case as well as the "normal" case. I broke the backup file paths out into package variables so I could override them for the test.

This PR does NOT address the problem of a user undesirably using machine config to overwrite symlinks with files, it only prevents the MCO from degrading in the event that such a configuration were to occur.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2020003

@openshift-ci openshift-ci bot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Jul 19, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 19, 2021

@jkyros: This pull request references Bugzilla bug 1970959, which is invalid:

  • expected the bug to target the "4.9.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1970959: Node upgrade stuck due to not writing through dangling symlink '/etc/machine-config-daemon/orig/etc/issue.mcdorig'

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jul 19, 2021
@jkyros
Copy link
Contributor Author

jkyros commented Jul 19, 2021

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jul 19, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 19, 2021

@jkyros: This pull request references Bugzilla bug 1970959, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @mike-nguyen

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from mike-nguyen July 19, 2021 15:54
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question

@@ -1605,11 +1610,11 @@ func (dn *Daemon) writeFiles(files []ign3types.File) error {
}

func origParentDir() string {
return filepath.Join("/etc", "machine-config-daemon", "orig")
return origParentDirPath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you did this for testing (?), but is there a reason you can't just mock the return value of this func? instead of having the var and func both existing at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am testing to make sure createOrigFile() is working properly, and I need a way to get it to write its files to a temporary directory during the test, but the file locations are essentially hard-coded because of those function calls.

Maybe I shouldn't worry about the test (or maybe it should be an e2e test because it does file i/o?). I don't know where you want to draw the line.

Anyway, my options as I saw them were:

  1. make origParentDir/noOrigParentDir function pointers and then mock/override their functions for the test:
    var origParentDir = func() string { return filepath.Join("/etc", "machine-config-daemon", "orig") }
  2. make package variables for the strings and have createOrigFile use those strings directly instead of the function calls ( I figured that would be "too much change" for no benefit )
  3. have origParentDir/noOrigParentDir/createOrigFile take an path/prefix argument so I could redirect where they put their files ( again, I felt like that was too much change for no benefit)
  4. make package variables for the path strings and just have the existing functions return them

I went with option 4, but I can do option 1 if that's preferable?

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, I think this generally LGTM, thanks for the tests!

I think the current variable paths method is fine, but will defer to others for a review as well.

One question below:

err = createOrigFile(relativeSymlink, relativeSymlink)
assert.Nil(t, err)

// Finally, make sure we can restore the relative symlink if we rollback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, even if it is a dangling symlink, we still restore it now correctly right?

I see we do a cp -a --reflink=auto so it seems that it should restore the (still dangling) symlink

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we still restore it correctly. I was worried about that too! :)

Like in the case of /etc/issue, the symlink wasn't "wrong", it was only dangling because of where we put it, if we restored it, it wouldn't be dangling anymore, it would just be relative (and correct).

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2021
@@ -58,6 +58,11 @@ const (
postConfigChangeActionReloadCrio = "reload crio"
)

var (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variable are not supposed to be updated, so this could be made const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I would have made them const if they weren't paths being assembled with filepath.Join (which I assume was done as good practice for cross-platform separators, but I don't know how likely it is that this will need to be cross platform)
  • If I make them constants, I can't assign to them in the test to redirect the files for testing

@sinnykumari
Copy link
Contributor

sinnykumari commented Jul 28, 2021

I am not sure if this is the right way to fix the bug. I looked at the bug and it seems, user is trying to update content of /etc/issue by applying a MachineConfig. On a 4.8 cluster RHCOS node, I see:

$ ls -l /etc/issue
 lrwxrwxrwx. 1 root root 16 Jul 28 14:38 /etc/issue -> ../usr/lib/issue

/etc/issue is symlink to /usr/lib/issue file. On RHCOS system, /usr is read-only filesystem, so in no way MCO would be able to edit the content of file. Ideally I think we should not even try to create either a backup of original file or try to write updated content for read-only content. MCO should either return an error that updating <file> is not possible due to read-only content and go either degraded or make it as info. I am not sure what would be the correct messaging: going degraded or just log info because this would ideally work for RHEL worker nodes in the pool.

@cgwalters thoughts?

@jkyros
Copy link
Contributor Author

jkyros commented Jul 28, 2021

I see what you're saying. writeFileAtomically's behavior is to replace the symlink with a file, not to write the file contents to the symlink's target

func writeFileAtomically(fpath string, b []byte, dirMode, fileMode os.FileMode, uid, gid int) error {

because under the hood it's a "rename" operation not a direct write.

func (t *PendingFile) CloseAtomicallyReplace() error {

I was assuming this was "desired behavior" 😄

Also...the compliance operator seems to encourage modifying /etc/issue on RHCOS as part of its audits -- at least one of the users that hit this was using the compliance operator -- so while the current behavior is maybe non-intuitive/incorrect, if we do change this to act on the target file vs the symlink itself, there will be some consequence.

@kikisdeliveryservice
Copy link
Contributor

while this is being sorted out
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 28, 2021
@yuqi-zhang
Copy link
Contributor

I think there's 2 issues here:

  1. handling of dangling symlinks in the MCD in general
  2. writing to locations we shouldn't be writing to

Which are somewhat tangential. We don't really do a great job of guarding against 2, so it's on a per-error basis. I agree with Sinny in that in the particular case of the BZ the user should have probably just written to /etc/issue.d. I am ok with the PR as a fix to 1, for other potential dangling symlinks that would normally be in writable locations.

As for the compliance operator, if we overwrite /etc/issue directly, technically its ok because the location of /usr/lib/issue would just not get used at all, and the symlink just becomes a file. However if we want to move compliance operator to using issue.d, maybe we can move the BZ and get their thoughts on it as well.

Would that be an ok path? (merge this for 1, move BZ for 2)

@sinnykumari
Copy link
Contributor

@jkyros I agree that MCO is not really doing good job in updating symlink files.

@yuqi-zhang sure, we can go ahead with this PR for in general improvement and move this bug to compliance team to see if they can update instructions to write to /etc/issue.d/ instead of in /etc/issue file. Suggesting to directly write directly into /etc/issue by making symlink to a file shouldn't be ideally done since that file is coming a spart of OS update and may cause issue(RHCOS team would have better idea)

In addition to above two points, I think we should also create a story for MCO to discuss about right way of handling symlink files for both cases editable and read-only.

Based on above discussion, this PR would need update in messaging and perhaps creating a new bz if we want to backport to older releases.

@openshift-merge-robot
Copy link
Contributor

/bugzilla refresh

The requirements for Bugzilla bugs have changed, recalculating validity.

@openshift-ci openshift-ci bot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Sep 6, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 6, 2021

@openshift-merge-robot: This pull request references Bugzilla bug 1970959, which is invalid:

  • expected the bug to target the "4.10.0" release, but it targets "4.9.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

The requirements for Bugzilla bugs have changed, recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jkyros
Copy link
Contributor Author

jkyros commented Nov 2, 2021

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Nov 2, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 2, 2021

@jkyros: This pull request references Bugzilla bug 1970959, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (rioliu@redhat.com), skipping review request.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Add symlink check before existence check for original file backup,
so the mcd regards dangling symlinks as present and doesn't try to
overwrite them on later updates. This will resolve a bug where the mcd
would back up a relative symlink and then try (and fail) to overwrite
it next time config got applied, causing the mcd to degrade.

This does NOT address the problem of a user undesirably overwriting
symlinks with files, it only prevents the MCO from degrading in the
event that such a write were to happen.

Fixes:  https://bugzilla.redhat.com/show_bug.cgi?id=1970959
@sinnykumari
Copy link
Contributor

Thanks John for making all the changes. LGTM
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 4, 2021
@sinnykumari
Copy link
Contributor

/retest

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2021

@jkyros: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-workers-rhel7 27dddb6 link false /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-workers-rhel8 27dddb6 link false /test e2e-aws-workers-rhel8
ci/prow/e2e-aws-upgrade-single-node 27dddb6 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-aws-disruptive 27dddb6 link false /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

15 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit b0511ee into openshift:master Nov 4, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2021

@jkyros: All pull requests linked via external trackers have merged:

Bugzilla bug 2020003 has been moved to the MODIFIED state.

In response to this:

Bug 2020003: Node upgrade stuck due to not writing through dangling symlink '/etc/machine-config-daemon/orig/etc/issue.mcdorig'

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants