New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/daemon: add atomic files writing and remove file system abstraction #401

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
6 participants
@runcom
Copy link
Member

runcom commented Feb 11, 2019

pkg/daemon: add atomic files writing and remove file system abstraction

This patch makes sure we write files and ssh (file)
atomically so, if and only if, we'll be able to reconcile a failing machine,
we don't end up having old files lying around.

Last thing this patch does is getting rid of the FileSystemClient abstraction.
I've replaced that with normal 1st class function assignements so we don't
regress in testing. We would need to move those tests to e2e anyway.

Closes #101

@cgwalters @kikisdeliveryservice @ashcrow ptal

Signed-off-by: Antonio Murdaca runcom@linux.com

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Feb 11, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom runcom force-pushed the runcom:file-leak-fix branch from b558b36 to 4657d41 Feb 11, 2019

@runcom runcom changed the title pkg/daemon: fix files leak pkg/daemon: atomically write files Feb 11, 2019

@runcom runcom changed the title pkg/daemon: atomically write files pkg/daemon: fix files leak Feb 11, 2019

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 11, 2019

I'm cool with replacing the abstraction with functions that can be mocked for testing.

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 11, 2019

@runcom runcom changed the title pkg/daemon: fix files leak pkg/daemon: fix potential files leak Feb 11, 2019

@runcom runcom force-pushed the runcom:file-leak-fix branch 3 times, most recently from b039868 to 3eeea91 Feb 11, 2019

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Feb 11, 2019

"fix files leak" is a bit understated of a commit message title 😉

"Remove filesystem mock, fix bugs"? dunno.

Personally I am not a huge fan of the current unit testing we have...I tend to prefer "real" testing and make that as convenient as possible. When the mocking starts to get nontrivial (as is the case not only with the filesystem but also our mock kube client stuff) I think the value proposition gets a lot more uncertain.

OTOH it can be easier to test error conditions via mocking. On the other other hand...a few strategic points to inject errors can go a long way.

I think though I'd say we probably shouldn't do this change right now...the bug isn't worth the code churn? Let's fix the leak in the existing code and keep this around for later?

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Feb 11, 2019

Something that did change of course since the unit testing here was introduced is that we have our own e2e-aws-op where we can test functionality "for real". EDIT: And another thing that changed is clusters install reliably so one can use them for testing.

@runcom

This comment has been minimized.

Copy link
Member Author

runcom commented Feb 11, 2019

"fix files leak" is a bit understated of a commit message title

added potential, even if, on any error, we end up leaving files around :)

Personally I am not a huge fan of the current unit testing we have...I tend to prefer "real" testing and make that as convenient as possible. When the mocking starts to get nontrivial (as is the case not only with the filesystem but also our mock kube client stuff) I think the value proposition gets a lot more uncertain.

I belive the abstraction introduced was because there was no e2e, not that we have it, we can indeed test it for real

I think though I'd say we probably shouldn't do this change right now...the bug isn't worth the code churn? Let's fix the leak in the existing code and keep this around for later?

I'm not sure about, this mainly adds atomic file writing to prevent files laying around, the defer file.Close() isn't really doing nothing per-se (I spotted that later, was already adding atomic file write).

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 11, 2019

I think though I'd say we probably shouldn't do this change right now...the bug isn't worth the code churn? Let's fix the leak in the existing code and keep this around for later?

That sounds reasonable to me.

@runcom

This comment has been minimized.

Copy link
Member Author

runcom commented Feb 11, 2019

Something that did change of course since the unit testing here was introduced is that we have our own e2e-aws-op where we can test functionality "for real".

this is indeed why I've added Closes #101 as well in the first comment

@runcom

This comment has been minimized.

Copy link
Member Author

runcom commented Feb 11, 2019

#402

@runcom runcom force-pushed the runcom:file-leak-fix branch from 3eeea91 to 6145839 Feb 11, 2019

@runcom runcom changed the title pkg/daemon: fix potential files leak pkg/daemon: add atomic files writing and remove file system abstraction Feb 11, 2019

@runcom runcom force-pushed the runcom:file-leak-fix branch 2 times, most recently from 71657cc to 1a089bf Feb 11, 2019

@runcom

This comment has been minimized.

Copy link
Member Author

runcom commented Feb 11, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member Author

runcom commented Feb 11, 2019

e2e(s) rock 🚀

--- FAIL: TestMCDeployed (308.89s)
	mcd_test.go:146: machine config didn't result in file being on any worker: timed out waiting for the condition

I0211 18:44:43.251341    5348 daemon.go:657] Unable to apply update: rename /tmp/file-write126687226 /etc/mytestconf: invalid cross-device link
E0211 18:44:43.251467    5348 writer.go:91] Marking degraded due to: rename /tmp/file-write126687226 /etc/mytestconf: invalid cross-device link

@runcom runcom force-pushed the runcom:file-leak-fix branch from 1a089bf to 24de6b0 Feb 11, 2019

Show resolved Hide resolved pkg/daemon/update.go Outdated
@@ -187,6 +185,7 @@ func New(
exitCh: exitCh,
stopCh: stopCh,
}
dn.atomicSSHKeysWriter = dn.atomicallyWriteSSHKey

This comment has been minimized.

@kikisdeliveryservice

kikisdeliveryservice Feb 12, 2019

Member

Could you explain what this is for?

This comment has been minimized.

@runcom

runcom Feb 12, 2019

Author Member

this is a lightweight version of the filesystem client, as in, the client existed mainly to be able to unit test the flow as we shouldn't wrap the file management API of Golang. By having a function which deals with writes only, we're able to mock it out in unit test till we have proper e2e in place for these scenarios.

This comment has been minimized.

@kikisdeliveryservice

kikisdeliveryservice Feb 12, 2019

Member

ah thanks! I missed the note on line 87.

@runcom runcom force-pushed the runcom:file-leak-fix branch from 24de6b0 to 7567af4 Feb 12, 2019

@runcom

This comment has been minimized.

Copy link
Member Author

runcom commented Feb 12, 2019

(rebased 😈 )


authkeypath := filepath.Join(sshDirPath, "authorized_keys")
// we're also appending all keys for any user to core, so for now

This comment has been minimized.

@kikisdeliveryservice

kikisdeliveryservice Feb 12, 2019

Member

Not quite: "we are appending all new SSHAuthorizedKeys of user core and passing to atomicallyWriteSSHKeys to write."

This comment has been minimized.

@runcom

runcom Feb 12, 2019

Author Member

mmm https://github.com/openshift/machine-config-operator/pull/401/files#diff-06961b075f1753956d802ba954d2cfb5R686 seems to suggest that we're ranging through every user in Users and grabbing their ssh keys as well

This comment has been minimized.

@kikisdeliveryservice

kikisdeliveryservice Feb 12, 2019

Member

no we are ranging over the SSHAuthorizedKeys

This comment has been minimized.

@runcom

runcom Feb 12, 2019

Author Member

well, no the last one actually

This comment has been minimized.

@runcom

runcom Feb 12, 2019

Author Member

there's no check around len(newUsers) so that range can be wrong right?

This comment has been minimized.

@runcom

runcom Feb 12, 2019

Author Member

alrighty, thanks, just saw your previous comment 😄

This comment has been minimized.

@kikisdeliveryservice

kikisdeliveryservice Feb 12, 2019

Member

Yeah I have a cleanup PR taking that erroneous [0] out :)

This comment has been minimized.

@runcom

runcom Feb 12, 2019

Author Member

no wait, the len is for newUsers to be empty, but what if it has more than 1 user? we'd fall into #401 (comment) as per your previous comment

This comment has been minimized.

@kikisdeliveryservice

kikisdeliveryservice Feb 12, 2019

Member

I think it's prob better to go over this with you outside of this pr discussion thread :)

tho i'd like this pr to wait until my cleanup pr hits and you can rebase on it. though it sounds like this one is going to be held anyway?

This comment has been minimized.

@runcom

runcom Feb 12, 2019

Author Member

well, sure, I can def wait for you PR to land first. As for holding this up, for now yea, but it's still something we need to have imo (nodes can go down, lose connection and stuff like that)

@runcom

This comment has been minimized.

Copy link
Member Author

runcom commented Feb 13, 2019

/retest

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 13, 2019

Rebase needed

@runcom runcom force-pushed the runcom:file-leak-fix branch 2 times, most recently from c143b0c to 0ff4904 Feb 17, 2019

@runcom

This comment has been minimized.

Copy link
Member Author

runcom commented Feb 17, 2019

rebased finally

@runcom runcom force-pushed the runcom:file-leak-fix branch 2 times, most recently from a5e0d77 to eb033b8 Feb 17, 2019

@runcom

This comment has been minimized.

Copy link
Member Author

runcom commented Feb 17, 2019

this is now writes atomically for systemd units and dropins as well

@runcom runcom force-pushed the runcom:file-leak-fix branch 5 times, most recently from 797eb8d to b96b21b Feb 17, 2019

@runcom runcom force-pushed the runcom:file-leak-fix branch from b96b21b to 611d2a0 Feb 17, 2019

@openshift-ci-robot openshift-ci-robot added size/XL and removed size/L labels Feb 17, 2019

@runcom runcom force-pushed the runcom:file-leak-fix branch 3 times, most recently from 1bcd584 to 177655b Feb 17, 2019

@runcom

This comment has been minimized.

Copy link
Member Author

runcom commented Feb 17, 2019

Unit flake tracked here: #449

/retest

pkg/daemon: add atomic files writing and remove file system abstraction
This patch makes sure we write files
atomically so, if and only if, we'll be able to reconcile a failing machine,
we don't end up having old files lying around.

Last thing this patch does is getting rid of the FileSystemClient abstraction.
I've replaced that with normal 1st class function assignements so we don't
regress in testing. We would need to move those tests to e2e anyway.

Signed-off-by: Antonio Murdaca <runcom@linux.com>

@runcom runcom force-pushed the runcom:file-leak-fix branch from 177655b to de623c6 Feb 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment