[FIX JENKINS-34855] Create a FileChannelWriter and use it for AtomicFileWriter "enforced" flushing #3170

batmat · 2017-12-01T22:55:14Z

Followup of #2548 & #3165

After a lot of reading and some testing, the only reliable ways to force a local device sync from Java code appear to be either by using FileChannel.force(boolean) or FileDescriptor.sync().

So even if the issues around this have generally been seen more when using remote FS, I think we can hope (?) that making sure we actually flush to local disk could at least make the probability for that issue (i.e. zero-length files seen in the wild) to happen overall (far?) less likely.

This change rewires the current internal core writer in AtomicFileWriter to delegate to a FileChannel, giving then us access to FileChannel.force().

Proposed changelog entries

Internal: JENKINS-34855 AtomicFileWriter now uses a FileChannel internally and forces a disk sync when committing.

Submitter checklist

JIRA issue is well described
Changelog entry appropriate for the audience affected by the change (users or developer, depending on the change). Examples
* Use the Internal: prefix if the change has no user-visible impact (API, test frameworks, etc.)
Appropriate autotests or explanation to why this change has no tests
For dependency updates: links to external changelogs and, if possible, full diffs

ToDos after merge

Update https://wiki.jenkins.io/display/JENKINS/Features+controlled+by+system+properties with the new sysprop
File an improvement followup from James' comment: [FIX JENKINS-34855] Create a FileChannelWriter and use it for AtomicFileWriter "enforced" flushing #3170 (review)

Desired reviewers

@reviewbybees esp. @jtnord

batmat · 2017-12-03T17:35:00Z

test/src/test/java/hudson/util/AtomicFileWriterPerfTest.java

@@ -24,7 +24,7 @@
     * <strong>really</strong> bad performance regressions.
     */
    @Issue("JENKINS-34855")
-    @Test(timeout = 30 * 1000L)
+    @Test(timeout = 100 * 1000L)


And disable FileChannel.force() in the case of AtomicFileWriter. See the committed javadoc for the full why.

amuniz

Did you see some flakiness after the fix in #2548? I wonder if a "corner case" is worth adding this complexity... I've seen the "atomic not being atomic" issue on network file systems only, never locally.

amuniz · 2017-12-04T08:41:25Z

test/src/test/java/hudson/util/AtomicFileWriterPerfTest.java

@@ -24,7 +24,7 @@
     * <strong>really</strong> bad performance regressions.
     */
    @Issue("JENKINS-34855")
-    @Test(timeout = 30 * 1000L)
+    @Test(timeout = 80 * 1000L)


Why? Has this worsened performance by ~2x?

Worse than that, around ~30+ for that very case. I just reduced to 50 seconds, since it actually takes ~35 seconds in CI.

But as explained in the commit log, I still think the overall change is an improvement. Before, we were quicker, but unsafe, is the whole point of this... :-|

(Note: mainly bumping for the CI to not fail since I was never able to see this test fail on my machine [takes ~17 with this change] anyway other than by running in Docker and aggressively reduce IOPS to 40 using Docker resource constraints.) Since we now *actually* sync to disk, this isn't really a regression IMO. From the very role of AtomicFileWriter, it looks more like a feature or an improvement than an issue. So I suspect that anyway 1) we won't see a lot of bad impact in the wild on most setups (as we generally already recommend SSDs and so on as per how Jenkins works) and 2) we win anyway in term of integrity by making sure (well, raising the chances, let's say) we do write to disk before renaming files.

batmat · 2017-12-04T09:08:12Z

"corner case"

@amuniz This fix is actually more like the main one, #2548 is only about atomic rename. But it could still happen we rename something that has still bits not flushed in the myriads of layers of FS caching and networking.

The whole issue may indeed be a corner case, granted, but is a dramatic one when it happens as you know ("zero-length"ed config files can just take down the whole instance).

I manually built and ran Jenkins on my machine with this change, and couldn't see any noticeable degradation. And that makes sense: for instance, AtomicFileWriterPerfTest does indeed go from 1 second to 35 seconds.

Taken out of context, it can look as just a horrible perf regression.

But now, if one looks at the test (and I wrote it only a few days ago to monitor this), this is doing 1000 Jenkins.save(). So, before, that means a Jenkins.save() now takes (on the CI setup) ~35 milliseconds, instead of previously ~1 millisecond.

35x is a huge variation, but the absolute value IMO tells us that that shouldn't be an actual issue, knowing also we have between those two added a far better effort to make sure important things are actually written to disk.

jtnord · 2017-12-04T10:44:50Z

core/src/main/java/hudson/util/FileChannelWriter.java

+    @Override
+    public void close() throws IOException {
+        if(channel.isOpen()) {
+            channel.force(true);


for the case where the file is being thrown away (abort is called rather than commit) we could get away with not calling force. (that is the ATF.commit could call a force method which does a flush, force, and then subsequently call close)

Will do as a followup.

jtnord · 2017-12-06T12:29:21Z

@amuniz

I've seen the "atomic not being atomic" issue on network file systems only, never locally.

I've seen it locally, depends on your file system of choice, especially if some flush metadata before data (or independantly)... https://www.spinics.net/lists/xfs/msg09930.html

batmat · 2017-12-06T16:32:05Z

@jenkinsci/code-reviewers for possibly more eyes or testing.

dwnusbaum

🐝

svanoort

While I think this does ensure commit() really does, I feel due to the performance discrepancy, the enforced sync-to-disk should be controlled by a system property, with default set to off.

That way for cases where we need ABSOLUTE integrity, we can set the property on (covering original issues), but we don't wreck the majority of masters where that is not the case.

I can tell you that because this behavior AFAICT is on for all XmlFiles, and Pipeline hits those heavily until JENKINS-47170 lands, the second this lands no master will be able to run Pipeline at an acceptable rate.

Or is this already covered by something else in the other changes that I've missed?

Edit: opt-in enforced flushing for the most critical config files (Jenkins core configs + Job config) would also be acceptable IMO, esp if a system property allows you to disable autosync if we encounter edge cases. It's more the files related to builds or that are touched often where it's a major problem.

jtnord · 2017-12-07T17:06:11Z

That way for cases where we need ABSOLUTE integrity, we can set the property on (covering original issues), but we don't wreck the majority of masters where that is not the case.

@svanoort you can not control requiring integrity vs not - it is just not an option.

most people may thing oh yeah - but then Jenkins starts with an empty / missing config.xml and everything goes to pot - and it is Jenkins fault.

Code that does not need to care about integrity has no business using AtomicFileWriter to begin with - as the sole purpose of that was to ensure integrity of the system.

If you ask people to set a flag once they have dataloss you will be screamed at "WHY WAS THAT NOT THE DEFAULT FOR SAKE"

If you want a flag then the flag then this should be of the form "I_DO_NOT_CARE_ABOUT_SYSTEM_INTEGRITY_GIVE_ME_PERFORMANCE" and have a value of "I_ACCEPT_THE_RISK" before we throw integrety out of the window.

That is you can OPT out of integrity but you never have to opt into it when it is the expected behaviour.

jglick · 2017-12-07T17:22:19Z

I would tend to agree. Jenkins should be safe and correct by default. If there is a demonstrable I/O performance issue under heavy load conditions involving Pipeline, well we can always have a system property to opt out of integrity guarantees, but better would be for the Pipeline FlowNode storage engine to selectively avoid AtomicFileWriter under conditions where it knows loss of a file is tolerable. (The same might be true of build.xml for Runs generally—if these get truncated, well, life moves on.)

batmat · 2017-12-07T20:01:32Z

@jglick @svanoort what plugin(s) you think already have tests (or could have?) that would assess the impact it would have (running with -Djenkins.version=2.94-SNAPSHOT)?

As explained above, though the impact is big on proportion, especially on ridiculously slow disks, we're still talking about milliseconds in absolute values. So for instance if flownodes are saved in a single build.xml (?), I tend to think the impact shouldn't be gigantic?
I disabled the SYNC on flush() for this purpose (because XmlFile transitively calls flush for every single writes or so), or do we call AtomicFileWriter.commît() potentially like dozens or hundreds of times per second from Pipeline code?

svanoort · 2017-12-08T02:27:03Z

@batmat I ought to be able to test it against Pipeline with Hydra, AKA The Scalability Lab soonishly -- there's a setting I need to benchmark anyway that turns OFF all the AtomicWriter behavior for Pipeline, so that'll be an easy approach to before/after comparison.

The problem with FlowNodes is that they're (currently) saved by default in a whole bunch of tiny XML files, and those small overheads already add up in a big way for them. Now, I have a mode that does away with that nonsense but people have to know that this is the problem and they need to select it.

I suspect that's not the only place where this overhead will pose a problem or where there's XML saved often.

In general, I'm not a fan of changes that introduce huge, non-optional performance regressions to protect against a rare failure mode (tied to operating-system level failures, here). To a moderate-to-large scale users, these performance changes are more likely to bring down a master than an edge case like this. I remind you all here about the ongoing pain we've had with user lookup performance. "Just" a performance regression there has claimed the lives of many production masters at large companies, c.f. JENKINS-35493 between this and its recent kin (c.f. JENKINS-47429.

Code that does not need to care about integrity has no business using AtomicFileWriter to begin with - as the sole purpose of that was to ensure integrity of the system.

@jtnord The problem is that XmlFile.write(Object) is what many things use by default for persisting XStream content.

I think @batmat did a really clever fix here, but we need to ensure that the fix is only applied where it's critical, i.e top-level Jenkins Configs or Build configs, or that at least people can turn it off.

I believe the phrase here is "First Do No Harm"?

stephenc · 2017-12-08T08:12:15Z

"First Do No Harm"

Which is exactly picking integrity over performance (as the default... it’s fine to allow users to opt-in to picking performance over integrity)

batmat · 2018-01-04T13:39:03Z

Updated as 3a5189e flush & close flags were reversed.

olivergondza · 2018-01-05T10:06:38Z

core/src/main/java/hudson/util/FileChannelWriter.java

+            LOGGER.finest("Flush is forced");
+            channel.force(true);
+        } else {
+            LOGGER.finest("Force disabled on flush(), no-op");


Should we not do some flush here? Otherwise turning the flag off would be result in behaviour more broken that it was before.

@olivergondza good question.

TL;DR: FileChannel.write() already achieves the equivalent of BufferedWriter.write() + BufferedWriter.flush().

FileChannel, contrary to the previous Files.newBufferedWriter has no Java-level buffering. So FileChannel.write() writes to disk, well actually sends data to the OS/disk cache depending on the system config.

So, flushing does not make as much sense here, or at least not the same meaning as when one calls BufferedWriter.flush(). And BTW, even BufferedWriter.flush() does not actually force a full flush to disk, it's more like a cache from JVM cache/RAM to the next level, no guaranty things end up on the disk. That is the whole point of this change.

FileChannel.force() actually guarantees everything is flushed to the disk. BufferedWriter.write() + BufferedWriter.flush() does not.

I see. So, if I understand correctly, the data should be send to disk more eagerly than before (avoiding Java level caching) even when turned off. Not that I would object against the change, I am merely pointing out there is no way for users to switch to the old behaviour we know was working fine for ages in case this would not work as expected.

switch to the old behaviour we know was working fine for ages in case this would not work as expected.

forceOnFlush defaults to false from AtomicFileWriter - so we are not pushing to disk on a flush thus it should behave very similar (the data has gone from Java to the OS) - although possibly via different underlying native calls.

By default from AtomicFileWriter we are only making sure the data is on disk when the Writer is close()d and if you want to simulate the old broken behaviour you can: see here and here

Reviewed before most recent changes

svanoort · 2018-01-05T14:22:28Z

core/src/main/java/hudson/util/AtomicFileWriter.java

@@ -51,6 +54,15 @@

    private static final Logger LOGGER = Logger.getLogger(AtomicFileWriter.class.getName());

+    private static /* final */ boolean LOSE_INTEGRITY = SystemProperties.getBoolean(
+            AtomicFileWriter.class.getName() + ".LOSE_DATA_INTEGRITY_FOR_PERFORMANCE");


This name feels really overly dramatic. Could we call it "DISABLE_FORCED_FLUSH" instead?

Basic rule: an option should tell people something about what it does -- this isn't a very descriptive name IMO and isn't really accurate either since it's not globally destroying data integrity, it's modifying one specific behavior wrt the OS caching.

I agree to adjust this, but I think our current behaviour is also not 100% safe enough by default, which it should probably be. I could have put something like REDUCE_LIKELINESS_TO_LOSE_DATA_WHEN_THINGS_GO_WRONG and that would indeed have been closer to the reality, but well :-).

DISABLE_FORCED_FLUSH feels a bit not dramatic enough. My thought and understanding of this is that we really would like to not have it disabled apart from emergencies and very special cases, so let's not encourage it.

But OK, let's rename the option, and still leave a (less dramatic too, ok) message on class loading.

svanoort · 2018-01-05T14:23:12Z

core/src/main/java/hudson/util/AtomicFileWriter.java

+
+    static {
+        if (LOSE_INTEGRITY) {
+            LOGGER.log(Level.SEVERE, "LOSE_DATA_INTEGRITY_FOR_PERFORMANCE flag used, YOU RISK LOSING DATA.");


Again, overly dramatic. I think we really only need to warn people about this case happening when the OS or storage is yanked rather than cleanly shut down.

svanoort · 2018-01-05T14:24:35Z

core/src/main/java/hudson/util/AtomicFileWriter.java

+        if (LOSE_INTEGRITY) {
+            // We should log it in WARN, but as this code is called often, this would create huge logs in prod system
+            // Hence why a similar log is logged once above in a static block.
+            LOGGER.log(Level.FINE, "WARNING: YOU SET A FLAG THAT COULD LEAD TO DATA LOSS.");


Seriously, we don't need to warn on every write. The data loss happens when you're yanking storage/OS out from under the software -- most applications aren't expected to behave in that scenario.

We should be issuing a more specific warning, something like "virtualization or storage failures can cause dataloss if this is enabled." My concern is that in settings where users do have reliable storage and are bitten by the IO performance drop they will still be too scared to use this even if we explicitly tell 'em to.

As requested especially by Sam

batmat · 2018-01-07T22:28:14Z

@svanoort drama adjusted ;-)

svanoort

Nice work here @batmat I think with the clearer warning & flag name we are in good shape.

oleg-nenashev · 2018-01-08T14:33:21Z

Please give me some time to review it guys. I think it's generally fine to go into the next LTS

ghost · 2018-01-11T18:48:07Z

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

batmat · 2018-01-12T09:16:03Z

@reviewbybees done.

oleg-nenashev · 2018-01-12T09:19:48Z

Won't have time for it this week, sorry.
Feel free to go forward if you feel it's ready

batmat · 2018-01-12T09:26:12Z

Thank you very much for trying @oleg-nenashev.

Yes, I think it's ready. It's been reviewed pretty thoroughly by many people, and I've myself tested it quite a lot. I would like it to land sooner than later in weekly so that we can possibly detect more (hopefully not) unforeseen impacts in the wild, and fix them all so that JENKINS-34855 can be backported and land into next LTS.

So I will move forward and merge this. Please ping me if this is deemed to have caused any issues once released.

Thank you everyone for the deep reviews.

oleg-nenashev

I see no obvious issues in the code.

batmat requested a review from jtnord December 1, 2017 22:55

batmat added 2 commits December 2, 2017 10:28

[JENKINS-34855] Create a FileChannelWriter

22efbad

Use FileChannelWriter behind AtomicFileWriter

9fb07a2

batmat force-pushed the JENKINS-34855 branch from e4aeee9 to 22efbad Compare December 2, 2017 09:30

batmat added the work-in-progress The PR is under active development, not ready to the final review label Dec 2, 2017

batmat commented Dec 3, 2017

View reviewed changes

batmat force-pushed the JENKINS-34855 branch from bcdef14 to eaec094 Compare December 3, 2017 19:37

oleg-nenashev self-requested a review December 3, 2017 20:31

batmat mentioned this pull request Dec 3, 2017

Add changelog for 2.93 jenkins-infra/jenkins.io#1257

Merged

New flag to FileChannel.force() on each Writer.flush() or not

1191bb0

And disable FileChannel.force() in the case of AtomicFileWriter. See the committed javadoc for the full why.

batmat force-pushed the JENKINS-34855 branch from eaec094 to 1191bb0 Compare December 3, 2017 22:33

batmat added needs-more-reviews Complex change, which would benefit from more eyes and removed work-in-progress The PR is under active development, not ready to the final review labels Dec 3, 2017

amuniz reviewed Dec 4, 2017

View reviewed changes

batmat force-pushed the JENKINS-34855 branch from ad981a5 to f555cb6 Compare December 4, 2017 08:44

batmat mentioned this pull request Dec 4, 2017

[JENKINS-34855] AtomicFileWriter should use NIO Atomic renames when available #2548

Merged

3 tasks

jtnord approved these changes Dec 4, 2017

View reviewed changes

jglick requested review from dwnusbaum and svanoort December 6, 2017 20:35

dwnusbaum previously approved these changes Dec 7, 2017

View reviewed changes

svanoort suggested changes Dec 7, 2017

View reviewed changes

batmat force-pushed the JENKINS-34855 branch from d692b09 to 3a5189e Compare January 4, 2018 13:35

olivergondza reviewed Jan 5, 2018

View reviewed changes

svanoort reviewed Jan 5, 2018

View reviewed changes

Be less dramatic with the dataloss criteria

bd2cb36

As requested especially by Sam

batmat requested a review from svanoort January 7, 2018 22:28

olivergondza approved these changes Jan 8, 2018

View reviewed changes

svanoort approved these changes Jan 8, 2018

View reviewed changes

oleg-nenashev self-assigned this Jan 8, 2018

batmat added on-hold This pull request depends on another event/release, and it cannot be merged right now and removed on-hold This pull request depends on another event/release, and it cannot be merged right now labels Jan 8, 2018

oleg-nenashev removed their assignment Jan 12, 2018

oleg-nenashev removed the on-hold This pull request depends on another event/release, and it cannot be merged right now label Jan 12, 2018

oleg-nenashev approved these changes Jan 12, 2018

View reviewed changes

batmat merged commit 438e929 into jenkinsci:master Jan 12, 2018

batmat deleted the JENKINS-34855 branch January 12, 2018 09:30

svanoort mentioned this pull request Sep 26, 2018

[JEP-210] Log handling rewrite jenkinsci/workflow-api-plugin#17

Merged

1 task

jglick mentioned this pull request Jul 6, 2021

Allow consumers of XmlFile to disable fsync(2) #5599

Merged

[FIX JENKINS-34855] Create a FileChannelWriter and use it for AtomicFileWriter "enforced" flushing #3170

[FIX JENKINS-34855] Create a FileChannelWriter and use it for AtomicFileWriter "enforced" flushing #3170

Conversation

batmat commented Dec 1, 2017 • edited

Proposed changelog entries

Submitter checklist

ToDos after merge

Desired reviewers

Choose a reason for hiding this comment

amuniz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

batmat commented Dec 4, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtnord commented Dec 6, 2017 • edited

batmat commented Dec 6, 2017

dwnusbaum left a comment

Choose a reason for hiding this comment

svanoort left a comment • edited

Choose a reason for hiding this comment

jtnord commented Dec 7, 2017

jglick commented Dec 7, 2017

batmat commented Dec 7, 2017

svanoort commented Dec 8, 2017

stephenc commented Dec 8, 2017

batmat commented Jan 4, 2018

olivergondza Jan 5, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtnord Jan 5, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

batmat commented Jan 7, 2018

svanoort left a comment

Choose a reason for hiding this comment

oleg-nenashev commented Jan 8, 2018

ghost commented Jan 11, 2018

batmat commented Jan 12, 2018

oleg-nenashev commented Jan 12, 2018

batmat commented Jan 12, 2018

oleg-nenashev left a comment

Choose a reason for hiding this comment

batmat commented Dec 1, 2017 •

edited

batmat commented Dec 4, 2017 •

edited

jtnord commented Dec 6, 2017 •

edited

svanoort left a comment •

edited

olivergondza Jan 5, 2018 •

edited

jtnord Jan 5, 2018 •

edited