Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX JENKINS-34855] Create a FileChannelWriter and use it for AtomicFileWriter "enforced" flushing #3170

Merged
merged 6 commits into from Jan 12, 2018

Conversation

batmat
Copy link
Member

@batmat batmat commented Dec 1, 2017

See JENKINS-34855.

Followup of #2548 & #3165

After a lot of reading and some testing, the only reliable ways to force a local device sync from Java code appear to be either by using FileChannel.force(boolean) or FileDescriptor.sync().

So even if the issues around this have generally been seen more when using remote FS, I think we can hope (?) that making sure we actually flush to local disk could at least make the probability for that issue (i.e. zero-length files seen in the wild) to happen overall (far?) less likely.

This change rewires the current internal core writer in AtomicFileWriter to delegate to a FileChannel, giving then us access to FileChannel.force().

Proposed changelog entries

  • Internal: JENKINS-34855 AtomicFileWriter now uses a FileChannel internally and forces a disk sync when committing.

Submitter checklist

  • JIRA issue is well described
  • Changelog entry appropriate for the audience affected by the change (users or developer, depending on the change). Examples
    * Use the Internal: prefix if the change has no user-visible impact (API, test frameworks, etc.)
  • Appropriate autotests or explanation to why this change has no tests
  • For dependency updates: links to external changelogs and, if possible, full diffs

ToDos after merge

Desired reviewers

@reviewbybees esp. @jtnord

@batmat batmat requested a review from jtnord December 1, 2017 22:55
@batmat batmat added the work-in-progress The PR is under active development, not ready to the final review label Dec 2, 2017
@@ -24,7 +24,7 @@
* <strong>really</strong> bad performance regressions.
*/
@Issue("JENKINS-34855")
@Test(timeout = 30 * 1000L)
@Test(timeout = 100 * 1000L)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR:

image

And disable FileChannel.force() in the case of AtomicFileWriter.
See the committed javadoc for the full why.
@batmat batmat added needs-more-reviews Complex change, which would benefit from more eyes and removed work-in-progress The PR is under active development, not ready to the final review labels Dec 3, 2017
Copy link
Member

@amuniz amuniz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see some flakiness after the fix in #2548? I wonder if a "corner case" is worth adding this complexity... I've seen the "atomic not being atomic" issue on network file systems only, never locally.

@@ -24,7 +24,7 @@
* <strong>really</strong> bad performance regressions.
*/
@Issue("JENKINS-34855")
@Test(timeout = 30 * 1000L)
@Test(timeout = 80 * 1000L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Has this worsened performance by ~2x?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worse than that, around ~30+ for that very case. I just reduced to 50 seconds, since it actually takes ~35 seconds in CI.

But as explained in the commit log, I still think the overall change is an improvement. Before, we were quicker, but unsafe, is the whole point of this... :-|

(Note: mainly bumping for the CI to not fail since I was never able to
see this test fail on my machine [takes ~17 with this change] anyway
other than by running in Docker and aggressively reduce IOPS to 40 using
Docker resource constraints.)

Since we now *actually* sync to disk, this isn't really a regression IMO.
From the very role of AtomicFileWriter, it looks more like a feature or
an improvement than an issue.

So I suspect that anyway 1) we won't see
a lot of bad impact in the wild on most setups (as we generally already
recommend SSDs and so on as per how Jenkins works) and 2) we win anyway
in term of integrity by making sure (well, raising the chances, let's
say) we do write to disk before renaming files.
@batmat
Copy link
Member Author

batmat commented Dec 4, 2017

"corner case"

@amuniz This fix is actually more like the main one, #2548 is only about atomic rename. But it could still happen we rename something that has still bits not flushed in the myriads of layers of FS caching and networking.

The whole issue may indeed be a corner case, granted, but is a dramatic one when it happens as you know ("zero-length"ed config files can just take down the whole instance).

I manually built and ran Jenkins on my machine with this change, and couldn't see any noticeable degradation. And that makes sense: for instance, AtomicFileWriterPerfTest does indeed go from 1 second to 35 seconds.

Taken out of context, it can look as just a horrible perf regression.

But now, if one looks at the test (and I wrote it only a few days ago to monitor this), this is doing 1000 Jenkins.save(). So, before, that means a Jenkins.save() now takes (on the CI setup) ~35 milliseconds, instead of previously ~1 millisecond.

35x is a huge variation, but the absolute value IMO tells us that that shouldn't be an actual issue, knowing also we have between those two added a far better effort to make sure important things are actually written to disk.

@Override
public void close() throws IOException {
if(channel.isOpen()) {
channel.force(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the case where the file is being thrown away (abort is called rather than commit) we could get away with not calling force. (that is the ATF.commit could call a force method which does a flush, force, and then subsequently call close)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do as a followup.

@jtnord
Copy link
Member

jtnord commented Dec 6, 2017

@amuniz

I've seen the "atomic not being atomic" issue on network file systems only, never locally.

I've seen it locally, depends on your file system of choice, especially if some flush metadata before data (or independantly)... https://www.spinics.net/lists/xfs/msg09930.html

@batmat
Copy link
Member Author

batmat commented Dec 6, 2017

@jenkinsci/code-reviewers for possibly more eyes or testing.

dwnusbaum
dwnusbaum previously approved these changes Dec 7, 2017
Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐝

Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think this does ensure commit() really does, I feel due to the performance discrepancy, the enforced sync-to-disk should be controlled by a system property, with default set to off.

That way for cases where we need ABSOLUTE integrity, we can set the property on (covering original issues), but we don't wreck the majority of masters where that is not the case.

I can tell you that because this behavior AFAICT is on for all XmlFiles, and Pipeline hits those heavily until JENKINS-47170 lands, the second this lands no master will be able to run Pipeline at an acceptable rate.

Or is this already covered by something else in the other changes that I've missed?

Edit: opt-in enforced flushing for the most critical config files (Jenkins core configs + Job config) would also be acceptable IMO, esp if a system property allows you to disable autosync if we encounter edge cases. It's more the files related to builds or that are touched often where it's a major problem.

@jtnord
Copy link
Member

jtnord commented Dec 7, 2017

That way for cases where we need ABSOLUTE integrity, we can set the property on (covering original issues), but we don't wreck the majority of masters where that is not the case.

@svanoort you can not control requiring integrity vs not - it is just not an option.

most people may thing oh yeah - but then Jenkins starts with an empty / missing config.xml and everything goes to pot - and it is Jenkins fault.

Code that does not need to care about integrity has no business using AtomicFileWriter to begin with - as the sole purpose of that was to ensure integrity of the system.

If you ask people to set a flag once they have dataloss you will be screamed at "WHY WAS THAT NOT THE DEFAULT FOR SAKE"

If you want a flag then the flag then this should be of the form "I_DO_NOT_CARE_ABOUT_SYSTEM_INTEGRITY_GIVE_ME_PERFORMANCE" and have a value of "I_ACCEPT_THE_RISK" before we throw integrety out of the window.

That is you can OPT out of integrity but you never have to opt into it when it is the expected behaviour.

@jglick
Copy link
Member

jglick commented Dec 7, 2017

I would tend to agree. Jenkins should be safe and correct by default. If there is a demonstrable I/O performance issue under heavy load conditions involving Pipeline, well we can always have a system property to opt out of integrity guarantees, but better would be for the Pipeline FlowNode storage engine to selectively avoid AtomicFileWriter under conditions where it knows loss of a file is tolerable. (The same might be true of build.xml for Runs generally—if these get truncated, well, life moves on.)

@batmat
Copy link
Member Author

batmat commented Dec 7, 2017

@jglick @svanoort what plugin(s) you think already have tests (or could have?) that would assess the impact it would have (running with -Djenkins.version=2.94-SNAPSHOT)?

As explained above, though the impact is big on proportion, especially on ridiculously slow disks, we're still talking about milliseconds in absolute values. So for instance if flownodes are saved in a single build.xml (?), I tend to think the impact shouldn't be gigantic?
I disabled the SYNC on flush() for this purpose (because XmlFile transitively calls flush for every single writes or so), or do we call AtomicFileWriter.commît() potentially like dozens or hundreds of times per second from Pipeline code?

@svanoort
Copy link
Member

svanoort commented Dec 8, 2017

@batmat I ought to be able to test it against Pipeline with Hydra, AKA The Scalability Lab soonishly -- there's a setting I need to benchmark anyway that turns OFF all the AtomicWriter behavior for Pipeline, so that'll be an easy approach to before/after comparison.

The problem with FlowNodes is that they're (currently) saved by default in a whole bunch of tiny XML files, and those small overheads already add up in a big way for them. Now, I have a mode that does away with that nonsense but people have to know that this is the problem and they need to select it.

I suspect that's not the only place where this overhead will pose a problem or where there's XML saved often.

In general, I'm not a fan of changes that introduce huge, non-optional performance regressions to protect against a rare failure mode (tied to operating-system level failures, here). To a moderate-to-large scale users, these performance changes are more likely to bring down a master than an edge case like this. I remind you all here about the ongoing pain we've had with user lookup performance. "Just" a performance regression there has claimed the lives of many production masters at large companies, c.f. JENKINS-35493 between this and its recent kin (c.f. JENKINS-47429.

Code that does not need to care about integrity has no business using AtomicFileWriter to begin with - as the sole purpose of that was to ensure integrity of the system.

@jtnord The problem is that XmlFile.write(Object) is what many things use by default for persisting XStream content.

I think @batmat did a really clever fix here, but we need to ensure that the fix is only applied where it's critical, i.e top-level Jenkins Configs or Build configs, or that at least people can turn it off.

I believe the phrase here is "First Do No Harm"?

@stephenc
Copy link
Member

stephenc commented Dec 8, 2017

"First Do No Harm"

Which is exactly picking integrity over performance (as the default... it’s fine to allow users to opt-in to picking performance over integrity)

@batmat
Copy link
Member Author

batmat commented Jan 4, 2018

Updated as 3a5189e flush & close flags were reversed.

LOGGER.finest("Flush is forced");
channel.force(true);
} else {
LOGGER.finest("Force disabled on flush(), no-op");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not do some flush here? Otherwise turning the flag off would be result in behaviour more broken that it was before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olivergondza good question.

TL;DR: FileChannel.write() already achieves the equivalent of BufferedWriter.write() + BufferedWriter.flush().

FileChannel, contrary to the previous Files.newBufferedWriter has no Java-level buffering. So FileChannel.write() writes to disk, well actually sends data to the OS/disk cache depending on the system config.

So, flushing does not make as much sense here, or at least not the same meaning as when one calls BufferedWriter.flush(). And BTW, even BufferedWriter.flush() does not actually force a full flush to disk, it's more like a cache from JVM cache/RAM to the next level, no guaranty things end up on the disk. That is the whole point of this change.

FileChannel.force() actually guarantees everything is flushed to the disk. BufferedWriter.write() + BufferedWriter.flush() does not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So, if I understand correctly, the data should be send to disk more eagerly than before (avoiding Java level caching) even when turned off. Not that I would object against the change, I am merely pointing out there is no way for users to switch to the old behaviour we know was working fine for ages in case this would not work as expected.

Copy link
Member

@jtnord jtnord Jan 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to the old behaviour we know was working fine for ages in case this would not work as expected.

forceOnFlush defaults to false from AtomicFileWriter - so we are not pushing to disk on a flush thus it should behave very similar (the data has gone from Java to the OS) - although possibly via different underlying native calls.

By default from AtomicFileWriter we are only making sure the data is on disk when the Writer is close()d and if you want to simulate the old broken behaviour you can: see here and here

@dwnusbaum dwnusbaum dismissed their stale review January 5, 2018 14:08

Reviewed before most recent changes

@@ -51,6 +54,15 @@

private static final Logger LOGGER = Logger.getLogger(AtomicFileWriter.class.getName());

private static /* final */ boolean LOSE_INTEGRITY = SystemProperties.getBoolean(
AtomicFileWriter.class.getName() + ".LOSE_DATA_INTEGRITY_FOR_PERFORMANCE");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name feels really overly dramatic. Could we call it "DISABLE_FORCED_FLUSH" instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic rule: an option should tell people something about what it does -- this isn't a very descriptive name IMO and isn't really accurate either since it's not globally destroying data integrity, it's modifying one specific behavior wrt the OS caching.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to adjust this, but I think our current behaviour is also not 100% safe enough by default, which it should probably be. I could have put something like REDUCE_LIKELINESS_TO_LOSE_DATA_WHEN_THINGS_GO_WRONG and that would indeed have been closer to the reality, but well :-).

DISABLE_FORCED_FLUSH feels a bit not dramatic enough. My thought and understanding of this is that we really would like to not have it disabled apart from emergencies and very special cases, so let's not encourage it.

But OK, let's rename the option, and still leave a (less dramatic too, ok) message on class loading.


static {
if (LOSE_INTEGRITY) {
LOGGER.log(Level.SEVERE, "LOSE_DATA_INTEGRITY_FOR_PERFORMANCE flag used, YOU RISK LOSING DATA.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, overly dramatic. I think we really only need to warn people about this case happening when the OS or storage is yanked rather than cleanly shut down.

if (LOSE_INTEGRITY) {
// We should log it in WARN, but as this code is called often, this would create huge logs in prod system
// Hence why a similar log is logged once above in a static block.
LOGGER.log(Level.FINE, "WARNING: YOU SET A FLAG THAT COULD LEAD TO DATA LOSS.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seriously, we don't need to warn on every write. The data loss happens when you're yanking storage/OS out from under the software -- most applications aren't expected to behave in that scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be issuing a more specific warning, something like "virtualization or storage failures can cause dataloss if this is enabled." My concern is that in settings where users do have reliable storage and are bitten by the IO performance drop they will still be too scared to use this even if we explicitly tell 'em to.

As requested especially by Sam
@batmat
Copy link
Member Author

batmat commented Jan 7, 2018

@svanoort drama adjusted ;-)

@batmat batmat requested a review from svanoort January 7, 2018 22:28
Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here @batmat I think with the clearer warning & flag name we are in good shape.

@oleg-nenashev oleg-nenashev added on-hold This pull request depends on another event/release, and it cannot be merged right now ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback and removed needs-more-reviews Complex change, which would benefit from more eyes labels Jan 8, 2018
@oleg-nenashev
Copy link
Member

Please give me some time to review it guys. I think it's generally fine to go into the next LTS

@oleg-nenashev oleg-nenashev self-assigned this Jan 8, 2018
@batmat batmat added on-hold This pull request depends on another event/release, and it cannot be merged right now and removed on-hold This pull request depends on another event/release, and it cannot be merged right now labels Jan 8, 2018
@ghost
Copy link

ghost commented Jan 11, 2018

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

@batmat
Copy link
Member Author

batmat commented Jan 12, 2018

@reviewbybees done.

@oleg-nenashev oleg-nenashev removed their assignment Jan 12, 2018
@oleg-nenashev oleg-nenashev removed the on-hold This pull request depends on another event/release, and it cannot be merged right now label Jan 12, 2018
@oleg-nenashev
Copy link
Member

Won't have time for it this week, sorry.
Feel free to go forward if you feel it's ready

@batmat
Copy link
Member Author

batmat commented Jan 12, 2018

Thank you very much for trying @oleg-nenashev.

Yes, I think it's ready. It's been reviewed pretty thoroughly by many people, and I've myself tested it quite a lot. I would like it to land sooner than later in weekly so that we can possibly detect more (hopefully not) unforeseen impacts in the wild, and fix them all so that JENKINS-34855 can be backported and land into next LTS.

So I will move forward and merge this. Please ping me if this is deemed to have caused any issues once released.

Thank you everyone for the deep reviews.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no obvious issues in the code.

@batmat batmat merged commit 438e929 into jenkinsci:master Jan 12, 2018
@batmat batmat deleted the JENKINS-34855 branch January 12, 2018 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
10 participants