New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX JENKINS-34855] Create a FileChannelWriter and use it for AtomicFileWriter "enforced" flushing #3170
Conversation
@@ -24,7 +24,7 @@ | |||
* <strong>really</strong> bad performance regressions. | |||
*/ | |||
@Issue("JENKINS-34855") | |||
@Test(timeout = 30 * 1000L) | |||
@Test(timeout = 100 * 1000L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And disable FileChannel.force() in the case of AtomicFileWriter. See the committed javadoc for the full why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see some flakiness after the fix in #2548? I wonder if a "corner case" is worth adding this complexity... I've seen the "atomic not being atomic" issue on network file systems only, never locally.
@@ -24,7 +24,7 @@ | |||
* <strong>really</strong> bad performance regressions. | |||
*/ | |||
@Issue("JENKINS-34855") | |||
@Test(timeout = 30 * 1000L) | |||
@Test(timeout = 80 * 1000L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Has this worsened performance by ~2x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worse than that, around ~30+ for that very case. I just reduced to 50 seconds, since it actually takes ~35 seconds in CI.
But as explained in the commit log, I still think the overall change is an improvement. Before, we were quicker, but unsafe, is the whole point of this... :-|
(Note: mainly bumping for the CI to not fail since I was never able to see this test fail on my machine [takes ~17 with this change] anyway other than by running in Docker and aggressively reduce IOPS to 40 using Docker resource constraints.) Since we now *actually* sync to disk, this isn't really a regression IMO. From the very role of AtomicFileWriter, it looks more like a feature or an improvement than an issue. So I suspect that anyway 1) we won't see a lot of bad impact in the wild on most setups (as we generally already recommend SSDs and so on as per how Jenkins works) and 2) we win anyway in term of integrity by making sure (well, raising the chances, let's say) we do write to disk before renaming files.
@amuniz This fix is actually more like the main one, #2548 is only about atomic rename. But it could still happen we rename something that has still bits not flushed in the myriads of layers of FS caching and networking. The whole issue may indeed be a corner case, granted, but is a dramatic one when it happens as you know ("zero-length"ed config files can just take down the whole instance). I manually built and ran Jenkins on my machine with this change, and couldn't see any noticeable degradation. And that makes sense: for instance, Taken out of context, it can look as just a horrible perf regression. But now, if one looks at the test (and I wrote it only a few days ago to monitor this), this is doing 1000 35x is a huge variation, but the absolute value IMO tells us that that shouldn't be an actual issue, knowing also we have between those two added a far better effort to make sure important things are actually written to disk. |
@Override | ||
public void close() throws IOException { | ||
if(channel.isOpen()) { | ||
channel.force(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the case where the file is being thrown away (abort is called rather than commit) we could get away with not calling force
. (that is the ATF.commit could call a force method which does a flush, force, and then subsequently call close)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do as a followup.
I've seen it locally, depends on your file system of choice, especially if some flush metadata before data (or independantly)... https://www.spinics.net/lists/xfs/msg09930.html |
@jenkinsci/code-reviewers for possibly more eyes or testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐝
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think this does ensure commit() really does, I feel due to the performance discrepancy, the enforced sync-to-disk should be controlled by a system property, with default set to off.
That way for cases where we need ABSOLUTE integrity, we can set the property on (covering original issues), but we don't wreck the majority of masters where that is not the case.
I can tell you that because this behavior AFAICT is on for all XmlFiles, and Pipeline hits those heavily until JENKINS-47170 lands, the second this lands no master will be able to run Pipeline at an acceptable rate.
Or is this already covered by something else in the other changes that I've missed?
Edit: opt-in enforced flushing for the most critical config files (Jenkins core configs + Job config) would also be acceptable IMO, esp if a system property allows you to disable autosync if we encounter edge cases. It's more the files related to builds or that are touched often where it's a major problem.
@svanoort you can not control requiring integrity vs not - it is just not an option. most people may thing oh yeah - but then Jenkins starts with an empty / missing config.xml and everything goes to pot - and it is Jenkins fault. Code that does not need to care about integrity has no business using If you ask people to set a flag once they have dataloss you will be screamed at "WHY WAS THAT NOT THE DEFAULT FOR SAKE" If you want a flag then the flag then this should be of the form That is you can OPT out of integrity but you never have to opt into it when it is the expected behaviour. |
I would tend to agree. Jenkins should be safe and correct by default. If there is a demonstrable I/O performance issue under heavy load conditions involving Pipeline, well we can always have a system property to opt out of integrity guarantees, but better would be for the Pipeline |
@jglick @svanoort what plugin(s) you think already have tests (or could have?) that would assess the impact it would have (running with -Djenkins.version=2.94-SNAPSHOT)? As explained above, though the impact is big on proportion, especially on ridiculously slow disks, we're still talking about milliseconds in absolute values. So for instance if flownodes are saved in a single build.xml (?), I tend to think the impact shouldn't be gigantic? |
@batmat I ought to be able to test it against Pipeline with Hydra, AKA The Scalability Lab soonishly -- there's a setting I need to benchmark anyway that turns OFF all the AtomicWriter behavior for Pipeline, so that'll be an easy approach to before/after comparison. The problem with FlowNodes is that they're (currently) saved by default in a whole bunch of tiny XML files, and those small overheads already add up in a big way for them. Now, I have a mode that does away with that nonsense but people have to know that this is the problem and they need to select it. I suspect that's not the only place where this overhead will pose a problem or where there's XML saved often. In general, I'm not a fan of changes that introduce huge, non-optional performance regressions to protect against a rare failure mode (tied to operating-system level failures, here). To a moderate-to-large scale users, these performance changes are more likely to bring down a master than an edge case like this. I remind you all here about the ongoing pain we've had with user lookup performance. "Just" a performance regression there has claimed the lives of many production masters at large companies, c.f. JENKINS-35493 between this and its recent kin (c.f. JENKINS-47429.
@jtnord The problem is that I think @batmat did a really clever fix here, but we need to ensure that the fix is only applied where it's critical, i.e top-level Jenkins Configs or Build configs, or that at least people can turn it off. I believe the phrase here is "First Do No Harm"? |
Which is exactly picking integrity over performance (as the default... it’s fine to allow users to opt-in to picking performance over integrity) |
Updated as 3a5189e flush & close flags were reversed. |
LOGGER.finest("Flush is forced"); | ||
channel.force(true); | ||
} else { | ||
LOGGER.finest("Force disabled on flush(), no-op"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not do some flush here? Otherwise turning the flag off would be result in behaviour more broken that it was before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olivergondza good question.
TL;DR: FileChannel.write()
already achieves the equivalent of BufferedWriter.write() + BufferedWriter.flush()
.
FileChannel
, contrary to the previous Files.newBufferedWriter
has no Java-level buffering. So FileChannel.write()
writes to disk, well actually sends data to the OS/disk cache depending on the system config.
So, flushing does not make as much sense here, or at least not the same meaning as when one calls BufferedWriter.flush()
. And BTW, even BufferedWriter.flush()
does not actually force a full flush to disk, it's more like a cache from JVM cache/RAM to the next level, no guaranty things end up on the disk. That is the whole point of this change.
FileChannel.force()
actually guarantees everything is flushed to the disk. BufferedWriter.write() + BufferedWriter.flush()
does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So, if I understand correctly, the data should be send to disk more eagerly than before (avoiding Java level caching) even when turned off. Not that I would object against the change, I am merely pointing out there is no way for users to switch to the old behaviour we know was working fine for ages in case this would not work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch to the old behaviour we know was working fine for ages in case this would not work as expected.
forceOnFlush
defaults to false
from AtomicFileWriter
- so we are not pushing to disk on a flush thus it should behave very similar (the data has gone from Java to the OS) - although possibly via different underlying native calls.
By default from AtomicFileWriter
we are only making sure the data is on disk when the Writer is close()
d and if you want to simulate the old broken behaviour you can: see here and here
@@ -51,6 +54,15 @@ | |||
|
|||
private static final Logger LOGGER = Logger.getLogger(AtomicFileWriter.class.getName()); | |||
|
|||
private static /* final */ boolean LOSE_INTEGRITY = SystemProperties.getBoolean( | |||
AtomicFileWriter.class.getName() + ".LOSE_DATA_INTEGRITY_FOR_PERFORMANCE"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name feels really overly dramatic. Could we call it "DISABLE_FORCED_FLUSH" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basic rule: an option should tell people something about what it does -- this isn't a very descriptive name IMO and isn't really accurate either since it's not globally destroying data integrity, it's modifying one specific behavior wrt the OS caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to adjust this, but I think our current behaviour is also not 100% safe enough by default, which it should probably be. I could have put something like REDUCE_LIKELINESS_TO_LOSE_DATA_WHEN_THINGS_GO_WRONG
and that would indeed have been closer to the reality, but well :-).
DISABLE_FORCED_FLUSH
feels a bit not dramatic enough. My thought and understanding of this is that we really would like to not have it disabled apart from emergencies and very special cases, so let's not encourage it.
But OK, let's rename the option, and still leave a (less dramatic too, ok) message on class loading.
|
||
static { | ||
if (LOSE_INTEGRITY) { | ||
LOGGER.log(Level.SEVERE, "LOSE_DATA_INTEGRITY_FOR_PERFORMANCE flag used, YOU RISK LOSING DATA."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, overly dramatic. I think we really only need to warn people about this case happening when the OS or storage is yanked rather than cleanly shut down.
if (LOSE_INTEGRITY) { | ||
// We should log it in WARN, but as this code is called often, this would create huge logs in prod system | ||
// Hence why a similar log is logged once above in a static block. | ||
LOGGER.log(Level.FINE, "WARNING: YOU SET A FLAG THAT COULD LEAD TO DATA LOSS."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seriously, we don't need to warn on every write. The data loss happens when you're yanking storage/OS out from under the software -- most applications aren't expected to behave in that scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be issuing a more specific warning, something like "virtualization or storage failures can cause dataloss if this is enabled." My concern is that in settings where users do have reliable storage and are bitten by the IO performance drop they will still be too scared to use this even if we explicitly tell 'em to.
As requested especially by Sam
@svanoort drama adjusted ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work here @batmat I think with the clearer warning & flag name we are in good shape.
Please give me some time to review it guys. I think it's generally fine to go into the next LTS |
This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation. |
@reviewbybees done. |
Won't have time for it this week, sorry. |
Thank you very much for trying @oleg-nenashev. Yes, I think it's ready. It's been reviewed pretty thoroughly by many people, and I've myself tested it quite a lot. I would like it to land sooner than later in weekly so that we can possibly detect more (hopefully not) unforeseen impacts in the wild, and fix them all so that JENKINS-34855 can be backported and land into next LTS. So I will move forward and merge this. Please ping me if this is deemed to have caused any issues once released. Thank you everyone for the deep reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no obvious issues in the code.
See JENKINS-34855.
Followup of #2548 & #3165
After a lot of reading and some testing, the only reliable ways to force a local device sync from Java code appear to be either by using
FileChannel.force(boolean)
orFileDescriptor.sync()
.So even if the issues around this have generally been seen more when using remote FS, I think we can hope (?) that making sure we actually flush to local disk could at least make the probability for that issue (i.e. zero-length files seen in the wild) to happen overall (far?) less likely.
This change rewires the current internal
core
writer inAtomicFileWriter
to delegate to aFileChannel
, giving then us access toFileChannel.force()
.Proposed changelog entries
Internal:
JENKINS-34855AtomicFileWriter
now uses aFileChannel
internally and forces a disk sync whencommit
ting.Submitter checklist
* Use the
Internal:
prefix if the change has no user-visible impact (API, test frameworks, etc.)ToDos after merge
Desired reviewers
@reviewbybees esp. @jtnord