Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

Fix race condition in the linux:shrinker splat test. #196

Closed
wants to merge 2 commits into from
Closed

Fix race condition in the linux:shrinker splat test. #196

wants to merge 2 commits into from

Conversation

stevenj
Copy link
Contributor

@stevenj stevenj commented Dec 3, 2012

The Linux Shrinker splat test has a race condition.
There is no guarantee that when the call to write "2" to /proc/sys/vm/drop_caches returns that the kernel has called the shrinker which has been defined.
The patch puts a wait event on this so that either the shrinker is run OR a 1 second timeout occurs.
If the timeout occurs then there is some sort of error cause it shouldn't take 1 second for the kernel to run through the shrinker callback.
Once the shrinker has run and shrunk down to 0, then the main test resumes and the race condition is fixed.
The only issue with this fix I can think of is if it is feasible for it to take longer than 1 second to execute, but I don't think that is likely.

It also fixes a small problem in the failsafe code which would trigger if the linux:shrinker test was run multiple times. This changes takes my error rate from about 1 error in runs 6, to 0 in 10,000 runs.

This fixes issues: #182 & #96
It also fixes one of the failures reported in: #125

@behlendorf
Copy link
Contributor

@stevenj Can you please cleanup and re push these comment based on the comments.

There is no guarantee that when the call to write "2" to /proc/sys/vm/drop_caches returns that the kernel has called the shrinker which has been defined.

Also I'm not convinced of this. My reading of the upstream kernel source suggest the caller will block in the handler until most of the cache has been reclaimed which is why the test was originally written this way. See https://github.com/torvalds/linux/blob/master/fs/drop_caches.c#L52

@stevenj
Copy link
Contributor Author

stevenj commented Dec 5, 2012

But you are not synchronously calling anything. You are echoing 2 to a virtual file system. Are you sure the echo will not return until the cache flush happens? I figured it was reasonable not to be synchronous in this case, because you have to consider, what does bash do, what does echo do, what does all the io redirection do, and what does the vfs handling do, is there caching or asynchronicity introduced by any of that? I don't know the answer to these BUT I definitely had it happening where the echo would return in some cases before any reclaiming occurred, so I highly doubt it is synchronous as implemented. Initially all i did was put a msleep after the echo and the problem went away, I use a wait queue because that is more reliable than just an arbitrary sleep, its also faster. I also know this method is used elsewhere in zfs, it is on my list to check those and make sure they are not expecting synchronicity, when it doesn't seem guaranteed in this test case.

I will fix the patch and resubmit tonight. Thanks for the comments.

@behlendorf
Copy link
Contributor

@stevenj How about this #200. It's your original patch with some minor tweaking/cleanup. It certainly does solve the issue and obviously won't cause harm. I haven't dug too deeply in to if it really is synchronous or not.

@behlendorf behlendorf closed this Dec 11, 2012
@behlendorf
Copy link
Contributor

@stevenj See commit 8842263, @nedbass identified the reason why the user mode helpers are racier on newer kernels. However, since we still have the issue with stdio buffering your improvements are probably needed as well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants