-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os: reading the contents of a directory can result in "lost" files #24015
Comments
Is this repeatable? Are any files being added or removed to the directory while your program is running? |
It is completely reproducible... on our particular server. I just tested it on a Fedora 27 system and it doesn't happen. We're going to try it out on Ubuntu as well. There may be other knobs to fiddle with as well in terms of mount options. If we discover anything useful I'll comment here. |
@ianlancetaylor FYI, I'm not the only one to run across this kind of issue: https://stackoverflow.com/questions/46719753/golang-bug-ioutil-readdir-listing-files-on-cifs-share-or-doing-something-wro |
Sorry, didn't answer your other question: no, the directory is completely unchanged and the mount point is effectively inaccessible to everybody but me, so I'm fairly certain this isn't an issue of files changing. |
Additional context: glibc readdir() defaults to 32KB buffers and uses getdents64(). There are some old kernel bugs (fixed before 3.10, which CentOS 7 should be on) that might contribute to this. |
Interesting - our RHEL 7 system reports Yes, I misspoke initially - it's RHEL 7, not CentOS 7. I'll update that in the description. |
FWIW: Using readdir in PHP or C, and os.listdir in Python all seem to work, but in C and PHP, listings are returned in a one-item-per-call way, so the lower-level system calls may not be equivalent. And I have no idea what the low-level Python code does (probably just uses the C function, which I suspect PHP also does). But I don't really know how to tell if this is truly a Go bug other than trying other languages, as this is far lower-level than I usually try to debug.... |
Forgot to add that we tested on a separate RHEL 7 system and had the same results, so it isn't just something broken with our environment. |
Try |
It's tough to say the precise problem, as I just don't understand enough of what is or isn't correct. The C version is calling getdents, whereas Go calls getdents64. The C version uses a 32k block where Go is using a 4k block. Since C uses a 32k block, all the responses show up in a single block, whereas Go of course has to call getdents64 one extra time. One potentially interesting difference is that Go calls
Another oddity which (hopefully) points to the problem is that Go's output suggests a skipped record in the
I reformatted these for display, but clearly When running on Fedora 27, the Another interesting thing I noticed is that when I tested a different directory, one with far more files, there's a pattern. On the first call, the bit at the end (I'm assuming this is the number of bytes put into the buffer) is 4080, and the next-to-last When the C program runs against the same huge directory, every I really don't know where to go from here. I still have no idea if this is a kernel or Go bug. |
Another very unfortunate note: using CIFS v1.0 doesn't fix this problem in all cases, it seems. I just noticed on the really huge directory, Go is still skipping files. I have no idea why the CIFS version fixes one particular case but not the other. |
Great analysis. A giant gist would have been okay, but I appreciate not having to read through it myself. I don't know much about this, and it seems like maybe @rmmh has some background here. But to me this is unambiguously a kernel bug. getdents works like read: the kernel maintains state for the FD, and it's responsible for giving you things that fit in the buffer you provided, then continuing from that point when you call again. Assuming nothing strange happened (a seek or some such) in between the two getdents64 calls, the kernel should have given the next file in the next call. Based on the second part of your comment, I suspect the bug is in the interaction between the size of the replies from the CIFS server and the size of the buffer passed to getdents(). It sounds like the CIFS server gives a reply of some size, and then the kernel chunks that into getdents() results, and misses the dentry at each chunk boundary. |
Try rebuilding after setting go/src/os/dir_unix.go's blockSize to 32KB and see if that fixes things. It's probably a kernel bug (since it occurs on 3.10 and not 4.x). I don't know the policy on workarounds for old kernel bugs, but you could easily make the argument that a larger buffer is more efficient for listing directories (~128 files per syscall is small). I would suggest writing a benchmark that does filepath.Walk, but filepath.Walk is probably too slow to notice a difference, since it calls lstat on each file (#16399). |
@heschik I think you're right - I thought strace was reporting what was sent to I'll do some more hacking at the go source to see about a 32k blockSize. And I can probably do a benchmark that just lists a bunch of hard-coded directories - we have hundreds (maybe thousands) of these things that the Go application is reading, so I have no shortage of test material. |
Benchmark results:
These tests are far from scientific, but across dozens of repeated runs the general trends were the same. Benchmarking aside, the 32k buffer does seem to give me the right readings in all cases I've tested on the CIFS mount. I'm not sure if this is an acceptable thing to change in core, given it's likely a kernel bug, but if nothing else I'm going to have to hack up our local Go source in order to prevent file loss. |
@jechols I can reproduce this on RHEL 7.4. Did you report it at https://bugzilla.redhat.com or as a ticket? I suspect 32k hides the issue, but if is possible to get the kernel client to receive even larger listings from the server in one operation, it might be that the issue is still there. I skimmed the kernel code, and I find this line suspicious, since rc = cifs_filldir(current_entry, file, filldir,
direntry, tmp_buf, max_len);
if (rc == -EOVERFLOW) {
rc = 0;
break;
}
file->f_pos++; from https://elixir.bootlin.com/linux/v3.10.108/source/fs/cifs/readdir.c#L869 Upstream, it seems this was changed in 3.11 when
from https://lwn.net/Articles/557314/ If I change the code in RHEL7.4 to if (rc) {
rc = 0;
break;
} there are no lost files. |
@chlunde that is excellent information. Our plan was to submit a bug to Red Hat, but we didn't have the time (or expertise) to do a very quick analysis. Your information should be really helpful! Thanks! |
I've had an rclone user report identical symptoms to this on the forum. The symptoms are a missing file in directory listings. I note that on my system at least, none of the normal tools I also mashed together a little C program which calls |
@chlunde I'm waiting for my supervisor to get the ball rolling, which may take a while to get from him to the campus central IT people and from them to RedHat. I don't really understand the red tape, but it sounds like something I need to let them handle, not go and do it myself.... |
It sounds like the general consensus is that this is a kernel bug. It doesn't sound like there is anything we do to fix it in Go, so closing. Please comment if you disagree. |
@ianlancetaylor I agree this isn't Go's fault, but given the relative difficulty of understanding and fixing the problem for those of us using CentOS / RHEL, and the prevalence of RHEL 7, I feel like a solution in Go may be worthwhile. I'm still uncertain if / when this will be fixed at the OS level given that 90% of apps use the C function. Remember that this bug only appears in Go. PHP, Python, Ruby, and others use the C function which has a large enough buffer that the bug is not triggered. If this isn't addressed in the Go source, I fear Go users will come to believe it's a Go defect. And those who discover it isn't may still avoid Go out of fear of encountering similar problems since Go tends to avoid tried-and-true low-level libraries. Run into a problem like this in production, and it will leave a very bad taste in your mouth. Don't get me wrong, I think this serves to show the importance of having multiple approaches instead of all these languages using the same core library, but others may not see it this way. |
@ianlancetaylor, I didn't address your belief that this can't be fixed in Go. It can by upping the buffer size as I did in a test commit: https://github.com/uoregon-libraries/go/commit/1b4c5990626653f9f02754b27e9df61f6d46961e |
@jechols I reported this to Red Hat, and a bz has been created: https://bugzilla.redhat.com/show_bug.cgi?id=1577640 The proper way to increase priority for this kernel bug is to create a support case and ask for higher priority for this bugzilla issue, and maybe ask for status directly on bugzilla too. If you can't get a support account, even though they do not cost your organization anything, at least go to bugzilla. |
As far as I can see, increasing the buffer size just hides the bug, by making it only occur when there are more files in the directory, or when the files have longer names. That's not a fix. It's sort of an anti-fix. If your tools are not reliable, it's better to know that they are not reliable. If there is an approach that actually works around the problem in all cases, I'm certainly open to it. |
@chlunde, thanks! I'll point that out to my supervisor in an effort to get the university pushing it. Acadamia moves slowly, so I'm still waiting to hear back :-/ @ianlancetaylor, I don't disagree, I just don't like that people will believe this is a Go issue. But you are right, it would just hide, not fix, the problem. It's worth noting that when the buffer size is increased, the bug cannot be reproduced on my end, because CIFS is returning only 5k responses. So there is (currently) no way for even an 8k buffer to hit this bug. From one of my early posts:
I suspect this is why the bug has never been noticed to users of the C library. |
Thanks, that does make it sounds like increasing the buffer size to more than 5760 will paper over the problem reliably for Go users. I'll do that. |
Change https://golang.org/cl/121756 mentions this issue: |
Reportedly CIFS on RHEL 7 can fail to report files if directories are read in 4K increments. While this seems to be a CIFS or RHEL bug, reportedly CIFS does not return more than 5760 bytes in a block, so reading in 8K increments should hide the problem from users with minimal cost. Fixes #24015 Change-Id: Iaf9f00ffe338d379c819ed9edcd4cc9834e3b0f7 Reviewed-on: https://go-review.googlesource.com/121756 Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Hello! I also have this problem. |
Which file system are you running this on? And what kernel? This is likely a linux kernel bug rather than a go bug IMHO. |
@yuan1238y This issue is closed and we believe that it is fixed. Also the problem described here was clearly due to a small buffer interacting with a kernel bug. If you are still having trouble with Go version 12, please open a new issue with all the details required to reproduce the problem. Thanks. |
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?1.9 and 1.10
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?Linux (
CentOSRHEL 7), amd64What did you do?
https://play.golang.org/p/g_0Xgifb9nD
var infos, err = ioutil.ReadDir(i.Location)
i.Location is a CIFS-mounted directory using "vers=3.0" in /etc/fstab's list of options
What did you expect to see?
All files listed in "infos"
What did you see instead?
Most files were listed, but one was missing.
This appears to be related to the version of CIFS in use and/or block size. I dug into the guts of the Go code and changed https://github.com/golang/go/blob/master/src/os/dir_unix.go#L51 to allocate 10x the buffer size, and the operation was successful. I added some debugging output to the loop which populates
d.buf
(https://github.com/golang/go/blob/master/src/os/dir_unix.go#L62-L75) and noticed that the file prior to the missing one was right at the end of one buffer-fill operation, and the file directly after the missing one was at the beginning of the next buffer-fill operation, leading me to believe that somewhere in the buffer setup, something is being cut off and lost somehow. I'm not sure why this is only the case in this precise setup (CIFS 3.0).I wasn't able to debug further, but it is important to note that you can work around this issue by forcing CIFS version 1.0 via "vers=1.0" when mounting.
The text was updated successfully, but these errors were encountered: