Premature free() in sftp_readdir_async() #7

Closed
ShapeShifter499 opened this Issue Feb 16, 2016 · 21 comments

Comments

Projects
None yet
3 participants

Hi,

With some help from IRC channel ##networking on Freenode it seems there maybe a bug with how sshfs uses fuse. Maybe with 6a2d06e Someone suggested running sshfs with the flag "-o sync_readdir" and it seems to help with the crash I was experiencing. Log of the original crash below.

[raansu@Y40-80 ~]$ sshfs --version
SSHFS version 2.5
FUSE library version: 2.9.5
fusermount version: 2.9.5
using FUSE kernel interface version 7.19
[raansu@Y40-80 ~]$ uname -a
Linux Y40-80 4.4.1-2-ARCH #1 SMP PREEMPT Wed Feb 3 13:12:33 UTC 2016 x86_64 GNU/Linux
[raansu@Y40-80 ~]$ 
Snippet of following command output is below "gdb --args /usr/bin/sshfs raansu-chromebook@192.168.1.5:/mnt /mnt -o reconnect,debug -C -o workaround=all"

*** Error in `/usr/bin/sshfs': double free or corruption (fasttop): 0x00007fffe8000900 ***
unique: 94124, opcode: READ (15), nodeid: 4, insize: 80, pid: 20076
read[140736549326944] 4096 bytes from 74526720 flags: 0x8002
   read[140736549326944] 4096 bytes from 74526720
   unique: 94124, success, outsize: 4112
======= Backtrace: =========
[backtrace without file names and line numbers snipped]

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff6456700 (LWP 19974)]
0x00007ffff70ff5f8 in raise () from /usr/lib/libc.so.6
(gdb)
Contributor

Nikratio commented Feb 16, 2016

Sorry, that backtrace doesn't contain debugging symbols and is thus pretty much useless. Could you please install non-stripped versions of libfuse and sshfs (the procedure depends on your distribution, but compiling from source should always work)? Ideally, please try to reproduce this when running under valgrind. The problem is that the glibc error only appears long after the damage has been done, valgrind should tell us exactly when things go wrong.

Non stripped? Sorry I'm more of a 'end user' of Linux, would compiling straight from git be enough?

EDIT: Um duh you already said it in the comment above and I missed it while reading oops

It seems like my Linux distro Arch Linux already pulls from git for their packages. I'm not sure what they did different that I wouldn't do.
FUSE: https://projects.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/fuse

SSHFS: https://projects.archlinux.org/svntogit/community.git/tree/trunk/PKGBUILD?h=packages/sshfs

EDIT: Suppose I'll try running what I have through Valgrind

Contributor

Nikratio commented Feb 16, 2016

Try to run file on your sshfs and libfuse, it should show this (look at the end of the line):

sshfs: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=013de4f71447e974ec08bc4b784671a3072cf0c8, not stripped
lib/.libs/libfuse.so.2.9.4: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=bb4a5de2097be5e4e1725247da7d8972725a8c71, not stripped

not this:

/lib/x86_64-linux-gnu/libfuse.so.2.9.3: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=f189495dbaf2b058015ecb4e8f992da27b589ac3, stripped
/usr/bin/sshfs: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=494863c988c204bafaf02a3907ceb91e140235b7, stripped

I have stripped, now what?

SSHFS

[raansu@Y40-80 ~]$ file /usr/bin/sshfs
/usr/bin/sshfs: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=d94d72f1f69133fe2de4e492045eb8e833891ad4, stripped

LibFuse

[raansu@Y40-80 ~]$ file /lib/libfuse.so*
/lib/libfuse.so:       symbolic link to libfuse.so.2.9.4
/lib/libfuse.so.2:     symbolic link to libfuse.so.2.9.4
/lib/libfuse.so.2.9.4: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=71ee1d27319e5385121bf8d547b73d23522652b0, stripped
[raansu@Y40-80 ~]$ 
Contributor

Nikratio commented Feb 16, 2016

Now install unstripped packages using the appropriate means for your distribution (on Debian, that would be apt install <package>-dbg), or compile from source instead.

I'll compile later and get things going. Seems my distro provides some dev and dbg packages but not for fuse and sshfs. Expect an update on that sometime tomorrow.

P.S. I'm currently running a load over sshfs through Valgrind with the versions I got from the distro. Exactly same command as the one I used with the gdb log but just valgrind in place of gdb. However this time around I have yet to hit a crash. I don't know enough to understand why, any ideas?

Just to add, before running sshfs through valgrind. I would experience the crash hours after I told Digikam to go and organize over 22k in photos and videos by the folder structure year/month/date of when it was taken or created using files in SSHFS mounted file system from the remote machine. Sometime into this SSHFS would core dump and shortly after Digikam would core dump. Only I saw just a empty desktop when I came back, no Digikam running. Below are one of the core dumps I saw in "journalctl -b"

Feb 15 05:03:32 Y40-80 systemd-coredump[4083]: Process 1805 (sshfs) of user 1000 dumped core.

                                               Stack trace of thread 4021:
[stuff without file names and line numbers snipped]

Tried to trigger the bug again without Valgrind and got a different gdb output but same issue, SSHFS is dying without a known reason.

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff6456700 (LWP 32675)]
0x00007ffff70ff5f8 in raise () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff70ff5f8 in raise () from /usr/lib/libc.so.6
#1  0x00007ffff7100a7a in abort () from /usr/lib/libc.so.6
[backtrace without file names and line numbers snipped]
(gdb)
Contributor

Nikratio commented Feb 17, 2016

On Feb 16 2016, Lance notifications@github.com wrote:

Tried to trigger the bug again without Valgrind and got a different
gdb output but same issue, SSHFS is dying without a known reason.
[...]

Sorry, without debugging symbols all the information you're providing
here is unusable for me...

Best,
-Nikolaus

GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

         »Time flies like an arrow, fruit flies like a Banana.«

Ok so couldn't sleep, compiled the correct packages. Hopefully what you need is below.

SSHFS

[raansu@Y40-80 ~]$ file /usr/bin/sshfs 
/usr/bin/sshfs: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=4e5b9f09176aff746011e3b1287d682bd31d61c7, not stripped
[raansu@Y40-80 ~]$ 

LibFUSE

[raansu@Y40-80 ~]$ file /usr/lib/libfuse.so*
/usr/lib/libfuse.so:       symbolic link to libfuse.so.2.9.4
/usr/lib/libfuse.so.2:     symbolic link to libfuse.so.2.9.4
/usr/lib/libfuse.so.2.9.4: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=42b59ea36f962cfcab8baaa6b024fe5638917d24, not stripped
[raansu@Y40-80 ~]$ 
[raansu@Y40-80 ~]$ sshfs --version
SSHFS version 2.5
FUSE library version: 2.9.5
fusermount version: 2.9.5
using FUSE kernel interface version 7.19
[raansu@Y40-80 ~]$ uname -a
Linux Y40-80 4.4.1-2-ARCH #1 SMP PREEMPT Wed Feb 3 13:12:33 UTC 2016 x86_64 GNU/Linux
[raansu@Y40-80 ~]$ 

Now for the crash log with the updated non stripped binaries
"gdb --args /usr/bin/sshfs raansu-chromebook@192.168.1.5:/mnt /mnt -o reconnect,debug -C -o workaround=all"

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff4c53700 (LWP 9366)]
0x00007ffff70ff5f8 in raise () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff70ff5f8 in raise () from /usr/lib/libc.so.6
#1  0x00007ffff7100a7a in abort () from /usr/lib/libc.so.6
#2  0x00007ffff713e05a in __libc_message () from /usr/lib/libc.so.6
#3  0x00007ffff71439a6 in malloc_printerr () from /usr/lib/libc.so.6
#4  0x00007ffff714418e in _int_free () from /usr/lib/libc.so.6
#5  0x000000000040416d in buf_free (buf=0x7fffe40198b8) at sshfs.c:520
#6  request_free (req=0x7fffe4019880) at sshfs.c:1272
#7  0x00000000004060ac in sftp_request_wait (req=0x7fffe4019880, type=type@entry=12 '\f', expect_type=expect_type@entry=104 'h', outbuf=outbuf@entry=0x7ffff4c52b40) at sshfs.c:1824
#8  0x000000000040bb75 in sftp_readdir_async (filler=<optimized out>, h=<optimized out>, handle=<optimized out>) at sshfs.c:2096
#9  sshfs_getdir (path=<optimized out>, h=0x7ffff4c52ba0, filler=0x40c4d0 <cache_dirfill>) at sshfs.c:2157
#10 0x000000000040c72a in cache_getdir (path=0x7fffe400a450 "/unified-jbod/Pictures/Birthdays/Mom's Birthdays", h=0x7ffff4c52c00, filler=0x7ffff7ba6680 <fill_dir_old>) at cache.c:327
#11 0x00007ffff7babc93 in fuse_fs_readdir (fs=0x618e70, path=0x7fffe400a450 "/unified-jbod/Pictures/Birthdays/Mom's Birthdays", buf=0x7fffe0579b30, filler=0x7ffff7ba66f0 <fill_dir>, off=0, 
    fi=0x7ffff4c52c80) at fuse.c:2019
#12 0x00007ffff7babdcc in readdir_fill (fi=0x7ffff4c52c80, dh=0x7fffe0579b30, off=0, size=4096, ino=4204, req=0x7fffe4006c50, f=0x618d10) at fuse.c:3467
#13 fuse_lib_readdir (req=0x7fffe4006c50, ino=4204, size=4096, off=0, llfi=<optimized out>) at fuse.c:3493
#14 0x00007ffff7bb29e6 in do_readdir (req=<optimized out>, nodeid=<optimized out>, inarg=<optimized out>) at fuse_lowlevel.c:1390
#15 0x00007ffff7bb3f69 in fuse_ll_process_buf (data=0x619000, buf=0x7ffff4c52f00, ch=<optimized out>) at fuse_lowlevel.c:2443
#16 0x00007ffff7bb0728 in fuse_do_work (data=0x7fffec01f0a0) at fuse_loop_mt.c:117
#17 0x00007ffff74774a4 in start_thread () from /usr/lib/libpthread.so.0
#18 0x00007ffff71b513d in clone () from /usr/lib/libc.so.6
(gdb) 
Contributor

Nikratio commented Feb 17, 2016

Yep, this backtrace is useful. But unfortunately it only tells us that you're most likely seeing the same issue that you referenced above. Something corrupts libc's malloc pool, and this corruption later causes the segfault for which you got the stacktrace. Valgrind should be able to tell us when the actual corruption occurs (unless it's a multithreading issue after all, but I don't think so).

@Nikratio Nikratio added the bug label Jun 5, 2016

Contributor

Nikratio commented Jun 3, 2017

I'm closing this issue for now, since I can't locally reproduce it and without a valgrind log there is little I can do to fix it. Please feel free to re-open with additional information.

@Nikratio Nikratio closed this Jun 3, 2017

@Nikratio Nikratio added the needs-info label Jun 3, 2017

mikemol commented Jul 7, 2017

I started encountering this earlier this week. Running with sshfs now. If anyone has a good way to do a systemd+fstab-driven sshfs mount under valgrind automatically, I'm all ears. For now, here's my command line, for anyone else who stumbles across the same problem:

time valgrind --tool=memcheck --read-var-info=yes --track-origins=yes --leak-check=full --show-leak-kinds=all --partial-loads-ok=no --expensive-definedness-checks=yes -v sshfs user@server /path/to/mount -o idmap=user -o transform_symlinks -o allow_other -o reconnect -f > sshfs.valgring.log 2>&1

If there's any additional flags I ought to be using, let me know.

mikemol commented Jul 10, 2017

==19702== Invalid read of size 4
==19702==    at 0x114AE6: sftp_readdir_async (sshfs.c:2092)
==19702==    by 0x114AE6: sshfs_getdir (sshfs.c:2157)
==19702==    by 0x1157F1: cache_getdir (cache.c:327)                                                                                                                                                                                                                                                                     
==19702==    by 0x4E42972: fuse_fs_readdir (fuse.c:2014)                                                                                                                                                                                                                                                                 
==19702==    by 0x4E42AAA: readdir_fill (fuse.c:3460)                                                                                                                                                                                                                                                                    
==19702==    by 0x4E42AAA: fuse_lib_readdir (fuse.c:3486)                                                                                                                                                                                                                                                                
==19702==    by 0x4E496A5: do_readdir (fuse_lowlevel.c:1389)                                                                                                                                                                                                                                                             
==19702==    by 0x4E4ABDA: fuse_ll_process_buf (fuse_lowlevel.c:2441)                                                                                                                                                                                                                                                    
==19702==    by 0x4E47470: fuse_do_work (fuse_loop_mt.c:117)                                                                                                                                                                                                                                                             
==19702==    by 0x55B2DC4: start_thread (pthread_create.c:308)                                                                                                                                                                                                                                                           
==19702==    by 0x58BE76C: clone (clone.S:113)                                                                                                                                                                                                                                                                           
==19702==  Address 0x7e4c4c0 is 0 bytes inside a block of size 136 free'd                                                                                                                                                                                                                                                
==19702==    at 0x4C29CDD: free (vg_replace_malloc.c:530)                                                                                                                                                                                                                                                                
==19702==    by 0x52C341D: g_free (gmem.c:189)                                                                                                                                                                                                                                                                           
==19702==    by 0x10E055: process_one_request (sshfs.c:1384)                                                                                                                                                                                                                                                             
==19702==    by 0x10E055: process_requests (sshfs.c:1414)                                                                                                                                                                                                                                                                
==19702==    by 0x55B2DC4: start_thread (pthread_create.c:308)                                                                                                                                                                                                                                                           
==19702==    by 0x58BE76C: clone (clone.S:113)                                                                                                                                                                                                                                                                           
==19702==  Block was alloc'd at                                                                                                                                                                                                                                                                                          
==19702==    at 0x4C2A975: calloc (vg_replace_malloc.c:711)                                                                                                                                                                                                                                                              
==19702==    by 0x52C3365: g_malloc0 (gmem.c:124)                                                                                                                                                                                                                                                                        
==19702==    by 0x11027A: sftp_request_send (sshfs.c:1835)                                                                                                                                                                                                                                                               
==19702==    by 0x1149B7: sftp_readdir_send (sshfs.c:2039)                                                                                                                                                                                                                                                               
==19702==    by 0x1149B7: sftp_readdir_async (sshfs.c:2067)                                                                                                                                                                                                                                                              
==19702==    by 0x1149B7: sshfs_getdir (sshfs.c:2157)                                                                                                                                                                                                                                                                    
==19702==    by 0x1157F1: cache_getdir (cache.c:327)                                                                                                                                                                                                                                                                     
==19702==    by 0x4E42972: fuse_fs_readdir (fuse.c:2014)                                                                                                                                                                                                                                                                 
==19702==    by 0x4E42AAA: readdir_fill (fuse.c:3460)                                                                                                                                                                                                                                                                    
==19702==    by 0x4E42AAA: fuse_lib_readdir (fuse.c:3486)                                                                                                                                                                                                                                                                
==19702==    by 0x4E496A5: do_readdir (fuse_lowlevel.c:1389)                                                                                                                                                                                                                                                             
==19702==    by 0x4E4ABDA: fuse_ll_process_buf (fuse_lowlevel.c:2441)                                                                                                                                                                                                                                                    
==19702==    by 0x4E47470: fuse_do_work (fuse_loop_mt.c:117)                                                                                                                                                                                                                                                             
==19702==    by 0x55B2DC4: start_thread (pthread_create.c:308)                                                                                                                                                                                                                                                           
==19702==    by 0x58BE76C: clone (clone.S:113)
Contributor

Nikratio commented Jul 12, 2017

@mikemol Thanks! What sshfs version is this?

@Nikratio Nikratio reopened this Jul 12, 2017

mikemol commented Jul 12, 2017

fuse-sshfs-2.5-1.el7.x86_64

Latest version from EPEL on CentOS 7.

@Nikratio Nikratio self-assigned this Jul 12, 2017

@Nikratio Nikratio removed the needs-info label Jul 12, 2017

@Nikratio Nikratio changed the title from sshfs' fuse_fs_readdir bug to Premature free() in sftp_readdir_async() Jul 12, 2017

Contributor

Nikratio commented Jul 12, 2017

Thanks! Looks like sftp_readdir_async() tries to access req->want_reply. However, sftp_process_one_request() frees req if req->want_reply under the same conditions. I'll have to think about how to best fix this.

mikemol commented Jul 12, 2017

Added to EPEL's bugzilla so they're aware when a fix is available, and can incorporate it: https://bugzilla.redhat.com/show_bug.cgi?id=1470193

Contributor

Nikratio commented Jul 12, 2017

Could you test if the following patch solves the problem?

diff --git a/sshfs.c b/sshfs.c
index 96b98..5b31c 100644
--- a/sshfs.c
+++ b/sshfs.c
@@ -2085,11 +2085,15 @@ static int sftp_readdir_async(struct buffer *handle, fuse_cache_dirh_t
                        outstanding--;
 
                        if (done) {
+                               /* We need to cache want_reply, since processing
+                                  thread may free req if want_reply is 0 */
+                               int want_reply;
                                pthread_mutex_lock(&sshfs.lock);
                                if (sshfs_req_pending(req))
                                        req->want_reply = 0;
+                               want_reply = req->want_reply;
                                pthread_mutex_unlock(&sshfs.lock);
-                               if (!req->want_reply)
+                               if (!want_reply)
                                        continue;
                        }

mikemol commented Jul 12, 2017

That's the solution I was going to suggest, and would fix the race condition. I'm not in a position to test it right now; my build host is unavailable.

However, given that it's clearly a race condition, you should be able to exacerbate it by dropping a sleep(1) between the unlock and the subsequent if. Should be able to make it pretty repeatable that way.

@Nikratio Nikratio closed this in 82766d1 Jul 12, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment