Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport endpoint is not connected #151

Closed
karibertils opened this issue Dec 12, 2020 · 40 comments
Closed

Transport endpoint is not connected #151

karibertils opened this issue Dec 12, 2020 · 40 comments

Comments

@karibertils
Copy link

I randomly but constantly get "cannot access '/media': Transport endpoint is not connected".

This seems to happen usually under heavy load, like scanning many many small files and such. With less load the rar2fs remains stable.

It feels like this is related to some timeout, like if the underlying fs is inresponsive for X amount of time then it happens. The underlying fs might hang for a while under heavy load, but never stops working.

Is there any way to improve this ?

@hasse69
Copy link
Owner

hasse69 commented Dec 12, 2020

Thanks for the issue report.

We need to understand a bit more about the effect of the error message you get (which is coming from FUSE itself).
When you see this error can you confirm that the rar2fs process is still running? As I read your report it seems it is, but it is not responsive and that it eventually becomes so and that without having to restart the mount point?
I have never had a report of such behavior before, and normally when you see a Transport endpoint is not connected error it means the FUSE file system crashed or terminated in an unclean fashion.

Are you stacking FUSE file systems or is rar2fs the only thing mounted on top of your underlying fs?
When you get this error would it be possible to run command strace on the running rar2fs process so that we can tell what it is doing at that time?

Do you use the --seek-length=1 mount option? If not, can you reproduce with that?
Have you tried the -owarmup mount option to see if it makes any difference?

.

@hasse69 hasse69 self-assigned this Dec 12, 2020
@karibertils
Copy link
Author

Yes the rar2fs process is still running when it happens but can't enter folder.

I am not stacking FUSE file systems. Here are the options I am using:
rar2fs --date-rar --no-expand-cbr --seek-length=1 --hist-size=40 --iob-size=32 -o max_readahead=128M,max_background=100,congestion_threshold=100,allow_other,rw,umask=000,nonempty /data /media

I think it was the same when using -owarmup. I will check it again.

I will run strace on the process next time it happens and report back.

@hasse69
Copy link
Owner

hasse69 commented Dec 12, 2020

Can you also try to reduce the number of mount options to a minimum?
It is very hard to see the effects of all those options and how they might interfere with each other.
Try to stick to only these ones if possible:

--seek-length=1 -oallow_other,rw,umask=000,nonempty

It is a bit weird that FUSE would not be able to reach the process, it is multi-threaded by default and FUSE can spawn numerous of threads for file system operations unless some other critical action is in progress and it needs to block. Also I am a bit puzzled whether this happens during some archive scanning (cache population) procedure of if happens during strict extraction on an archive. Once the cache is in effect there should not be anything really that would become busy like you describe.
Please also check using ps or top if there are a lot of spawned rar2fs threads and lsof | grep rar2fs | wc -l if we have ended up with a huge amount of open file descriptors owned by rar2fs. Also ouput from ulimit -a could be interesting to see.
I am starting to think we might have hit some current system limit here. Also top might tell if you are starting to run out of free memory.

@karibertils
Copy link
Author

Ok I have reduced the options to --date-rar --seek-length=1 -o allow_other,rw,umask=000

I usually run find in the folder to populate the cache right after mounting, and haven't noticed problems during that. It usually takes some time until the issue occurs. Sometimes happens 2-3 times a day, other times it happens after running fine for many days.

Here are some outputs.

# rar2fs --version
rar2fs v1.29.2-git2c47c3a (DLL version 8)    Copyright (C) 2009 Hans Beckerus
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it under
certain conditions; see <http://www.gnu.org/licenses/> for details.
FUSE library version: 2.9.9
fusermount version: 2.9.9
using FUSE kernel interface version 7.19
# ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-m: resident set size (kbytes)      unlimited
-u: processes                       385749
-n: file descriptors                1024
-l: locked-in-memory size (kbytes)  12350560
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 385749
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 0
-N 15:                              unlimited

Output from lsof -n|grep rar2fs | wc -l is 408 right now and rar2fs has 11 threads in top. No issues yet.

I attached strace to the process. I see nothing except this below, no matter what is being done on the file system. Don't know if that might change if error occurs ?

nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcf1ecf300) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcf1ecf300) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcf1ecf300) = 0

@hasse69
Copy link
Owner

hasse69 commented Dec 12, 2020

I usually run find in the folder to populate the cache right after mounting

Try to avoid that. Using find forces access to your mount point which means FUSE has to deal with every single file access. That is why -owarmup is the preferred method since it will do exactly what find does but without involving FUSE thus increasing the responsiveness drastically while the cache is being populated. Also the warmup is a lot faster than the find approach.

Do you need the --date-rar option or can you disable that too just to make sure?

The number of open files should drop unless you are in the middle of a complete mount point scan. It is not expected to have that many open file descriptors on an idling file system.

There is no obvious system limits that look suspicious.

I need to check the strace, I did not expect it to behave as you describe.

@hasse69
Copy link
Owner

hasse69 commented Dec 12, 2020

Also, would it be possible to run rar2fs in the foreground using -f just to see that it says something that is otherwise lost when daemonized. And to push it even a bit further, use -d instead of -f to see each FUSE operation being called. Be aware your console is going to become heavily spammed with trace output.

@hasse69
Copy link
Owner

hasse69 commented Dec 12, 2020

Ok for strace you need to find the actual thread handling the system calls, in my case it was the third thread from the top.

# ps -Tef  | grep rar2fs
20026 20026 19953  0 14:32 pts/0    00:00:00 ./src/rar2fs -f --date-rar --seek-length=1 
20026 20029 19953  0 14:32 pts/0    00:00:00 ./src/rar2fs -f --date-rar --seek-length=1
20026 20030 19953  0 14:32 pts/0    00:00:00 ./src/rar2fs -f --date-rar --seek-length=1
20026 20174 19953  0 14:34 pts/0    00:00:00 ./src/rar2fs -f --date-rar --seek-length=1
20026 20175 19953  0 14:34 pts/0    00:00:00 ./src/rar2fs -f --date-rar --seek-length=1 
20026 20180 19953  0 14:35 pts/0    00:00:00 ./src/rar2fs -f --date-rar --seek-length=1 

It was easily spotted by trying strace on each pid until you find the one blocking in a read() call.
Note that you must make sure the file system is not being accessed while searching for it or you might never find it.
But doing strace on all threads once you end up in the situation when it fails to respond is needed regardless since there might be multiple file system processing threads active simultaneously which is the whole purpose of multi threaded mode.

@karibertils
Copy link
Author

Yes I need the -date-rar. Now using --date-rar --seek-length=1 -o allow_other,rw,umask=000,warmup -d and sending stderr to logfile.

Found the thread blocking on read() and logging strace on that also. I will check strace on all threads next time the issue comes up.

@karibertils
Copy link
Author

The process crashed now.

I did think I saw rar2fs still running previous times. But now I'm thinking I might have been mistaken and it has been crashing all along. Could have been unrelated dangling instance running after lazy unmount I was seeing. Didn't really pay too much attention to it until now since the crashes keep happening.

Last lines in the rar2fs log

ioctl(14, TCGETS, 0x7f4034b825f0)       = -1 ENOTTY (Inappropriate ioctl for device)
read(14, "Rar!\32\7\0", 7)              = 7
lseek(14, 0, SEEK_CUR)                  = 7
read(14, "\361\373s\1\0\r\0", 7)        = 7
read(14, "\0\0\0\0\0\0", 6)             = 6
lseek(14, 20, SEEK_SET)                 = 20
lseek(14, 0, SEEK_CUR)                  = 20
lseek(14, 0, SEEK_CUR)                  = 20
read(14, "\216\221t\303\200K\0", 7)     = 7
read(14, "\r\360\372\28\313\3@\2S\327g|\262\31\6A\0240+\0 \0\0\0XXXXXXX"..., 68) = 68
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1188, ...}) = 0
lseek(14, 20, SEEK_SET)                 = 20
futex(0x562c558b792c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x562c558b792c, FUTEX_WAKE_PRIVATE, 1) = 1
+++ killed by SIGABRT +++

Last lines on the strace I had running. Maybe I should run it on more threads next time ?

LOOKUP /+DL
getattr /+DL
   unique: 730678, error: -2 (No such file or directory), outsize: 16
unique: 730680, opcode: MKDIR (9), nodeid: 1, insize: 52, pid: 882028
mkdir /+DL 0755 umask=0022
getattr /+DL
   NODEID: 44589
   unique: 730680, success, outsize: 144
unique: 730682, opcode: OPENDIR (27), nodeid: 44589, insize: 48, pid: 882028
opendir flags: 0x38800 /+DL
   opendir[139913326466368] flags: 0x38800 /+DL
   unique: 730682, success, outsize: 32
unique: 730684, opcode: RELEASEDIR (29), nodeid: 44589, insize: 64, pid: 0
releasedir[139913326466368] flags: 0x0
   unique: 730684, success, outsize: 16
unique: 730686, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 882033
getattr /
   unique: 730686, success, outsize: 120
unique: 730688, opcode: LOOKUP (1), nodeid: 1, insize: 45, pid: 882033
LOOKUP /.git
getattr /.git
double free or corruption (fasttop)

It crashed pretty much exactly when I did mkdir +DL this time. But usually I'm never creating folders when it happens.
I'm gonna see if I can reproduce this on another server, to rule out hardware issues.

@hasse69
Copy link
Owner

hasse69 commented Dec 12, 2020

At a first glance, yes you seem to have found some nasty bug here. Would be great if we could narrow it down a bit.
The double free or corruption (fasttop) is not a good sign.

@hasse69
Copy link
Owner

hasse69 commented Dec 12, 2020

Checked your previous posts, can you please change this

-c: core file size (blocks)         0

using something like ulimit -c unlimited.
Now you are effectively blocking core dumps from being created. A core dump might give us more information about the crash.

@karibertils
Copy link
Author

Ok using unlimited now.

@hasse69
Copy link
Owner

hasse69 commented Dec 12, 2020

We might be needing some extra debug symbols so configure package using --enable-debug=1.
And make sure to run the built and not installed binary since the latter will become stripped.

@karibertils
Copy link
Author

Ok

@hasse69
Copy link
Owner

hasse69 commented Dec 13, 2020 via email

@karibertils
Copy link
Author

Ok

It crashed again but I didn't realize ulimit -c unlimited needed to be run in the same shell as rar2fs process was started in, so no core dump. Thought it was a global setting, anyway it's setup properly now.

The crash message was different this time. munmap_chunk(): invalid pointer

I am running it through gdb now. Let's see what we get from that.

@hasse69
Copy link
Owner

hasse69 commented Dec 13, 2020

I have spotted an error I have not seen before, need to look into this.

==21603== Thread 4:
==21603== Invalid read of size 8
==21603==    at 0x413B03: hashtable_entry_delete_subkeys (hashtable.c:242)
==21603==    by 0x4140CE: dircache_get (dircache.c:195)
==21603==    by 0x41D62E: rar2_readdir (rar2fs.c:3373)
==21603==    by 0x4E40408: fuse_fs_readdir (fuse.c:2009)
==21603==    by 0x4E4065B: fuse_lib_readdir (fuse.c:3467)
==21603==    by 0x4E489C5: do_readdir (fuse_lowlevel.c:1390)
==21603==    by 0x4E49BB3: fuse_ll_process_buf (fuse_lowlevel.c:2442)
==21603==    by 0x4E46966: fuse_do_work (fuse_loop_mt.c:117)
==21603==    by 0x5794E99: start_thread (pthread_create.c:308)
==21603==    by 0x5A9DCCC: clone (clone.S:112)
==21603==  Address 0x626b158 is 24 bytes inside a block of size 32 free'd
==21603==    at 0x4C2A82E: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21603==    by 0x413B56: hashtable_entry_delete_subkeys (hashtable.c:245)
==21603==    by 0x4140CE: dircache_get (dircache.c:195)
==21603==    by 0x41D62E: rar2_readdir (rar2fs.c:3373)
==21603==    by 0x4E40408: fuse_fs_readdir (fuse.c:2009)
==21603==    by 0x4E4065B: fuse_lib_readdir (fuse.c:3467)
==21603==    by 0x4E489C5: do_readdir (fuse_lowlevel.c:1390)
==21603==    by 0x4E49BB3: fuse_ll_process_buf (fuse_lowlevel.c:2442)
==21603==    by 0x4E46966: fuse_do_work (fuse_loop_mt.c:117)
==21603==    by 0x5794E99: start_thread (pthread_create.c:308)
==21603==    by 0x5A9DCCC: clone (clone.S:112)

@hasse69
Copy link
Owner

hasse69 commented Dec 13, 2020

Can you try this patch?

diff --git a/src/hashtable.c b/src/hashtable.c
index db8d137..2e6d326 100644
--- a/src/hashtable.c
+++ b/src/hashtable.c
@@ -238,11 +238,14 @@ void hashtable_entry_delete_subkeys(void *h, const char *key, uint32_t hash)

         for (i = 0; i < ht->size; i++) {
                 b = &ht->bucket[i];
+rewind:
                 p = b;
                 while (p->next) {
                         p = p->next;
-                        if (strstr(p->key, key) == p->key)
+                        if (strstr(p->key, key) == p->key) {
                                 hashtable_entry_delete_hash(h, p->key, p->hash);
+                                goto rewind;
+                        }
                 }
                 if (b->key && (strstr(b->key, key) == b->key))
                         hashtable_entry_delete_hash(h, b->key, b->hash);

@karibertils
Copy link
Author

Sure I'll give it a try

@karibertils
Copy link
Author

Still stable so far.

@hasse69
Copy link
Owner

hasse69 commented Dec 15, 2020

I'll get it as it is still working without hiccups?

@karibertils
Copy link
Author

Yes, I have been stress testing alot and no problems so far. At this point I'm 99% sure the patch solved it.

hasse69 pushed a commit that referenced this issue Dec 15, 2020
When new data is added to a directoy and all affected cache entries are
invalidated there is potential access of already freed memory.
This issue was spotted using valgrind and was for some reason overlooked
after commit c8af7494106705b547b7c545aee2fdd10aeec3ac was introduced.

==21603== Invalid read of size 8
==21603==  Address 0x626b158 is 24 bytes inside a block of size 32 free'd

Resolves issue: #151

Signed-off-by: Hans Beckerus <hans.beckerus at gmail.com>
@hasse69
Copy link
Owner

hasse69 commented Dec 15, 2020

I have pushed a patch to master now. It is similar but not the same as the one posted here since it was rather ugly.
I will close this issue now but feel free to re-open it again if you find problems with the official patch. I will also submit a new patch release v1.29.3 soon since I consider this a rather severe issue.

@hasse69 hasse69 closed this as completed Dec 15, 2020
hasse69 pushed a commit that referenced this issue Dec 15, 2020
When new data is added to a directoy and all affected cache entries are
invalidated there is potential access of already freed memory.
This issue was spotted using valgrind and was for some reason overlooked
after commit 4b1d308 was introduced.

==21603== Invalid read of size 8
==21603==  Address 0x626b158 is 24 bytes inside a block of size 32 free'd

Resolves issue: #151

Signed-off-by: Hans Beckerus <hans.beckerus at gmail.com>
@karibertils
Copy link
Author

I have still been running into crashes periodically but really random. Can be stable for weeks and then crashes few times in a row and then stable again for weeks.

But I was working with nodejs script making symlinks and I noticed it was crashing rar2fs every time.

I stripped the script down to a minimal dummy version which exposes the same bug

const fs = require('fs')

for (let i = 0; i < 100; i++) {
	for (let k = 0; k < 100; k++) {
		fs.mkdir(`/media/crash/${i}`, err => {
			fs.readdir(`/media/crash/${i}`, err => {
				fs.symlink(`/${i}`, `/media/crash/${i}/${k}`, err => {
					// ignore
				})
			})
		})
	}
}

Here's log after running it

feb 14 20:07:29 rocky rar2fs[967764]: Got signal 11, faulty address is (nil), from 0x558021004d3e
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(+0x14d3e) [0x558021004d3e]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fb9c9f0a730]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libc.so.6(+0xa19bd) [0x7fb9c9dd89bd]
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(hashtable_entry_delete_subkeys+0x6e) [0x5580210054aa]
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(dircache_invalidate+0x78) [0x558021005884]
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(dircache_get+0x237) [0x558021005cda]
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(+0x1e5ef) [0x55802100e5ef]
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(+0x1eaf7) [0x55802100eaf7]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc250) [0x7fb9ca0c3250]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc3ce) [0x7fb9ca0c33ce]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x170b8) [0x7fb9ca0ce0b8]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x13d5c) [0x7fb9ca0cad5c]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7fb9c9efffa3]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fb9c9e304cf]

I am running
rar2fs --date-rar --no-expand-cbr --seek-length=1 -o noatime,allow_other,rw,umask=000 /data /media

# rar2fs --version
rar2fs v1.29.4-git16009ec (DLL version 8)    Copyright (C) 2009 Hans Beckerus
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it under
certain conditions; see <http://www.gnu.org/licenses/> for details.
FUSE library version: 2.9.9
fusermount version: 2.9.9
using FUSE kernel interface version 7.19

@hasse69 hasse69 reopened this Feb 14, 2021
@hasse69
Copy link
Owner

hasse69 commented Feb 14, 2021

Seems there is still some problem in this area. Will see what I can make out of your reproducer.

@hasse69
Copy link
Owner

hasse69 commented Feb 14, 2021

Can you confirm if your source folder was in fact empty (except for the crash directory) while executing your java script?
I tried reproducing the same problem using the following bash script:

#!/bin/bash

for i in {0..99}
do
   for k in {0..99}
   do
       mkdir d/crash/$i
       ls d/crash/$i
       ln -sf /$i d/crash/$i/$k
   done
done

Mounted using:

# ./src/rar2fs -f --date-rar --no-expand-cbr --seek-length=1 -o noatime,allow_other,rw,umask=000 s/issue151 d

Can you reproduce using this script? I believe I got the essence of it, right?

@hasse69
Copy link
Owner

hasse69 commented Feb 14, 2021

Also, if you could run in the foreground (-f) and use gdb to trap the error would be good. A bt from gdb would give us a little bit more information about exactly what pointer has gone rouge. Since I cannot reproduce this my guess is that it has something to do with a hash collision unique to your setup.

@karibertils
Copy link
Author

No that bash script does not crash.

But this one works for me

#!/bin/bash

for i in {0..99}
do
   for k in {0..99}
   do
       mkdir d/crash/$i &
       ls d/crash/$i &
       ln -sf /$i d/crash/$i/$k &
   done
done

Got a huge crash with that one

feb 14 23:20:29 rocky rar2fs[2415787]: Got signal 11, faulty address is (nil), from 0x55f091712d3e
feb 14 23:20:29 rocky rar2fs[2415787]: Got signal 11, faulty address is (nil), from 0x55f091712d3e
feb 14 23:20:29 rocky rar2fs[2415787]: Got signal 11, faulty address is (nil), from 0x55f091712d3e
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x14d3e) [0x55f091712d3e]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f16aecee730]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libc.so.6(+0xa19bd) [0x7f16aebbc9bd]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(hashtable_entry_delete_subkeys+0x6e) [0x55f0917134aa]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x14d3e) [0x55f091712d3e]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f16aecee730]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(dircache_invalidate+0x78) [0x55f091713884]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libc.so.6(+0x1576ce) [0x7f16aec726ce]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(dircache_get+0x237) [0x55f091713cda]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(hashtable_entry_get_hash+0x73) [0x55f091713115]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(dircache_get+0x7e) [0x55f091713b21]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x1f57a) [0x55f09171d57a]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x1e5ef) [0x55f09171c5ef]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(fuse_fs_readdir+0x67) [0x7f16aeeaa1a7]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x1eaf7) [0x55f09171caf7]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc250) [0x7f16aeea7250]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc3ce) [0x7f16aeea73ce]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xf368) [0x7f16aeeaa368]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x170b8) [0x7f16aeeb20b8]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x15fb2) [0x7f16aeeb0fb2]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x14d3e) [0x55f091712d3e]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x170b8) [0x7f16aeeb20b8]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x13d5c) [0x7f16aeeaed5c]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x13d5c) [0x7f16aeeaed5c]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f16aecee730]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f16aece3fa3]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libc.so.6(+0xa19bd) [0x7f16aebbc9bd]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(hashtable_entry_delete_subkeys+0x6e) [0x55f0917134aa]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f16aece3fa3]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f16aec144cf]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f16aec144cf]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(dircache_invalidate+0x78) [0x55f091713884]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(dircache_get+0x237) [0x55f091713cda]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x1e5ef) [0x55f09171c5ef]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x1eaf7) [0x55f09171caf7]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc250) [0x7f16aeea7250]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc3ce) [0x7f16aeea73ce]

@karibertils
Copy link
Author

Here is output from gdb

Thread 227 "rar2fs" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff650f700 (LWP 2549064)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt 
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7c25535 in __GI_abort () at abort.c:79
#2  0x00007ffff7c7c508 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7d8728d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff7c82c1a in malloc_printerr (str=str@entry=0x7ffff7d88fb0 "double free or corruption (fasttop)") at malloc.c:5341
#4  0x00007ffff7c845d7 in _int_free (av=0x7fff64000020, p=0x7fff6403a020, have_lock=<optimized out>) at malloc.c:4258
#5  0x00005555555692c6 in hashtable_entry_delete_hash (h=0x7fffe8008da0, key=0x7fff10020fd0 "P&\004\234\376\177", hash=3330833281) at hashtable.c:158
#6  0x0000555555569533 in hashtable_entry_delete_subkeys (h=0x7fffe8008da0, key=0x7ffe9c042630 "/crash/6", hash=3330833281) at hashtable.c:250
#7  0x0000555555569884 in dircache_invalidate (path=0x7ffe9c042630 "/crash/6") at dircache.c:121
#8  0x0000555555569cda in dircache_get (path=0x7ffe9c042630 "/crash/6") at dircache.c:195
#9  0x00005555555725ef in syncdir (path=0x7ffe9c042630 "/crash/6") at rar2fs.c:3082
#10 0x0000555555572af7 in rar2_getattr (path=0x7ffe8c07c830 "/crash/6/9", stbuf=0x7ffff650ec40) at rar2fs.c:3203
#11 0x00007ffff7f8f250 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#12 0x00007ffff7f8f3ce in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#13 0x00007ffff7f9a0b8 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#14 0x00007ffff7f96d5c in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#15 0x00007ffff7dcbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#16 0x00007ffff7cfc4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Here is the previous gdb output
https://pastebin.com/9abvUmSf

@hasse69
Copy link
Owner

hasse69 commented Feb 14, 2021

I now use a bucket size of 1 (i.e. every single entry in the cache will collide) and still no sign of crash, not even with your variant of the script. So question remains, was your mount point empty before you tried this or not?
Also, can you try to reproduce when mounting using the -s flag?

@karibertils
Copy link
Author

No the mount point has thousands of files and folders. But the /media/crash folder is empty.

The crash does not happen when using -s flag

@hasse69
Copy link
Owner

hasse69 commented Feb 15, 2021

So then we can conclude this is some thread concurrency issue. There is a lock missing somewhere.

@hasse69
Copy link
Owner

hasse69 commented Feb 15, 2021

If possible can you try this patch (master/HEAD) to see if we enter a function that looks a bit suspicious?
I have my doubts though that this is really a function you ever reach.

diff --git a/src/rar2fs.c b/src/rar2fs.c
index 1e6f23b..6e56b22 100644
--- a/src/rar2fs.c
+++ b/src/rar2fs.c
@@ -2601,6 +2601,7 @@ static void __listrar_tocache_forcedir(struct filecache_entry *entry_p,
  ****************************************************************************/
 static inline void __listrar_cachedir(const char *mp)
 {
+fprintf(stderr, "ZZZ __listrar_cachedir\n");
         if (!dircache_get(mp))
                (void)dircache_alloc(mp);
 }

You need to run using -f to see the output on your terminal.

@karibertils
Copy link
Author

Yes that function is being entered like 100 times when I cd to the /media folder initially

@karibertils
Copy link
Author

karibertils commented Feb 15, 2021

Seems this is related to the underlying fs I'm using on /data mountpoint.

I am using cephfs, which is fully POSIX compliant distributed fs, normally works the same as local filesystems for everything I throw at it. Way better than NFS in my experience, but seems something in rar2fs conflicting with it unfortunately.

I tried several other file systems and cannot reproduce the issue at all. Except when I tried NTFS, it also crashes using the same bash script every time. But the backtrace seems a bit different.

The __listrar_cachedir function is not entered when using ntfs, but is entered when using cephfs.

#0  __strstr_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strstr-sse2-unaligned.S:40
#1  0x00005555555694aa in hashtable_entry_delete_subkeys (h=0x7fffe8008da0, key=0x7fff6c008dc0 "/crash/3", hash=3330833281) at hashtable.c:244
#2  0x0000555555569884 in dircache_invalidate (path=0x7fff6c008dc0 "/crash/3") at dircache.c:121
#3  0x0000555555569cda in dircache_get (path=0x7fff6c008dc0 "/crash/3") at dircache.c:195
#4  0x000055555557260f in syncdir (path=0x7fff6c008dc0 "/crash/3") at rar2fs.c:3083
#5  0x0000555555572b17 in rar2_getattr (path=0x7ffee0021df0 "/crash/3/34", stbuf=0x7ffef8ff8c40) at rar2fs.c:3204
#6  0x00007ffff7f8f250 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#7  0x00007ffff7f8f3ce in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#8  0x00007ffff7f9a0b8 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#9  0x00007ffff7f96d5c in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#10 0x00007ffff7dcbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#11 0x00007ffff7cfc4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I am unable to reproduce the issue when using -s flag both on cephfs and ntfs underlying fs. Verified 10+ times.

@hasse69
Copy link
Owner

hasse69 commented Feb 15, 2021

Please try below patch to see if it has any effect.

diff --git a/src/rar2fs.c b/src/rar2fs.c
index 1e6f23b..4b1dc7c 100644
--- a/src/rar2fs.c
+++ b/src/rar2fs.c
@@ -2601,8 +2601,10 @@ static void __listrar_tocache_forcedir(struct filecache_entry *entry_p,
  ****************************************************************************/
 static inline void __listrar_cachedir(const char *mp)
 {
+        pthread_rwlock_wrlock(&dir_access_lock);
         if (!dircache_get(mp))
                (void)dircache_alloc(mp);
+        pthread_rwlock_unlock(&dir_access_lock);
 }

 /*!
@@ -2615,13 +2617,15 @@ static void __listrar_cachedirentry(const char *mp)
         char *tmp = safe_path;
         safe_path = __gnu_dirname(safe_path);
         if (CHRCMP(safe_path, '/')) {
+                pthread_rwlock_wrlock(&dir_access_lock);
                 struct dircache_entry *dce = dircache_get(safe_path);
                 if (dce) {
-                        char *tmp = strdup(mp);
-                        dir_entry_add(&dce->dir_entry_list, basename(tmp),
+                        char *tmp2 = strdup(mp);
+                        dir_entry_add(&dce->dir_entry_list, basename(tmp2),
                                       NULL, DIR_E_RAR);
-                        free(tmp);
+                        free(tmp2);
                 }
+                pthread_rwlock_unlock(&dir_access_lock);
         }
         free(tmp);
 }

@karibertils
Copy link
Author

Same behaviour after patch with both ntfs & cephfs

@hasse69
Copy link
Owner

hasse69 commented Feb 15, 2021

I found a race condition using valgrind/helgrind that might be related. Please try attached patch as well.

151.patch.txt

# patch -p1 < 151.patch.txt

@karibertils
Copy link
Author

Yep that patches solves it ! Awesome job 😄

@hasse69
Copy link
Owner

hasse69 commented Feb 15, 2021

I will push the patch to master and then close this issue.
From what I understand it was fairly easy to reproduce and thus I am rather confident with this change now that you have verified it. One key to solving this issue was obviously your excellent reproducer (script). Even though I could not reproduce the exact same crash it was enough for valgrind to trap it. So, awesome job from your side as well! 👍

hasse69 pushed a commit that referenced this issue Feb 15, 2021
Commit 33af7aa did not completely
solve the problem(s) reported in issue #151.
After running valgrind/helgrind a race condition was spotted with
the following signature:

==29979== Possible data race during write of size 8 at 0x62E2F20 by thread #7
==29979== Locks held: none
==29979==    at 0x411D6B: hashtable_entry_delete_hash (string3.h:52)
==29979==    by 0x412180: hashtable_entry_delete_subkeys (hashtable.c:250)
==29979==    by 0x4126EE: dircache_get (dircache.c:195)
==29979==    by 0x41BC63: syncdir (rar2fs.c:3082)
==29979==    by 0x41BF12: rar2_getattr (rar2fs.c:3203)
==29979==    by 0x4E4534F: lookup_path (fuse.c:2472)

The problem is caused by a missing rwlock when detecting stale cache
entries and when such entries were invalidated.
This patch also adds a few missing locks spotted by pure inspection
of the code and which were not part of the use-case covered by the
valgrind/helgrind test run.

Resolves-issue: #151
Signed-off-by: Hans Beckerus <hans.beckerus at gmail.com>
@hasse69 hasse69 closed this as completed Feb 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants