Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File lost, no handle? #362

Open
jSML4ThWwBID69YC opened this issue May 10, 2020 · 22 comments
Open

File lost, no handle? #362

jSML4ThWwBID69YC opened this issue May 10, 2020 · 22 comments

Comments

@jSML4ThWwBID69YC
Copy link

Have you read through available documentation and open Github issues?

Yes

Is this a BUG report, FEATURE request, or a QUESTION? Who is the indended audience?

BUG report

System information

Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...).

All componets built from the FreeBSD ports tree.

moosefs3-master-3.0.112_1
moosefs3-cgiserv-3.0.112_1
moosefs3-master-3.0.112_1
moosefs3-chunkserver-3.0.112_1
moosefs3-client-3.0.112_1
fusefs-libs3-3.9.1

Operating system (distribution) and kernel version.

FreeBSD 12.1p3

Hardware / network configuration, and underlying filesystems on master, chunkservers, and clients.

MooseFS master server: 1x
CPU: Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz
RAM: 256GB
HW DISK: 4x 1.2TB in raidz1

Moosefs chunk servers: 5x identical
CPU: Intel(R) Xeon(R) CPU E5-2407 0 @ 2.20GHz
RAM: 32GB
HW DISK: 4x 8TB in raidz1.
Moose disk is a single folder using ZFS reservation and quotas mounted at /storage/chunk
DATA: storage/chunk 889G 18.1T 889G /storage/chunk

MooseFS client servers: 2x identical
CPU: Intel(R) Xeon(R) CPU E5-2667 0 @ 2.90GHz
RAM: 192GB
HW DISK: 4x 1.2TB configured as raidz1

Network

All servers are plugged in twice with 10GB uplinks. The uplinks are configured as LACP across two Cisco switches in a stack. The network cards are Dell branded Intel cards using the 'ix' driver. MTU is set to 9198.

How much data is tracked by moosefs master (order of magnitude)?

  • All fs objects: 10831413
  • Total space: 95 TiB
  • Free space: 90 TiB
  • RAM used: 3.2 GiB
  • last metadata save duration: ~4.9s

Describe the problem you observed.

It appears that the file handle can be lost, causing files and folders to be unmanageable. I can't replicate it manually, it happens sporadically. Here's the output from one such event.

The initial command and error happened during the deletion of a folder with multiple sub folders and files. Here's the one that triggered the issue this time.

rm -rf /apps/searx-env/
rm: /apps/searx-env/lib/python3.8/site-packages/setuptools/_vendor/packaging: Directory not empty
rm: /apps/searx-env/lib/python3.8/site-packages/setuptools/_vendor: Directory not empty
rm: /apps/searx-env/lib/python3.8/site-packages/setuptools: Directory not empty
rm: /apps/searx-env/lib/python3.8/site-packages: Directory not empty
rm: /apps/searx-env/lib/python3.8: Directory not empty
rm: /apps/searx-env/lib: Directory not empty
rm: /apps/searx-env/: Directory not empty

This is the lower directory point from the above error.

cd /apps/searx-env/lib/python3.8/site-packages/setuptools/_vendor/packaging
pwd
/apps/searx-env/lib/python3.8/site-packages/setuptools/_vendor/packaging

This is what exists in the packaging folder.

ls -al
ls: __pycache__: Resource temporarily unavailable
total 1
drwxr-s---  3 5137  80  1 May 10 13:33 .
drwxr-s---  3 5137  80  1 May 10 13:33 ..

I'm not sure what 'Resource temporarily unavailable' means, but the file can not be managed by normal tools anymore.

Here's the oplog output during the ls -l command.

05.10 13:45:09.614446: uid:0 gid:0 pid:6141 cmd:open (2147483633) (internal node: OPLOG): OK (1,0)
#
#
05.10 13:45:12.327980: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.328419: uid:5137 gid:5137 pid:7511 cmd:lookup (2885983,bin): OK (1.0,2886024,1.0,[drwxr-xr-x:0040755,2,0,0,1587830676,1525905461,1587830676,1])
05.10 13:45:12.328964: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.329102: uid:5137 gid:5137 pid:7511 cmd:lookup (2885983,libexec): OK (1.0,3139980,1.0,[drwxr-xr-x:0040755,2,0,0,1587830676,1525905461,1587830676,1])
05.10 13:45:12.329303: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.329438: uid:5137 gid:5137 pid:7511 cmd:lookup (2885983,etc): OK (1.0,2890092,1.0,[drwxr-xr-x:0040755,2,0,0,1587830676,1525905461,1587830676,1])
05.10 13:45:12.329578: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.329714: uid:5137 gid:5137 pid:7511 cmd:lookup (2885983,var): OK (1.0,5342374,1.0,[drwxr-xr-x:0040755,2,0,0,1587830676,1525905461,1587830676,1])
05.10 13:45:12.329916: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.330077: uid:5137 gid:5137 pid:7511 cmd:lookup (2885983,lib): OK (1.0,3139922,1.0,[drwxr-xr-x:0040755,2,0,0,1587830676,1525905461,1587830676,1])
05.10 13:45:12.330265: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.330427: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.330979: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.331212: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.331376: uid:5137 gid:5137 pid:7511 cmd:lookup (2885983,dev): OK (1.0,2890083,1.0,[drwxr-xr-x:0040755,2,0,0,1587830676,1525905461,1587830676,1])
05.10 13:45:12.331614: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.331773: uid:5137 gid:5137 pid:7511 cmd:lookup (2885983,usr): OK (1.0,5342372,1.0,[drwxr-xr-x:0040755,2,0,0,1587830676,1525905461,1587830676,1])
05.10 13:45:12.332015: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.332320: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.332524: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.332727: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.332954: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.333285: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.333437: uid:5137 gid:5137 pid:7511 cmd:lookup (2885983,home): OK (1.0,2902410,1.0,[drwxr-x---:0040750,7,5137,5137,1587830676,1589050134,1589050134,2009720])
05.10 13:45:12.333488: uid:5137 gid:5137 pid:7511 cmd:access (2902410,0x1): OK
05.10 13:45:12.333608: uid:5137 gid:5137 pid:7511 cmd:lookup (2902410,.termcap.db): ENOENT (No such file or directory)
05.10 13:45:12.333731: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.333750: uid:5137 gid:5137 pid:7511 cmd:access (2902410,0x1): OK
05.10 13:45:12.333881: uid:5137 gid:5137 pid:7511 cmd:lookup (2902410,.termcap): ENOENT (No such file or directory)
05.10 13:45:12.333987: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK
05.10 13:45:12.334469: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x1): OK
05.10 13:45:12.334606: uid:5137 gid:5137 pid:7511 cmd:getattr (3149970) [no handle]: OK (1.0,[drwxr-s---:0042750,3,5137,80,1589117582,1589117614,1589117614,1])
05.10 13:45:12.334754: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x1): OK
05.10 13:45:12.334915: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x4): OK
05.10 13:45:12.335062: uid:5137 gid:5137 pid:7511 cmd:opendir (3149970): OK [handle:0000000C]
05.10 13:45:12.335194: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x1): OK
05.10 13:45:12.335337: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x1): OK
05.10 13:45:12.335469: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x4): OK
05.10 13:45:12.335603: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x1): OK
05.10 13:45:12.335726: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x4): OK
05.10 13:45:12.335909: uid:5137 gid:5137 pid:7511 cmd:statfs (1): OK (104453604638720,99272582511104,99272582511104,0,0,8950890)
05.10 13:45:12.336026: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x1): OK
05.10 13:45:12.336138: uid:5137 gid:5137 pid:7511 cmd:readdir (3149970,4096,137) [handle:0000000C]: OK (104)
05.10 13:45:12.336270: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x1): OK
05.10 13:45:12.336400: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x1): OK
05.10 13:45:12.336420: uid:5137 gid:5137 pid:7511 cmd:getattr (3149931) [no handle] (using open dir cache): OK (1.0,[drwxr-s---:0042750,3,5137,80,1589117582,1589117614,1589117614,1])
05.10 13:45:12.336548: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x1): OK
05.10 13:45:12.336568: uid:5137 gid:5137 pid:7511 cmd:lookup (3149970,__pycache__) (using open dir cache): OK (1.0,3150067,1.0,[drwxr-s---:0042750,2,5137,80,1589117582,1589117582,1589117582,1])
05.10 13:45:12.336595: uid:5137 gid:5137 pid:7511 cmd:readdir (3149970,4096,137) [handle:0000000C]: OK (no data)
05.10 13:45:12.336737: uid:5137 gid:5137 pid:7511 cmd:access (3149970,0x1): OK
05.10 13:45:12.339550: uid:5137 gid:5137 pid:7511 cmd:access (2885983,0x1): OK

At this point, the entire /apps/searx-env/lib/python3.8/site-packages/setuptools/_vendor/packaging structure is undeletable because of the non-existent, yet existing file pycache.

I've seen this with different files/folder combinations as well.

Other MooseFS mount points are not affected, even on the same file directory structure. For example, using a separate mount and running ls works.

ls -al /apps/searx-env/lib/python3.8/site-packages/setuptools/_vendor/packaging/
total 2
drwxr-s---  3 5137  www  1 May 10 13:33 .
drwxr-s---  3 5137  www  1 May 10 13:33 ..
drwxr-s---  2 5137  www  1 May 10 13:33 __pycache__

The only fix I've found is to unmount the mfsmount with the issue and remount it. This may be related to #350, or #354 as it seems the file handler is being lost.

@chogata
Copy link
Member

chogata commented May 11, 2020

The oplog you sent here shows that everything is OK on MooseFS side - when asked about the file __pycache__, it sends OK in answer. So it's not MooseFS responding Resource temporarily unavailable, it's FreeBSD kernel.

You write "it happens sporadically", which means you don't have a foolproof way to repeat it. But do you have any scenario in which this happens more often than not?

@chogata chogata added platform specific upstream/kernel upstream OS kernel issue labels May 11, 2020
@jSML4ThWwBID69YC
Copy link
Author

Hello @chogata

I am unfortunately not able to replicate it on demand, but it has happened repeatedly over the last several months of testing. The error 'Resource temporarily unavailable' seems to be the EAGAIN (35) error from 'man errno'. The issue affects multiple mfsmounts across multiple servers. It does not however affect every mfsmount. The difference seems to be weather there is software, or processes running that interact with the files with write actions.

For example, there are fifteen different mount points over three servers in which the files could be accessed. Three of the mfsmounts across two servers show the error. All other mfsmounts work properly. The three mount points showing the issue are also the only ones that have write interactions with the files.

From what I can tell, a write action failed when I tried to delete the searx-env folder and sub folders. The issue then manifested on each of the mfsmounts where processes were running that also tried to write to the files. Mfsmounts where no writing is attempted do not seem affected.

I've left the system running, with the issue still happening, so I can run further tests. Any ideas are most welcome.

@jSML4ThWwBID69YC
Copy link
Author

Hello @chogata,

I've done additional testing over the last few weeks. From what I can tell, the file system action happens correctly on MooseFS, but the FuseFS cache is being corrupted somehow. I've been able to replicate the issue several times this week by doing large amounts of small file transfers using rsync. I've also seen it happen by making several-layered folders using a script, though that only happened once. It seems like cache consistency in relation to what's actually on MooseFS is lost.

During my tests, I mounted MooseFS in two different ways.

  1. Multiple MooseFS mounts on a single server referencing the same export. When done this way, any given mount point may show the issue, while other mount points would not. It seems to be cache-specific to each mount.

  2. A single MooseFS mount overlayed with NullFS to make it available in several locations. Note, Nullfs was mounted with -o nocache to avoid FreeBSD bug https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=245688. In this case, the issue still happens. It does not seem to be triggered by using multiple mounts on the same server.

The only fuse related error I've found is fuse_vnop_lookup: WB cache incoherent on /path/to/mfs/mount. I can't confirm if it's related, but I'm adding it here just in case. It happens rarely, and the last instance was back on May 23, before 3.0.113, so that might have been something else.

I've now set all systems to using 'mfscachemode=DIRECT'. I haven't been able to replicate the issue over the last week when using direct mode. Another indicator to it being a cache-related issue.

I hope this helps to track down the root cause and find a resolution. For now, MooseFS doesn't appear safe to use with the default mfscachemode=FBSDAUTO.

@chogata
Copy link
Member

chogata commented Jun 8, 2020

Thank you for all this new information - we will try to track the issue again.

@jSML4ThWwBID69YC
Copy link
Author

This just happened again while using mfscachemode=DIRECT. The mount command is:

mfsmount IP:/web -o mfspassword=password -o mfsmkdircopysgid=1 -o mfscachemode=DIRECT /storage/chunk

It happened while unzipping Drupal 9 zip from from https://www.drupal.org/project/drupal/releases/9.0.0.

Here's an last bit of output from the unzip process.

 extracting: drupal-9.0.0/core/assets/vendor/ckeditor/plugins/showblocks/images/block_h4.png  
 extracting: drupal-9.0.0/core/assets/vendor/ckeditor/plugins/showblocks/images/block_h5.png  
 extracting: drupal-9.0.0/core/assets/vendor/ckeditor/plugins/showblocks/images/block_h6.png  
 extracting: drupal-9.0.0/core/assets/vendor/ckeditor/plugins/showblocks/images/block_p.png  
 extracting: drupal-9.0.0/core/assets/vendor/ckeditor/plugins/showblocks/images/block_pre.png  
   creating: drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/
   creating: drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs/
unzip: open('drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs/sourcedialog.js'): Resource temporarily unavailable

I tried to delete the directory and received this.

 rm -rf drupal-9.0.0/
rm: drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs/sourcedialog.js: Resource temporarily unavailable
rm: drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs: Directory not empty
rm: drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog: Directory not empty
rm: drupal-9.0.0/core/assets/vendor/ckeditor/plugins: Directory not empty
rm: drupal-9.0.0/core/assets/vendor/ckeditor: Directory not empty
rm: drupal-9.0.0/core/assets/vendor: Directory not empty
rm: drupal-9.0.0/core/assets: Directory not empty
rm: drupal-9.0.0/core: Directory not empty
rm: drupal-9.0.0/: Directory not empty

Here's the mfsfileinfo for the sourcedialog.js file from three different physical machines all accessing the same export.

The original mount:

mfsfileinfo /<snip>/home/drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs/sourcedialog.js
/storage/chunk/<snip>/home/drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs/sourcedialog.js: realpath error on (/storage/chunk/<snip>/home/drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs/sourcedialog.js): EAGAIN (Resource temporarily unavailable)

Server two:

mfsfileinfo /storage/chunk/<snip>/home/drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs/sourcedialog.js
/storage/chunk/<snip>/home/drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs/sourcedialog.js:
        no chunks - empty file

Server three:

mfsfileinfo /storage/chunk/<snip>/home/drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs/sourcedialog.js
/storage/chunk/<snip>/home/drupal-9.0.0/core/assets/vendor/ckeditor/plugins/sourcedialog/dialogs/sourcedialog.js:
        no chunks - empty file

I deleted the drupal-9.0.0 folder from a different mfs mount than the one the issue started on. It deletes just fine, even though it would not delete on the initial mount. I then attempted to unzip the archive again on the initial mfs mount. It failed the same way, but with a different file. Again, deleting it from a secondary mfs mount works. A few more tests show the same repeat pattern. Lastly, I attempted the same unzip from the other two mount points. It failed each time, and now I can't delete it without creating a fourth mfs mount, or rebooting servers to clear it out.

After rebooting the server, I was able to delete the files, and run the unzip again. It worked correctly this time.

The systems were using nullfs mounts on top of the mfs one in this case. I've switched back to using multiple mfs mounts without nullfs. I've also left mfscachemode at default since the issue has now happened in direct mode.

Other details:
The mfs mounts are using a /web sub folder.
The mfs mount has mfsmkdircopysgid=1 set.
The master server has ATIME_MODE = 4 set.
I'm using kern.timecounter.hardware=HPET for the time counter.
NTP is set and working.
This patch has been applied: https://github.com/moosefs/moosefs/commit/81d0d29044ad197b3ea61fa89bb88d607bb79772.diff

@jSML4ThWwBID69YC
Copy link
Author

It happened again this morning. I used rsync to copy a bunch of files over the network from a non mfs disk to a mfs mount. After the copy, I attempted to delete some unneeded data that was copied and received the 'Resource temporarily unavailable' error again.

In this case, I was using mfscachemode=DIRECT and multiple mfs mounts on a single server. Only one of the mfs mounts exhibited the issue. Rebooting the server resolved the issue.

@chogata
Copy link
Member

chogata commented Jun 16, 2020

Okay, that's good to know. Our test instance is currently running a long batch of intense testing, it needs to finish before I can start testing this issue, so it might be a couple more days. But I do have this on my todo list still.

@jSML4ThWwBID69YC
Copy link
Author

jSML4ThWwBID69YC commented Jun 18, 2020

I've run into additional issues that seem related.

1: Using mfscachemode=AUTO, or FBSDAUTO on FreeBSD 12.1p6 leads to data corruption. Changes in files do not propagate to other mfs mount points. This is easily replicated. I can open a separate issue for this if needed. The Resource temporarily unavailable error from this issue also happens sporadically.

The logs show this message repeatedly.

kernel: fuse_vnop_lookup: WB cache incoherent on /path/to/mfs/mount.

2: Using mfscachemode=DIRECT exhibits the 'Resource temporarily unavailable' issue sporadically.

I've found I can trigger this most often by doing large amounts of rapid file manipulation. For example, using rsync on a large number of small files. Secondly, I've found compiling programs such as Node.js directly on the mfs mount will eventually fail leaving a zombie process. I don't see the 'Resource temporarily unavailable' error when that happens, but program compiling works when using AUTO/FBSDAUTO mode.

The fuse_vnop_lookup message does not happen when using DIRECT mode.

3: Using mfscachemode=NEVER seems to avoid issues 1 and 2. File changes propagate correctly and the 'Resource temporarily unavailable' issue does not happen.

The fuse_vnop_lookup message happens, but significantly less than when using AUTO/FBSDAUTO mode. In fact, it only happens on a mfs mount being accessed by Nginx.

At this point, I'm setting all the servers to mfscachemode=NEVER and I will continue to see whether I can replicate any of the above errors again.

Other information:

The vfs.fusefs.data_cache_mode is set to 1 for all of the above cases. According to the man page (man fusefs), it only applies to pre-7.23 fuse file systems. Is MooseFS pre-7.23, or something higher? Does the vfs.fusefs.data_cache_mode override mfscachemode setting?

The value of sysctl vfs.fusefs.stats.lookup_cache_hits is continually incrementing even with mfscachemode=NEVER being set. It seems that some type of caching is still happening even though its set not too.

EX: I ran this three times in a row.

root:~ # sysctl vfs.fusefs.stats.lookup_cache_hits
vfs.fusefs.stats.lookup_cache_hits: 275748965

root:~ # sysctl vfs.fusefs.stats.lookup_cache_hits
vfs.fusefs.stats.lookup_cache_hits: 277048903

root:~ # sysctl vfs.fusefs.stats.lookup_cache_hits
vfs.fusefs.stats.lookup_cache_hits: 277080166

I'll report again if I can replicate any of the issues while mfscachemode=NEVER is set. Thanks for putting this on your list. Please let me know if there is anything else I can do to help.

@jSML4ThWwBID69YC
Copy link
Author

The fuse_vnop_lookup message is happening at the top of the hour every hour. Basically at the same time the master is doing it's metadata dump.

Jun 18 03:00:13 web02 kernel: fuse_vnop_lookup: WB cache incoherent on /path/to/mfs/mount!
Jun 18 04:00:02 web02 syslogd: last message repeated 1 times
Jun 18 05:00:11 web02 syslogd: last message repeated 1 times
Jun 18 06:00:12 web02 syslogd: last message repeated 1 times
Jun 18 06:00:12 web02 syslogd: last message repeated 1 times
Jun 18 07:00:07 web02 syslogd: last message repeated 1 times
Jun 18 08:00:09 web02 syslogd: last message repeated 1 times
Jun 18 09:00:02 web02 syslogd: last message repeated 1 times
Jun 18 10:00:01 web02 syslogd: last message repeated 1 times
Jun 18 11:00:05 web02 syslogd: last message repeated 1 times
Jun 18 12:00:12 web02 syslogd: last message repeated 1 times
Jun 18 12:00:12 web02 syslogd: last message repeated 1 times
Jun 18 13:00:08 web02 syslogd: last message repeated 1 times
Jun 18 14:00:02 web02 syslogd: last message repeated 1 times
Jun 18 15:00:02 web02 syslogd: last message repeated 1 times

@chogata
Copy link
Member

chogata commented Jun 25, 2020

I've been trying to replicate your initial issue (Resource temporarily unavailable) for a couple of days now, with no success. No amount of copying/deleting, rsyncing/deleting or unpacking/deleting is yielding any bad effects, including trying to delete (rm -rf) a directory tree on one client, while another is still actively writing to it. So, just to make sure: you reported this first on .112, but you then mentioned .113, so I assume the problem still exists on .113 for you?
And just FYI, this looks to me like a completely separate issue from the one you reported in #374.

@jSML4ThWwBID69YC
Copy link
Author

It happened again last night. This time it was a simple WordPress plugin update that triggered it. There was an additional error last night.

mfsmount[17652]: writeworker: can't set TCP_NODELAY: ECONNRESET (Connection reset by peer)

I've seen that error, and it's opposite on different days, but I'd not associated it to this issue. Here's the opposite.

mfsmount[51736]: readworker: can't set TCP_NODELAY: ECONNRESET (Connection reset by peer)

Here's the sysctl.conf values that differ from the defaults.

# Host Security Settings
hw.mds_disable=3
security.bsd.see_other_uids=0
security.bsd.see_other_gids=0
security.bsd.unprivileged_read_msgbuf=0
security.bsd.unprivileged_proc_debug=0
security.bsd.see_jail_proc=0
security.bsd.hardlink_check_gid=0
security.bsd.hardlink_check_uid=1
security.bsd.stack_guard_page=1
security.bsd.map_at_zero=1
kern.randompid = 5339
kern.elf64.nxstack = 1
kern.coredump=0

# Set the congestion control algorithm
net.inet.tcp.cc.algorithm=newreno # Was HTCP. Changed back to newreno for testing. 

# Network values
kern.ipc.maxsockbuf=16777216
net.inet.tcp.sendspace=65536
net.inet.tcp.recvspace=65536
kern.ipc.somaxconn=4096

I've also experimented with using different loader.conf options, but the issue shows up either way.

loader.conf 
#net.inet.tcp.soreceive_stream="1"
#net.isr.bindthreads="1"

@chogata
Copy link
Member

chogata commented Jun 29, 2020

security.bsd.see_other_uids=0
security.bsd.see_other_gids=0

This might be a problem and might possibly explain the 'Resource temporarily unavailable' error. On a local file system on a single operating system, when you open a session and cd into a directory and then you open another session and delete this directory, there will be no error and the directory will be deleted, BUT the inode of this directory is remembered by the system and never reused, until it ceases to be the CWD of your first session.

On a network filesystem, like MooseFS, there is no single system that "knows" all CWDs. The OS on the machine with your first MooseFS mount does not know, what happens on your other machine with your other MooseFS mount, so as long as a a directory deleted from MooseFS directory is no longer a CWD of any process on this first machine, this OS thinks it's okay to reuse this inode number. But if it is a CWD of a process on the second machine and we re-use it, we have a problem. So it falls to MooseFS to "ask" (through clients) about all CWDs on all machines that have this instance of MooseFS mounted and keep this list and make sure inodes from this list are never reused as long as they remain on the list.

We had a lot of issues with that on some versions of Linux kernel, but so far not on FreeBSD. However, we always run FreeBSD on default security options, which do not include hiding other uid's processes...

This asks for more tests, I will try to squeeze them somewhere in this week. I'm not saying this is for sure the case, but IF it is, it might turn out we just cannot support this setting. Tests will tell.

In the meantime, since this might turn out to be a different issue altogether, I would like to ask if you could mount your mounts with some cache options disabled, specifically those: -o mfsattrcacheto=0 -o mfsentrycacheto=0 -o mfsdirentrycacheto=0 -o mfsnegentrycacheto=0 and see if you still encounter the 'Resource temporarily unavailable' error from time to time? I have to warn you though, disabling all caches will most probably have some negative impact on performance.

The fuse_vnop_lookup: WB cache incoherent looks like a separate issue and most probably will be fixed together with #374.

@jSML4ThWwBID69YC
Copy link
Author

jSML4ThWwBID69YC commented Jul 3, 2020

The mfsmount program is run as root, who can see all users regardless of the security.bsd.see_* settings.

I temporarily set the requested cache features to 0 on a single node. The network bandwidth jumped from an average of 20Mb/s to 100Mb/s.

There is new hardware being deployed this month. I'm setting aside a server to run stress tests on. I'll include testing with all the cache features set to 0 once the server is available.

@jSML4ThWwBID69YC
Copy link
Author

Hello,

Even with new replacement hardware, this continues to happen. I can't keep a single server up for more than 48 hours without this issue occurring. What can I do to help track this down for a resolution?

@chogata
Copy link
Member

chogata commented Jul 24, 2020

I have tried many times to repeat your problem. I succeeded ONCE. I tried to repeat exactly what I did then, but no luck. We can try to raise the issue with FreeBSD, but I'm afraid without a repeatable scenario they won't be able to do anything about it...

@jSML4ThWwBID69YC
Copy link
Author

Hello @chogata

I'm glad to hear you were able to replicate it at least once. May I ask what steps you took that exhibted the issue?

I've noticed this problem has become more common as the traffic and file changes have increased. I'm still trying to find a way to manually replicate this, but I suspect a very specific combination of things has to happen.

Thank you for sticking with it. I'll follow up once I have a way to replicate this.

@jSML4ThWwBID69YC
Copy link
Author

This is still an issue in 3.0.114. Unfortunately, while it continues to happen regularly, I'm still unable to consistently replicate it by hand.

@jSML4ThWwBID69YC
Copy link
Author

@chogata

Unfortunately, this is still an issue on 3.0.115 with FreeBSD 12.1-RELEASE-p10. Can you provide any additional troubleshooting advice?

@jSML4ThWwBID69YC
Copy link
Author

jSML4ThWwBID69YC commented Nov 6, 2020

Hello @chogata

I've tried testing mfsmount with -o mfsattrcacheto=0 -o mfsentrycacheto=0 -o mfsdirentrycacheto=0 -o mfsnegentrycacheto=0 set. That unfortunately makes MooseFS unusable.

I've continued testing the normal mount options and found an interesting issue. I'm not been able to create the issue, or catch it happening within twenty four hours of a client server reboot. I can replicate it by hand after the server has been up for at least 24 hours. The longer the server is up, the easier it is. My test is simply to extract a tarball on one mfsmount while deleting the extracted files from another mfsmount. Rinse wash repeat and the issue will eventually show. When it shows up, it happens all at once. It does not get better or worse after that. Un-mounting mfs is the only resolution.

Again, I can't make it happen if the servers have been up for less than 24 hours. The closer I get to 48 hours the easier it is trigger.

This is happening on servers with a fairly consistent work load. I wonder if perhaps some other type of limit is being hit causing a disconnect between the mfsclient and underlying FuseFS? The CPU, memory and bandwidth are well below their limits. Here's the average bandwidth usage over the last 30 days for the various MooseFS related systems.

Clients: 10 Mb/s
Chunk servers: 1Mb/s
Master: 25Mb/s
There are spikes into the 100s of Mb/s range a lot, but it's the exception.

Of note, this continues to happen on FreeBSD 12.2 clients too.

@jSML4ThWwBID69YC
Copy link
Author

jSML4ThWwBID69YC commented Nov 7, 2020

My test is simply to extract a tarball on one mfsmount while deleting the extracted files from another mfsmount. Rinse wash repeat and the issue will eventually show.`

Running the above tests on a freshly rebooted client server took until 28 hours and 16 minutes to trigger. Note, I was not running the test in an automated fashion. I was simply running it by hand when ever I had a few minutes to run it.

$ uptime
 8:10PM  up 1 day,  4:16, 0 users, load averages: 1.34, 1.64, 1.63

This on FreeBSD 12.2, which seems to be easier to trigger on than 12.1 was. Since I can replicate this somewhat reliably now, is there anything else that can be done to resolve it?

@chogata
Copy link
Member

chogata commented Aug 10, 2021

A bit of an explanation for anybody interested in this thread :)

On a network filesystem it can happen that on one client (mount) an inode gets deleted while it is still used on another client (mount) on a different machine. If this inode is, for some reason, remembered in kernel cache on this second machine, it cannot be re-used.

There are many instances of such behaviour. One of them is if a deleted (on one machine) catalog is a CWD - current working directory - of a process (on another machine). Those inodes are reserved and cannot be re-used in MooseFS until they stop being a CWD. This is enough for all systems, but FreeBSD, which, for some reason, remembers also parents of those CWD directories and disallows their re-use. It's a very specific, FreeBSD-only behaviour and it took us quite some time to find it as the cause of problems described in this tread. We introduced a special, FreeBSD only workaround for this problem, so hopefully our FreeBSD users won't experience the "resource temporary unavailable" problem anymore with 3.0.116. We encourage everybody to test and share the results with us.

@chogata chogata added fix committed / please test and removed upstream/kernel upstream OS kernel issue labels Aug 10, 2021
@jkiebzak
Copy link

nice detective work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants