Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mfsmount slows down over time, or basic usage to the point of being unusable #347

Closed
jSML4ThWwBID69YC opened this issue Mar 21, 2020 · 18 comments

Comments

@jSML4ThWwBID69YC
Copy link

jSML4ThWwBID69YC commented Mar 21, 2020

Have you read through available documentation and open Github issues?

Yes

Note: https://moosefs.com/wp-content/uploads/2018/08/MooseFS-Hardware-Guide-v.0.9.pdf is a 404 error.

Is this a BUG report, FEATURE request, or a QUESTION? Who is the indended audience?

Question and maybe a bug.

System information

Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...).

All componets built from the FreeBSD ports tree.

moosefs3-master-3.0.111_2
moosefs3-cgiserv-3.0.111_2
moosefs3-master-3.0.111_2
moosefs3-chunkserver-3.0.111_2
moosefs3-client-3.0.111_2
fusefs-libs3-3.9.1

Operating system (distribution) and kernel version.

Hardware / network configuration, and underlying filesystems on master, chunkservers, and clients.

MooseFS master server: 1x
CPU: Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz
RAM: 256GB
HW DISK: 4x 1.2TB in raidz1
OS: Freebsd 12.1p2

Moosefs chunk servers: 5x identical
CPU: Intel(R) Xeon(R) CPU E5-2407 0 @ 2.20GHz
RAM: 32GB
HW DISK: 4x 8TB in raidz1.
Moose disk is a single folder using ZFS reservation and quotas mounted at /storage/chunk
DATA: storage/chunk 889G 18.1T 889G /storage/chunk

MooseFS client servers: 2x identical
CPU: Intel(R) Xeon(R) CPU E5-2667 0 @ 2.90GHz
RAM: 192GB
HW DISK: 4x 1.2TB configured as raidz1

Memory resources:
The master server shows 240GB of free memory on average.
All five chunk servers show roughly 10GB of free memory at any time.
Both client servers average around 170GB of free memory even when everything is operating slowly.

Network

All servers are plugged in twice with 10GB uplinks. The uplinks are configured as LACP across two Cisco switches in a stack. The network cards are Dell branded Intel cards using the 'ix' driver.

I've run iperf3 tests several times during a time when the mfsmount is slow. The tests are between a client and the master server. Here are the high and low results.

HIGH
Accepted connection from 192.168.0.32, port 32634
[ 5] local 192.168.0.50 port 5201 connected to 192.168.0.32 port 32635
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.00 GBytes 8.62 Gbits/sec
[ 5] 1.00-2.00 sec 1.15 GBytes 9.89 Gbits/sec
[ 5] 2.00-3.00 sec 1.15 GBytes 9.86 Gbits/sec
[ 5] 3.00-4.00 sec 1.15 GBytes 9.89 Gbits/sec
[ 5] 4.00-5.00 sec 1.15 GBytes 9.89 Gbits/sec
[ 5] 5.00-6.00 sec 1.15 GBytes 9.89 Gbits/sec
[ 5] 6.00-7.00 sec 1.15 GBytes 9.89 Gbits/sec
[ 5] 7.00-8.00 sec 1.15 GBytes 9.89 Gbits/sec
[ 5] 8.00-9.00 sec 1.15 GBytes 9.89 Gbits/sec
[ 5] 9.00-10.00 sec 1.15 GBytes 9.88 Gbits/sec
[ 5] 10.00-10.04 sec 43.3 MBytes 9.88 Gbits/sec


[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 11.4 GBytes 9.76 Gbits/sec receiver

LOW
Accepted connection from 192.168.0.32, port 33533
[ 5] local 192.168.0.50 port 5201 connected to 192.168.0.32 port 33535
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 992 MBytes 8.32 Gbits/sec
[ 5] 1.00-2.00 sec 1.15 GBytes 9.89 Gbits/sec
[ 5] 2.00-3.00 sec 1.15 GBytes 9.89 Gbits/sec
[ 5] 3.00-4.00 sec 782 MBytes 6.56 Gbits/sec
[ 5] 4.00-5.01 sec 1.03 GBytes 8.78 Gbits/sec
[ 5] 5.01-6.00 sec 906 MBytes 7.66 Gbits/sec
[ 5] 6.00-7.00 sec 1.15 GBytes 9.88 Gbits/sec
[ 5] 7.00-8.00 sec 1.15 GBytes 9.88 Gbits/sec
[ 5] 8.00-9.00 sec 783 MBytes 6.57 Gbits/sec
[ 5] 9.00-10.00 sec 838 MBytes 7.00 Gbits/sec


[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.03 sec 9.83 GBytes 8.42 Gbits/sec receiver

How much data is tracked by moosefs master (order of magnitude)?

  • All fs objects: 10838421
  • Total space: 95 TiB
  • Free space: 91 TiB
  • RAM used: 3.9 GiB
  • last metadata save duration: ~3.5s

Describe the problem you observed.

The mfsmount slows down over time or basic usage to the point of being unusable. Rebooting the client server restores performance.

Can you reproduce it? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes. A simple test is creating a tarball of a directory. After a fresh system boot the creation is quick. After several hours, the creation slows down and eventually reaches a point of not being able to complete.

For example, this command: time tar -czf backup.tar.gz /folder1 /folder2

On a fresh boot:
real 3m3.088s
user 0m18.313s
sys 0m4.670s

After the server has been in use:
real 35m25.855s
user 0m21.521s
sys 0m5.037s

For comparison, running the same test on local disk is significantly faster and consistent over time.
real 0m19.482s
user 0m9.681s
sys 0m4.164s

Include any warning/errors/backtraces from the system logs

The mfsmount log only says this at startup.

Mar 21 17:24:06 web00 mfsmount[84308]: monotonic clock function: clock_gettime
Mar 21 17:24:06 web00 mfsmount[84308]: monotonic clock speed: 16967 ops / 10 mili seconds
Mar 21 17:24:08 web00 mfsmount[84308]: my st_dev: 3976265474

Configuration files

mfsmount.cfg:

mfscachemode=AUTO
mfsmaster=IP
mfspassword=password
mfssubfolder=web
mfsmkdircopysgid=1
/storage/chunk

mfsexports.cfg:


*                       /       rw,alldirs,admin,maproot=0:0
*                       /web       rw,alldirs,admin,maproot=0:0,maxtrashtime=0s,password=PASSWORD,mingoal=2

Everything else is as shipped with the exception of IP binding and connection options.

NOTES:
The client servers are handling traffic and processing for fifty low traffic websites. During normal operation the sites are fine, but the tar stress test causes them to all go offline due to the slow mfsmount.

I'm not sure if this a bug in MooseFS, FreeBSD, or simply a configuration issue. The network, disk i/o, cpu usage and memory are all essentially idle or very low usage.

Can anybody help pinpoint why mfsmount becomes so slow over time, or usage?

@jSML4ThWwBID69YC
Copy link
Author

Here is a simplified test on a freshly booted client server. I'm using the same MooseFS master, chunk, and network configuration as above. The test is to repeatedly tarball a basic Nextcloud installation.

The script dumps the Nextcloud database from a different server and tarballs up the public_html and private_html folders.

Here are the results.

[ssh00 /]$  time /usr/local/bin/scripts/nextcloud-backup.sh 
Maintenance mode enabled
Starting Backup
Maintenance mode disabled
Backup complete. Download /home/autobackup/auto-backup-21-21.tar.gz to your local computer for safe keeping.

real	1m28.270s
user	0m15.617s
sys	0m3.855s
[ssh00 /]$  time /usr/local/bin/scripts/nextcloud-backup.sh 
Maintenance mode enabled
Starting Backup
Maintenance mode disabled
Backup complete. Download /home/autobackup/auto-backup-21-21.tar.gz to your local computer for safe keeping.

real	2m21.262s
user	0m16.010s
sys	0m4.617s
[ssh00 /]$  time /usr/local/bin/scripts/nextcloud-backup.sh 
Maintenance mode enabled
Starting Backup
Maintenance mode disabled
Backup complete. Download /home/autobackup/auto-backup-21-21.tar.gz to your local computer for safe keeping.

real	5m3.893s
user	0m17.513s
sys	0m5.497s
[ssh00 /]$  time /usr/local/bin/scripts/nextcloud-backup.sh 
Maintenance mode enabled
Starting Backup
Maintenance mode disabled
Backup complete. Download /home/autobackup/auto-backup-21-21.tar.gz to your local computer for safe keeping.

real	4m23.723s
user	0m17.129s
sys	0m6.019s
[ssh00 /]$  time /usr/local/bin/scripts/nextcloud-backup.sh 
Maintenance mode enabled
Starting Backup
Maintenance mode disabled
Backup complete. Download /home/autobackup/auto-backup-21-21.tar.gz to your local computer for safe keeping.

real	3m44.522s
user	0m16.631s
sys	0m6.580s
[ssh00 /]$  time /usr/local/bin/scripts/nextcloud-backup.sh 
Maintenance mode enabled
Starting Backup
Maintenance mode disabled
Backup complete. Download /home/autobackup/auto-backup-21-21.tar.gz to your local computer for safe keeping.

real	4m0.305s
user	0m16.402s
sys	0m6.476s
[ssh00 /]$  time /usr/local/bin/scripts/nextcloud-backup.sh 
Maintenance mode enabled
Starting Backup
Maintenance mode disabled
Backup complete. Download /home/autobackup/auto-backup-21-21.tar.gz to your local computer for safe keeping.

real	60m39.528s
user	0m17.375s
sys	0m6.638s

Note, the performance continues to be bad even if I shut everything but the mfsmount down. A reboot of the server is needed to bring performance back.

@jSML4ThWwBID69YC
Copy link
Author

I've repeated the above tests on an idle server using DIRECT and AUTO mode. The same slowdown happens either way.

I welcome any ideas on steps to resolve this.

@chogata
Copy link
Member

chogata commented Mar 24, 2020

I personally often use FreeBSD client to run tests and I never noticed this problem, but I usually work on an instance that is otherwise Linux only.

Is it tar specific? Or do you observe other operations slowing down? Would you mind doing a simple stress test with dd - make a loop that creates a file (approximately the size of your test backup tarballs) from /dev/zero and time dd performance over time?

@jSML4ThWwBID69YC
Copy link
Author

I ran the requested test. Unfortunately, running dd on a loop is not enough to cause the mfsmount to have any significant difference in speed. It appears accessing a large amount of different files in rapid sequence is enough to trigger this. Using tar on a large enough directory will do it. So will using rsync to sync a large amount of changes. In both cases, the entire mfsmount slows down affecting all other related processes. The client system resources remain close to idle.

Depending on the command and time it takes, I may need to run it several times in a row before the performance will start dropping. For example, I created a tarball of a Drupal based website. The first run took 1m58, and increased to 4m52 by the eighth run. In one case, on an earlier test, I've seen 60+ minutes for something that takes under two minutes on a fresh boot. So far the only errors I can find are rare instances of these:

mfsmount[56908]: writeworker: can't set TCP_NODELAY: ECONNRESET (Connection reset by peer)
mfsmount[88647]: sysctl(kern.proc.filedesc) error: EBUSY (Device or resource busy) 

I also found an issue with AUTO mode while testing, and I'll open a separate report on that.

@chogata
Copy link
Member

chogata commented Mar 25, 2020

It appears accessing a large amount of different files in rapid sequence is enough to trigger this.

This is a very good hint. Let me do some testing and I will get back to you.

@chogata
Copy link
Member

chogata commented Mar 25, 2020

And one more question:

A reboot of the server is needed to bring performance back.

Is reboot of the server really necessary? Umounting mfs share and mounting it back doesn't help?

@jSML4ThWwBID69YC
Copy link
Author

Is reboot of the server really necessary? Umounting mfs share and mounting it back doesn't help?

I'll test that in the next few hours. There's lots of nullfs mounts layered over the mfsmount that makes rebooting faster.

@jSML4ThWwBID69YC
Copy link
Author

jSML4ThWwBID69YC commented Mar 25, 2020

TL;DR: Umounting and mounting it back resolves the performance issue for a short time.

I've tested two situations with how the mfs mount is accessed.

1: Using the mfsmount directly does not appear to show any performance issues. I can mount and umount the mfs share without issue.

2: Using nullfs mounts on top of the mfs mount to share data to containers (jails) seems to be an issue. For a testing, I did the following.

  • Nullfs mount the mfs mount into a jail running SSH.
  • Login into the SSH jail and run the tarball tests.

The performance issues happens when accessing the data over the nullfs mount on top of the mfs mount.

Next I shutdown the jail and attempted to remove the nullfs and mfs mounts. The nullfs mounts took a long time to unmount. The mfs mount would not shutdown with this error.

umount: unmount of /storage/chunk failed: Device busy

Note, nothing was accessing the mount that I'm aware of.

I eventually had to force umount the mfs mount. Mounting it all again and redoing the test showed restored performance until the eventually slow down again.

I should note that using nullfs mounts on top of data storage is something I do with ZFS storage and it does not suffer performance issues.

@chogata
Copy link
Member

chogata commented Mar 26, 2020

Okay, so it looks like there is a bug in nullfs...

If you read a file via nullfs (with cat, for example), nullfs will not send release command after it finishes using it. What is even worse, this one file will never get another release, even if you subsequently read it directly from the original (moosefs) mountpoint. When you first read it from moosefs without "touching" it via nullfs, it's fine, release is there (even if the nullfs is already mounted), but once you "touch" it via nullfs, it never gets releases. So reference counters in MooseFS never go down to zero and the file stays open.

Check the mount tab in your CGI, open files column...

@jSML4ThWwBID69YC
Copy link
Author

I've removed the mfscachemode, and mfssubfolder settings and retested. The performance drop continues to happen, though bug #350 does not.

Here are the two client servers open file statistics.

web00 = 59,618 open files.
web02 = 15,822 open files.

The web00 server is the one I'm running the tarball tests on.

The MooseFS based client servers are configured identical to zfs only servers. In both cases there are hundreds of nullfs mounts in use. The only difference is weather it's a zfs filesystem, or a mfs one that's under the nullfs. I don't experience any issues on the zfs only servers, even under high load.

@jSML4ThWwBID69YC
Copy link
Author

Hello,

I checked the mount tab again and the open file count is increasing over time.

web00 = 67,642
web02 = 23,283

I've now removed the nullfs mounts in favor of a separate mfs mount for each needed location. This requires six to ten separate mfs mounts per physical machine. Are there any repercussions for doing this? Does the Fuse cache work together or separately across the mounts?

I'll report back after it's had 10+ hours to run, and I'll check if I'm still seeing the high open file count. For now, it does look like the number of open files is changing, including going down at times.

@chogata
Copy link
Member

chogata commented Mar 27, 2020

The only drawback of multiple mounts is increased RAM usage - each mount needs its own amount of cache. And no, they don't share FUSE cache, because FUSE has now way of knowing that it's the same fs.
Other than that, it works great, we even recommend using more than one mount on clients with loads of operations, for efficiency reasons.

As for your specific usecase - I'm no expert on jails in FreeBSD, only had a passing acquittance with them, but as far as I know you can use symbolic links to pass fragments of filesystem as mountpoints to them. Wouldn't that solve your problem?

@jSML4ThWwBID69YC
Copy link
Author

Hello,

Using multiple mfs mounts without nullfs seems to work around the issue. At the high end, I'm seeing around 5000 open files. The number of open files is going up and down depending on usage.

Symbolic links won't work in this case as the data is being mounted into chrooted environments where the stat and realpath values have to match the chroot environment.

I don't see the issue using nullfs on normal file system mounts, so I suspect this is directly related to MooseFS, or perhaps related to fuse mounts in general. Can you reproduce a simplified test case for the nullfs issue? This seems like something that needs to be reported to the FreeBSD developers.

In regards to disk quotas, do I need to set the quota at each mfs mount point for sub folders, or will doing it for a single mount point apply across them all? I've noticed in the CGI interface that it's using the MooseFS path, instead of the mount point path so I'm hopefully I only need to apply it once. Can the quota be set at the master server using the MooseFS path instead of a client path?

Thank you for all of your assistance in this. At the very least there is a work around until the nullfs issue can be fixed.

@acid-maker acid-maker added the upstream/kernel upstream OS kernel issue label Apr 6, 2020
@chogata
Copy link
Member

chogata commented Apr 7, 2020

Hi,
Quota set for one directory in MooseFS will be observed in all subdirectories, so if you have, for example, this structure on your instance:

/dir1/subdir1
/dir1/subdir2
/dir1/subdir3

You can set quota for dir1 and the sum of all data recorded in subdir1, subdir2 and subdir3 cannot go over the quota. But it's not shared equally, so it doesn't mean that subdir1 can only have 1/3 of the total quota. If you need separate quotas for all the subdirectories, you need to define them.

We will try to make the bug report to FreeBSD this week about the problem with nullfs. There is definitely an issue there.

@pkonopelko
Copy link
Member

Note: https://moosefs.com/wp-content/uploads/2018/08/MooseFS-Hardware-Guide-v.0.9.pdf is a 404 error.

Hardware Guide URL redirect is fixed now.

@chogata
Copy link
Member

chogata commented Apr 17, 2020

@jSML4ThWwBID69YC
Copy link
Author

jSML4ThWwBID69YC commented Apr 17, 2020

Thank you. I've followed up on the FreeBSD side with further tests. Mounting nullfs with -o nocache seems to allow cmd:release to happen.

@jSML4ThWwBID69YC
Copy link
Author

This is confirmed fixed in FreeBSD 12.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants