rclone using too much memory #2157

calisro · 2018-03-20T17:33:29Z

rclone v1.39-211-g572ee5ecβ

os/arch: linux/amd64
go version: go1.10

I recently upgraded from rclone.v1.38-235-g2a01fa9fβ to rclone v1.39-211-g572ee5ecβ

I'm using the following mount:
export RCLONE_CONFIG="/home/robert/.rclone.conf"
export RCLONE_BUFFER_SIZE=0M
export RCLONE_RETRIES=5
export RCLONE_STATS=0
export RCLONE_TIMEOUT=10m
export RCLONE_LOG_LEVEL=INFO
export RCLONE_DRIVE_USE_TRASH=false
/usr/sbin/rclone -vv --log-file /data/log/rmount-gs.log
mount robgs-cryptp:Media $GS_RCLONE
--allow-other
--default-permissions
--gid $gid --uid $uid
--max-read-ahead 1024k
--buffer-size 50M
--dir-cache-time=72h
--umask $UMASK 2>&1 > /data/log/debug &

This worked flawlessly before. In the most recent version, it is using WAY too much memory causing OOM issues on my linux server. The above uses upwards of 7GIGS of resident memory before my OOM killer terminates the mount when activity is being read/written to the mount. I used to be able to crank up the buffer-size to 150M without issues. There is NOTHING else different between the setup. I can restore my old version and re-run the batch process without issues and then replace it with the new version and it consistently terminates due to OOM after consuming everything left.

If I restore the old version and rerun the exact same process, rclone consistently uses no more than 1.8G.

I can provide logs but there is nothing them. They simply show transfers. Even in the fuse-debug there is nothing but normal activity. There seems to be either a leak or the use of memory has changed DRASTICALLY between the versions making things unusable. I've rolled back till this is sorted.

remusb · 2018-03-20T18:26:09Z

Are you using cache? If yes and you're also using Plex, can you disable that feature and see if it does the same thing?

It will be enough to just remove the configs from the section and start again.

LE: Even easier will be to run simply with -v.

calisro · 2018-03-20T19:30:46Z

no I'm not using cache and it is running with -v normally. The older version didn't use cache as that was not implemented yet I dont think. The new one is using the same remotes and setup as the old. The only change is the executable upgrade.

remusb · 2018-03-20T19:44:25Z

Hmm, ok. There was another report in the forum and that person uses cache. I was trying to establish a corellation but it seems to be coming from somewhere else.

ncw · 2018-03-20T19:52:48Z

I'd like to replicate this if possible. Can you describe the activity that causes the problem more?

Or maybe you've got a script I could run?

Which provider are you using? And are you using crypt too?

calisro · 2018-03-20T20:11:59Z

So the batch script is actually just a set of rclone commands to sync stuff up against the mount points.

SETUP:
I have a local mount point called /data/Media1
I have the rclone mounted on /data/Media2

This is my mount:
I'm using the following mount:

export RCLONE_CONFIG="/home/robert/.rclone.conf"
export RCLONE_BUFFER_SIZE=0M
export RCLONE_RETRIES=5
export RCLONE_STATS=0
export RCLONE_TIMEOUT=10m
export RCLONE_LOG_LEVEL=INFO
export RCLONE_DRIVE_USE_TRASH=false
/usr/sbin/rclone -vv --log-file /data/log/rmount-gs.log 
mount robgs-cryptp:Media /data/Media2
--allow-other 
--default-permissions 
--gid $gid --uid $uid 
--max-read-ahead 1024k 
--buffer-size 50M 
--dir-cache-time=72h 
--umask $UMASK 2>&1 > /data/log/debug &

BATCH:
The batch script is simply copying new data from one to the other using this among other commands. But this is where most of hte activity starts and where the OOM starts to happen:

export RCLONE_STATS=5m
export RCLONE_CONFIG=/home/robert/.rclone.conf
export RCLONE_ONE_FILE_SYSTEM=true
export RCLONE_RETRIES=10
export RCLONE_BWLIMIT="07:00,5M 23:00,10M"
export RCLONE_TRANSFERS=8
export RCLONE_CHECKERS=16
export RCLONE_SKIP_LINKS=true
export RCLONE_DRIVE_USE_TRASH=false
rclone \
   -v \
   --modify-window 1ms \
   --exclude=**nfo \
 copy /data/Media1 /data/Media2

I'm using google as a provider and it is using crypt as well. Media2 is simply synced with Media1 but I don't delete so it becomes a superset of what is copied to Media1. New files get put in Media1 and then replicated via this process.

danielloader · 2018-03-20T20:30:17Z

I am experiencing this too, only noticed as my IOWAIT shot up as I hit the swap and rclone pages were being pushed to the disk.
As to replication, can't seem to replicate it other than mount the remote with crypt/cache and try and play media on plex, after about 5 mins it thrashes the disk as memory shoots up from 200MB to the full 2GB on the VPS.

This wasn't the behaviour in the later betas of 1.39, anything changed in that respect?

ncw · 2018-03-20T21:03:09Z

I managed to replicate this...

First I made a directory with 1000 files in using

make_test_files -max-size 1000000 test-files

(this is one of the rclone tools)

I then mounted up

rclone mount /tmp/test1 /mnt/tmp/

and ran this little script to copy the files in and out of the mount

#!/bin/bash

for i in `seq 100`; do
    echo "round $i"
    rclone -v move /tmp/test-files/ /mnt/tmp/test-files
    rclone -v move /mnt/tmp/test-files /tmp/test-files/
    echo "waiting for 10s... (round $i)"
    sleep 10s
done

v1.39 uses a pretty constant 4MB, v1.40 goes up and up!

danielloader · 2018-03-20T21:05:35Z

https://i.imgur.com/Lvm01QB.png
Graph of when I replicate your steps, looks like a memory leak alright. Once RAM is exhausted it's just 100% IOWAIT stalled as it's trying to use swap.

ncw · 2018-03-20T21:14:51Z

OK here is the pprof memory usage - rclone was at about 2GB RSS when I took this.

So looks like it might be a bug in the fuse library... Though I'm not 100% sure about that.

(pprof) list allocMessage
Total: 1.83GB
ROUTINE ======================== github.com/ncw/rclone/vendor/bazil.org/fuse.allocMessage in /home/ncw/go/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fuse.go
    1.76GB     1.76GB (flat, cum) 96.03% of Total
         .          .    419:var reqPool = sync.Pool{
         .          .    420:	New: allocMessage,
         .          .    421:}
         .          .    422:
         .          .    423:func allocMessage() interface{} {
    1.76GB     1.76GB    424:	m := &message{buf: make([]byte, bufSize)}
         .          .    425:	m.hdr = (*inHeader)(unsafe.Pointer(&m.buf[0]))
         .          .    426:	return m
         .          .    427:}
         .          .    428:
         .          .    429:func getMessage(c *Conn) *message {

remusb · 2018-03-20T21:22:56Z

Was it updated just before the release? It's a bit weird no one noticed this in the last betas.

A simple list of the vendor shows that it wasn't.

calisro · 2018-03-20T21:24:05Z

I have this issue on 1.39 though.

calisro · 2018-03-20T21:24:22Z

rclone v1.39-211-g572ee5ecβ

ncw · 2018-03-20T21:51:12Z

I used the magic of git bisect to narrow it down to this commit fc32fee

commit fc32fee4add93e7a3bda180c7c1088f93c759d13
Author: Nick Craig-Wood <nick@craig-wood.com>
Date:   Fri Mar 2 16:39:42 2018 +0000

    mount, cmount: add --attr-timeout to control attribute caching in kernel
    
    This flag allows the attribute caching in the kernel to be controlled.
    The default is 0s - no caching - which is recommended for filesystems
    which can change outside the control of the kernel.
    
    Previously this was at the default meaning it was 60s for mount and 1s
    for cmount.  This showed strange effects when files changed on the
    remote not via the kernel.  For instance Caddy would serve corrupted
    files for a while when serving from an rclone mount when a file
    changed on the remote.

Which is both good and bad.

Good because there is a workaround --attr-timeout 1s, but bad because this is the second bug I've bisected to the same commit today

I tried --attr-timeout 1ms and that still leaked a bit of memory, whereas --attr-timeout 1s seems OK.

I suspect this is a bug in the fuse library as rclone cmount (which is based off libfuse) doesn't show the same behaviour - it works fine with --attr-timeout 0.

I'll report a bug upstream in a moment...

danielloader · 2018-03-20T21:53:29Z

I've not seen cmount mentioned before, what's the difference?

ncw · 2018-03-20T21:55:30Z

cmount is a libfuse based mount - it works just the same. It isn't normally compiled in for linux as it requires cgo which would require a cross compile environment for all the OSes. Build it in with go build -tags cmount. It provides the mount command on Windows.

ncw · 2018-03-20T22:01:52Z

Upstream bug report: bazil/fuse#196

danielloader · 2018-03-20T22:26:06Z

Are we likely to run into any issues at 1s, especially with vfs caching enabled and the cache layer?

calisro · 2018-03-21T14:22:46Z

I haven't had a OOM termination or the memory usage since Ive added --attr-timeout 1s

@ncw Can you tell me what that can negatively affect running with that timeout?

danielloader · 2018-03-21T14:56:02Z

I imagine it's just that it'll mean for a second at any given time there's a chance data on the remote will be out of sync to locally and thus streaming the file may go awry. Least that's how I read it.

calisro · 2018-03-21T14:58:38Z

I'll also enable cache and make sure its stable. I tried this before and ran into the same issue before reverting to non-cached again.

ncw · 2018-03-22T09:25:34Z

@Daniel-Loader wrote:

Are we likely to run into any issues at 1s, especially with vfs caching enabled and the cache layer?

Probably not. For previous rclone's the value was 60s (or 1s on Windows).

What it means is that if you upload a new file not via the mount then there is a 1s window of opportunity for the kernel to be caching stale data. In practice what this means is that the kernel gets the length of the file wrong and you either get a truncated file or a file with corruption on the end.

1s is quite a small window and the kernel would have to be caching the inode for it to be a problem, so I think you are very unlikely to see it. In fact I didn't have any reports of problems with it at 60s, though people may not have realised there was a problem as retrying will have fixed it.

@calisro Assuming this fixes your problem, I think I'll set the default back to 1s.

hklcf · 2018-03-22T10:38:38Z

If rclone use all of ram on your system, it'll crash the mounted drive. So I write a script to check and remount it.

if [ ! -e $RCLONE_MOUNT_PATH/mounted ]; then
  DOREMOUNT()
fi

calisro · 2018-03-22T11:32:21Z

Still working well in both cache and non-cached mounts

neik1 · 2018-03-22T15:19:02Z

I am also running in "out of memory" errors. Debug-Log unfortunately doesn't show much information.
It is happening with rclone v1.39-155-g62e72801β and 1.40 stable as well. I'm using following mount command:

rclone_1.40 mount --log-level DEBUG --log-file /tmp/log.txt --allow-non-empty --allow-other --buffer-size 32M --dir-cache-time 8760h --poll-interval 24h --vfs-cache-mode writes --attr-timeout 1s --vfs-cache-max-age 0m upload: /media/cry

€dit: Usually I do not use --attr-timeout 1s but after reading that is solved the problems for calisro I gave it a try but without success.

ncw · 2018-03-22T15:25:07Z

@neik1 what is your access pattern to cause this problem? Can you show me how to replicate it? Can you post a log somewhere?

hklcf · 2018-03-22T15:43:04Z

Heavy I/O test, when rclone use ~90%+- of system ram, mounted drive crash, can't read/write mounted drive.

neik1 · 2018-03-22T16:12:47Z

@ncw: Well, my setup is running rclone + emby on Ubuntu (x64). I restarted both yesterday evening during today at about 11AM it crashed with the OOM error. In between there was a library scan (to check for updated files) and 2 files were streamed successfully. On the third one it crashed. There were no more than one user streaming simultaneously.

Log -> https://1drv.ms/t/s!AoPn9ceb766mgYspfoZqWaZExTlsEQ

After analyzing a bit the log I was surprised that there are over 30 lines which saying "open file" (see line 13 of the log)

@hklcf, this propably isn't the right place but anyway: What are you using to have such a nice graph for your memory usage?

hklcf · 2018-03-22T16:34:46Z

@neik1 This chart provided by DigitalOcean control panel. NetData also is a good choice:blush:

ncw · 2018-03-22T16:46:41Z

@neik1 thanks for the log :-)

You have 95 go routines running which seems like a fair number...

I think you have 22 open files. Each of these may be using --buffer-size 32M of memory, so that is 700M accounted for. How much memory does your machine have?

Could you try --attr-timeout 60s and see if that helps?

If you can run rclone with --memprofile mem.prof and then quit the mount before it crashes and then attach the mem.prof along with the rclone binary you used in a .zip file I can analyse it and produce one of those graphs above.

ncw · 2018-03-22T16:49:00Z

Here is my cheap and nasty memory monitoring tool

#!/bin/sh
# Monitor memory usage of the given command

COMMAND="$1"

if [ "$COMMAND" = "" ]; then
   echo "Syntax: $0 command"
   exit 1
fi

headers=""
while [ 1 ]; do
    ps $headers -C "${COMMAND}" -o pid,rss,vsz,cmd
    sleep 1
    headers="--no-headers"
done

Run it as memusage rclone and it will show the memory of any rclone processes running writing a csv style log line every second.

neik1 · 2018-03-22T17:14:30Z

@ncw, how come that I have 22 open files when I am just streaming one specific file?
My machine has just 2GB of memory.

Stopping rclone right before it crashes will be difficult because my sister and my parents stream as well to their devices and I can't monitor it like 24/7. Is it possible to create a script (or alike) that monitors rclone and if it exceeds let's say 500-600MB it kills it to create the memprofile?

I restarted all the processes and put --attr-timeout 60s --memprofile /tmp/mem.prof and removed --buffer-size 32M.

Sorry, for this stupid question but what exactly do I need to do with that script? Just save it and run it like: ./script.sh?
€dit: Running it like that leads to -> Syntax: mem.sh command

memusage rclone shows me that -> -bash: memusage: command not found

hklcf · 2018-03-22T18:04:15Z

@neik1 save the script to memusage.sh and run ./memusage.sh rclone

danielloader · 2018-03-22T18:08:06Z

Worst case and you can't get memory user to work well with 2GB, try mounting with the --cache-chunk-no-memory flag, performance hit cost but if you have more local storage space than RAM, which is usually a given it may help. (well it depends, on a SSD I'd imagine the chunks are network limited rather than storage limited).
I'd rather wait another 1s or so for a file to start playing on plex than worry about it OOM killing the mount.

neik1 · 2018-03-22T18:40:06Z

@hklcf, thanks that helped. Set it up and it's now running.

@Daniel-Loader: Hi Daniel,
so, basically you are telling me 2GB isn't enough for rclone? I assumed rclone would buffer the 32MB + a bit overhead maybe and that would be it.
Anyway, thanks for that hint. Will give it a try if the problem persists.

seuffert · 2018-03-22T18:44:29Z

I assumed rclone would buffer the 32MB + a bit overhead maybe and that would be it.

--buffer-size is per open file on a mount

neik1 · 2018-03-22T19:02:33Z

OK! But how is it possible that there were so many files open that led to that issue when there was only one streaming going? That's the point I do not understand yet.

danielloader · 2018-03-22T19:12:23Z

Well if there's a library scan if it's not doing it sequentially it might be doing mediainfo lookups on multiple files at once. You could try changing the buffer size down to 8MB and see if it retains enough performance? Alternatively as said, write the 32MB buffers to disk as a swap of sorts.

I know on plex you can opt out of chapter/thumbnail creation which would read the whole file on scans, emby can opt out too?

neik1 · 2018-03-22T21:52:37Z

With this adapted mount command it crashed again:

rclone_1.40 mount --log-level DEBUG --log-file /tmp/log.txt --allow-non-empty --allow-other --dir-cache-time 8760h --poll-interval 24h --vfs-cache-mode writes --attr-timeout 60s --memprofile /tmp/mem.prof --vfs-cache-max-age 0m upload: /media/cry
(Difference to the first is that --attr-timout was set to 60s, memprofile was added and buffer-size 32m was removed)

Log -> https://1drv.ms/t/s!AoPn9ceb766mgYsqBGX5KjSKxFVeiA
During the crash there was only one streaming ongoing and nothing else.
Unfortunately, the memusage script apparently stopped when the ssh connection disconnected. How can I start it so that it stays alive? I used ./memusage.sh > /tmp/mem.log &

My next step will be trying to avoid the mem cache at all with --cache-chunk-no-memory
€dit: Is this a rclone cache specific flag? I am not using cache!

@Daniel-Loader, I am not using chapter/thumbnail creation. The library scan for new files takes about 8min (scrapping of new files included). So, it's actually pretty fast I suppose.

danielloader · 2018-03-22T22:34:13Z

Yeah that's incredibly fast for a non cached remote media scan, depending on library size!

hklcf · 2018-03-23T00:58:43Z

@neik1 Screen

neik1 · 2018-03-23T15:22:45Z

Yeah, well... At the end it doesn't seem to be a rclone issue. In my case it seems to be an Emby problem with a specific client.
Gonna try to narrow it down with the developers over there and will come back to report. ;-)

€dit: I am saying this because Emby crashed again but this time rclone didn't (probably because of the --cache-no-mem flag) and was only using 80mb of memory.

ncw · 2018-03-24T11:19:00Z

I've committed a fix to change the default to --attr-timeout to 1s and made nodes in the docs about what happens when you set it to 0s.

Hopefuly upstream will come up with a fix eventually, but this will do for the moment.

ncw · 2018-03-24T11:19:19Z

This will be in

https://beta.rclone.org/v1.40-018-g98a92460/ (uploaded in 15-30 mins)

rclone/rclone#2157

ncw added the bug label Mar 20, 2018

ncw added this to the v1.41 milestone Mar 20, 2018

ncw mentioned this issue Mar 20, 2018

Memory leak when Attribute.Valid is set to 0 bazil/fuse#196

Open

ncw closed this as completed in 98a9246 Mar 24, 2018

calisro mentioned this issue Apr 2, 2018

rclone still using too much memory #2200

Closed

lhilton added a commit to MegaMaid/docker-cloud-media-scripts that referenced this issue Jun 27, 2018

Added attr-timeout flag to rclone to workaround bug

f236702

rclone/rclone#2157

darthShadow mentioned this issue May 2, 2020

getcwd syscall does not work in rclone mounted drives #4104

Closed

cornerfix mentioned this issue May 8, 2024

severe file corruption - 2 different bugs - not sure if in sshfs or fuse3, reporting in both projects libfuse/sshfs#302

Open

rclone using too much memory #2157

rclone using too much memory #2157

Comments

calisro commented Mar 20, 2018 • edited Loading

remusb commented Mar 20, 2018 • edited Loading

calisro commented Mar 20, 2018

remusb commented Mar 20, 2018 • edited Loading

ncw commented Mar 20, 2018

calisro commented Mar 20, 2018 • edited Loading

danielloader commented Mar 20, 2018 • edited Loading

ncw commented Mar 20, 2018

danielloader commented Mar 20, 2018 • edited Loading

ncw commented Mar 20, 2018 • edited Loading

remusb commented Mar 20, 2018 • edited Loading

calisro commented Mar 20, 2018

calisro commented Mar 20, 2018

ncw commented Mar 20, 2018

danielloader commented Mar 20, 2018

ncw commented Mar 20, 2018

ncw commented Mar 20, 2018

danielloader commented Mar 20, 2018

calisro commented Mar 21, 2018 • edited Loading

danielloader commented Mar 21, 2018

calisro commented Mar 21, 2018

ncw commented Mar 22, 2018

hklcf commented Mar 22, 2018 • edited Loading

calisro commented Mar 22, 2018

neik1 commented Mar 22, 2018 • edited Loading

ncw commented Mar 22, 2018

hklcf commented Mar 22, 2018 • edited Loading

neik1 commented Mar 22, 2018

hklcf commented Mar 22, 2018 • edited Loading

ncw commented Mar 22, 2018

ncw commented Mar 22, 2018

neik1 commented Mar 22, 2018 • edited Loading

hklcf commented Mar 22, 2018 • edited Loading

danielloader commented Mar 22, 2018 • edited Loading

neik1 commented Mar 22, 2018

seuffert commented Mar 22, 2018 • edited Loading

neik1 commented Mar 22, 2018

danielloader commented Mar 22, 2018 • edited Loading

neik1 commented Mar 22, 2018 • edited Loading

danielloader commented Mar 22, 2018

hklcf commented Mar 23, 2018

neik1 commented Mar 23, 2018 • edited Loading

ncw commented Mar 24, 2018

ncw commented Mar 24, 2018

calisro commented Mar 20, 2018 •

edited

Loading

remusb commented Mar 20, 2018 •

edited

Loading

remusb commented Mar 20, 2018 •

edited

Loading

calisro commented Mar 20, 2018 •

edited

Loading

danielloader commented Mar 20, 2018 •

edited

Loading

danielloader commented Mar 20, 2018 •

edited

Loading

ncw commented Mar 20, 2018 •

edited

Loading

remusb commented Mar 20, 2018 •

edited

Loading

calisro commented Mar 21, 2018 •

edited

Loading

hklcf commented Mar 22, 2018 •

edited

Loading

neik1 commented Mar 22, 2018 •

edited

Loading

hklcf commented Mar 22, 2018 •

edited

Loading

hklcf commented Mar 22, 2018 •

edited

Loading

neik1 commented Mar 22, 2018 •

edited

Loading

hklcf commented Mar 22, 2018 •

edited

Loading

danielloader commented Mar 22, 2018 •

edited

Loading

seuffert commented Mar 22, 2018 •

edited

Loading

danielloader commented Mar 22, 2018 •

edited

Loading

neik1 commented Mar 22, 2018 •

edited

Loading

neik1 commented Mar 23, 2018 •

edited

Loading