Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclone using too much memory #2157

Closed
calisro opened this issue Mar 20, 2018 · 43 comments
Closed

rclone using too much memory #2157

calisro opened this issue Mar 20, 2018 · 43 comments
Labels
Milestone

Comments

@calisro
Copy link
Member

calisro commented Mar 20, 2018

rclone v1.39-211-g572ee5ecβ

  • os/arch: linux/amd64
  • go version: go1.10

I recently upgraded from rclone.v1.38-235-g2a01fa9fβ to rclone v1.39-211-g572ee5ecβ

I'm using the following mount:
export RCLONE_CONFIG="/home/robert/.rclone.conf"
export RCLONE_BUFFER_SIZE=0M
export RCLONE_RETRIES=5
export RCLONE_STATS=0
export RCLONE_TIMEOUT=10m
export RCLONE_LOG_LEVEL=INFO
export RCLONE_DRIVE_USE_TRASH=false
/usr/sbin/rclone -vv --log-file /data/log/rmount-gs.log
mount robgs-cryptp:Media $GS_RCLONE
--allow-other
--default-permissions
--gid $gid --uid $uid
--max-read-ahead 1024k
--buffer-size 50M
--dir-cache-time=72h
--umask $UMASK 2>&1 > /data/log/debug &

This worked flawlessly before. In the most recent version, it is using WAY too much memory causing OOM issues on my linux server. The above uses upwards of 7GIGS of resident memory before my OOM killer terminates the mount when activity is being read/written to the mount. I used to be able to crank up the buffer-size to 150M without issues. There is NOTHING else different between the setup. I can restore my old version and re-run the batch process without issues and then replace it with the new version and it consistently terminates due to OOM after consuming everything left.

If I restore the old version and rerun the exact same process, rclone consistently uses no more than 1.8G.

I can provide logs but there is nothing them. They simply show transfers. Even in the fuse-debug there is nothing but normal activity. There seems to be either a leak or the use of memory has changed DRASTICALLY between the versions making things unusable. I've rolled back till this is sorted.

@remusb
Copy link
Collaborator

remusb commented Mar 20, 2018

Are you using cache? If yes and you're also using Plex, can you disable that feature and see if it does the same thing?

It will be enough to just remove the configs from the section and start again.

LE: Even easier will be to run simply with -v.

@calisro
Copy link
Member Author

calisro commented Mar 20, 2018

no I'm not using cache and it is running with -v normally. The older version didn't use cache as that was not implemented yet I dont think. The new one is using the same remotes and setup as the old. The only change is the executable upgrade.

@remusb
Copy link
Collaborator

remusb commented Mar 20, 2018

Hmm, ok. There was another report in the forum and that person uses cache. I was trying to establish a corellation but it seems to be coming from somewhere else.

@ncw
Copy link
Member

ncw commented Mar 20, 2018

I'd like to replicate this if possible. Can you describe the activity that causes the problem more?

Or maybe you've got a script I could run?

Which provider are you using? And are you using crypt too?

@ncw ncw added the bug label Mar 20, 2018
@ncw ncw added this to the v1.41 milestone Mar 20, 2018
@calisro
Copy link
Member Author

calisro commented Mar 20, 2018

So the batch script is actually just a set of rclone commands to sync stuff up against the mount points.

SETUP:
I have a local mount point called /data/Media1
I have the rclone mounted on /data/Media2

This is my mount:
I'm using the following mount:

export RCLONE_CONFIG="/home/robert/.rclone.conf"
export RCLONE_BUFFER_SIZE=0M
export RCLONE_RETRIES=5
export RCLONE_STATS=0
export RCLONE_TIMEOUT=10m
export RCLONE_LOG_LEVEL=INFO
export RCLONE_DRIVE_USE_TRASH=false
/usr/sbin/rclone -vv --log-file /data/log/rmount-gs.log 
mount robgs-cryptp:Media /data/Media2
--allow-other 
--default-permissions 
--gid $gid --uid $uid 
--max-read-ahead 1024k 
--buffer-size 50M 
--dir-cache-time=72h 
--umask $UMASK 2>&1 > /data/log/debug &

BATCH:
The batch script is simply copying new data from one to the other using this among other commands. But this is where most of hte activity starts and where the OOM starts to happen:

export RCLONE_STATS=5m
export RCLONE_CONFIG=/home/robert/.rclone.conf
export RCLONE_ONE_FILE_SYSTEM=true
export RCLONE_RETRIES=10
export RCLONE_BWLIMIT="07:00,5M 23:00,10M"
export RCLONE_TRANSFERS=8
export RCLONE_CHECKERS=16
export RCLONE_SKIP_LINKS=true
export RCLONE_DRIVE_USE_TRASH=false
rclone \
   -v \
   --modify-window 1ms \
   --exclude=**nfo \
 copy /data/Media1 /data/Media2 

I'm using google as a provider and it is using crypt as well. Media2 is simply synced with Media1 but I don't delete so it becomes a superset of what is copied to Media1. New files get put in Media1 and then replicated via this process.

@danielloader
Copy link
Contributor

danielloader commented Mar 20, 2018

I am experiencing this too, only noticed as my IOWAIT shot up as I hit the swap and rclone pages were being pushed to the disk.
As to replication, can't seem to replicate it other than mount the remote with crypt/cache and try and play media on plex, after about 5 mins it thrashes the disk as memory shoots up from 200MB to the full 2GB on the VPS.

This wasn't the behaviour in the later betas of 1.39, anything changed in that respect?

@ncw
Copy link
Member

ncw commented Mar 20, 2018

I managed to replicate this...

First I made a directory with 1000 files in using

make_test_files -max-size 1000000 test-files

(this is one of the rclone tools)

I then mounted up

rclone mount /tmp/test1 /mnt/tmp/

and ran this little script to copy the files in and out of the mount

#!/bin/bash

for i in `seq 100`; do
    echo "round $i"
    rclone -v move /tmp/test-files/ /mnt/tmp/test-files
    rclone -v move /mnt/tmp/test-files /tmp/test-files/
    echo "waiting for 10s... (round $i)"
    sleep 10s
done

v1.39 uses a pretty constant 4MB, v1.40 goes up and up!

@danielloader
Copy link
Contributor

danielloader commented Mar 20, 2018

https://i.imgur.com/Lvm01QB.png
Graph of when I replicate your steps, looks like a memory leak alright. Once RAM is exhausted it's just 100% IOWAIT stalled as it's trying to use swap.

@ncw
Copy link
Member

ncw commented Mar 20, 2018

OK here is the pprof memory usage - rclone was at about 2GB RSS when I took this.

So looks like it might be a bug in the fuse library... Though I'm not 100% sure about that.

profile001

(pprof) list allocMessage
Total: 1.83GB
ROUTINE ======================== github.com/ncw/rclone/vendor/bazil.org/fuse.allocMessage in /home/ncw/go/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fuse.go
    1.76GB     1.76GB (flat, cum) 96.03% of Total
         .          .    419:var reqPool = sync.Pool{
         .          .    420:	New: allocMessage,
         .          .    421:}
         .          .    422:
         .          .    423:func allocMessage() interface{} {
    1.76GB     1.76GB    424:	m := &message{buf: make([]byte, bufSize)}
         .          .    425:	m.hdr = (*inHeader)(unsafe.Pointer(&m.buf[0]))
         .          .    426:	return m
         .          .    427:}
         .          .    428:
         .          .    429:func getMessage(c *Conn) *message {

@remusb
Copy link
Collaborator

remusb commented Mar 20, 2018

Was it updated just before the release? It's a bit weird no one noticed this in the last betas.

A simple list of the vendor shows that it wasn't.

@calisro
Copy link
Member Author

calisro commented Mar 20, 2018

I have this issue on 1.39 though.

@calisro
Copy link
Member Author

calisro commented Mar 20, 2018

rclone v1.39-211-g572ee5ecβ

@ncw
Copy link
Member

ncw commented Mar 20, 2018

I used the magic of git bisect to narrow it down to this commit fc32fee

commit fc32fee4add93e7a3bda180c7c1088f93c759d13
Author: Nick Craig-Wood <nick@craig-wood.com>
Date:   Fri Mar 2 16:39:42 2018 +0000

    mount, cmount: add --attr-timeout to control attribute caching in kernel
    
    This flag allows the attribute caching in the kernel to be controlled.
    The default is 0s - no caching - which is recommended for filesystems
    which can change outside the control of the kernel.
    
    Previously this was at the default meaning it was 60s for mount and 1s
    for cmount.  This showed strange effects when files changed on the
    remote not via the kernel.  For instance Caddy would serve corrupted
    files for a while when serving from an rclone mount when a file
    changed on the remote.

Which is both good and bad.

Good because there is a workaround --attr-timeout 1s, but bad because this is the second bug I've bisected to the same commit today

I tried --attr-timeout 1ms and that still leaked a bit of memory, whereas --attr-timeout 1s seems OK.

I suspect this is a bug in the fuse library as rclone cmount (which is based off libfuse) doesn't show the same behaviour - it works fine with --attr-timeout 0.

I'll report a bug upstream in a moment...

@danielloader
Copy link
Contributor

I've not seen cmount mentioned before, what's the difference?

@ncw
Copy link
Member

ncw commented Mar 20, 2018

cmount is a libfuse based mount - it works just the same. It isn't normally compiled in for linux as it requires cgo which would require a cross compile environment for all the OSes. Build it in with go build -tags cmount. It provides the mount command on Windows.

@ncw
Copy link
Member

ncw commented Mar 20, 2018

Upstream bug report: bazil/fuse#196

@danielloader
Copy link
Contributor

Are we likely to run into any issues at 1s, especially with vfs caching enabled and the cache layer?

@calisro
Copy link
Member Author

calisro commented Mar 21, 2018

I haven't had a OOM termination or the memory usage since Ive added --attr-timeout 1s

@ncw Can you tell me what that can negatively affect running with that timeout?

@danielloader
Copy link
Contributor

I imagine it's just that it'll mean for a second at any given time there's a chance data on the remote will be out of sync to locally and thus streaming the file may go awry. Least that's how I read it.

@calisro
Copy link
Member Author

calisro commented Mar 21, 2018

I'll also enable cache and make sure its stable. I tried this before and ran into the same issue before reverting to non-cached again.

@ncw
Copy link
Member

ncw commented Mar 22, 2018

@Daniel-Loader wrote:

Are we likely to run into any issues at 1s, especially with vfs caching enabled and the cache layer?

Probably not. For previous rclone's the value was 60s (or 1s on Windows).

What it means is that if you upload a new file not via the mount then there is a 1s window of opportunity for the kernel to be caching stale data. In practice what this means is that the kernel gets the length of the file wrong and you either get a truncated file or a file with corruption on the end.

1s is quite a small window and the kernel would have to be caching the inode for it to be a problem, so I think you are very unlikely to see it. In fact I didn't have any reports of problems with it at 60s, though people may not have realised there was a problem as retrying will have fixed it.

@calisro Assuming this fixes your problem, I think I'll set the default back to 1s.

@hklcf
Copy link

hklcf commented Mar 22, 2018

If rclone use all of ram on your system, it'll crash the mounted drive. So I write a script to check and remount it.

if [ ! -e $RCLONE_MOUNT_PATH/mounted ]; then
  DOREMOUNT()
fi

@calisro
Copy link
Member Author

calisro commented Mar 22, 2018

Still working well in both cache and non-cached mounts

@neik1
Copy link

neik1 commented Mar 22, 2018

I am also running in "out of memory" errors. Debug-Log unfortunately doesn't show much information.
It is happening with rclone v1.39-155-g62e72801β and 1.40 stable as well. I'm using following mount command:

rclone_1.40 mount --log-level DEBUG --log-file /tmp/log.txt --allow-non-empty --allow-other --buffer-size 32M --dir-cache-time 8760h --poll-interval 24h --vfs-cache-mode writes --attr-timeout 1s --vfs-cache-max-age 0m upload: /media/cry

€dit: Usually I do not use --attr-timeout 1s but after reading that is solved the problems for calisro I gave it a try but without success.

@ncw
Copy link
Member

ncw commented Mar 22, 2018

@neik1 what is your access pattern to cause this problem? Can you show me how to replicate it? Can you post a log somewhere?

@hklcf
Copy link

hklcf commented Mar 22, 2018

image

Heavy I/O test, when rclone use ~90%+- of system ram, mounted drive crash, can't read/write mounted drive.

@neik1
Copy link

neik1 commented Mar 22, 2018

@ncw: Well, my setup is running rclone + emby on Ubuntu (x64). I restarted both yesterday evening during today at about 11AM it crashed with the OOM error. In between there was a library scan (to check for updated files) and 2 files were streamed successfully. On the third one it crashed. There were no more than one user streaming simultaneously.

Log -> https://1drv.ms/t/s!AoPn9ceb766mgYspfoZqWaZExTlsEQ

After analyzing a bit the log I was surprised that there are over 30 lines which saying "open file" (see line 13 of the log)

@hklcf, this propably isn't the right place but anyway: What are you using to have such a nice graph for your memory usage?

@hklcf
Copy link

hklcf commented Mar 22, 2018

@neik1 This chart provided by DigitalOcean control panel. NetData also is a good choice:blush:

@ncw
Copy link
Member

ncw commented Mar 22, 2018

@neik1 thanks for the log :-)

You have 95 go routines running which seems like a fair number...

I think you have 22 open files. Each of these may be using --buffer-size 32M of memory, so that is 700M accounted for. How much memory does your machine have?

Could you try --attr-timeout 60s and see if that helps?

If you can run rclone with --memprofile mem.prof and then quit the mount before it crashes and then attach the mem.prof along with the rclone binary you used in a .zip file I can analyse it and produce one of those graphs above.

@ncw
Copy link
Member

ncw commented Mar 22, 2018

Here is my cheap and nasty memory monitoring tool

#!/bin/sh
# Monitor memory usage of the given command

COMMAND="$1"

if [ "$COMMAND" = "" ]; then
   echo "Syntax: $0 command"
   exit 1
fi

headers=""
while [ 1 ]; do
    ps $headers -C "${COMMAND}" -o pid,rss,vsz,cmd
    sleep 1
    headers="--no-headers"
done

Run it as memusage rclone and it will show the memory of any rclone processes running writing a csv style log line every second.

@neik1
Copy link

neik1 commented Mar 22, 2018

@ncw, how come that I have 22 open files when I am just streaming one specific file?
My machine has just 2GB of memory.

Stopping rclone right before it crashes will be difficult because my sister and my parents stream as well to their devices and I can't monitor it like 24/7. Is it possible to create a script (or alike) that monitors rclone and if it exceeds let's say 500-600MB it kills it to create the memprofile?

I restarted all the processes and put --attr-timeout 60s --memprofile /tmp/mem.prof and removed --buffer-size 32M.

Sorry, for this stupid question but what exactly do I need to do with that script? Just save it and run it like: ./script.sh?
€dit: Running it like that leads to -> Syntax: mem.sh command

memusage rclone shows me that -> -bash: memusage: command not found

@hklcf
Copy link

hklcf commented Mar 22, 2018

@neik1 save the script to memusage.sh and run ./memusage.sh rclone

@danielloader
Copy link
Contributor

danielloader commented Mar 22, 2018

Worst case and you can't get memory user to work well with 2GB, try mounting with the --cache-chunk-no-memory flag, performance hit cost but if you have more local storage space than RAM, which is usually a given it may help. (well it depends, on a SSD I'd imagine the chunks are network limited rather than storage limited).
I'd rather wait another 1s or so for a file to start playing on plex than worry about it OOM killing the mount.

@neik1
Copy link

neik1 commented Mar 22, 2018

@hklcf, thanks that helped. Set it up and it's now running.

@Daniel-Loader: Hi Daniel,
so, basically you are telling me 2GB isn't enough for rclone? I assumed rclone would buffer the 32MB + a bit overhead maybe and that would be it.
Anyway, thanks for that hint. Will give it a try if the problem persists.

@seuffert
Copy link
Contributor

seuffert commented Mar 22, 2018

I assumed rclone would buffer the 32MB + a bit overhead maybe and that would be it.

--buffer-size is per open file on a mount

@neik1
Copy link

neik1 commented Mar 22, 2018

OK! But how is it possible that there were so many files open that led to that issue when there was only one streaming going? That's the point I do not understand yet.

@danielloader
Copy link
Contributor

danielloader commented Mar 22, 2018

Well if there's a library scan if it's not doing it sequentially it might be doing mediainfo lookups on multiple files at once. You could try changing the buffer size down to 8MB and see if it retains enough performance? Alternatively as said, write the 32MB buffers to disk as a swap of sorts.

I know on plex you can opt out of chapter/thumbnail creation which would read the whole file on scans, emby can opt out too?

@neik1
Copy link

neik1 commented Mar 22, 2018

With this adapted mount command it crashed again:

rclone_1.40 mount --log-level DEBUG --log-file /tmp/log.txt --allow-non-empty --allow-other --dir-cache-time 8760h --poll-interval 24h --vfs-cache-mode writes --attr-timeout 60s --memprofile /tmp/mem.prof --vfs-cache-max-age 0m upload: /media/cry
(Difference to the first is that --attr-timout was set to 60s, memprofile was added and buffer-size 32m was removed)

Log -> https://1drv.ms/t/s!AoPn9ceb766mgYsqBGX5KjSKxFVeiA
During the crash there was only one streaming ongoing and nothing else.
Unfortunately, the memusage script apparently stopped when the ssh connection disconnected. How can I start it so that it stays alive? I used ./memusage.sh > /tmp/mem.log &

My next step will be trying to avoid the mem cache at all with --cache-chunk-no-memory
€dit: Is this a rclone cache specific flag? I am not using cache!

@Daniel-Loader, I am not using chapter/thumbnail creation. The library scan for new files takes about 8min (scrapping of new files included). So, it's actually pretty fast I suppose.

@danielloader
Copy link
Contributor

Yeah that's incredibly fast for a non cached remote media scan, depending on library size!

@hklcf
Copy link

hklcf commented Mar 23, 2018

@neik1 Screen

@neik1
Copy link

neik1 commented Mar 23, 2018

Yeah, well... At the end it doesn't seem to be a rclone issue. In my case it seems to be an Emby problem with a specific client.
Gonna try to narrow it down with the developers over there and will come back to report. ;-)

€dit: I am saying this because Emby crashed again but this time rclone didn't (probably because of the --cache-no-mem flag) and was only using 80mb of memory.

@ncw ncw closed this as completed in 98a9246 Mar 24, 2018
@ncw
Copy link
Member

ncw commented Mar 24, 2018

I've committed a fix to change the default to --attr-timeout to 1s and made nodes in the docs about what happens when you set it to 0s.

Hopefuly upstream will come up with a fix eventually, but this will do for the moment.

@ncw
Copy link
Member

ncw commented Mar 24, 2018

This will be in

https://beta.rclone.org/v1.40-018-g98a92460/ (uploaded in 15-30 mins)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants