Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

Weird behavior during rsync and failure #100

Closed
mathieuruellan opened this issue Oct 12, 2016 · 16 comments
Closed

Weird behavior during rsync and failure #100

mathieuruellan opened this issue Oct 12, 2016 · 16 comments

Comments

@mathieuruellan
Copy link

mathieuruellan commented Oct 12, 2016

Hello, i'm using the last version of svfs (0.8.2 i386) on a debian stable.
I mount a hubic fuse like this

hubicfuse on /mnt/hubic type fuse.svfs (rw,nosuid,nodev,relatime,user_id=1000,group_id=100,default_permissions,allow_other,user=mathieu)

I'm doing a rync like this

rsync -rtW --inplace --progress /mnt/readynas/backup/biniou/2016.10/ /mnt/hubic/backup/biniou/2016.10/

The command fails. After a timeout error (debug mode), it continues with a weird upload rate (my bandwidth is limited to 600KB/s).

DEBU[2016-10-12T18:14:12+02:00] -> [ID=0xc43] Write 131064           source=fuse
DEBU[2016-10-12T18:14:12+02:00] <- Write [ID=0xc44 Node=0x5 Uid=1000 Gid=100 Pid=7479] 0x2 131072 @268042232 fl=WriteLockOwner lock=3159737928114442182 ffl=OpenWriteOnly  source=fuse
DEBU[2016-10-12T18:14:12+02:00] -> [ID=0xc44] Write 131072           source=fuse
DEBU[2016-10-12T18:14:12+02:00] <- Write [ID=0xc45 Node=0x5 Uid=1000 Gid=100 Pid=7479] 0x2 8 @268173304 fl=WriteLockOwner lock=3159737928114442182 ffl=OpenWriteOnly  source=fuse
DEBU[2016-10-12T18:14:12+02:00] -> [ID=0xc45] Write 8                source=fuse
DEBU[2016-10-12T18:14:12+02:00] <- Write [ID=0xc46 Node=0x5 Uid=1000 Gid=100 Pid=7479] 0x2 131064 @268173312 fl=WriteLockOwner lock=3159737928114442182 ffl=OpenWriteOnly  source=fuse
DEBU[2016-10-12T18:14:13+02:00] -> [ID=0xc46] Write 131064           source=fuse
DEBU[2016-10-12T18:14:13+02:00] <- Write [ID=0xc47 Node=0x5 Uid=1000 Gid=100 Pid=7479] 0x2 131072 @268304376 fl=WriteLockOwner lock=3159737928114442182 ffl=OpenWriteOnly  source=fuse
DEBU[2016-10-12T18:14:13+02:00] -> [ID=0xc47] Write 131072           source=fuse
DEBU[2016-10-12T18:14:13+02:00] <- Write [ID=0xc48 Node=0x5 Uid=1000 Gid=100 Pid=7479] 0x2 8 @268435448 fl=WriteLockOwner lock=3159737928114442182 ffl=OpenWriteOnly  source=fuse
DEBU[2016-10-12T18:14:13+02:00] -> [ID=0xc48] Write 8                source=fuse
DEBU[2016-10-12T18:14:13+02:00] <- Write [ID=0xc49 Node=0x5 Uid=1000 Gid=100 Pid=7479] 0x2 131064 @268435456 fl=WriteLockOwner lock=3159737928114442182 ffl=OpenWriteOnly  source=fuse
DEBU[2016-10-12T18:14:32+02:00] -> [ID=0xc49] Write error=EIO: Timeout when reading or writing data  source=fuse
DEBU[2016-10-12T18:14:32+02:00] <- Flush [ID=0xc4a Node=0x5 Uid=1000 Gid=100 Pid=7479] 0x1 fl=0x0 lk=0x2bd9a4e794970fc6  source=fuse
DEBU[2016-10-12T18:14:32+02:00] -> [ID=0xc4a] Flush                  source=fuse
DEBU[2016-10-12T18:14:32+02:00] <- Release [ID=0xc4b Node=0x5 Uid=0 Gid=0 Pid=0] 0x1 fl=OpenReadOnly rfl=0 owner=0x0  source=fuse
DEBU[2016-10-12T18:14:32+02:00] -> [ID=0xc4b] Release                source=fuse
DEBU[2016-10-12T18:14:32+02:00] <- Write [ID=0xc4c Node=0x5 Uid=1000 Gid=100 Pid=7479] 0x2 131064 @268435456 fl=WriteLockOwner lock=3159737928114442182 ffl=OpenWriteOnly  source=fuse
DEBU[2016-10-12T18:14:47+02:00] -> [ID=0xc4c] Write error=EIO: Timeout when reading or writing data  source=fuse
DEBU[2016-10-12T18:14:47+02:00] <- Flush [ID=0xc4d Node=0x5 Uid=1000 Gid=100 Pid=7479] 0x2 fl=0x0 lk=0x2bd9a4e794970fc6  source=fuse
DEBU[2016-10-12T18:14:47+02:00] -> [ID=0xc4d] Flush                  source=fuse
    268,795,904  25%   79.31kB/s    2:49:08  DEBU[2016-10-12T18:14:47+02:00] <- Release [ID=0xc4e Node=0x5 Uid=0 Gid=0 Pid=0] 0x2 fl=OpenWriteOnly rfl=0x2 owner=0xc8253648  source=fuse
DEBU[2016-10-12T18:14:47+02:00] -> [ID=0xc4e] Release                source=fuse
  1,073,741,824 100%    2.05MB/s    0:08:18 (xfr#1, to-chk=61/63)
2016.10.01.FULL.10.dar
  1,073,741,824 100%   34.95MB/s    0:00:29 (xfr#2, to-chk=60/63)
2016.10.01.FULL.11.dar
  1,073,741,824 100%   36.16MB/s    0:00:28 (xfr#3, to-chk=59/63)
2016.10.01.FULL.12.dar
  1,073,741,824 100%   35.74MB/s    0:00:28 (xfr#4, to-chk=58/63)
2016.10.01.FULL.13.dar
  1,073,741,824 100%   35.59MB/s    0:00:28 (xfr#5, to-chk=57/63)
2016.10.01.FULL.14.dar
  1,073,741,824 100%   38.07MB/s    0:00:26 (xfr#6, to-chk=56/63)
2016.10.01.FULL.15.dar
  1,073,741,824 100%   39.27MB/s    0:00:26 (xfr#7, to-chk=55/63)
2016.10.01.FULL.16.dar
  1,073,741,824 100%   38.56MB/s    0:00:26 (xfr#8, to-chk=54/63)
2016.10.01.FULL.17.dar
  1,073,741,824 100%   39.44MB/s    0:00:25 (xfr#9, to-chk=53/63)


...

DB.PIWIGO.2016.10.12.sql.gz
             20 100%    0.03kB/s    0:00:00 (xfr#62, to-chk=0/63)
rsync: write failed on "/mnt/hubic/backup/biniou/2016.10/2016.10.01.FULL.1.dar": Input/output error (5)
rsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.1]
@mathieuruellan mathieuruellan changed the title Weird behavior during rsync. failure Weird behavior during rsync and failure Oct 12, 2016
@xlucas
Copy link
Contributor

xlucas commented Oct 12, 2016

Hello, with hubiC you should try to lower the default segment_size mount option value to something like 64(MB).

What happens here is : once the uploaded size reaches 256MB (segment size default value), SVFS will move the uploaded part to a segment container server-side and create a manifest refering to it before continuing transferring data. When this happens, you will see a noticeable pause in the upload process : that's the time taken for the data to be copied elsewhere. Unfortunately, such operations are slow at the moment, so a timeout is reached before completion.

Lowering this value at mount time will help completing these operations quicker.

@xlucas xlucas added the usage label Oct 12, 2016
@mathieuruellan
Copy link
Author

That workaround is not really restrictive (on customer side).
Thank you.

@romanrm
Copy link

romanrm commented Oct 13, 2016

Still it is strange that rsync proceeds at dozens of MB/sec after getting the failure (and apparently copies into nowhere). I have encountered this too.

@mathieuruellan
Copy link
Author

It's not better with segment_size=64 :(

Here is my fstab line:
hubicfuse /mnt/hubic svfs hubic_auth=XXXXXXXXX,hubic_token=XXXXXXXX,container=default,allow_other,default_permissions,uid=mathieu,user,extra_attr,segment_size=64 0 0

@xlucas
Copy link
Contributor

xlucas commented Oct 21, 2016

What if you lower it even more ? Does the problem disappear at some point or not at all ?

@romanrm
Copy link

romanrm commented Oct 21, 2016

I would suggest also testing another approach, i.e. not using segments at all by setting segment_size to 5120 (and not uploading files larger than 5 GB).

@mathieuruellan
Copy link
Author

Thank you!
I will try both suggestions! Give you feedback after!

@mathieuruellan
Copy link
Author

Well, i'm trying with segment_size to 5120 since this morning, and no failure since.

@mathieuruellan
Copy link
Author

Failure after 20h of upload.

@xlucas
Copy link
Contributor

xlucas commented Nov 7, 2016

Were your files all < 5 GB ?

@mathieuruellan
Copy link
Author

Yes! My backup files are splitted into 1GB files.

@mathieuruellan
Copy link
Author

mathieuruellan commented Nov 7, 2016

Since last settings, it's really better, still crashing but less often.

@xlucas
Copy link
Contributor

xlucas commented Nov 22, 2016

At this point the issue is more related to the current performance of the hubic infrastructure than an issue in svfs I'm afraid.

@ghost
Copy link

ghost commented Nov 30, 2016

How can we get rsync to hard-fail in this situation ? Because the IO error will get rsync to return an exit code but it will continue to spit data to nowhere until there is nothing left to copy.

For example, if rsync is used on a script to copy 1T of data and an error occurs after copying 1G, it will not stop and will continue copying 999G (which consume time and IO ressources) to /dev/null, before realising it gone wrong and eventually trying again.

Does it come from rsync (who would need a parameter to properly stop) or svfs ?

@xlucas
Copy link
Contributor

xlucas commented Jan 6, 2017

One way to get around this would be to implement an "object moved" event polling after a COPY request is sent to Swift and let the user control the polling timeout with an option.

@xlucas
Copy link
Contributor

xlucas commented Feb 28, 2017

The issue actually come from the underlying swift library SVFS uses.

In particular, SVFS exposes several options coming from this layer as mount options like connect_timeout and request_timeout, user_agent etc.

It happens the (very) badly named connection timeout of this library is actually a timeout regarding the entire request processing (from connection to sending the request, getting a reply etc) and the other badly named request timeout is actually only used while reading a request body.

So for instance passing connect_timeout=5m should fix the issue.

I'm working at submitting a PR to change timeout names and separating connection from processing.
In the meantime SVFS default timeout for the connection_timeout mount option will be changed from 15s to 5m.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants