Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsync hangs on particular file #2138

Open
Wishmesh opened this issue May 21, 2017 · 111 comments
Open

rsync hangs on particular file #2138

Wishmesh opened this issue May 21, 2017 · 111 comments
Labels

Comments

@Wishmesh
Copy link

  • Your Windows build number: Microsoft Windows [Version 10.0.15063] + BOW 16.04 (reinstalled today)

  • What you're doing and what's happening: I am trying to rsync folders between two Windows (both BashOnWindows). I am using following commands:

rsync --archive --stats --progress --delete /mnt/u/1-folder /mnt/u/2nd-folder some_name@192.168.99.17:/mnt/u/2-folder/
rsync -rl --progress --delete ...
rsync -rl --delete ...
  • What's wrong / what should be happening instead: It works almost flawlessly. But sometimes (at least 5 different cases) it stucks on particular file. When stopped (Ctrl+C) and resumed, it always hangs on the same file. Usually it hangs only on .pch and .pdb files (about 20MB). See the example below. I stop it with Ctrl+C (see the ^C).
        560,502 100%    1.63MB/s    0:00:00 (xfr#78, ir-chk=1056/39982)
some-path/more-path/cl.command.1.tlog
          6,060 100%   17.83kB/s    0:00:00 (xfr#79, ir-chk=1055/39982)
some-path/more-path/some-file.Build.CppClean.log
          1,817 100%    5.33kB/s    0:00:00 (xfr#80, ir-chk=1054/39982)
some-path/more-path/some-file.pch
         32,768   0%   95.81kB/s    0:00:46  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.0]
[sender] io timeout after 60 seconds -- exiting

I tried to wait for more than 24h... it stays on the one file.

Also, strange thing is:

  • it does not fail, if I move files around to different folders
  • as a workaround I can move offending files one by one with rsync (then rsync works again)
  • Strace of the failing command, if applicable: Ohhh.. Sorry. Tried to reproduce. But I can't. Because I moved files around. If you can not resolve this without strace, then I will wait till the next time and will provide strace here in comments.
    (If <cmd> is failing, then run strace -o strace.txt -ff <cmd>, and post the strace.txt output here)
@sunjoong
Copy link

@Wishmesh - What name and path name of that particular file? I had a strange experience with file named "ext" and directory named "ext"; My anti-virus blocked that file and directory.

@Wishmesh
Copy link
Author

@sunjoong - no. Anti-virus have these folders excluded. Path looks like:

X:\1-prj\some\simple\folders\here\Release_x64\appname.pch
or
X:\1-prj\some\simple\folders\here\Release_x64\appname.pdb

There were other extensions too, but I do not remember of what type they was...

@therealkenc
Copy link
Collaborator

therealkenc commented May 23, 2017

Unfortunately I can't repro here. Strange files to be having trouble with. Their size isn't that special, nor should that be a problem anyway. TCP streams are very reliable on WSL these days. Any chance those files are open in something (anything) in Windows? Or special in any way you can think of, versus the other files in your tree?

One thing you might try is to do it old school and see what happens:

tar cpf - /mnt/u/somewhere | (ssh some_name@192.168.99.17 "cd remote_dir; tar xpf -")

That doesn't get you all the rsync goodness, but it will help eliminate filesystem and networking problems. If that works, try rsync without the --delete flag and see if that helps. Otherwise, shrug 🤷.

@Wishmesh
Copy link
Author

Wishmesh commented May 23, 2017

Any chance those files are open in something (anything) Windows?

No. It stuck on the same file again, and again. Even after reboot. Antivirus disabled for that drive.

Ok. Will try out your suggestions next time it happens.

@therealkenc
Copy link
Collaborator

therealkenc commented May 23, 2017

It's not antivirus; and even if it were, those files still aren't very special compared to all the object files you must have in there. I rsync around enormous trees (like chromium) somewhat regularly, but that is on VolFS (/home/me). I rsynced a random Visual Studio project directory DrvFS to DrvFS (granted localhost to localhost) trying to repro just before I posted but it went fine. Anyway, bonne chance.

@Wishmesh
Copy link
Author

I have 2 folders (about 10GB):

  1. 36193 files and 6142 folders
  2. 33734 files and 5975 folders

Nothing extraordinary. And "file being used / not accessible" is less probable, because.... When it stuck, I can copy the same stuck file using the same rsync syntax, but specifying filename -- syncing by one file works... and then after that, again rsync works for some days or weeks without any problem.

@therealkenc
Copy link
Collaborator

...for some days or weeks without any problem

Yeah that's going to be a bitch to track. So obviously it's going to work for the one-off test I did. Sigh...

That the fail is sticky even after reboot is very relevant. That means it pretty much has to be a "filesystem thing", but I couldn't speculate what exactly. Appreciate you reporting though, so at least it is on the books.

@Wishmesh
Copy link
Author

That means it pretty much has to be a "filesystem thing"

But why rsync then works if I sync the same files one by one? And both - source and destination FS are NTFS, if that matters.

@therealkenc
Copy link
Collaborator

therealkenc commented May 23, 2017

But why rsync then works if I sync the same files one by one?

Dunno, but it pretty much has to be a "filesystem thing" by deductive reasoning. Because your DrvFS (NTFS) filesystem is the only thing that can hold the sticky failure state across a reboot. Your sync one-by-one fix is, for whatever reason, getting things back into a sane state. Next time it fails, try deleting the problematic files on the server that you are pushing to and doing a full sync, instead of syncing them one at a time.

Handwaving here, the whole process of deleting files on DrvFS is not straightforward, because delete semantics are different in Unix and Windows. That why the --delete option is a red flag. Or, at least for lack of a better working theory atm.

@Wishmesh
Copy link
Author

Wishmesh commented May 24, 2017

Yeah that's going to be a bitch to track.

Let us try....

  1. Today I tried hard to reproduce and I succeeded

Try 1
some/folders/sub/folders/src/Debug/rc.read.1.tlog
          2,896 100%   10.79kB/s    0:00:00 (xfr#24, ir-chk=1011/41352)
some/folders/sub/folders/src/Debug/rc.write.1.tlog
            250 100%    0.93kB/s    0:00:00 (xfr#25, ir-chk=1010/41352)
some/folders/sub/folders/src/Debug/vc100.idb
          1,200   0%    4.47kB/s    0:05:24

Try2
some/folders/sub/folders/src/Debug_x64/some-name.log
          3,331 100%   28.04kB/s    0:00:00 (xfr#25, ir-chk=1034/41390)
some/folders/sub/folders/src/Debug_x64/some-name.obj
      1,397,284 100%   11.01MB/s    0:00:00 (xfr#26, ir-chk=1033/41390)
some/folders/sub/folders/src/Debug_x64/some-name.pch
         10,640   0%   85.17kB/s    0:05:33

Try3
some/folders/sub/folders/src/Debug_x64/some-name.pch
         10,640   0%    0.00kB/s    0:00:00
  1. The strange thing is, it sometimes shows 5+ minutes required transfer time almost immediately.

  2. When it stuck at some-name.pch, I checked MD5 of file on client and server -- they differ.

  3. When I said that state remains after reboot I somehow forgot about server state. I blamed client. So now I rebooted the server too -- both computers rebooted. So now it is confirmed -- it persists after reboot of both -- server and client.

sending incremental file list
deleting some/folders/sub/folders/src/some-name.opensdf
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%   26.62MB/s    0:00:01 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%   51.02kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/
some/folders/sub/folders/src/Debug_x64/some-name.pch
         32,768   0%   56.04kB/s    0:08:25  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting
  1. Checked offending file on the server. The file some-name.pch is missing on the server.

  2. Trying rsync again. It still stuck on the same file (even it is not on the server):

sending incremental file list
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%  303.08MB/s    0:00:00 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%  187.97kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/some-name.pch
         32,768   0%  233.58kB/s    0:02:01  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting
  1. Notice that before the some-name.pch file it synced .sdf and .suo file again. I am not deeply familiar with rsync. It writes multiple files at once? Why send .sdf and .suo again, when they were synced last time successfully?

  2. Deleting entire folder on the server: some/folders/sub/folders/src/Debug_x64

  3. Trying rsync again:

sending incremental file list
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%  305.40MB/s    0:00:00 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%  185.19kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/
some/folders/sub/folders/src/Debug_x64/Base64.obj
         10,358 100%   11.60kB/s    0:00:00 (xfr#3, ir-chk=1045/41389)
some/folders/sub/folders/src/Debug_x64/CL.read.1.tlog
         45,746 100%   12.38MB/s    0:00:00 (xfr#4, ir-chk=1044/41389)
some/folders/sub/folders/src/Debug_x64/CL.write.1.tlog
          3,618 100%  588.87kB/s    0:00:00 (xfr#5, ir-chk=1043/41389)
some/folders/sub/folders/src/Debug_x64/**********.obj
         70,915 100%    1.50MB/s    0:00:00 (xfr#6, ir-chk=1042/41389)
some/folders/sub/folders/src/Debug_x64/*********.obj
         15,243 100%  236.28kB/s    0:00:00 (xfr#7, ir-chk=1041/41389)
some/folders/sub/folders/src/Debug_x64/*********.obj
^Crsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting

Now it stuck on the some.obj file.

  1. Ctrl+C, try again... again stuck on the same.obj file...

  2. Remembered about strace...

strace -o strace.txt -ff rsync with params

Got 2 strace files. Unfortunately I cannot share them here as is... But will provide you with parts I you will need.

First file seems to come from ssh. 178974 bytes long.

dup2(3, 0)                              = 0
close(4)                                = 0
close(5)                                = 0
dup2(6, 1)                              = 1
close(3)                                = 0
close(6)                                = 0
fcntl(0, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(0, F_SETFL, O_RDWR)               = 0
execve("/usr/local/sbin/ssh", ["ssh", "-l", ....
execve("/usr/bin/ssh", ["ssh", "-l", .....
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd49ca20000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
...
... skip here ...
...
clock_gettime(CLOCK_BOOTTIME, {2396, 116297000}) = 0
read(3, "\216k__\225G!\263\265G}\374T\256\253\346\233\373>\305z][a\rb\32\211T\20\270\300"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 116528000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 116644000}) = 0
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 116915000}) = 0
read(3, "\303\220\n\231\212\234\355+\277\34n5\\^\21O\326\366\324\372N\0\225\276n\313\355U\357\347`\335"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 117163000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 117278000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 117552000}) = 0
write(5, "\374\377\0\7,K\252G\377\n\260\23[\24DW@\312\233#\244\216\241\333\2300e\334L\24\212_"..., 16384) = 16384
read(3, "2\35\202\326T\32$F\276i\324\327k[\312\256\362\224/\255/z\247\17\243\261\346)\3464\312\236"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 117901000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 118047000}) = 0
select(7, [3 4], [3], NULL, NULL)       = 2 (in [3], out [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 118324000}) = 0
read(3, "\5\260\374\270\275\305\350\270\355E^|\243Ku\216\177\3\266&\317@\10\335t\362\261kP\360Ts"..., 8192) = 8192
write(3, "7\2571u\265?\347\4\212Ig]\t\21<JC\332.\243\206\257k\246\216\266\252\5\v\r'N"..., 36) = 36
clock_gettime(CLOCK_BOOTTIME, {2396, 118728000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 118843000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 119116000}) = 0
write(5, ";aW|v\223Yu\26\253w0\334\330\212\344\205\273\246^\370\"x@13\327<\240.G<"..., 16384) = 16384
read(3, "\376i\255\251\245\361ft\321\325\351_\264\337\333\361\363W`\37\235\361\267\3709\216\335\4\357\313H\336"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 119464000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 119579000}) = 0
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 119841000}) = 0
read(3, "\340\305\217i\325f2\267\246\211?Q\2gm\33\246\325\f\35\365\241d\322\255\370\300\275$\312\275D"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 120085000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 120200000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 120474000}) = 0
write(5, "#\3W+\246\246FS\23\330\253?\207\1u\323\253\351\2563h3\26\301\3171\317}\244\352\212-"..., 16384

2nd file from rsync. 9277601 bytes long:

execve("/usr/bin/rsync", ["rsync", "--archive", "--stats", "--progress", "--delete", ......
brk(NULL)                               = 0x7fffe73de000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f993b200000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=21423, ...}) = 0
mmap(NULL, 21423, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f993b206000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\20\0\0\0\0\0\0"..., 832) = 832
...
...
...
open("some/folders/sub/folders/src/Debug_x64/*********.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=70915, ...}) = 0
write(1, "some/folders/sub/*******"..., 64) = 64
read(3, "d\206\265\0\225C%Yj\256\0\0\317\2\0\0\0\0\0\0.drectve\0\0\0\0"..., 70915) = 70915
gettimeofday({1495618771, 246185}, NULL) = 0
ioctl(1, TIOCGPGRP, [546])              = 0
write(1, "\r         32,768  46%   50.71kB/"..., 46) = 46
select(6, [5], [4], [5], {60, 0})       = 2 (in [5], out [4], left {59, 999999})
read(5, "\rT\v\320\16u\203&#0\25\352\353\313\204#\5x;", 19) = 19
write(4, "Q\216\0\7\377\376^\0S\0:\0\\\0001\0-\0P\0R\0J\0\\\0S\0O\0F\0"..., 36437) = 36437
gettimeofday({1495618771, 246814}, NULL) = 0
gettimeofday({1495618771, 246891}, NULL) = 0
gettimeofday({1495618771, 246961}, NULL) = 0
write(1, "\r         70,915 100%  109.58kB/"..., 72) = 72
close(3)                                = 0
open("some/folders/sub/folders/src/Debug_x64/*******.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=15243, ...}) = 0
write(1, "some/folders/sub/*****"..., 65) = 65
read(3, "d\206\17\0\225C%Yf)\0\0f\0\0\0\0\0\0\0.drectve\0\0\0\0"..., 15243) = 15243
gettimeofday({1495618771, 247785}, NULL) = 0
ioctl(1, TIOCGPGRP, [546])              = 0
write(1, "\r         15,243 100%   23.52kB/"..., 46) = 46
gettimeofday({1495618771, 248074}, NULL) = 0
write(1, "\r         15,243 100%   23.52kB/"..., 72) = 72
close(3)                                = 0
open("some/folders/sub/folders/src/Debug_x64/some.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=1071815, ...}) = 0
write(1, "some/folders/sub/********"..., 61) = 61
read(3, "d\206s\6\225C%Y\206\321\t\0\343\36\0\0\0\0\0\0.drectve\0\0\0\0"..., 262144) = 262144
select(6, [5], [4], [5], {60, 0})       = 2 (in [5], out [4], left {59, 999998})
read(5, "\32\371\177\2075\240'\347\247\223\245\2647\310\252\372\301\251\205\303T\324%R=\253\n\257\26\251\371\235"..., 38) = 38
write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484
  • "******" and "...." means, I obscured some names/parts.
  1. The client and the server is now in the state, where I can reliably repeat the hang :)
    So if I can somehow help further, just give me instructions.

@Wishmesh
Copy link
Author

And one more thing. I am starting to suspect that it is hanging only on the files, that are created with Visual Studio that is running as Elevated process under admin privileges. But it is not easy to reproduce, because in most time it sync these files easily.

@sunjoong
Copy link

@Wishmesh - I don't know why that happened (hmm.... perhaps timestamp or file closing problem??), but remember you said;

I am trying to rsync folders between two Windows

If so.... between two Windows.... using robocopy might be a meanwhile workaround, I think.

@therealkenc
Copy link
Collaborator

write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484

Those writes are likely blocking because the send buffers are full. Which means it is the server (receiving) side where you are going to need to do the strace. Which is going to be a [. . .]

Visual Studio that is running as Elevated process under admin privileges.

That could matter. If you insist on running VS as admin, then you are probably going to want to try running bash with elevated privileges too; assuming you are not already. That might not (probably won't) unstick a stuck rsync, but, spitballing, it might avoid whatever is getting you in the state in the first place. That goes double if you are accessing those files with elevated privilege on the server as well, as opposed to just pushing them there for backups.

  1. Checked offending file on the server. The file some-name.pch is missing on the server.

For giggles take a peek in %appdatalocal%/lxss/temp (look but don't touch) on the server and see if there is anything untoward in there. See #1940. You have a totally unrelated scenario, of course, but you do have a situation with deletes going on under failure conditions. Failure conditions, which are probably uncommon enough to explain why you go days or weeks without seeing a problem.

Hopefully the devs take a look. Even more hopefully the problem just gets fixed magically with ongoing improvements, which happens a lot around here.

If so.... between two Windows.... using robocopy might be a meanwhile workaround, I think.

Well yeah. Or Cygwin rsync. But where would be the fun in that.

@Wishmesh
Copy link
Author

Robocopy isn't an option, because I need to sync files to non-Windows machines too.
Cygwin sucks. Bash on Windows is king 👍
And of course I can workaround the problem, but fix of this issue would be better ;)

in %localappdata%/lxss/temp (not %appdatalocal%) nothing unusual:
0000000000000000_devzero 0000000000000001_unlink 0000000000000002_unlink 0000000000000003_unlink 0000000000000004_unlink 0000000000000005_tmpfs 0000000000000006_unlink 0000000000000007_tmpfs 0000000000000008_unlink 0000000000000009_tmpfs 000000000000000a_unlink 000000000000000b_tmpfs 000000000000000c_unlink 000000000000000d_tmpfs 000000000000000e_unlink 000000000000000f_unlink 0000000000000010_unlink

Also, tried to run Bash as Elevated admin. It still stops on the same file. Cannot say if running as elevated would avoid getting stuck in the first place.

On server files are not touched / accessed. Like in the backup.

@sunjoong
Copy link

sunjoong commented May 24, 2017

@Wishmesh

Cygwin sucks. Bash on Windows is king 👍

Haha... Then.... Does Msys2 rsync make same problem too? Compairing is a good tool to clearify issue, I think. And... Is that sure all Visual Studio windows were closed before launching rsync?

@therealkenc
Copy link
Collaborator

in %localappdata%/lxss/temp (not %appdatalocal%) nothing unusual:
0000000000000001_unlink .... etc

What are the size of those files? Or more specifically, do they look familiar in the context of your errant files (pch etc).

On server files are not touched / accessed. Like in the backup.

Wish you wouldn't have said that. That negates most theories related to elevated privileges. You client strace looks clean. You don't have any of the usual fs related gotchas there. Since you are blocking on a network write, by definition either the server isn't reading or the bytes, or the bytes didn't get there in the first place.

If your use case is backups, and there aren't other circumstances (like not having space on C:), you might want to push to /home or /var on the server. Eliminates a variable, anyway.

@sunjoong
Copy link

@Wishmesh - I think @therealkenc points good one. You might sync files under /mnt/<dir> directory. What if under /home/<dir> or /var/<dir>?

@Wishmesh
Copy link
Author

Haha... Then.... Does Msys2 rsync make same problem too? Compairing is a good tool to clearify issue, I think. And... Is that sure all Visual Studio windows were closed before launching rsync?

  1. Unable to compare with Msys2.
  2. Of course VS closed. As I said. Full reboot for both - client and server. Nothing running except services... Start Bash... run rsync... hung...

What are the size of those files?

These are folders in the temp directory... I said - nothing unusual, because they are identical to what I see in other unrelated to this problem PC.

About syncing to /home ... just tried the same folder and of course rsync works. Remember I said in the first posts... that if I try to rsync individual files (that are stuck), then it unstacks... and everything works again from this point.

So I currently have stuck the following folder:
some/folders/sub/folders/src/Debug_x64/some-file.obj
on the file some-file.obj

I just did rsync to /home/user/testing ... and it works
I also tried to rsync to /mnt/drive/some-other-path ... and of course it works.
Remember, it need many tries to get to the "stuck state" -- some days or weeks.

After testing these I run original rsync command:
rsync /mnt/u/some/ ...
And it stuck on the same file:
some/folders/sub/folders/src/Debug_x64/some-file.obj

So I suspect that if I will run:
rsync /mnt/u/some/folders/ ...
or
rsync /mnt/u/some/folders/sub/ ...
... it will work... and the stuck state would be gone. Because rsync on individual file unstacks...

Do you have other ideas to try, before I issue rsync with "deeper" path?

@sunjoong
Copy link

sunjoong commented May 24, 2017

@Wishmesh - I could not figure it out why that happen but... think there are still many traps in using DrvFS, i.e., /mnt/<dir>; DrvFS does not fully compatible with VolFS, i.e., /home/<dir> or /var/<dir>, and rsync of WSL is a real linux binary and considered to be run within linux filesystem (in this case, VolFS), so it might not fully compatiable with DrvFS yet. However, Cygwin or Msys2 use (modified) windows program (and considered to be able to be run within Windows filesystem); They don't have the same concept of DrvFS and VolFS (but they use another approach.)

I thought, if you could fix the point of condition when that problem occurs, i.e., if you could reproduce it when you want it, you could compare it with non DrvFS case. But... you look like to say it's impossible.

And... I don't understand @therealkenc's saying enogh but... guess it might be about sort of limits during programs running. You know, there are some directories in %localappdata%\lxss\temp; These directoris are changing contents during programs running.

@Wishmesh
Copy link
Author

I couldn't resist to try.... So I have one file stuck:
/mnt/x/1/2/3/4/5/some-file.ext

  1. running rsync /mnt/x/1 user@IP:/mnt/x/abc/
    result -- hang

  2. running rsync /mnt/x/1/2 user@IP:/mnt/x/abc/1/
    result -- hang

  3. running rsync /mnt/x/1/2/3 user@IP:/mnt/x/abc/1/2/
    result -- hang

  4. running rsync /mnt/x/1/2/3/4/5 user@IP:/mnt/x/abc/1/2/3/4/
    result -- hang

  5. running rsync /mnt/x/1/2/3/4/5/some-file.ext user@IP:/mnt/x/abc/1/2/3/4/5/
    result -- success

  6. running command from the step 1 again
    rsync /mnt/x/1 user@IP:/mnt/x/abc/
    success

Issue is gone till the next time :(

@Wishmesh
Copy link
Author

Wishmesh commented May 24, 2017

And if it is filesystem issue, then why syncing files one by one works?

@sunjoong
Copy link

@Wishmesh

And if it is filesystem issue, then why syncing files one by one works?

Good question; You know, people say good question when they don't know a quite answer.

I cannot be sure that's filesystem issue; I'm just guessing it might be or not and it could be timestamp related issue if it might be a filesystem issue. Of course, it might not be timestamp related, it might even not filesystem issue, it might be a issue I could not imagine. But I know there is a timestamp issue in DrvFS itself that make a strange thing, and found another timestamp issue of rsync (but it might not fit on your case) https://askubuntu.com/questions/112863/rsync-not-working-between-ntfs-fat-and-ext .

@therealkenc
Copy link
Collaborator

therealkenc commented May 24, 2017

About syncing to /home ... just tried the same folder and of course rsync works.

Sorry my poor explanation. When I said "you might want to push to /home or /var on the server", I meant as operating procedure moving forward. Something is getting borked on the server side. We don't know what, because no server side strace. But I think you'll have a lower probability of borkage on the server if you push to VolFS, because smart money says this is a DrvFS "filesystem thing". [You'll keep using DrvFS on the client of course, because the whole point is having access in Visual Studio. There's no problem we can see on the client.]

Do you have other ideas to try, before I issue rsync with "deeper" path?

Once you push to any different folder without corruption you're golden. Go deeper, sideways, to a whole different server, whatever.

Hold on to the corrupt folder in its pristine corrupt state if you have the space. The devs are going to need diagnostics of some kind on that tree, because no one will be able to repro this locally. You've got a real bug. But note that the chances of anyone reading this far down the thread is rapidly decreasing.

@sunilmut
Copy link
Member

Adding @SvenGroot to see if he can help out with this one. Also, marking this one as a bug.

@sunilmut sunilmut added the bug label Jun 12, 2017
@SvenGroot
Copy link
Member

@Wishmesh Could you follow the steps from section 8 at https://github.com/Microsoft/BashOnWindows/blob/master/CONTRIBUTING.md, and provide us with the log files? That may provide some insight in to what is happening.

@therealkenc
Copy link
Collaborator

His problem is on the server-side of an WSL-to-WSL rsync push. There's gotta be a way to collect a server-side strace but it isn't immediately obvious to me. The client-side he supplied above. It hangs on:

write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484

...which is not of much use, because it is not failing, just blocking. The problem isn't network-related because the problem persists across a reboot and only once the DrvFS filesystem on the server-side has gotten into a corrupt state.

@SvenGroot
Copy link
Member

Since the server is also WSL, logs of both sides would be helpful. Strace would be helpful but I can see why that would be difficult in this scenario. However, ETL logs should be easy if you have access to the server.

Thanks,
Sven

@Wishmesh
Copy link
Author

Thanks for looking into this!

  1. I tried to get the server-side strace with the following command:
    sudo strace -ff -o mylog /usr/sbin/sshd
    I got two files - mylog.41 and mylog.42, however I do not see the relevant lines in contrast to client strace logs.

If you have idea how to strace rsync on server, let me know.

  1. I got some log files using logman command. Particularly the lxcore_kernel.etl seems to contain something useful, but I am unable to read it using EventViewer... missing some decoder perhaps.
    As it seems that it contains private info, I do not feel comfortable to share it here. Maybe I can share it to some email with @ microsoft.com at the end?

@jlahijani
Copy link

I'm experiencing this issue as well. It occurs on this image.
test

@mumbleskates
Copy link

@squeaksvolvo: @0xabu's killall solution is immensely cleaner than this.

@aressler38
Copy link

This is still an issue in Microsoft Windows 10 Enterprise 10.0.17763 Build 17763. I am using rsync in the VS Code window, and it gets stuck. Resizing the window seems to unstick it for a little while.

@eroller
Copy link

eroller commented Jun 22, 2020

This just started happening for me today. I'm not sure why now, maybe a delayed Windows Update from my domain admins? This is totally unacceptable.

Windows Version 10.0.17763 Build 17763
WSL VERSION="18.04.2 LTS (Bionic Beaver)"

@xianwenchen
Copy link

I have the same problem when rsyncing from /home/ to a remote computer.

Under WSL2 terminal, if I ctrl + z and then fg, I could resume rsync to transfer the next file. However, I have to repeat ctrl + z and fg many times to go through a sync.

@MichaelHipp
Copy link

I have the same problem.

Why on earth has this not been fixed???

@acmuller
Copy link

It's amazing that nothing has been done about this after this amount of time. I was hoping that WSL2 would fix it, but WSL2 doesn't even support SSH to a remote server. For the time being, I'm sticking with my VMWare-based Linuxes.

@123benni
Copy link

123benni commented Oct 14, 2020

This is a really annoying issue. I'm experiencing this with WSL Debian on Windows 10 Enterprise. The workarounds to resize the WSL window works perfectly well. However, it is really annoying.

Some hints for reproducing the issue:
It is actually quite easy to generate files that cannot be rsynced on WSL. I use MATLAB to programmatically generate (a huge amout of) images. Every now and then I rerun my scripts so the already existing images are overwritten with new content. Every time this happens the file which was overwritten is now unable to rsync on WSL. So in order to run rsync (through 100-1000 of files which all hang) I have to keep resizing the WSL window for a mere 10 minutes.

I think it has to do with the timestamps of the files: Whenever a file is overwritten the "created date" stays the same however the "updated date" is changed. I'm quite sure that this is at the root of the issue.

So please some developer go ahead and investigate this issue. Thank you.

In the meantime: if you resize vertically you get more window updates than if you resize horizontally. So you get a higher file throughput rate. 😲 😬 😆

@kiu
Copy link

kiu commented Oct 14, 2020

Using the workaround mentioned in this bug report, I am using this to run my regular backup. Works for me.

#!/bin/bash
IGN_WIN='--exclude Cache --exclude .cache --exclude cache --exclude cache2 --exclude NTUSER.DAT --exclude ntuser.dat.LOG1 --exclude ntuser.dat.LOG2'

/usr/bin/rsync -vxrlHpEogDth -W --delete-after --delete-excluded --ignore-errors $IGN_WIN  -e 'ssh -i /root/.ssh/id_rsa.winbackup' /mnt/c/Users/ winbackup@xxx:/backup 1> /root/winbackup.1.log 2> /root/winbackup.2.log &
pid=$!

while /bin/kill -0 $pid 2>/dev/null; do
    /usr/bin/killall -CHLD ssh
    /bin/sleep 0.1
done

@therealkenc
Copy link
Collaborator

There is definitely (scare quote) "something" going on here; I have seen the rsync hang as well on WSL1, but not recently on WSL2 (for some value of recent, call it ~a year). That doesn't mean it isn't still a problem, natch. There has been no authoritative explanation of why the SIGCHLD work-around works, that I know of, anyway. Doing a ctrl-z (SIGTSTP) + fg or resizing the terminal (SIGWINCH) also unblocks. Signals are "totally different" on WSL2 though, so the fact the problem appears to persist on WSL2 (at least for some) makes the problem all the more curious.

If someone can catch rsync hanging on WSL2, maybe post a screencap after a ctrl-c with a cat /proc/version as a bump. [That the problem persists on WSL1 isn't in question, or the issue would be closed already.]

@MichaelHipp
Copy link

FWIW, I have 3 systems running WSL2 on Win10 x64. Only 1 of the 3 seems to experience the hanging problem. But maybe it is specific to some particular file attribute or content, so the other 2 may trip over it at some point.

@cwallraven
Copy link

For me it was working perfectly with the -W [whole-file transfer] workaround for quite some time and with one of the recent windows updates it came back on WSL. It happens pretty predictably when multiple, small files need to be transferred for me [like small, 4KB-large images, for example].

@jwinterm
Copy link

jwinterm commented Dec 5, 2020

Same issue for me - random hangs when running rsync on windows ubuntu subsytem for linux to a remote debian machine. Just verified it is working if I run rsync inside of msys it works as expected. Running:
Microsoft Windows [Version 10.0.19041.630]

@bitbound
Copy link

I'm having the same issue. I'm using rsync to deploy ASP.NET Core apps, from WSL Ubuntu 20.04 on Windows 10 to an Ubuntu 20.04 VPS.

This has always been an issue for me, starting with 18.04 on both ends.

If I deploy with the same command using an Ubuntu VM in Azure DevOps or GitHub Actions, it works fine.

rsync -r -v /mnt/d/Publish/ root@example.com:/var/www/example-app/

As others have mentioned, if I use the -W parameter, there's no issue. But it takes longer and uses more bandwidth, of course.

@Nigel-Caughey
Copy link

This has been affecting me as well transferring to WSL2 ( Ubuntu 20.04 -- Linux version 4.19.104-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Wed Feb 19 06:37:35 UTC 2020 ) from OSX

io timeout after 3 seconds -- exiting:03
rsync error: timeout in data send/receive (code 30) at /AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Sources/rsync/rsync-54.120.1/rsync/io.c(164) [sender=2.6.9]

Adding the -W did solve the issue, thanks !!.

@usovalx
Copy link

usovalx commented Mar 3, 2021

As a confirmation, just ran into this issue on fresh W10 install, where rsync of large files (win->lin) would hang.
Spend few hours trying to debug it myself, until I noticed that my WSL was created as WSL1 instance, instead of WSL2.
Things seem to work fine in WSL2.

As a possible reason as to why signals help - strace(ing) some of the hanged rsync instances were stuck in "select" calls.
Assuming there is some race condition somewhere in select & pipe combo, where write to pipe won't wake up the select on another side, this would explain why resizing terminal (or sending SIGCHLD) helps - signal will interrupt the select, and subsequent call into it will notice data in the pipe and resume copying.

@yakoder
Copy link

yakoder commented Aug 16, 2021

@usovalx Thanks for that info. Found a fresh win10(home) install w/ a fresh WSL (openSUSE) install was 🙄 using WSL1... (pretty darn sure I installed WSL2, but it wasn't default for :sigh: "reasons"... fixed that... until finding your comment I was thinking I was using WSL2).
Back to topic: After converting from wsl1 to wsl2, rsync finished successfully. Was getting stuck on same file multiple times. File seemed to be about a "third" of the way thru, but that's only gut feel.
Had used these options (in case it matters): rsync -v --stats -rAEtlh --ignore-errors --delete

edit: Just dawned on me that the issue might've been due to rsync'ing from /mnt/c (I see a few other similar comments, so under the "more data == better analysis" adding that tidbit here).

@andrew521
Copy link

andrew521 commented Mar 2, 2022

I am having the same issue. Windows 11, sshd on wsl2 - ubuntu, about 40gb sync. It was working flawlessly until couple of days ago but now rsync timesout always on the same directory.

  • if I remove the directory, it goes further but still hangs on another file / directory, always the same.
  • if I rename the directory, it hangs on the newly named
  • -z and -W seems to help a bit but still not 100% reliable
  • killall solution does not help.

Has anyone figured this one out?

@eykamp
Copy link

eykamp commented Mar 15, 2022

I found that resizing the window will free things up for a few files.

@usovalx
Copy link

usovalx commented Mar 15, 2022

@andrew521 I haven't tried win11 myself, but this behavior feels very similar to what I had encountered with WSL1.
Double-check that your WSL instance is correct version by running wsl -l -v in cmd window.

Otherwise try running your rsync under strace - this may give some idea what is the issue at hand.
Run strace -o rsync.trace rsync ...... and then post last few kb of the trace file rsync.trace as gist.

Copy link
Contributor

This issue has been automatically closed since it has not had any activity for the past year. If you're still experiencing this issue please re-file this as a new issue or feature request.

Thank you!

@bunnie
Copy link

bunnie commented Feb 1, 2024

I definitely still have this problem.

@AdrianEddy
Copy link

I still have this problem too on WSL1

@nvsystems
Copy link

This bug is certainly not solved. But I guess WSL1 is EOL and it will never be solved. Probably one of the gremlins that made Microsoft change the entire approach to the new WSL2 hotness.

@bunnie
Copy link

bunnie commented Feb 15, 2024

I guess all tickets are being closed automatically now without a week of response from the ticket opener.

I have re-opened this issue at #11157 in case anyone wants to follow-on the journey of this clown car.

@craigloewen-msft
Copy link
Member

Reopened and will track it here! Thank you for the feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests