-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsyncfp 1.67 and earlier sometimes exits before all rsync processes have completed #35
Comments
Hi Ryan,
Sorry for the delay. I've tried to predicate your problem on a couple of
machines using different interfaces and networks and I'm afraid I can't.
The child rsyncs always end before the parent pfp. If I kill the parent
pfp manually (via ^C) or explicit kill from another term, all the rsyncs
die with it:
[[
INFO: Starting rsync for chunkfile [/home/hjm/.parsyncfp/fpcache/f.5]..
INFO: Starting rsync for chunkfile [/home/hjm/.parsyncfp/fpcache/f.6]..
INFO: Starting rsync for chunkfile [/home/hjm/.parsyncfp/fpcache/f.7]..
| Elapsed | 1m | [ wlp3s0] MB/s | Running || Susp'd |
Chunks [2020-05-18]
Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs |
[UpTo] of [ToDo]
X11 forwarding request failed on channel 0
12.31.59 0.05 1.01 1.28 / 0.00 8 <> 0
[8] of [24]
12.32.03 0.12 1.01 1.12 / 0.00 8 <> 0
[14] of [24]
12.32.06 0.17 1.01 1.14 / 0.00 8 <> 0
[14] of [24]
12.32.09 0.22 1.25 1.11 / 0.00 8 <> 0
[14] of [24]
12.32.12 0.27 1.25 1.12 / 0.00 8 <> 0
[14] of [24]
12.32.15 0.32 1.15 1.12 / 0.00 8 <> 0
[14] of [24]
^C
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644)
[sender=3.1.3]
rsync: [sender] write error: Broken pipe (32)
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644)
[sender=3.1.3]
rsync: [sender] write error: Broken pipe (32)
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644)
[sender=3.1.3]
rsync: [sender] write error: Broken pipe (32)rsync error: received SIGINT,
SIGTERM, or SIGHUP (code 20) at rsync.c(644) [send
er=3.1.3]
rsync: [sender] write error: Broken pipe (32)
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644)
[sender=3.1.3]
rsync: [sender] write error: Broken pipe (32)
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644)
[sender=3.1.3]
rsync: [sender] write error: Broken pipe (32)
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644)
[sender=3.1.3]rsync error: received SIGINT, SIGTER
M, or SIGHUP (code 20) at rsync.c(644) [sender=3.1.3]
rsync: [sender] write error: Broken pipe (32)
]]
this is the way these processes are suposed to work and in my hands that's
the way they do work. Could the processes that you're detecting be from
another pfp? I can't match the log IDs.
The fpart error is explained by it's author like this:
[[
Hmmm... That error can only be triggered when using arbitrary values
(fpart's
option -a), which asks fpart *not* to crawl a FS but instead read lines
containing something like :
size path
values. The error is then triggered when sscanf() fails reading a line. Is
pfp using that option ?
It might be interesting to check the exact log line with a tool that
displays
special characters (e.g. 'cat -bet logfile') to see if there is a whitespace
or something else.
Finally, it is possible to build fpart with the '--enable-debug' option. It
will display info while crawling the filesystem. It may help us better
understand what's happening.
]]
And that option is used when fpart is taking lists of files to generate
chunks. It's quite possible that there's something in your list that it does't
like, such as a weird file name (like '^s' or one of the many wocko names
that are possible to create via random mouse events.
(I have several like that).
You can recompile fpart to enable the extended debugging if you want to try
to track that down. It is supposed to print out the offending name, so the
fact that it isn't implies that it's a non-printable character or
whitespace.
Let me know if you see the pfp errors in other contexts or if you verify
that the surviving rsyncs are children of the parent pfp that exited.
harry
…--
Harry Mangalam
|
Yes, the child rsyncs are absolutely children of the PFP process that's still running. We don't run multiples at all on this system, and only have been running one at a time otherwise on the system that does run more than one, and you can see that the command lines for the I'll be starting up a new round tomorrow night. It looks like the critical line here is 928, so if there's some instrumentation it would be helpful to add here, let me know. Re:
(I notice I do know that we have some filenames with junk in them -- mostly carriage returns at the end. I tried what you suggested on the fpart log and he's right:
Do you know which part should be the problem part? The output makes it unclear whether it's the one before or after, but neither the
Thanks for your assistance! |
You're right - line 928:
if ( $rPIDs eq "" && $sPIDs eq "" && $CUR_FPI >= $nbr_cur_fpc_fles &&
$FPART_RUNNING == 0 ) {
is the exit test. If you could dump all the variables inside that loop to
make sure that I'm using the right tests..?
I could easily see that I messed up the test, but I have trouble seeing how
the children PIDs could escape death of their parent, unless the parent is
started with a nohup signal or something like it, as described here:
http://morningcoffee.io/killing-a-process-and-all-of-its-descendants.html
In that case the children also inherit the nohup and can survive the death
of their parent. Are you starting your script with something like that?
That same article does clarify that children DO NOT always die with their
parents (so the universe remains intact with your results ;), but it's
generally unusual).
I'll check what's happening to the '--dispose' option. I haven't been
paying any attention to it since I added it.
Re: the fpart error - I assume that you're not going to see the weird
filename in a chunkfile bc it was an error and therefore was not included
in the chunking process. If that's a consideration,you can probably cause
fpart to print the whole fully-qualified filename and then find out where
it should have gone. Or bet that it won't crash rsync and bypass that
name-checking...?
Harry
Harry
…On Mon, May 18, 2020 at 1:57 PM Ryan Novosielski ***@***.***> wrote:
Yes, the child rsyncs are absolutely children of the PFP process that's
still running. We don't run multiples at all on this system, and only have
been running one at a time otherwise on the system that does run more than
one, and you can see that the command lines for the rsync processes above
correspond to the parsyncfp command line.
I'll be starting up a new round tomorrow night. It looks like the critical
line here is 928, so if there's some instrumentation it would be helpful to
add here, let me know.
Re: fpart, yes, we use the PFP option to read file sizes from a list.
Here is our full command line for PFP:
/usr/local/bin/parsyncfp-1.67 -i ens6 --checkperiod 1800 --nowait
--altcache /root/.parsyncfp-backups-$SNAPSHOT --dispose c -NP 12
--rsyncopts '-a -e "ssh -x -c ***@***.*** -o Compression=no"'
--maxload 96 --chunksize=5T
--fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean
--trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme
$BACKUPHOST:/zdata/gss/$SNAPSHOT
(I notice --dispose c doesn't seem to work either, but maybe I'm not
specifying that correctly?)
I do know that we have some filenames with junk in them -- mostly carriage
returns at the end. I tried what you suggested on the fpart log and he's
right:
***@***.*** .parsyncfp-backups-projectsc]# cat -bet fpcache/fpart.log.23.01.04_2020-05-12 | grep error
33 error parsing input values: ^I$
Do you know which part should be the problem part? The output makes it
unclear whether it's the one before or after, but neither the f.30 file
or f.31 seem to contain a ^I
...
32 Filled part #30: size = 5502396364451, 52938 file(s)$
33 error parsing input values: ^I$
34 Filled part #31: size = 4576145248049, 96973 file(s)$
Thanks for your assistance!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3YYSYXDIYZH7KE2KIMDRSGOJ7ANCNFSM4NAD2HHA>
.
--
Harry Mangalam
|
I'll add prints of those PIDs, etc. before I run this again. When this has gone well, I typically run my script via I know at one time there was a bug where PFP wasn't careful to confirm that one of the We'll see what it prints out this go 'round. Re: the |
This didn't happen on this run, though I ran it via Here's the output I did capture, at any rate:
|
Here's some output from our most recent run on one of our three campuses; I mentioned in Issue 34 that there seems to be a bug where sometimes PFP exits while rsync processes are still running, which is incorrect. You can see in the below output that PFP exists, and then there are these lines:
What is happening during that time is a
pgrep -x rsync
loop to ensure that there are no longer any rsyncs running that was added by my colleague Bill Abbott, probably to deal with this problem. As you can see, it was several hours before all of thersync
processes completed. Here's the full output:While this
rsync
loop was running, I checked for running rsync processes:As you can see, they are numerous.
We still have an error message in the
fpart.log
-- it's not clear to me whether it's related, or how I would go about figuring out what is upsetting it:But we have all of the cache files from the run, so we can look them over and run tests in the interim (the transfer finished very fast this week so I have till Tuesday night to tinker).
The text was updated successfully, but these errors were encountered: