Intermittent problem displaying status output at normal verbosity, parsyncfp-1.61 #34

novosirj · 2020-04-29T20:48:25Z

Sorry to be burying you in issues. I just want to get these reported so you know about them/so you can help if I'm just doing something wrong.

Our backup script runs paryncfp twice on the only campus where I've been seeing this. The function does the following (I've stripped out all of the lines that are commented or do an echo of status for brevity purposes, and only have showed that function and the place where it is called -- have omitted other stuff above, below, and where the ellipsis are):

parallel() {

export HOMEDIR=/gpfs/cache/home/root/policy
export FILESYSTEM=$1
export SNAPSHOT=$2
export MOUNTPOINT=$3

/usr/lpp/mmfs/bin/mmapplypolicy $FILESYSTEM -S $SNAPSHOT -f $HOMEDIR/$SNAPSHOT -P $HOMEDIR/listfiles.rule -I defer

/usr/bin/cut -d' ' -f'5 7-' $HOMEDIR/$SNAPSHOT.list.allfiles > $HOMEDIR/$SNAPSHOT.list.allfiles.clean

rm -rf /root/.parsyncfp-backups

/usr/local/bin/parsyncfp-1.61 -i ens6 --checkperiod 1800 --nowait --altcache '/root/.parsyncfp-backups' -NP 12 --rsyncopts '-a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"' --maxload 96 --chunksize=10T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT

sleep 5

while pgrep -x "rsync" >/dev/null 2>&1
do
  sleep 60
done

}

...

parallel cache cache gpfs/cache
parallel projects projectsp projects

Possibly worth nothing that that last while loop I believe is to deal with another bug someplace where parsyncfp exits but for some reason it hasn't waited for all of the rsync process to exit. This might be related to some of the previous PID-related problems. I do believe I've seen it have to wait at this point recently, but that may be fixed with 1.65+.

Anyhow, the weird behavior is this: as you can see above, that parallel() function is called twice, only with different arguments. So the parsyncfp command uses all of the same flags. On this most recent run, the first run for the cache filesystem printed no output whatsoever about progress -- no header line, and no status lines after it. However the second run, against projects did print the header and is outputting status lines at the expected interval.

I'm not entirely sure where to look at this for debugging. I don't think this version uses status. At some point I was running it without scut but I now have that, so unless the PATH is not working correctly, I'm not sure what's going wrong.

I'll attach the relevant output (we log STDOUT and STDERR to two different files).

The text was updated successfully, but these errors were encountered:

novosirj · 2020-04-29T21:27:50Z

Here is STDOUT (.log.log) and STDERR (.err.log), both trimmed in the same places to the relevant portion.

One thing that does jump out at me is that "[0m" that appears in both. I'm not sure the origin of that.

But you can see in the logs, one of the times it looks like parsyncfp maybe just exits way too soon (also evidenced by the fact that it spends a lot of time in that pgrep -x rsync loop) the first time, and works as expected the second time?

asb-parsyncfp-20200429.err.log
asb-parsyncfp-20200429.log.log

hjmangalam · 2020-04-30T04:52:22Z

Hi Ryan, Something is indeed weird here. If you're feeding fpart a list of files (especially if you're including the sizes (and you are since you're using the '--trustme' option), it should be processed very quickly, as it is apparently in the second (projects) run, which adds chunks every few seconds. In the first (cache) run, you're completing a chunk every ~a minute. Are there huge differences in the distribution of file sizes between the 2 filesystems? Here are some suggestions that might show where things are going bad: - What do the pfp and fpart logs say about the 2 transfers? - Does the fpart log indicate that it completed correctly? - What are the sizes of the fpcache files? If there was a huge difference in sizes of the files to transfer, there would be a huge difference in the sizes of the cache files (which is noted at the end of the pfp run - IF it terminated correctly). - Does the pfp log indicate some kind of trouble? - Does the comparison between the pfp and fpart logs from the 'cache' run vs the 'projects' run indicate where the trouble might lie? - Could the input file lists be corrupted in some way between the first and the second runs? (unlikely as I'd imagine that they're generated by set scripts and fpart doesn't crash, but just runs very long, but worth checking.) The odd characters you noted "[0m" that appears in both logs is probably part of the pfp terminal color change escape sequence. I don't think that would introduce such a difference, but .. I can't guarantee that it couldn't.. I'll keep trying to come up with an explanation as I answer your other email. hjm

…

On Wed, Apr 29, 2020 at 2:28 PM Ryan Novosielski ***@***.***> wrote: Here is STDOUT (.log.log) and STDERR (.err.log), both trimmed in the same places to the relevant portion. One thing that does jump out at me is that "[0m" that appears in both. I'm not sure the origin of that. But you can see in the logs, one of the times it looks like parsyncfp maybe just exits way too soon (also evidenced by the fact that it spends a lot of time in that pgrep -x rsync loop) the first time, and works as expected the second time? asb-parsyncfp-20200429.err.log <https://github.com/hjmangalam/parsyncfp/files/4554664/asb-parsyncfp-20200429.err.log> asb-parsyncfp-20200429.log.log <https://github.com/hjmangalam/parsyncfp/files/4554665/asb-parsyncfp-20200429.log.log> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3YZTFCZ32DDCIUDJPDLRPCLWJANCNFSM4MUBX57Q> .

-- Harry Mangalam

novosirj · 2020-04-30T05:00:48Z

The "cache" filesystem and the "projects" filesystem do likely have a much different file size histogram. "cache" has 152970416 inodes in use, while "projects" has 115009601, and "projects" is ~2.5 times larger in disk space than cache. Lots of people have build directories for software in there, etc. One git repo can really up the numbers (we've been thinking we need to move to inode quotas on that filesystem).

I'll have to save the .parsyncfp-backups directory or write them to a unique ID or something. The second run clears the files of the first run. I can also do another run if it completes; shouldn't hurt anything since it's already been synced for the week. Most of the other suggestions hinge on having the files around from both runs to compare, seems like.

hjmangalam · 2020-04-30T05:12:26Z

Hmm - even so, that's an average of ~3.6s per chunk for projects and ~60s per chunk for cache. Re: the overwrite of the cache files, if you can afford the space, use the FILESYSTEM variable to name the altcache so they don't get overwritten. hjm

…

On Wed, Apr 29, 2020 at 10:01 PM Ryan Novosielski ***@***.***> wrote: The "cache" filesystem and the "projects" filesystem do likely have a much different file size histogram. "cache" has 152970416 inodes in use, while "projects" has 115009601, and "projects" is ~2.5 times larger than cache. I'll have to save the .parsyncfp-backups directory or write them to a unique ID or something. The second run clears the files of the first run. I can also do another run if it completes; shouldn't hurt anything since it's already been synced for the week. Most of the other suggestions hinge on having the files around from both runs to compare, seems like. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3Y54M27HVKCOVFHY2QDRPEAY3ANCNFSM4MUBX57Q> .

-- Harry Mangalam

novosirj · 2020-04-30T06:00:54Z

Yup; and I should have space. FYI, the list for the "cache" filesystem is itself 20G, while the "projects" filesystem list is 16GB. Of course that size also depends on the length of the PATH names, etc., but here are some line counts:

[root@quorum01 policy]# wc -l cache.list.allfiles.clean projectsp.list.allfiles.clean 
  164657014 cache.list.allfiles.clean
  109127324 projectsp.list.allfiles.clean
  273784338 total

hjmangalam · 2020-05-04T21:06:35Z

OK - I found a few problems, now corrected. The worst one was that in (partially) adding another option, I broke the list processing (and I found some other minor problems in the process). It now seems to work with lists (--fromlist), with lists with sizes (--trustme), and the --trimpath options the way it did in the past (and perhaps a little better error checking). Please try the new version 1.67 and tell me how it goes. hjm

…

On Wed, Apr 29, 2020 at 11:01 PM Ryan Novosielski ***@***.***> wrote: Yup; and I should have space. FYI, the list for the "cache" filesystem is itself 20G, while the "projects" filesystem list is 16GB. Of course that size also depends on the length of the PATH names, etc., but here are some line counts: ***@***.*** policy]# wc -l cache.list.allfiles.clean projectsp.list.allfiles.clean 164657014 cache.list.allfiles.clean 109127324 projectsp.list.allfiles.clean 273784338 total — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3YZFNP65DYGZVCIVZE3RPEH2JANCNFSM4MUBX57Q> .

-- Harry Mangalam

novosirj · 2020-05-04T21:11:22Z

Great! Will try it tomorrow/sooner if the current runs complete.

novosirj · 2020-05-06T03:19:39Z

So far so good! Will keep you posted, but those that have gotten to this point in the script are running well (the list generation takes awhile).

novosirj · 2020-05-06T06:23:00Z

Seems like the filesystem that wasn't showing status at normal verbosity routinely has a smaller number of file list "parts" than the number of rsyncs I have configured. Today it's 6 parts as a result of a high chunk size and relatively small filesystem. It may be that something else previously obscured that problem.

Any particular suggestions for the number of "parts" to ultimately shoot for? I guess too small and you risk a slow directory becoming the choke point for all of them simultaneously (a lot more likely for say 4 rsyncs than for 10).

hjmangalam · 2020-05-06T15:13:28Z

I generally try to shoot for about 3-5 times the number of rsyncs. That way the rssyncs are kept busy with too much churn and fpart can generate chunks quickly enough that you're not waiting forever to start. Are you using fpart to recurse the dirs or are you using GPFS to generate the lists that you want to transfer. Generating via lists (esp with the --trustme option - you provide the sizes) should be much faster than a full recursion. I realize that startup time is significant on large FSs and I'm still thinking about how to improve on this delay. fpart does a very good job already. The only thing that might improve this performance is multi-threading the recursion, which I believe qdirstat does. But then all those threads are all stat-hammering and it's not clear if that's going to lead to a huge improvement. Harry

…

On Tue, May 5, 2020 at 11:23 PM Ryan Novosielski ***@***.***> wrote: Seems like the filesystem that wasn't showing status at normal verbosity routinely has a smaller number of file list "parts" than the number of rsyncs I have configured. Today it's 6 parts as a result of a high chunk size and relatively small filesystem. It may be that something else previously obscured that problem. Any particular suggestions for the number of "parts" to ultimately shoot for? I guess too small and you risk a slow directory becoming the choke point for all of them simultaneously (a lot more likely for say 4 rsyncs than for 10). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3Y6YUDZGVIKFWXQSAS3RQD65BANCNFSM4MUBX57Q> .

-- Harry Mangalam

hjmangalam · 2020-05-06T20:50:11Z

BY the way, pls let me know if your pfp runs terminate correctly. they have on all my tests of the latest version, but it's always /yours/ that are the critical ones. harry

…

On Wed, May 6, 2020 at 8:12 AM Harry Mangalam ***@***.***> wrote: I generally try to shoot for about 3-5 times the number of rsyncs. That way the rssyncs are kept busy with too much churn and fpart can generate chunks quickly enough that you're not waiting forever to start. Are you using fpart to recurse the dirs or are you using GPFS to generate the lists that you want to transfer. Generating via lists (esp with the --trustme option - you provide the sizes) should be much faster than a full recursion. I realize that startup time is significant on large FSs and I'm still thinking about how to improve on this delay. fpart does a very good job already. The only thing that might improve this performance is multi-threading the recursion, which I believe qdirstat does. But then all those threads are all stat-hammering and it's not clear if that's going to lead to a huge improvement. Harry On Tue, May 5, 2020 at 11:23 PM Ryan Novosielski ***@***.***> wrote: > Seems like the filesystem that wasn't showing status at normal verbosity > routinely has a smaller number of file list "parts" than the number of > rsyncs I have configured. Today it's 6 parts as a result of a high chunk > size and relatively small filesystem. It may be that something else > previously obscured that problem. > > Any particular suggestions for the number of "parts" to ultimately shoot > for? I guess too small and you risk a slow directory becoming the choke > point for all of them simultaneously (a lot more likely for say 4 rsyncs > than for 10). > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#34 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AASF3Y6YUDZGVIKFWXQSAS3RQD65BANCNFSM4MUBX57Q> > . > -- Harry Mangalam

-- Harry Mangalam

novosirj · 2020-05-06T21:11:33Z

Oh, I'm sorry that that wasn't clear: the list generation happens via GPFS, so well before we run PFP. Probably 90 mins of that before we even get to PFP on the "cache" or "projects" filesystem on the one campus. But so far both runs on "cache" and "projects" have run, displayed the transfer output, and the "cache" FS completed normally already. None of the other ones have completed yet, but are behaving and relatively close to the end.

I did notice that the list generation inside PFP did seem to take a bit longer -- like I could watch it generate list parts and it was over a minute each on one of the sites and about a minute total on the others (those used to be nearly instantaneous), but it's not an amount of time that manners given how long this runs for. It also seems like maybe this frmlist.tmp file is new, large, and potentially the source of some of the slowdown as that directory doesn't have the best write performance. It also seems to contain exactly the same information as the file we feed in, but I'm assuming it does some sort of massaging/validation on that file?

novosirj · 2020-05-06T23:24:40Z

This was a little weird, but it got past it.

17.34.14   1127.03     5.63       3.04 / 0.00             4    <>   0          [17] of [17]
        | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-05-06]
  Time  | time(m) |  Load  |     TCP / RDMA  out  |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
INFO: Waiting [0]s for next chunkfile..
18.30.18   1183.10     0.99       2.43 / 0.00             4    <>   0          [17] of [17]

novosirj · 2020-05-08T21:11:18Z

We can leave this one open for another go-round, but I suspect it's solved in 1.67. No issue this time, and it seems likely it was caused by having a number of parts that was smaller than the number of rsync processes.

novosirj · 2020-05-13T17:46:34Z

Had a different sort of issue this week. As you may recall, we've had the problem where the PFP runs forever waiting for a new chunkfile. This week, on the same/habitual trouble campus, PFP exited way before the rsyncs from PFP exited. I have all of the cache files so we can attempt to isolate the problem if that's helpful. Also, we have no backup currently running (the tuning we did improved the speed greatly), so I have a little more flexibility than usual.

hjmangalam · 2020-05-13T20:57:16Z

Hmm. Since this is a new problem, could you start it in a new thread? I'll wait for the new thread to answer substantially. hjm

…

On Wed, May 13, 2020 at 10:46 AM Ryan Novosielski ***@***.***> wrote: Had a different sort of issue this week. As you may recall, we've had the problem where the PFP runs forever waiting for a new chunkfile. This week, on the same/habitual trouble campus, PFP exited way before the rsyncs from PFP exited. I have all of the cache files so we can attempt to isolate the problem if that's helpful. Also, we have no backup currently running (the tuning we did improved the speed greatly), so I have a little more flexibility than usual. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3Y6HMV6S55KB42UIVULRRLMIRANCNFSM4MUBX57Q> .

-- Harry Mangalam

novosirj · 2020-05-13T21:01:06Z

No problem; they sort of interrelate, and it was in the opening to this ticket:

Possibly worth nothing that that last while loop I believe is to deal with another bug someplace where parsyncfp exits but for some reason it hasn't waited for all of the rsync process to exit. This might be related to some of the previous PID-related problems. I do believe I've seen it have to wait at this point recently, but that may be fixed with 1.65+.

But I agree, it's not really related to the title anyway. This past run did show the status output while it was running. I'll close this one if I don't see it reappear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent problem displaying status output at normal verbosity, parsyncfp-1.61 #34

Intermittent problem displaying status output at normal verbosity, parsyncfp-1.61 #34

novosirj commented Apr 29, 2020

novosirj commented Apr 29, 2020

hjmangalam commented Apr 30, 2020 via email

novosirj commented Apr 30, 2020 •

edited

Loading

hjmangalam commented Apr 30, 2020 via email

novosirj commented Apr 30, 2020

hjmangalam commented May 4, 2020 via email

novosirj commented May 4, 2020

novosirj commented May 6, 2020

novosirj commented May 6, 2020

hjmangalam commented May 6, 2020 via email

hjmangalam commented May 6, 2020 via email

novosirj commented May 6, 2020

novosirj commented May 6, 2020

novosirj commented May 8, 2020 •

edited

Loading

novosirj commented May 13, 2020

hjmangalam commented May 13, 2020 via email

novosirj commented May 13, 2020

Intermittent problem displaying status output at normal verbosity, parsyncfp-1.61 #34

Intermittent problem displaying status output at normal verbosity, parsyncfp-1.61 #34

Comments

novosirj commented Apr 29, 2020

novosirj commented Apr 29, 2020

hjmangalam commented Apr 30, 2020 via email

novosirj commented Apr 30, 2020 • edited Loading

hjmangalam commented Apr 30, 2020 via email

novosirj commented Apr 30, 2020

hjmangalam commented May 4, 2020 via email

novosirj commented May 4, 2020

novosirj commented May 6, 2020

novosirj commented May 6, 2020

hjmangalam commented May 6, 2020 via email

hjmangalam commented May 6, 2020 via email

novosirj commented May 6, 2020

novosirj commented May 6, 2020

novosirj commented May 8, 2020 • edited Loading

novosirj commented May 13, 2020

hjmangalam commented May 13, 2020 via email

novosirj commented May 13, 2020

novosirj commented Apr 30, 2020 •

edited

Loading

novosirj commented May 8, 2020 •

edited

Loading