-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent problem displaying status output at normal verbosity, parsyncfp-1.61 #34
Comments
Here is STDOUT ( One thing that does jump out at me is that "[0m" that appears in both. I'm not sure the origin of that. But you can see in the logs, one of the times it looks like asb-parsyncfp-20200429.err.log |
Hi Ryan,
Something is indeed weird here.
If you're feeding fpart a list of files (especially if you're including the
sizes (and you are since you're using the '--trustme' option), it should
be processed very quickly, as it is apparently in the second (projects)
run, which adds chunks every few seconds.
In the first (cache) run, you're completing a chunk every ~a minute. Are
there huge differences in the distribution of file sizes between the 2
filesystems?
Here are some suggestions that might show where things are going bad:
- What do the pfp and fpart logs say about the 2 transfers?
- Does the fpart log indicate that it completed correctly?
- What are the sizes of the fpcache files? If there was a huge difference
in sizes of the files to transfer, there would be a huge difference in the
sizes of the cache files (which is noted at the end of the pfp run - IF it
terminated correctly).
- Does the pfp log indicate some kind of trouble?
- Does the comparison between the pfp and fpart logs from the 'cache' run
vs the 'projects' run indicate where the trouble might lie?
- Could the input file lists be corrupted in some way between the first and
the second runs? (unlikely as I'd imagine that they're generated by set
scripts and fpart doesn't crash, but just runs very long, but worth
checking.)
The odd characters you noted "[0m" that appears in both logs is probably
part of the pfp terminal color change escape sequence. I don't think that
would introduce such a difference, but .. I can't guarantee that it
couldn't..
I'll keep trying to come up with an explanation as I answer your other
email.
hjm
…On Wed, Apr 29, 2020 at 2:28 PM Ryan Novosielski ***@***.***> wrote:
Here is STDOUT (.log.log) and STDERR (.err.log), both trimmed in the same
places to the relevant portion.
One thing that does jump out at me is that "[0m" that appears in both. I'm
not sure the origin of that.
But you can see in the logs, one of the times it looks like parsyncfp
maybe just exits way too soon (also evidenced by the fact that it spends a
lot of time in that pgrep -x rsync loop) the first time, and works as
expected the second time?
asb-parsyncfp-20200429.err.log
<https://github.com/hjmangalam/parsyncfp/files/4554664/asb-parsyncfp-20200429.err.log>
asb-parsyncfp-20200429.log.log
<https://github.com/hjmangalam/parsyncfp/files/4554665/asb-parsyncfp-20200429.log.log>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3YZTFCZ32DDCIUDJPDLRPCLWJANCNFSM4MUBX57Q>
.
--
Harry Mangalam
|
The "cache" filesystem and the "projects" filesystem do likely have a much different file size histogram. "cache" has 152970416 inodes in use, while "projects" has 115009601, and "projects" is ~2.5 times larger in disk space than cache. Lots of people have build directories for software in there, etc. One git repo can really up the numbers (we've been thinking we need to move to inode quotas on that filesystem). I'll have to save the |
Hmm - even so, that's an average of ~3.6s per chunk for projects and ~60s
per chunk for cache.
Re: the overwrite of the cache files, if you can afford the space, use
the FILESYSTEM variable to name the altcache so they don't get overwritten.
hjm
…On Wed, Apr 29, 2020 at 10:01 PM Ryan Novosielski ***@***.***> wrote:
The "cache" filesystem and the "projects" filesystem do likely have a much
different file size histogram. "cache" has 152970416 inodes in use, while
"projects" has 115009601, and "projects" is ~2.5 times larger than cache.
I'll have to save the .parsyncfp-backups directory or write them to a
unique ID or something. The second run clears the files of the first run. I
can also do another run if it completes; shouldn't hurt anything since it's
already been synced for the week. Most of the other suggestions hinge on
having the files around from both runs to compare, seems like.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3Y54M27HVKCOVFHY2QDRPEAY3ANCNFSM4MUBX57Q>
.
--
Harry Mangalam
|
Yup; and I should have space. FYI, the list for the "cache" filesystem is itself 20G, while the "projects" filesystem list is 16GB. Of course that size also depends on the length of the PATH names, etc., but here are some line counts:
|
OK - I found a few problems, now corrected. The worst one was that in
(partially) adding another option, I broke the list processing (and I found
some other minor problems in the process).
It now seems to work with lists (--fromlist), with lists with sizes
(--trustme), and the --trimpath options the way it did in the past (and
perhaps a little better error checking).
Please try the new version 1.67 and tell me how it goes.
hjm
…On Wed, Apr 29, 2020 at 11:01 PM Ryan Novosielski ***@***.***> wrote:
Yup; and I should have space. FYI, the list for the "cache" filesystem is
itself 20G, while the "projects" filesystem list is 16GB. Of course that
size also depends on the length of the PATH names, etc., but here are some
line counts:
***@***.*** policy]# wc -l cache.list.allfiles.clean projectsp.list.allfiles.clean
164657014 cache.list.allfiles.clean
109127324 projectsp.list.allfiles.clean
273784338 total
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3YZFNP65DYGZVCIVZE3RPEH2JANCNFSM4MUBX57Q>
.
--
Harry Mangalam
|
Great! Will try it tomorrow/sooner if the current runs complete. |
So far so good! Will keep you posted, but those that have gotten to this point in the script are running well (the list generation takes awhile). |
Seems like the filesystem that wasn't showing status at normal verbosity routinely has a smaller number of file list "parts" than the number of rsyncs I have configured. Today it's 6 parts as a result of a high chunk size and relatively small filesystem. It may be that something else previously obscured that problem. Any particular suggestions for the number of "parts" to ultimately shoot for? I guess too small and you risk a slow directory becoming the choke point for all of them simultaneously (a lot more likely for say 4 rsyncs than for 10). |
I generally try to shoot for about 3-5 times the number of rsyncs. That
way the rssyncs are kept busy with too much churn and fpart can generate
chunks quickly enough that you're not waiting forever to start.
Are you using fpart to recurse the dirs or are you using GPFS to generate
the lists that you want to transfer. Generating via lists (esp with the
--trustme option - you provide the sizes) should be much faster than a full
recursion.
I realize that startup time is significant on large FSs and I'm still
thinking about how to improve on this delay. fpart does a very good job
already. The only thing that might improve this performance is
multi-threading the recursion, which I believe qdirstat does. But then all
those threads are all stat-hammering and it's not clear if that's going to
lead to a huge improvement.
Harry
…On Tue, May 5, 2020 at 11:23 PM Ryan Novosielski ***@***.***> wrote:
Seems like the filesystem that wasn't showing status at normal verbosity
routinely has a smaller number of file list "parts" than the number of
rsyncs I have configured. Today it's 6 parts as a result of a high chunk
size and relatively small filesystem. It may be that something else
previously obscured that problem.
Any particular suggestions for the number of "parts" to ultimately shoot
for? I guess too small and you risk a slow directory becoming the choke
point for all of them simultaneously (a lot more likely for say 4 rsyncs
than for 10).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3Y6YUDZGVIKFWXQSAS3RQD65BANCNFSM4MUBX57Q>
.
--
Harry Mangalam
|
BY the way, pls let me know if your pfp runs terminate correctly. they
have on all my tests of the latest version, but it's always /yours/ that
are the critical ones.
harry
…On Wed, May 6, 2020 at 8:12 AM Harry Mangalam ***@***.***> wrote:
I generally try to shoot for about 3-5 times the number of rsyncs. That
way the rssyncs are kept busy with too much churn and fpart can generate
chunks quickly enough that you're not waiting forever to start.
Are you using fpart to recurse the dirs or are you using GPFS to generate
the lists that you want to transfer. Generating via lists (esp with the
--trustme option - you provide the sizes) should be much faster than a full
recursion.
I realize that startup time is significant on large FSs and I'm still
thinking about how to improve on this delay. fpart does a very good job
already. The only thing that might improve this performance is
multi-threading the recursion, which I believe qdirstat does. But then all
those threads are all stat-hammering and it's not clear if that's going to
lead to a huge improvement.
Harry
On Tue, May 5, 2020 at 11:23 PM Ryan Novosielski ***@***.***>
wrote:
> Seems like the filesystem that wasn't showing status at normal verbosity
> routinely has a smaller number of file list "parts" than the number of
> rsyncs I have configured. Today it's 6 parts as a result of a high chunk
> size and relatively small filesystem. It may be that something else
> previously obscured that problem.
>
> Any particular suggestions for the number of "parts" to ultimately shoot
> for? I guess too small and you risk a slow directory becoming the choke
> point for all of them simultaneously (a lot more likely for say 4 rsyncs
> than for 10).
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#34 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AASF3Y6YUDZGVIKFWXQSAS3RQD65BANCNFSM4MUBX57Q>
> .
>
--
Harry Mangalam
--
Harry Mangalam
|
Oh, I'm sorry that that wasn't clear: the list generation happens via GPFS, so well before we run PFP. Probably 90 mins of that before we even get to PFP on the "cache" or "projects" filesystem on the one campus. But so far both runs on "cache" and "projects" have run, displayed the transfer output, and the "cache" FS completed normally already. None of the other ones have completed yet, but are behaving and relatively close to the end. I did notice that the list generation inside PFP did seem to take a bit longer -- like I could watch it generate list parts and it was over a minute each on one of the sites and about a minute total on the others (those used to be nearly instantaneous), but it's not an amount of time that manners given how long this runs for. It also seems like maybe this frmlist.tmp file is new, large, and potentially the source of some of the slowdown as that directory doesn't have the best write performance. It also seems to contain exactly the same information as the file we feed in, but I'm assuming it does some sort of massaging/validation on that file? |
This was a little weird, but it got past it.
|
We can leave this one open for another go-round, but I suspect it's solved in 1.67. No issue this time, and it seems likely it was caused by having a number of parts that was smaller than the number of rsync processes. |
Had a different sort of issue this week. As you may recall, we've had the problem where the PFP runs forever waiting for a new chunkfile. This week, on the same/habitual trouble campus, PFP exited way before the rsyncs from PFP exited. I have all of the cache files so we can attempt to isolate the problem if that's helpful. Also, we have no backup currently running (the tuning we did improved the speed greatly), so I have a little more flexibility than usual. |
Hmm. Since this is a new problem, could you start it in a new thread?
I'll wait for the new thread to answer substantially.
hjm
…On Wed, May 13, 2020 at 10:46 AM Ryan Novosielski ***@***.***> wrote:
Had a different sort of issue this week. As you may recall, we've had the
problem where the PFP runs forever waiting for a new chunkfile. This week,
on the same/habitual trouble campus, PFP exited way before the rsyncs from
PFP exited. I have all of the cache files so we can attempt to isolate the
problem if that's helpful. Also, we have no backup currently running (the
tuning we did improved the speed greatly), so I have a little more
flexibility than usual.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3Y6HMV6S55KB42UIVULRRLMIRANCNFSM4MUBX57Q>
.
--
Harry Mangalam
|
No problem; they sort of interrelate, and it was in the opening to this ticket:
But I agree, it's not really related to the title anyway. This past run did show the status output while it was running. I'll close this one if I don't see it reappear. |
Sorry to be burying you in issues. I just want to get these reported so you know about them/so you can help if I'm just doing something wrong.
Our backup script runs paryncfp twice on the only campus where I've been seeing this. The function does the following (I've stripped out all of the lines that are commented or do an echo of status for brevity purposes, and only have showed that function and the place where it is called -- have omitted other stuff above, below, and where the ellipsis are):
Possibly worth nothing that that last while loop I believe is to deal with another bug someplace where parsyncfp exits but for some reason it hasn't waited for all of the rsync process to exit. This might be related to some of the previous PID-related problems. I do believe I've seen it have to wait at this point recently, but that may be fixed with 1.65+.
Anyhow, the weird behavior is this: as you can see above, that
parallel()
function is called twice, only with different arguments. So theparsyncfp
command uses all of the same flags. On this most recent run, the first run for thecache
filesystem printed no output whatsoever about progress -- no header line, and no status lines after it. However the second run, againstprojects
did print the header and is outputting status lines at the expected interval.I'm not entirely sure where to look at this for debugging. I don't think this version uses
status
. At some point I was running it withoutscut
but I now have that, so unless the PATH is not working correctly, I'm not sure what's going wrong.I'll attach the relevant output (we log STDOUT and STDERR to two different files).
The text was updated successfully, but these errors were encountered: