-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't seem to get parsyncfp 1.63 or newer to work right #33
Comments
Hi Ryan,
It'll take me some time to set up a similar situation on our cluster and I
have some things that need doing in the daylight, so I'll try to get to
this tonight. Thanks for such a complete error report.
Harry
…On Tue, Apr 28, 2020 at 9:58 PM Ryan Novosielski ***@***.***> wrote:
Here's an example of what happens, tested with parsyncfp 1.65 with the -d
flag so we could see what is going on. It doesn't correctly do the fpart
section, and it seems like never generates a list based on my input list?
MOUNTPOINT=projects
HOMEDIR=/gpfs/cache/home/root/policy
***@***.*** bin]# /usr/local/bin/parsyncfp-1.65 -d -i ens6 --checkperiod 1800 --nowait --altcache '/root/.parsyncfp-backups' -NP 12 --rsyncopts '-a -e "ssh -x -c ***@***.*** -o Compression=no"' --maxload 96 --chunksize=10T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT
DEBUG[359]: FPARTSIZE set to: [10T]
FPARTSIZE_N set to [10995116277800]
Press [ENTER] to continue.
DEBUG[380]: Using network interface [ens6] with connection quality [ 100000Mb/s]
Press [ENTER] to continue.
DEBUG[385]: 1m load is [0.85] and the 1m Load:#CPU ratio is [0.0354166666666667] ( [24] CPU cores).
OK to continue.
Press [ENTER] to continue.
WARN: About to remove all the old cached chunkfiles from [/root/.parsyncfp-backups/fpcache].
Enter ^C to stop this.
If you specified '--nowait', cache will be cleared in 3s regardless.
Otherwise, hit [Enter] and I'll clear them.
INFO: The fpart chunk files [/root/.parsyncfp-backups/fpcache/f*] are cleared .. continuing.
INFO: The TRIMPATH you specified exists, is a dir, and is readable.
INFO: Alternative file list is readable; converting list to chunks.
DEBUG[624]: fpart fork cmd:
[cd /projects/.snapshots/projectsc; fpart -v -L -s 10995116277800 -a -i /tmp/pfp.tmp -o /root/.parsyncfp-backups/fpcache/f 2> /root/.parsyncfp-backups/fpcache/fpart.log.00.41.53_2020-04-29 & echo "${!}" > /root/.parsyncfp-backups/fpcache/FP_PIDFILE00.41.53_2020-04-29]
Press [ENTER] to continue.
INFO: Forking fpart with PID = [6639]. Check [/root/.parsyncfp-backups/fpcache/fpart.log.00.41.53_2020-04-29] for errors if it hangs.
INFO: Waiting for fpart to be forked.
INFO: [/root/.parsyncfp-backups/fpcache/f.0] visible.
INFO: Waiting [1]s for chunk files to be written.
INFO: Starting the 1st [12] rsyncs ..
DEBUG[685]: Complete rsync cmd = [cd /projects/.snapshots/projectsc && rsync --bwlimit=1000000 -a -e "ssh -x -c ***@***.*** -o Compression=no" -a -s --log-file=/root/.parsyncfp-backups/rsync-logfile-00.41.53_2020-04-29_0 --files-from=/root/.parsyncfp-backups/fpcache/f.0 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups/fpcache/rsync-PIDs-00.41.53_2020-04-29]
Press [ENTER] to continue.
DEBUG[704]: OUT OF RSYNC STARTUP LOOP
Press [ENTER] to continue.
WARN:
The number of chunk files generated by fpart [1] < the # of rsync
processes you specified [12].
Did you check the dir tree / file list to make sure you're setting the chunk
size appropriately (--chunksize) ? It's currently set to [10T].
| Elapsed | 1m | [ ens6] MB/s | Running || Susp'd | Chunks [2020-04-29]
Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo]
DEBUG[1105]: sPIDs string in get_rPIDs() = []
Press [ENTER] to continue.
^C
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(638) [sender=3.1.2]
***@***.*** bin]#```
I'm not sure why this happens. Here's what the fpart log looks like (this wasn't the exact same run, but this is the same as it looked like; I didn't save the output):
***@***.*** ~]# cat /root/.parsyncfp-backups/fpcache/fpart.log.00.40.37_2020-04-29
Examining filesystem...
error parsing input values:
Filled part #0: size = 0, 1 file(s)
1 file(s) found.```
I'm not sure what's going on here, or how to debug. Originally, I didn't have `stats` and `scut` installed. An issue on my system is that there is a stats command that belongs to Confluent that's installed on the system and in the PATH. So I change the PATH with `export PATH=...` where PATH no longer contains the Confluent PATH with `stats` but does contain `/usr/local/bin` which is where I have `stats`, `scut`, and `fpart` that are required for `parsyncfp`. But I'm not sure if that's not working out somehow, or if something else is going wrong?
The only other thing I can think of is that our file specified by `--fromlist=` actually is formatted with a space between the filesize and the PATH instead of the <tab> that's specified in the documentation/usage info. This has never been a problem before, however.
For the moment, I've dropped back to 1.61 to continue my backups. Never tried 1.62.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#33>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3Y7N7JAEFQTAFFHNMBTRO6XYTANCNFSM4MTMXI4A>
.
--
Harry Mangalam
|
Any help is welcome whenever it comes; no real rush for us. I won't be trying anything until next Tuesday most likely (we run this weekly at 23:00 EST/EDT on Tuesdays). |
Hi Ryan,
Hmm - I haven't duplicated your bug, but I can see where this kind of
situation could go wrong.
If you ask for 12 rsyncs and only provide enough work for 1, it might very
well be that it hangs waiting for enough work for 12. It does give the
warning, but I can't quite remember how it handles that case. I'll check
in the morning. Again, too late for touching code here.
Actually, this is part of the same code refactoring I was thinking about
when I mentioned the faster startup, so it's a timely bug.
More tomorrow.
hjm
…On Wed, Apr 29, 2020 at 1:33 PM Ryan Novosielski ***@***.***> wrote:
Any help is welcome whenever it comes; no real rush for us. I won't be
trying anything until next Tuesday most likely (we run this weekly at 23:00
EST/EDT on Tuesdays).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#33 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3YYZVA4TTSRGTLRD62TRPCFIVANCNFSM4MTMXI4A>
.
--
Harry Mangalam
|
I'll have to have it try to do that transfer again when it wraps up work on that filesystem to see what the content of the fpart files is; the amount of source data changes from week to week, theoretically, so it might not happen next week. I never considered what might happen if the parts are broken up in 10TB chunks and there aren't more than a couple of them. |
1.67 seems to have solved this problem. |
Here's an example of what happens, tested with parsyncfp 1.65 with the -d flag so we could see what is going on. It doesn't correctly do the fpart section, and it seems like never generates a list based on my input list?
I'm not sure why this happens. Here's what the fpart log looks like (this wasn't the exact same run, but this is the same as it looked like; I didn't save the output):
I'm not sure what's going on here, or how to debug. Originally, I didn't have
stats
andscut
installed. An issue on my system is that there is a stats command that belongs to Confluent that's installed on the system and in the PATH. So I change the PATH withexport PATH=...
where PATH no longer contains the Confluent PATH withstats
but does contain/usr/local/bin
which is where I havestats
,scut
, andfpart
that are required forparsyncfp
. But I'm not sure if that's not working out somehow, or if something else is going wrong?The only other thing I can think of is that our file specified by
--fromlist=
actually is formatted with a space between the filesize and the PATH instead of the that's specified in the documentation/usage info. This has never been a problem before, however.For the moment, I've dropped back to 1.61 to continue my backups. Never tried 1.62.
The text was updated successfully, but these errors were encountered: