-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overridable --max-jobs-run and compatibility with align-text checked in validate_data_dir.sh #4094
Conversation
Fixed formatting of an error message.
update my fork before new contribution
[src,scripts,egs] Add "chain2" scripts which enable more flexible egs…
This reverts commit f93c192.
Refresh before changing
sorry, I don't think this is useful -- this is not how usually the command
line argument work -- you sometimes need to override something that has
been set up before (as default)
y.
…On Thu, Jun 4, 2020 at 7:23 PM o-alexandre-felipe ***@***.***> wrote:
The current implementation was assuming the last value passed to
--max-jobs-run
The interpretation I am giving to the argument max-jobs-run is of a
constraint,
jobs-run <= 4, jobs-run <= 16, is satisfied with jobs-run=4, but not with
jobs-run=16
so if I run
run.pl --max-jobs-run 4 --max-jobs-run 16
I would expect that the number of jobs is at most 4
The need to change this arised when trying to limit the number of jobs of
a script externally
for instance I set
run_cmd=run.pl --max-jobs-run 4 # I have a reason for that, e.g. I don't
want to run out of memory
internally the script run something like
$run_cmd --max-jobs-run 16
The previous version of run.pl would start 16 jobs while the new version
will run only 4
------------------------------
You can view, comment on, or merge this pull request online at:
#4094
Commit Summary
- Fixed formatting of an error message.
- Merge pull request #1 from
o-alexandre-felipe/o-alexandre-felipe-patch-1
- Merge pull request #2 from kaldi-asr/master
- Minimal support for archlinux
- Minimal support for archlinux
- Minimal support for archlinux
- Try MKLLIBDIR=$MKLROOT/lib as well
- Merge pull request #3 from kaldi-asr/master
- Revert "[src] Fix wrong error message format in make_lexicon_fst.py"
- if many given, honor the minimum --max-jobs-run
- Merge pull request #4 from kaldi-asr/master
File Changes
- *M* egs/wsj/s5/utils/parallel/run.pl
<https://github.com/kaldi-asr/kaldi/pull/4094/files#diff-66af723956bde67946f8e42c380ca8f9>
(14)
Patch Links:
- https://github.com/kaldi-asr/kaldi/pull/4094.patch
- https://github.com/kaldi-asr/kaldi/pull/4094.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4094>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYXZTZFVBQD57VF4BWODRU7KAXANCNFSM4NS2P7KA>
.
|
actually, I understand your use case, I just feel like you should be adding
an extra parameter, as you are kinda overloading the meaning of
"max-jobs-run". Perhaps --limit-jobs N?
y.
…On Thu, Jun 4, 2020 at 10:07 PM Jan Trmal ***@***.***> wrote:
sorry, I don't think this is useful -- this is not how usually the command
line argument work -- you sometimes need to override something that has
been set up before (as default)
y.
On Thu, Jun 4, 2020 at 7:23 PM o-alexandre-felipe <
***@***.***> wrote:
> The current implementation was assuming the last value passed to
> --max-jobs-run
>
> The interpretation I am giving to the argument max-jobs-run is of a
> constraint,
> jobs-run <= 4, jobs-run <= 16, is satisfied with jobs-run=4, but not with
> jobs-run=16
>
> so if I run
> run.pl --max-jobs-run 4 --max-jobs-run 16
>
> I would expect that the number of jobs is at most 4
>
> The need to change this arised when trying to limit the number of jobs of
> a script externally
>
> for instance I set
>
> run_cmd=run.pl --max-jobs-run 4 # I have a reason for that, e.g. I don't
> want to run out of memory
> internally the script run something like
> $run_cmd --max-jobs-run 16
>
> The previous version of run.pl would start 16 jobs while the new version
> will run only 4
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
> #4094
> Commit Summary
>
> - Fixed formatting of an error message.
> - Merge pull request #1 from
> o-alexandre-felipe/o-alexandre-felipe-patch-1
> - Merge pull request #2 from kaldi-asr/master
> - Minimal support for archlinux
> - Minimal support for archlinux
> - Minimal support for archlinux
> - Try MKLLIBDIR=$MKLROOT/lib as well
> - Merge pull request #3 from kaldi-asr/master
> - Revert "[src] Fix wrong error message format in make_lexicon_fst.py"
> - if many given, honor the minimum --max-jobs-run
> - Merge pull request #4 from kaldi-asr/master
>
> File Changes
>
> - *M* egs/wsj/s5/utils/parallel/run.pl
> <https://github.com/kaldi-asr/kaldi/pull/4094/files#diff-66af723956bde67946f8e42c380ca8f9>
> (14)
>
> Patch Links:
>
> - https://github.com/kaldi-asr/kaldi/pull/4094.patch
> - https://github.com/kaldi-asr/kaldi/pull/4094.diff
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#4094>, or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACUKYXZTZFVBQD57VF4BWODRU7KAXANCNFSM4NS2P7KA>
> .
>
|
Actually I am just extending the current functionality, in a way that we don't have an unexpected behavior. Many scripts receive a run_script option, what is the side effect of allowing the user to send a parameterized run_script, without rewriting the source code. If you think this should not be supported it would be interesting to produce an error if the script is invoked with more than one option for --max-run-scripts (and maybe other options as well). |
align-text splits the text in space characters [ \t\n\r\f\v] and it fails if some character is not printable. Additional improvements
|
I would be OK to merge this, if it's tested and you're sure it won't break any existing setup. |
I am doing the tests, I will do some additional changes and post here. |
Thanks!
…On Thu, Jun 18, 2020 at 2:46 PM o-alexandre-felipe ***@***.***> wrote:
I would be OK to merge this, if it's tested and you're sure it won't break
any existing setup.
I am doing the tests, I will do some additional changes and post here.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4094 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO4FSMQMKBQNKKXBQQLRXGZ47ANCNFSM4NS2P7KA>
.
|
This is the results of the tests I am running locally for check_data_dir.sh # the proposed solution checking for non-printable characters
time script/validate_data_dir.sh --no-feats . && echo 'Pass' || echo 'Fail'
# Proposed backward compatible (simply use sort -c to check file order)'
time script/validate_data_dir.sh --no-feats --non-print . && echo 'Pass' || echo 'Fail'
# Current version
time utils/validate_data_dir.sh --no-feats . && echo 'Pass' || echo 'Fail'
mkdir -p good
cat > good/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
4 /dev/null
EOF
cat > good/text <<EOF
1 this
2 is
3 my
4 test
EOF
cat > good/utt2spk <<EOF
1 1
2 1
3 2
4 2
EOF # some text lines are repeated but have different utterance-ids
mkdir -p repeat_text
cat > repeat_text/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
4 /dev/null
5 /dev/null
EOF
cat > repeat_text/text <<EOF
1 this
2 test
3 is
4 my
5 test
EOF
cat > repeat_text/utt2spk <<EOF
1 1
2 1
3 2
4 2
5 3
EOF # some utterance is repeated but their values are different
mkdir -p repeat_id
cat > repeat_id/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
3 /dev/null2
5 /dev/null
EOF
cat > repeat_id/text <<EOF
1 this
2 test
3 is
3 my
5 test
EOF
cat > repeat_id/utt2spk <<EOF
1 1
2 1
3 2
3 3
5 3
EOF mkdir -p speaker_out_of_order
cat > good/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
4 /dev/null
EOF
cat > speaker_out_of_order/text <<EOF
1 this
2 is
3 my
4 test
EOF
cat > speaker_out_of_order/utt2spk <<EOF
1 1
2 2
3 1
4 1
EOF for dir in good repeat_text repeat_id speaker_out_of_order; do
echo "###### $dir ########"
for ver in script utils ; do
utils/utt2spk_to_spk2utt.pl $dir/utt2spk > $dir/spk2utt
$ver/validate_data_dir.sh --no-feats $dir
done
done 2>1
|
SLEEP_TIME=1 function run
{
# the type of implementation that allows us to freely set any number of jobs
/usr/bin/time -f "%E ellapsed at run with $1" $1 \
JOB=1:32 tmp/JOB.log sleep ${SLEEP_TIME}\; echo JOB\;
} function run8
{
# the type prevents the user to control the parallelism
/usr/bin/time -f"%E ellapsed at run8 with $1" $1 --max-jobs-run 8 \
JOB=1:32 tmp/JOB.log sleep ${SLEEP_TIME}\; echo JOB\;
} for script in run run8; do
for ver in ../utils/parallel ../script ; do
for J1 in 32 8 2; do
run_cmd="$ver/run.pl --max-jobs-run ${J1}"
$script "$run_cmd"
done
done
done
ExplanationWe can see run times ~2 seconds when running 32 jobs in parallel, ~6 seconds when running 8 jobs in parallel, ~24 seconds when running 2 jobs in parallel. The run function simulates a script that does not pass the in each test we are passing a run script that is either the proposed ( When invoke run function, both scripts does the same. When we invoke the run8 function with the current version, regardless of When we invoke the run8 with the proposed version, the |
thanks! we don't have a place to put tests for these kinds of scripts. won't be adding one at this point. Let us know when you are confident it's ready to merge. |
Ready to merge |
The current implementation was assuming the last value passed to --max-jobs-run
The interpretation I am giving to the argument max-jobs-run is of a constraint,
jobs-run <= 4, jobs-run <= 16, is satisfied with jobs-run=4, but not with jobs-run=16
so if I run
run.pl --max-jobs-run 4 --max-jobs-run 16
I would expect that the number of jobs is at most 4
The need to change this arised when trying to limit the number of jobs of a script externally
for instance I set
The previous version of run.pl would start 16 jobs while the new version will run only 4