overridable --max-jobs-run and compatibility with align-text checked in validate_data_dir.sh #4094

o-alexandre-felipe · 2020-06-04T17:22:39Z

The current implementation was assuming the last value passed to --max-jobs-run

The interpretation I am giving to the argument max-jobs-run is of a constraint,
jobs-run <= 4, jobs-run <= 16, is satisfied with jobs-run=4, but not with jobs-run=16

so if I run
run.pl --max-jobs-run 4 --max-jobs-run 16

I would expect that the number of jobs is at most 4

The need to change this arised when trying to limit the number of jobs of a script externally

for instance I set

run_cmd=run.pl --max-jobs-run 4 # I have a reason for that, e.g. I don't want to run out of memory
internally the script run something like
$run_cmd --max-jobs-run 16

The previous version of run.pl would start 16 jobs while the new version will run only 4

Fixed formatting of an error message.

update my fork before new contribution

[src,scripts,egs] Add "chain2" scripts which enable more flexible egs…

This reverts commit f93c192.

Refresh before changing

jtrmal · 2020-06-04T20:08:38Z

sorry, I don't think this is useful -- this is not how usually the command line argument work -- you sometimes need to override something that has been set up before (as default) y.

…

On Thu, Jun 4, 2020 at 7:23 PM o-alexandre-felipe ***@***.***> wrote: The current implementation was assuming the last value passed to --max-jobs-run The interpretation I am giving to the argument max-jobs-run is of a constraint, jobs-run <= 4, jobs-run <= 16, is satisfied with jobs-run=4, but not with jobs-run=16 so if I run run.pl --max-jobs-run 4 --max-jobs-run 16 I would expect that the number of jobs is at most 4 The need to change this arised when trying to limit the number of jobs of a script externally for instance I set run_cmd=run.pl --max-jobs-run 4 # I have a reason for that, e.g. I don't want to run out of memory internally the script run something like $run_cmd --max-jobs-run 16 The previous version of run.pl would start 16 jobs while the new version will run only 4 ------------------------------ You can view, comment on, or merge this pull request online at: #4094 Commit Summary - Fixed formatting of an error message. - Merge pull request #1 from o-alexandre-felipe/o-alexandre-felipe-patch-1 - Merge pull request #2 from kaldi-asr/master - Minimal support for archlinux - Minimal support for archlinux - Minimal support for archlinux - Try MKLLIBDIR=$MKLROOT/lib as well - Merge pull request #3 from kaldi-asr/master - Revert "[src] Fix wrong error message format in make_lexicon_fst.py" - if many given, honor the minimum --max-jobs-run - Merge pull request #4 from kaldi-asr/master File Changes - *M* egs/wsj/s5/utils/parallel/run.pl <https://github.com/kaldi-asr/kaldi/pull/4094/files#diff-66af723956bde67946f8e42c380ca8f9> (14) Patch Links: - https://github.com/kaldi-asr/kaldi/pull/4094.patch - https://github.com/kaldi-asr/kaldi/pull/4094.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4094>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYXZTZFVBQD57VF4BWODRU7KAXANCNFSM4NS2P7KA> .

jtrmal · 2020-06-04T20:16:38Z

actually, I understand your use case, I just feel like you should be adding an extra parameter, as you are kinda overloading the meaning of "max-jobs-run". Perhaps --limit-jobs N? y.

…

On Thu, Jun 4, 2020 at 10:07 PM Jan Trmal ***@***.***> wrote: sorry, I don't think this is useful -- this is not how usually the command line argument work -- you sometimes need to override something that has been set up before (as default) y. On Thu, Jun 4, 2020 at 7:23 PM o-alexandre-felipe < ***@***.***> wrote: > The current implementation was assuming the last value passed to > --max-jobs-run > > The interpretation I am giving to the argument max-jobs-run is of a > constraint, > jobs-run <= 4, jobs-run <= 16, is satisfied with jobs-run=4, but not with > jobs-run=16 > > so if I run > run.pl --max-jobs-run 4 --max-jobs-run 16 > > I would expect that the number of jobs is at most 4 > > The need to change this arised when trying to limit the number of jobs of > a script externally > > for instance I set > > run_cmd=run.pl --max-jobs-run 4 # I have a reason for that, e.g. I don't > want to run out of memory > internally the script run something like > $run_cmd --max-jobs-run 16 > > The previous version of run.pl would start 16 jobs while the new version > will run only 4 > ------------------------------ > You can view, comment on, or merge this pull request online at: > > #4094 > Commit Summary > > - Fixed formatting of an error message. > - Merge pull request #1 from > o-alexandre-felipe/o-alexandre-felipe-patch-1 > - Merge pull request #2 from kaldi-asr/master > - Minimal support for archlinux > - Minimal support for archlinux > - Minimal support for archlinux > - Try MKLLIBDIR=$MKLROOT/lib as well > - Merge pull request #3 from kaldi-asr/master > - Revert "[src] Fix wrong error message format in make_lexicon_fst.py" > - if many given, honor the minimum --max-jobs-run > - Merge pull request #4 from kaldi-asr/master > > File Changes > > - *M* egs/wsj/s5/utils/parallel/run.pl > <https://github.com/kaldi-asr/kaldi/pull/4094/files#diff-66af723956bde67946f8e42c380ca8f9> > (14) > > Patch Links: > > - https://github.com/kaldi-asr/kaldi/pull/4094.patch > - https://github.com/kaldi-asr/kaldi/pull/4094.diff > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#4094>, or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACUKYXZTZFVBQD57VF4BWODRU7KAXANCNFSM4NS2P7KA> > . >

o-alexandre-felipe · 2020-06-14T20:17:12Z

Actually I am just extending the current functionality, in a way that we don't have an unexpected behavior.

Many scripts receive a run_script option, what is the side effect of allowing the user to send a parameterized run_script, without rewriting the source code.

If you think this should not be supported it would be interesting to produce an error if the script is invoked with more than one option for --max-run-scripts (and maybe other options as well).

update 2020-06-16

o-alexandre-felipe · 2020-06-16T20:12:45Z

align-text splits the text in space characters [ \t\n\r\f\v] and it fails if some character is not printable.
I adding the option to check the compatibility of data directory in egs/wsj/s5/utils/validate_data_dir.sh.

Additional improvements

replaced a chain of if blocks with a case statement.
replaced a needlessly complicated verification order and uniqueness by a single sort -uc command.

danpovey · 2020-06-17T12:36:45Z

I would be OK to merge this, if it's tested and you're sure it won't break any existing setup.

o-alexandre-felipe · 2020-06-18T06:46:26Z

I would be OK to merge this, if it's tested and you're sure it won't break any existing setup.

I am doing the tests, I will do some additional changes and post here.

danpovey · 2020-06-18T06:54:25Z

Thanks!

…

On Thu, Jun 18, 2020 at 2:46 PM o-alexandre-felipe ***@***.***> wrote: I would be OK to merge this, if it's tested and you're sure it won't break any existing setup. I am doing the tests, I will do some additional changes and post here. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4094 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO4FSMQMKBQNKKXBQQLRXGZ47ANCNFSM4NS2P7KA> .

o-alexandre-felipe · 2020-06-18T08:00:31Z

This is the results of the tests I am running locally for check_data_dir.sh

# the proposed solution checking for non-printable characters
time script/validate_data_dir.sh --no-feats . && echo 'Pass' || echo 'Fail'

script/validate_data_dir.sh: text contains 16 lines with non-printable characters

real	0m5.333s
user	0m2.352s
sys	0m0.341s
Fail

# Proposed backward compatible (simply use sort -c to check file order)'
time script/validate_data_dir.sh --no-feats --non-print . && echo 'Pass' || echo 'Fail'

script/validate_data_dir.sh: Successfully validated data-directory .

real	1m35.941s
user	1m13.856s
sys	0m1.888s
Pass

# Current version
time utils/validate_data_dir.sh --no-feats . && echo 'Pass' || echo 'Fail'

utils/validate_data_dir.sh: Successfully validated data-directory .

real	1m39.713s
user	1m15.447s
sys	0m2.439s
Pass

### Unit tests

mkdir -p good
cat > good/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
4 /dev/null
EOF

cat > good/text <<EOF
1 this
2 is
3 my
4 test
EOF

cat > good/utt2spk <<EOF
1 1
2 1
3 2
4 2
EOF

# some text lines are repeated but have different utterance-ids

mkdir -p repeat_text
cat > repeat_text/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
4 /dev/null
5 /dev/null
EOF

cat > repeat_text/text <<EOF
1 this
2 test
3 is
4 my
5 test
EOF

cat > repeat_text/utt2spk <<EOF
1 1
2 1
3 2
4 2
5 3
EOF

# some utterance is repeated but their values are different

mkdir -p repeat_id
cat > repeat_id/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
3 /dev/null2
5 /dev/null
EOF

cat > repeat_id/text <<EOF
1 this
2 test
3 is
3 my
5 test
EOF

cat > repeat_id/utt2spk <<EOF
1 1
2 1
3 2
3 3
5 3
EOF

mkdir -p speaker_out_of_order
cat > good/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
4 /dev/null
EOF

cat > speaker_out_of_order/text <<EOF
1 this
2 is
3 my
4 test
EOF

cat > speaker_out_of_order/utt2spk <<EOF
1 1
2 2
3 1
4 1
EOF

for dir in good repeat_text repeat_id speaker_out_of_order; do
  echo "###### $dir ########"
  for ver in script utils ; do
    utils/utt2spk_to_spk2utt.pl $dir/utt2spk > $dir/spk2utt
    $ver/validate_data_dir.sh --no-feats $dir
  done
done 2>1

###### good ########
script/validate_data_dir.sh: Successfully validated data-directory good
utils/validate_data_dir.sh: Successfully validated data-directory good
###### repeat_text ########
script/validate_data_dir.sh: Successfully validated data-directory repeat_text
utils/validate_data_dir.sh: Successfully validated data-directory repeat_text
###### repeat_id ########
script/validate_data_dir.sh: file repeat_id/utt2spk is not sorted or has duplicates
utils/validate_data_dir.sh: file repeat_id/utt2spk is not in sorted order or has duplicates
###### speaker_out_of_order ########
script/validate_data_dir.sh: utt2spk is not in sorted order when sorted first on speaker-id 
(fix this by making speaker-ids prefixes of utt-ids)
utils/validate_data_dir.sh: utt2spk is not in sorted order when sorted first on speaker-id 
(fix this by making speaker-ids prefixes of utt-ids)


Including that check for non-printable characters took me 1.5 seconds, but using sort -C to check order compensated, reducing the overall runtime in 4%.

I was wondering if there is a directory for pushing tests.

o-alexandre-felipe · 2020-06-18T09:49:17Z

SLEEP_TIME=1

function run
{
  # the type of implementation that allows us to freely set any number of jobs
  /usr/bin/time -f "%E ellapsed at run with $1" $1 \
     JOB=1:32 tmp/JOB.log  sleep ${SLEEP_TIME}\; echo JOB\;
}

function run8 
{
  # the type prevents the user to control the parallelism
  /usr/bin/time -f"%E ellapsed at run8 with $1" $1 --max-jobs-run 8 \
      JOB=1:32 tmp/JOB.log  sleep ${SLEEP_TIME}\; echo JOB\;
}

for script in run run8; do
for ver in ../utils/parallel ../script ; do
  for J1 in 32 8 2; do
    run_cmd="$ver/run.pl --max-jobs-run ${J1}"
    $script "$run_cmd"
  done
done
done

0:01.88 ellapsed at run with ../utils/parallel/run.pl --max-jobs-run 32
0:06.28 ellapsed at run with ../utils/parallel/run.pl --max-jobs-run 8
0:23.40 ellapsed at run with ../utils/parallel/run.pl --max-jobs-run 2
0:01.81 ellapsed at run with ../script/run.pl --max-jobs-run 32
0:05.89 ellapsed at run with ../script/run.pl --max-jobs-run 8
0:24.17 ellapsed at run with ../script/run.pl --max-jobs-run 2
0:06.29 ellapsed at run8 with ../utils/parallel/run.pl --max-jobs-run 32
0:05.70 ellapsed at run8 with ../utils/parallel/run.pl --max-jobs-run 8
0:05.60 ellapsed at run8 with ../utils/parallel/run.pl --max-jobs-run 2
0:06.08 ellapsed at run8 with ../script/run.pl --max-jobs-run 32
0:05.72 ellapsed at run8 with ../script/run.pl --max-jobs-run 8
0:23.71 ellapsed at run8 with ../script/run.pl --max-jobs-run 2

Explanation

We can see run times ~2 seconds when running 32 jobs in parallel, ~6 seconds when running 8 jobs in parallel, ~24 seconds when running 2 jobs in parallel.

The run function simulates a script that does not pass the --max-jobs-run option to the run script.
The run8 function simulates a script that pass --max-jobs-run 8 to the run script.

in each test we are passing a run script that is either the proposed (../script/run.pl) or the current (../utils/parallel/run.pl) version of the run.pl and specifying the desired number of jobs.

When invoke run function, both scripts does the same.

When we invoke the run8 function with the current version, regardless of --max-jobs-run specified outside the script will use 8 parallel jobs.

When we invoke the run8 with the proposed version, the --max-jobs-run can not be set to more than 8 (run8 "../script/run.pl --max-jobs-run 32" took ~6 seconds, compatible with the run time with 8 parallel jobs), but it can be set to less than 8, (run8 "../script/run.pl --max-jobs-run 2" took ~24 seconds, compatible with the run time with 2 parallel jobs).

danpovey · 2020-06-18T11:03:45Z

thanks! we don't have a place to put tests for these kinds of scripts. won't be adding one at this point. Let us know when you are confident it's ready to merge.

o-alexandre-felipe · 2020-06-18T11:11:31Z

Ready to merge

o-alexandre-felipe added 11 commits December 29, 2019 10:29

Fixed formatting of an error message.

d479b48

Merge pull request #1 from o-alexandre-felipe/o-alexandre-felipe-patch-1

3465627

Fixed formatting of an error message.

Merge pull request #2 from kaldi-asr/master

3c0ddbd

update my fork before new contribution

Minimal support for archlinux

cf5cc89

Minimal support for archlinux

c6a8c21

Minimal support for archlinux

3254267

Try MKLLIBDIR=$MKLROOT/lib as well

655de42

Merge pull request #3 from kaldi-asr/master

4edf7b2

[src,scripts,egs] Add "chain2" scripts which enable more flexible egs…

Revert "[src] Fix wrong error message format in make_lexicon_fst.py"

eb2bf67

This reverts commit f93c192.

if many given, honor the minimum --max-jobs-run

7c2b8d9

Merge pull request #4 from kaldi-asr/master

a9ad729

Refresh before changing

o-alexandre-felipe added 3 commits June 16, 2020 21:02

Check for text compatibility with align-text

d897eb1

Merge branch 'master' of https://github.com/o-alexandre-felipe/kaldi

82dd00b

Merge pull request #5 from kaldi-asr/master

197069f

update 2020-06-16

o-alexandre-felipe changed the title ~~Prevent run.pl from overriding --max-jobs-run argument~~ overridable --max-jobs-run and compatibility with align-text checked in validate_data_dir.sh Jun 16, 2020

o-alexandre-felipe added 2 commits June 18, 2020 10:45

check_sorted_and_uniq to compare only keys

4ef4c33

Merge branch 'master' of https://github.com/o-alexandre-felipe/kaldi

9c0c2e5

danpovey merged commit 31c2bae into kaldi-asr:master Jun 18, 2020

kkm000 mentioned this pull request Jun 23, 2020

Errors running utils/validate_data_dir.sh on CSJ corpus #4126

Closed

Sentewolf mentioned this pull request Apr 30, 2021

validate_data_dir.sh checks for empty variable which already exists in parent environment #4511

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overridable --max-jobs-run and compatibility with align-text checked in validate_data_dir.sh #4094

overridable --max-jobs-run and compatibility with align-text checked in validate_data_dir.sh #4094

o-alexandre-felipe commented Jun 4, 2020

jtrmal commented Jun 4, 2020 via email

jtrmal commented Jun 4, 2020 via email

o-alexandre-felipe commented Jun 14, 2020

o-alexandre-felipe commented Jun 16, 2020

danpovey commented Jun 17, 2020

o-alexandre-felipe commented Jun 18, 2020

danpovey commented Jun 18, 2020 via email

o-alexandre-felipe commented Jun 18, 2020

o-alexandre-felipe commented Jun 18, 2020

danpovey commented Jun 18, 2020

o-alexandre-felipe commented Jun 18, 2020

overridable --max-jobs-run and compatibility with align-text checked in validate_data_dir.sh #4094

overridable --max-jobs-run and compatibility with align-text checked in validate_data_dir.sh #4094

Conversation

o-alexandre-felipe commented Jun 4, 2020

jtrmal commented Jun 4, 2020 via email

jtrmal commented Jun 4, 2020 via email

o-alexandre-felipe commented Jun 14, 2020

o-alexandre-felipe commented Jun 16, 2020

danpovey commented Jun 17, 2020

o-alexandre-felipe commented Jun 18, 2020

danpovey commented Jun 18, 2020 via email

o-alexandre-felipe commented Jun 18, 2020

o-alexandre-felipe commented Jun 18, 2020

Explanation

danpovey commented Jun 18, 2020

o-alexandre-felipe commented Jun 18, 2020