Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overridable --max-jobs-run and compatibility with align-text checked in validate_data_dir.sh #4094

Merged
merged 16 commits into from
Jun 18, 2020

Conversation

o-alexandre-felipe
Copy link
Contributor

The current implementation was assuming the last value passed to --max-jobs-run

The interpretation I am giving to the argument max-jobs-run is of a constraint,
jobs-run <= 4, jobs-run <= 16, is satisfied with jobs-run=4, but not with jobs-run=16

so if I run
run.pl --max-jobs-run 4 --max-jobs-run 16

I would expect that the number of jobs is at most 4

The need to change this arised when trying to limit the number of jobs of a script externally

for instance I set

run_cmd=run.pl --max-jobs-run 4 # I have a reason for that, e.g. I don't want to run out of memory
internally the script run something like
$run_cmd --max-jobs-run 16

The previous version of run.pl would start 16 jobs while the new version will run only 4

@jtrmal
Copy link
Contributor

jtrmal commented Jun 4, 2020 via email

@jtrmal
Copy link
Contributor

jtrmal commented Jun 4, 2020 via email

@o-alexandre-felipe
Copy link
Contributor Author

Actually I am just extending the current functionality, in a way that we don't have an unexpected behavior.

Many scripts receive a run_script option, what is the side effect of allowing the user to send a parameterized run_script, without rewriting the source code.

If you think this should not be supported it would be interesting to produce an error if the script is invoked with more than one option for --max-run-scripts (and maybe other options as well).

@o-alexandre-felipe
Copy link
Contributor Author

align-text splits the text in space characters [ \t\n\r\f\v] and it fails if some character is not printable.
I adding the option to check the compatibility of data directory in egs/wsj/s5/utils/validate_data_dir.sh.

Additional improvements

  • replaced a chain of if blocks with a case statement.
  • replaced a needlessly complicated verification order and uniqueness by a single sort -uc command.

@o-alexandre-felipe o-alexandre-felipe changed the title Prevent run.pl from overriding --max-jobs-run argument overridable --max-jobs-run and compatibility with align-text checked in validate_data_dir.sh Jun 16, 2020
@danpovey
Copy link
Contributor

I would be OK to merge this, if it's tested and you're sure it won't break any existing setup.

@o-alexandre-felipe
Copy link
Contributor Author

I would be OK to merge this, if it's tested and you're sure it won't break any existing setup.

I am doing the tests, I will do some additional changes and post here.

@danpovey
Copy link
Contributor

danpovey commented Jun 18, 2020 via email

@o-alexandre-felipe
Copy link
Contributor Author

This is the results of the tests I am running locally for check_data_dir.sh

# the proposed solution checking for non-printable characters
time script/validate_data_dir.sh --no-feats . && echo 'Pass' || echo 'Fail'
script/validate_data_dir.sh: text contains 16 lines with non-printable characters

real	0m5.333s
user	0m2.352s
sys	0m0.341s
Fail
# Proposed backward compatible (simply use sort -c to check file order)'
time script/validate_data_dir.sh --no-feats --non-print . && echo 'Pass' || echo 'Fail'
script/validate_data_dir.sh: Successfully validated data-directory .

real	1m35.941s
user	1m13.856s
sys	0m1.888s
Pass
# Current version
time utils/validate_data_dir.sh --no-feats . && echo 'Pass' || echo 'Fail'
utils/validate_data_dir.sh: Successfully validated data-directory .

real	1m39.713s
user	1m15.447s
sys	0m2.439s
Pass
### Unit tests
mkdir -p good
cat > good/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
4 /dev/null
EOF

cat > good/text <<EOF
1 this
2 is
3 my
4 test
EOF

cat > good/utt2spk <<EOF
1 1
2 1
3 2
4 2
EOF
# some text lines are repeated but have different utterance-ids

mkdir -p repeat_text
cat > repeat_text/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
4 /dev/null
5 /dev/null
EOF

cat > repeat_text/text <<EOF
1 this
2 test
3 is
4 my
5 test
EOF

cat > repeat_text/utt2spk <<EOF
1 1
2 1
3 2
4 2
5 3
EOF
# some utterance is repeated but their values are different

mkdir -p repeat_id
cat > repeat_id/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
3 /dev/null2
5 /dev/null
EOF

cat > repeat_id/text <<EOF
1 this
2 test
3 is
3 my
5 test
EOF

cat > repeat_id/utt2spk <<EOF
1 1
2 1
3 2
3 3
5 3
EOF
mkdir -p speaker_out_of_order
cat > good/wav.scp <<EOF
1 /dev/null
2 /dev/null
3 /dev/null
4 /dev/null
EOF

cat > speaker_out_of_order/text <<EOF
1 this
2 is
3 my
4 test
EOF

cat > speaker_out_of_order/utt2spk <<EOF
1 1
2 2
3 1
4 1
EOF
for dir in good repeat_text repeat_id speaker_out_of_order; do
  echo "###### $dir ########"
  for ver in script utils ; do
    utils/utt2spk_to_spk2utt.pl $dir/utt2spk > $dir/spk2utt
    $ver/validate_data_dir.sh --no-feats $dir
  done
done 2>1
###### good ########
script/validate_data_dir.sh: Successfully validated data-directory good
utils/validate_data_dir.sh: Successfully validated data-directory good
###### repeat_text ########
script/validate_data_dir.sh: Successfully validated data-directory repeat_text
utils/validate_data_dir.sh: Successfully validated data-directory repeat_text
###### repeat_id ########
script/validate_data_dir.sh: file repeat_id/utt2spk is not sorted or has duplicates
utils/validate_data_dir.sh: file repeat_id/utt2spk is not in sorted order or has duplicates
###### speaker_out_of_order ########
script/validate_data_dir.sh: utt2spk is not in sorted order when sorted first on speaker-id 
(fix this by making speaker-ids prefixes of utt-ids)
utils/validate_data_dir.sh: utt2spk is not in sorted order when sorted first on speaker-id 
(fix this by making speaker-ids prefixes of utt-ids)

Including that check for non-printable characters took me 1.5 seconds, but using sort -C to check order compensated, reducing the overall runtime in 4%.

I was wondering if there is a directory for pushing tests.

@o-alexandre-felipe
Copy link
Contributor Author

SLEEP_TIME=1
function run
{
  # the type of implementation that allows us to freely set any number of jobs
  /usr/bin/time -f "%E ellapsed at run with $1" $1 \
     JOB=1:32 tmp/JOB.log  sleep ${SLEEP_TIME}\; echo JOB\;
}
function run8 
{
  # the type prevents the user to control the parallelism
  /usr/bin/time -f"%E ellapsed at run8 with $1" $1 --max-jobs-run 8 \
      JOB=1:32 tmp/JOB.log  sleep ${SLEEP_TIME}\; echo JOB\;
}
for script in run run8; do
for ver in ../utils/parallel ../script ; do
  for J1 in 32 8 2; do
    run_cmd="$ver/run.pl --max-jobs-run ${J1}"
    $script "$run_cmd"
  done
done
done
0:01.88 ellapsed at run with ../utils/parallel/run.pl --max-jobs-run 32
0:06.28 ellapsed at run with ../utils/parallel/run.pl --max-jobs-run 8
0:23.40 ellapsed at run with ../utils/parallel/run.pl --max-jobs-run 2
0:01.81 ellapsed at run with ../script/run.pl --max-jobs-run 32
0:05.89 ellapsed at run with ../script/run.pl --max-jobs-run 8
0:24.17 ellapsed at run with ../script/run.pl --max-jobs-run 2
0:06.29 ellapsed at run8 with ../utils/parallel/run.pl --max-jobs-run 32
0:05.70 ellapsed at run8 with ../utils/parallel/run.pl --max-jobs-run 8
0:05.60 ellapsed at run8 with ../utils/parallel/run.pl --max-jobs-run 2
0:06.08 ellapsed at run8 with ../script/run.pl --max-jobs-run 32
0:05.72 ellapsed at run8 with ../script/run.pl --max-jobs-run 8
0:23.71 ellapsed at run8 with ../script/run.pl --max-jobs-run 2

Explanation

We can see run times ~2 seconds when running 32 jobs in parallel, ~6 seconds when running 8 jobs in parallel, ~24 seconds when running 2 jobs in parallel.

The run function simulates a script that does not pass the --max-jobs-run option to the run script.
The run8 function simulates a script that pass --max-jobs-run 8 to the run script.

in each test we are passing a run script that is either the proposed (../script/run.pl) or the current (../utils/parallel/run.pl) version of the run.pl and specifying the desired number of jobs.

When invoke run function, both scripts does the same.

When we invoke the run8 function with the current version, regardless of --max-jobs-run specified outside the script will use 8 parallel jobs.

When we invoke the run8 with the proposed version, the --max-jobs-run can not be set to more than 8 (run8 "../script/run.pl --max-jobs-run 32" took ~6 seconds, compatible with the run time with 8 parallel jobs), but it can be set to less than 8, (run8 "../script/run.pl --max-jobs-run 2" took ~24 seconds, compatible with the run time with 2 parallel jobs).

@danpovey
Copy link
Contributor

thanks! we don't have a place to put tests for these kinds of scripts. won't be adding one at this point. Let us know when you are confident it's ready to merge.

@o-alexandre-felipe
Copy link
Contributor Author

Ready to merge

@danpovey danpovey merged commit 31c2bae into kaldi-asr:master Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants