Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow xvector computation with all time spend on compilation #4271

Open
nshmyrev opened this issue Sep 19, 2020 · 11 comments
Open

Very slow xvector computation with all time spend on compilation #4271

nshmyrev opened this issue Sep 19, 2020 · 11 comments
Labels
bug stale Stale bot on the loose

Comments

@nshmyrev
Copy link
Contributor

nshmyrev commented Sep 19, 2020

While running Voxceleb with different architectures I noticed that xvector extraction is very slow:

nnet3-xvector-compute --verbose=0 --use-gpu=no --min-chunk-size=25 --chunk-size=10000 \
--cache-capacity=64 "nnet3-copy \
--nnet-config=exp/xvector_nnet_1a/extract.config \
exp/xvector_nnet_1a/final.raw - |" "ark:apply-cmvn-sliding \
--norm-vars=false --center=true --cmn-window=300 \
scp:feats.scp ark:- | select-voiced-frames \
ark:- scp,s,cs:data/voxceleb1_test/split20/1/vad.scp ark:- |" \
ark,scp:exp/xvector_nnet_1a/xvectors_voxceleb1_test/xvector.1.ark,exp/xvector_nnet_1a/xvectors_voxceleb1_test/xvector.1.scp 
...
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00008, using chunk size  of 1136
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00009, using chunk size  of 812
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00010, using chunk size  of 456
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00011, using chunk size  of 420
....

LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00019, using chunk size  of 764
LOG (select-voiced-frames[5.5.669~1-b1d80]:main():select-voiced-frames.cc:106) Done selecting voiced frames; processed 19 utterances, 0 had errors.
LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:main():nnet3-xvector-compute.cc:238) Time taken 15.0148s: real-time factor assuming 100 frames/sec is 0.108457
LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:main():nnet3-xvector-compute.cc:241) Done 19 utterances, failed for 0
LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 12.9 seconds taken in nnet3 compilation total (breakdown: 12.7 compilation, 0.0195 optimization, 0 shortcut expansion, 0.0045 checking, 2.86e-06 computing indexes, 0.108 misc.) + 0 I/O.

Note that from 15.0148s of execution 12.7 were spent on compilation. Profiler confirms the issue, only 10% is in actual neural network computation.

It seems to be related to variable length of the chunks because if I submit chunks of equal size with --min-chunk-size=400 --chunk-size=400, the computation is much faster and compilation is done only once.

I wonder what is the proper approach to speedup this:

  1. Fix something internally inside compilation so it will not compute again and again
  2. Cluster on chunks of fixed width (probably use 100 frames steps - 100, 200, ...10000). Then such compilations will be cached more effectively.
  3. I see there is also nnet3-xvector-compute-batched, but it also suffers from this issue. Is it supposed to work faster?
@nshmyrev nshmyrev added the bug label Sep 19, 2020
@nshmyrev nshmyrev changed the title Very slow xvector computation Very slow xvector computation with all time spend on compilation Sep 19, 2020
@nshmyrev
Copy link
Contributor Author

  1. Cluster on chunks of fixed width (probably use 100 frames steps - 100, 200, ...10000). Then such compilations will be cached more effectively.

This idea drops few frames but in general works pretty fast.

@danpovey
Copy link
Contributor

danpovey commented Sep 19, 2020 via email

@nshmyrev
Copy link
Contributor Author

nshmyrev commented Sep 19, 2020

@danpovey it is default chunk size in the voxceleb recipe:

and it does improve accuracy over smaller chunks (tried 400 instead of 10000, usually EER is somewhat higher).

Also, if we set chunk size 400 do we need to have cache size 400 too so we can cache all computations? It is 64 by default.

@nshmyrev
Copy link
Contributor Author

Overall, I don't quite like the way chunks are allocated in nnet3-xvector-compute, it looks like we cut full say 400 frames slices first and we average them with very tiny 25-frame chunk in the end. I would better try to arrange chunks more uniformly keeping their size the same, maybe just changing the hop size.

@danpovey
Copy link
Contributor

danpovey commented Sep 21, 2020 via email

@nshmyrev
Copy link
Contributor Author

Yes, I am looking on this. Unfortunately I see the small degradation here (2.9->3.0 compared to baseline). Not sure about the reason, investigating it.

@entn-at
Copy link
Contributor

entn-at commented Sep 22, 2020

I believe there's also nnet3-xvector-compute-batched, which does chunking of audio files. mean.vec/LDA/PLDA would likely have to be retrained on xvectors extracted using this binary, as they won't be the same as the ones computed by nnet3-xvector-compute.

@gorinars
Copy link
Contributor

I believe we had this issue a few years ago and a good speed-up was achieved by pre-computing the cache once and saving it to the file. If you pre-compute it for all segment lengths, then no overhead is needed on inference. I am not sure if that was worth to have in kaldi master, but the reading part was in https://github.com/kaldi-asr/kaldi/pull/2303/files# . Precomputing the cache should be quite straightforward. I might have a small binary doing this if it's useful.

@stale
Copy link

stale bot commented Dec 9, 2020

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Dec 9, 2020
@nshmyrev
Copy link
Contributor Author

@stale stale bot removed the stale Stale bot on the loose label Dec 13, 2020
@stale
Copy link

stale bot commented Feb 11, 2021

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Stale bot on the loose
Projects
None yet
Development

No branches or pull requests

4 participants