Very slow xvector computation with all time spend on compilation #4271

nshmyrev · 2020-09-19T11:58:57Z

While running Voxceleb with different architectures I noticed that xvector extraction is very slow:

nnet3-xvector-compute --verbose=0 --use-gpu=no --min-chunk-size=25 --chunk-size=10000 \
--cache-capacity=64 "nnet3-copy \
--nnet-config=exp/xvector_nnet_1a/extract.config \
exp/xvector_nnet_1a/final.raw - |" "ark:apply-cmvn-sliding \
--norm-vars=false --center=true --cmn-window=300 \
scp:feats.scp ark:- | select-voiced-frames \
ark:- scp,s,cs:data/voxceleb1_test/split20/1/vad.scp ark:- |" \
ark,scp:exp/xvector_nnet_1a/xvectors_voxceleb1_test/xvector.1.ark,exp/xvector_nnet_1a/xvectors_voxceleb1_test/xvector.1.scp 
...
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00008, using chunk size  of 1136
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00009, using chunk size  of 812
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00010, using chunk size  of 456
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00011, using chunk size  of 420
....

LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00019, using chunk size  of 764
LOG (select-voiced-frames[5.5.669~1-b1d80]:main():select-voiced-frames.cc:106) Done selecting voiced frames; processed 19 utterances, 0 had errors.
LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:main():nnet3-xvector-compute.cc:238) Time taken 15.0148s: real-time factor assuming 100 frames/sec is 0.108457
LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:main():nnet3-xvector-compute.cc:241) Done 19 utterances, failed for 0
LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 12.9 seconds taken in nnet3 compilation total (breakdown: 12.7 compilation, 0.0195 optimization, 0 shortcut expansion, 0.0045 checking, 2.86e-06 computing indexes, 0.108 misc.) + 0 I/O.

Note that from 15.0148s of execution 12.7 were spent on compilation. Profiler confirms the issue, only 10% is in actual neural network computation.

It seems to be related to variable length of the chunks because if I submit chunks of equal size with --min-chunk-size=400 --chunk-size=400, the computation is much faster and compilation is done only once.

I wonder what is the proper approach to speedup this:

Fix something internally inside compilation so it will not compute again and again
Cluster on chunks of fixed width (probably use 100 frames steps - 100, 200, ...10000). Then such compilations will be cached more effectively.
I see there is also nnet3-xvector-compute-batched, but it also suffers from this issue. Is it supposed to work faster?

The text was updated successfully, but these errors were encountered:

nshmyrev · 2020-09-19T12:06:07Z

Cluster on chunks of fixed width (probably use 100 frames steps - 100, 200, ...10000). Then such compilations will be cached more effectively.

This idea drops few frames but in general works pretty fast.

danpovey · 2020-09-19T15:36:44Z

You are really not supposed to use such a large chunk size. Is there a reason you chose such a large value? I don't believe it is necessary in terms of accuracy of results. We normally use something like 500 or smaller, I think.

…

On Sat, Sep 19, 2020 at 7:59 PM Nickolay V. Shmyrev < ***@***.***> wrote: While running Voxceleb with different architectures I noticed that xvector extraction is very slow: nnet3-xvector-compute --verbose=0 --use-gpu=no --min-chunk-size=25 --chunk-size=10000 \ --cache-capacity=64 "nnet3-copy \ --nnet-config=exp/xvector_nnet_1a/extract.config \ exp/xvector_nnet_1a/final.raw - |" "ark:apply-cmvn-sliding \ --norm-vars=false --center=true --cmn-window=300 \ scp:feats.scp ark:- | select-voiced-frames \ ark:- scp,s,cs:data/voxceleb1_test/split20/1/vad.scp ark:- |" \ ark,scp:exp/xvector_nnet_1a/xvectors_voxceleb1_test/xvector.1.ark,exp/xvector_nnet_1a/xvectors_voxceleb1_test/xvector.1.scp ... LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00008, using chunk size of 1136 LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00009, using chunk size of 812 LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00010, using chunk size of 456 LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00011, using chunk size of 420 .... LOG (nnet3-xvector-compute[5.5.679~1-28e2b]:main():nnet3-xvector-compute.cc:241) Done 19 utterances, failed for 0 LOG (nnet3-xvector-compute[5.5.679~1-28e2b]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 34.8 seconds taken in nnet3 compilation total (breakdown: 34.4 compilation, 0.0515 optimization, 0 shortcut expansion, 0.0125 checking, 5.96e-06 computing indexes, 0.261 misc.) + 0 I/O. Note that from 34.8 of execution 34.4 were spent on computation. Profiler confirms the issue, only 10% is in actual neural network computation. It seems to be related to variable length of the chunks because if I submit chunks of equal size with --min-chunk-size=400 --chunk-size=400, the computation is much faster and compilation is done only once. I wonder what is the proper approach to speedup this: 1. Fix something internally inside compilation so it will not compute again and again 2. Cluster on chunks of fixed width (probably use 100 frames steps - 100, 200, ...10000). Then such compilations will be cached more effectively. 3. I see there is also nnet3-xvector-compute-batched, but it also suffers from this issue. Is it supposed to work faster? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4271>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO77S4BE5D4MCBIY4MLSGSMJTANCNFSM4RS7MTNA> .

nshmyrev · 2020-09-19T15:40:06Z

@danpovey it is default chunk size in the voxceleb recipe:

kaldi/egs/voxceleb/v1/local/nnet3/xvector/tuning/run_xvector_1a.sh

Line 85 in 8d0c830

max_chunk_size=10000

and it does improve accuracy over smaller chunks (tried 400 instead of 10000, usually EER is somewhat higher).

Also, if we set chunk size 400 do we need to have cache size 400 too so we can cache all computations? It is 64 by default.

nshmyrev · 2020-09-19T16:46:52Z

Overall, I don't quite like the way chunks are allocated in nnet3-xvector-compute, it looks like we cut full say 400 frames slices first and we average them with very tiny 25-frame chunk in the end. I would better try to arrange chunks more uniformly keeping their size the same, maybe just changing the hop size.

danpovey · 2020-09-21T09:25:12Z

I agree that limiting the chunk sizes to a multiple of some number like 100 (or maybe 1/20 of the given chunk size), and maybe avoiding tiny chunks as well, is a good idea. Do you have time to implement that?

…

On Sun, Sep 20, 2020 at 12:47 AM Nickolay V. Shmyrev < ***@***.***> wrote: Overall, I don't quite like the way chunks are allocated in nnet3-xvector-compute, it looks like we cut full say 400 frames slices first and we average them with very tiny 25-frame chunk in the end. I would better try to arrange chunks more uniformly keeping their size the same, maybe just changing the hop size. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4271 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO7YAKFSMZEPZVMH5DDSGTOARANCNFSM4RS7MTNA> .

nshmyrev · 2020-09-21T09:27:01Z

Yes, I am looking on this. Unfortunately I see the small degradation here (2.9->3.0 compared to baseline). Not sure about the reason, investigating it.

entn-at · 2020-09-22T12:20:49Z

I believe there's also nnet3-xvector-compute-batched, which does chunking of audio files. mean.vec/LDA/PLDA would likely have to be retrained on xvectors extracted using this binary, as they won't be the same as the ones computed by nnet3-xvector-compute.

gorinars · 2020-09-22T13:33:18Z

I believe we had this issue a few years ago and a good speed-up was achieved by pre-computing the cache once and saving it to the file. If you pre-compute it for all segment lengths, then no overhead is needed on inference. I am not sure if that was worth to have in kaldi master, but the reading part was in https://github.com/kaldi-asr/kaldi/pull/2303/files# . Precomputing the cache should be quite straightforward. I might have a small binary doing this if it's useful.

stale · 2020-12-09T12:42:42Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

nshmyrev · 2020-12-13T16:37:45Z

Work around this problem in asv-subtools:

https://github.com/Snowdar/asv-subtools/blob/master/kaldi/patch/src/nnet3bin/nnet3-offline-xvector-compute.cc

stale · 2021-02-11T17:12:29Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

nshmyrev added the bug label Sep 19, 2020

nshmyrev changed the title ~~Very slow xvector computation~~ Very slow xvector computation with all time spend on compilation Sep 19, 2020

stale bot added the stale Stale bot on the loose label Dec 9, 2020

stale bot removed the stale Stale bot on the loose label Dec 13, 2020

stale bot added the stale Stale bot on the loose label Feb 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow xvector computation with all time spend on compilation #4271

Very slow xvector computation with all time spend on compilation #4271

nshmyrev commented Sep 19, 2020 •

edited

Loading

nshmyrev commented Sep 19, 2020

danpovey commented Sep 19, 2020 via email

nshmyrev commented Sep 19, 2020 •

edited

Loading

nshmyrev commented Sep 19, 2020

danpovey commented Sep 21, 2020 via email

nshmyrev commented Sep 21, 2020

entn-at commented Sep 22, 2020

gorinars commented Sep 22, 2020

stale bot commented Dec 9, 2020

nshmyrev commented Dec 13, 2020

stale bot commented Feb 11, 2021

Very slow xvector computation with all time spend on compilation #4271

Very slow xvector computation with all time spend on compilation #4271

Comments

nshmyrev commented Sep 19, 2020 • edited Loading

nshmyrev commented Sep 19, 2020

danpovey commented Sep 19, 2020 via email

nshmyrev commented Sep 19, 2020 • edited Loading

nshmyrev commented Sep 19, 2020

danpovey commented Sep 21, 2020 via email

nshmyrev commented Sep 21, 2020

entn-at commented Sep 22, 2020

gorinars commented Sep 22, 2020

stale bot commented Dec 9, 2020

nshmyrev commented Dec 13, 2020

stale bot commented Feb 11, 2021

nshmyrev commented Sep 19, 2020 •

edited

Loading

nshmyrev commented Sep 19, 2020 •

edited

Loading