Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue training using the Aspire recipe #4041

Open
andrenatal opened this issue Apr 14, 2020 · 8 comments
Open

Issue training using the Aspire recipe #4041

andrenatal opened this issue Apr 14, 2020 · 8 comments
Labels
bug stale Stale bot on the loose

Comments

@andrenatal
Copy link

Hello,

I'm trying to train a model using the Aspire recipe, using the latest code from the master branch, but am encountering the following error when running local/chain/run_tdnn_lstm.sh. When I trained using local/chain/run_tdnn.sh, it worked fine.


steps/nnet3/chain/get_egs.sh --frames-overlap-per-eg 0 --generate-egs-scp true                 --cmd "run.pl"                 --cmvn-opts "--norm-means=false --norm-vars=false"                 --online-ivector-dir "exp/nnet3/ivectors_train_rvb"                 --left-context 58                 --right-context 28                 --left-context-initial 18                 --right-context-final 28                 --left-tolerance '5'                 --right-tolerance '5'                 --frame-subsampling-factor 3                 --alignment-subsampling-factor 3                 --stage -10                 --frames-per-iter 1500000                 --frames-per-eg 160,140,110,80                 --srand 0                 data/train_rvb_hires exp/chain/tdnn_lstm_1a exp/chain/tri5a_train_rvb_lats exp/chain/tdnn_lstm_1a/egs
steps/nnet3/chain/get_egs.sh --frames-overlap-per-eg 0 --generate-egs-scp true --cmd run.pl --cmvn-opts --norm-means=false --norm-vars=false --online-ivector-dir exp/nnet3/ivectors_train_rvb --left-context 58 --right-context 28 --left-context-initial 18 --right-context-final 28 --left-tolerance 5 --right-tolerance 5 --frame-subsampling-factor 3 --alignment-subsampling-factor 3 --stage -10 --frames-per-iter 1500000 --frames-per-eg 160,140,110,80 --srand 0 data/train_rvb_hires exp/chain/tdnn_lstm_1a exp/chain/tri5a_train_rvb_lats exp/chain/tdnn_lstm_1a/egs
steps/nnet3/chain/get_egs.sh: File data/train_rvb_hires/utt2uniq exists, so ensuring the hold-out set includes all perturbed versions of the same source utterance.
steps/nnet3/chain/get_egs.sh: Holding out 300 utterances in validation set and 300 in training diagnostic set, out of total 5614836.
steps/nnet3/chain/get_egs.sh: creating egs.  To ensure they are not deleted later you can do:  touch exp/chain/tdnn_lstm_1a/egs/.nodelete
steps/nnet3/chain/get_egs.sh: feature type is raw, with 'apply-cmvn'
tree-info exp/chain/tdnn_lstm_1a/tree
feat-to-dim scp:exp/nnet3/ivectors_train_rvb/ivector_online.scp -
steps/nnet3/chain/get_egs.sh: working out number of frames of training data
steps/nnet3/chain/get_egs.sh: working out feature dim
steps/nnet3/chain/get_egs.sh: creating 1374 archives, each with 18749 egs, with
steps/nnet3/chain/get_egs.sh:   160,140,110,80 labels per example, and (left,right) context = (58,28)
steps/nnet3/chain/get_egs.sh:   ... and (left-context-initial,right-context-final) = (18,28)
steps/nnet3/chain/get_egs.sh: Getting validation and training subset examples in background.
steps/nnet3/chain/get_egs.sh: Generating training examples on disk
run.pl: job failed, log is in exp/chain/tdnn_lstm_1a/egs/log/create_valid_subset.log

When I inspect the aforementioned log file, I see this:


# utils/filter_scp.pl exp/chain/tdnn_lstm_1a/egs/valid_uttlist exp/chain/tdnn_lstm_1a/egs/lat_special.scp | lattice-align-phones --replace-output-symb
ols=true exp/chain/tri5a_train_rvb_lats/final.mdl scp:- ark:- | chain-get-supervision --lattice-input=true --frame-subsampling-factor=3 --right-tolera
nce=5 --left-tolerance=5 exp/chain/tdnn_lstm_1a/tree exp/chain/tdnn_lstm_1a/0.trans_mdl ark:- ark:- | nnet3-chain-get-egs --online-ivectors=scp:exp/nn
et3/ivectors_train_rvb/ivector_online.scp --online-ivector-period=10 --srand=0 --left-context=58 --right-context=28 --num-frames=160,140,110,80 --fram
e-subsampling-factor=3 --compress=true --left-context-initial=18 --right-context-final=28 --normalization-fst-scale=1.0 exp/chain/tdnn_lstm_1a/normali
zation.fst "ark,s,cs:utils/filter_scp.pl exp/chain/tdnn_lstm_1a/egs/valid_uttlist data/train_rvb_hires/feats.scp | apply-cmvn --norm-means=false --nor
m-vars=false --utt2spk=ark:data/train_rvb_hires/utt2spk scp:data/train_rvb_hires/cmvn.scp scp:- ark:- |" ark,s,cs:- ark:exp/chain/tdnn_lstm_1a/egs/val
id_all.cegs
# Started at Mon Apr 13 22:01:36 PDT 2020
#
chain-get-supervision --lattice-input=true --frame-subsampling-factor=3 --right-tolerance=5 --left-tolerance=5 exp/chain/tdnn_lstm_1a/tree exp/chain/t
dnn_lstm_1a/0.trans_mdl ark:- ark:-
nnet3-chain-get-egs --online-ivectors=scp:exp/nnet3/ivectors_train_rvb/ivector_online.scp --online-ivector-period=10 --srand=0 --left-context=58 --rig
ht-context=28 --num-frames=160,140,110,80 --frame-subsampling-factor=3 --compress=true --left-context-initial=18 --right-context-final=28 --normalizat
ion-fst-scale=1.0 exp/chain/tdnn_lstm_1a/normalization.fst 'ark,s,cs:utils/filter_scp.pl exp/chain/tdnn_lstm_1a/egs/valid_uttlist data/train_rvb_hires
/feats.scp | apply-cmvn --norm-means=false --norm-vars=false --utt2spk=ark:data/train_rvb_hires/utt2spk scp:data/train_rvb_hires/cmvn.scp scp:- ark:-
|' ark,s,cs:- ark:exp/chain/tdnn_lstm_1a/egs/valid_all.cegs
LOG (nnet3-chain-get-egs[5.5.569~1-6f329]:ComputeDerived():nnet-example-utils.cc:335) Rounding up --num-frames=160,140,110,80 to multiples of --frame-
subsampling-factor=3, to: 162,141,111,81
lattice-align-phones --replace-output-symbols=true exp/chain/tri5a_train_rvb_lats/final.mdl scp:- ark:-
apply-cmvn --norm-means=false --norm-vars=false --utt2spk=ark:data/train_rvb_hires/utt2spk scp:data/train_rvb_hires/cmvn.scp scp:- ark:-
WARNING (nnet3-chain-get-egs[5.5.569~1-6f329]:ProcessFile():nnet3-chain-get-egs.cc:134) Not producing egs for utterance rev1-fe_03_00123-A-041128-0411
78 because it is too short: 48 frames.
WARNING (nnet3-chain-get-egs[5.5.569~1-6f329]:ProcessFile():nnet3-chain-get-egs.cc:134) Not producing egs for utterance rev1-fe_03_00325-A-034285-0343
60 because it is too short: 73 frames.
WARNING (nnet3-chain-get-egs[5.5.569~1-6f329]:ProcessFile():nnet3-chain-get-egs.cc:134) Not producing egs for utterance rev1-fe_03_04633-B-000179-0002
59 because it is too short: 78 frames.
WARNING (nnet3-chain-get-egs[5.5.569~1-6f329]:ProcessFile():nnet3-chain-get-egs.cc:134) Not producing egs for utterance rev1-fe_03_05509-B-049806-0498
81 because it is too short: 73 frames.
WARNING (nnet3-chain-get-egs[5.5.569~1-6f329]:ProcessFile():nnet3-chain-get-egs.cc:134) Not producing egs for utterance rev1-fe_03_11038-B-022852-0229
26 because it is too short: 72 frames.
WARNING (nnet3-chain-get-egs[5.5.569~1-6f329]:ProcessFile():nnet3-chain-get-egs.cc:134) Not producing egs for utterance rev1-fe_03_11661-A-052912-0529
41 because it is too short: 27 frames.
WARNING (nnet3-chain-get-egs[5.5.569~1-6f329]:ProcessFile():nnet3-chain-get-egs.cc:134) Not producing egs for utterance rev2-fe_03_00123-A-041128-0411
78 because it is too short: 48 frames.
WARNING (nnet3-chain-get-egs[5.5.569~1-6f329]:ProcessFile():nnet3-chain-get-egs.cc:134) Not producing egs for utterance rev2-fe_03_00325-A-034285-0343
60 because it is too short: 73 frames.
WARNING (nnet3-chain-get-egs[5.5.569~1-6f329]:main():nnet3-chain-get-egs.cc:386) No pdf-level posterior for key rev2-fe_03_03392-A-000991-001172
ERROR (nnet3-chain-get-egs[5.5.569~1-6f329]:FindKeyInternal():util/kaldi-table-inl.h:2149) You provided the "s" option  (sorted order), but keys are o
ut of order or duplicated: rev2-fe_03_03635-B-013708-013832 is followed by rev2-fe_03_03392-A-000991-001172: rspecifier is ark,s,cs:-

[ Stack-Trace: ]
nnet3-chain-get-egs(kaldi::MessageLogger::LogMessage() const+0xb42) [0x56104d2e1960]
nnet3-chain-get-egs(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x56104cf7a447]
nnet3-chain-get-egs(kaldi::RandomAccessTableReaderDSortedArchiveImpl<kaldi::KaldiObjectHolder<kaldi::chain::Supervision> >::FindKeyInternal(std::__cxx
11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x469) [0x56104cf8d909]
nnet3-chain-get-egs(kaldi::RandomAccessTableReaderDSortedArchiveImpl<kaldi::KaldiObjectHolder<kaldi::chain::Supervision> >::HasKey(std::__cxx11::basic
_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x9) [0x56104cf8db61]
nnet3-chain-get-egs(kaldi::RandomAccessTableReader<kaldi::KaldiObjectHolder<kaldi::chain::Supervision> >::HasKey(std::__cxx11::basic_string<char, std:
:char_traits<char>, std::allocator<char> > const&)+0x40) [0x56104cf81a62]
nnet3-chain-get-egs(main+0xe24) [0x56104cf76bfe]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f9ad2216b97]
nnet3-chain-get-egs(_start+0x2a) [0x56104cf75cfa]

WARNING (nnet3-chain-get-egs[5.5.569~1-6f329]:Close():kaldi-io.cc:515) Pipe utils/filter_scp.pl exp/chain/tdnn_lstm_1a/egs/valid_uttlist data/train_rv
b_hires/feats.scp | apply-cmvn --norm-means=false --norm-vars=false --utt2spk=ark:data/train_rvb_hires/utt2spk scp:data/train_rvb_hires/cmvn.scp scp:-
 ark:- | had nonzero return status 36096
LOG (nnet3-chain-get-egs[5.5.569~1-6f329]:~UtteranceSplitter():nnet-example-utils.cc:357) Split 127 utts, with total length 46459 frames (0.129053 hou
rs assuming 100 frames per second)
LOG (nnet3-chain-get-egs[5.5.569~1-6f329]:~UtteranceSplitter():nnet-example-utils.cc:366) Average chunk length was 132.473 frames; overlap between adj
acent chunks was 1.12357% of input length; length of output was 99.5135% of input length (minus overlap = 98.39%).
LOG (nnet3-chain-get-egs[5.5.569~1-6f329]:~UtteranceSplitter():nnet-example-utils.cc:382) Output frames are distributed among chunk-sizes as follows:
81 = 14.89%, 111 = 12.24%, 141 = 11.89%, 162 = 60.97%
kaldi::KaldiFatalError
# Accounting: time=10 threads=1
# Ended (code 255) at Mon Apr 13 22:01:46 PDT 2020, elapsed time 10 seconds

So is the recipe updated and currently working with master or should I just use fisher_english?

Thanks

@andrenatal andrenatal added the bug label Apr 14, 2020
@danpovey
Copy link
Contributor

danpovey commented Apr 15, 2020 via email

@andrenatal
Copy link
Author

andrenatal commented Apr 15, 2020

Hi @danpovey, thanks for the response. Yes, export LC_ALL=C is set in path.sh.

I'll run utils/validate_data_dir.sh to validate the dateset and post the results here.

Thanks.

@danpovey
Copy link
Contributor

Looks to me like that file lat_special.scp may not be in sorted order. You'll have to trace back into how it was created and figure out why.

@andrenatal
Copy link
Author

Ok, thanks, will try to see what happened to this file.

I just ran validate_data_dir.sh and it worked fine:

utils/validate_data_dir.sh: Successfully validated data-directory data/train_rvb_hires/

@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
@johnjosephmorgan
Copy link
Contributor

I just saw that this is the same issue I recently posted.

@stale stale bot removed the stale Stale bot on the loose label Jul 3, 2020
@johnjosephmorgan
Copy link
Contributor

lat_special.scp is generated by lattice-copy. Should the ,s be added to the wspecifier there?

@stale
Copy link

stale bot commented Sep 1, 2020

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Stale bot on the loose
Projects
None yet
Development

No branches or pull requests

3 participants