segment_long_utterances.sh failing on decode_segmentation #1629

nizmagu · 2017-05-18T14:51:27Z

I was trying to use segment_long_utterances.sh on 6 5-hour-long files.
Upon reaching stage 4, I get the following message:

steps/cleanup/decode_segmentation.sh --beam 15.0 --lattice-beam 1.0 --nj 6 --cmd run.pl --mem 4G --skip-scoring true --allow-partial false exp/segment_train_long/graphs_uniform_seg exp/segment_train_long/train_long_uniform_seg exp/segment_train_long/lats
filter_scps.pl: warning: some input lines were output to multiple files [OK if splitting per utt] 
steps/cleanup/decode_segmentation.sh: feature type is lda
run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/decode.*.log

When inspecting the log files, decode_segmentation gave this error:

ERROR (gmm-latgen-faster[5.1.92-a7e61]:FindKeyInternal():util/kaldi-table-inl.h:2122) You provided the "cs" option but are not calling with keys in sorted order: 2010_07_19_9050-1000000-1003000 < 2010_07_19_9050-585000-588000: rspecifier is ark,s,cs:apply-cmvn  --utt2spk=ark:exp/segment_train_long/train_long_uniform_seg/split6/2/utt2spk scp:exp/segment_train_long/train_long_uniform_seg/split6/2/cmvn.scp scp:exp/segment_train_long/train_long_uniform_seg/split6/2/feats.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats exp/segment_train_long/final.mat ark:- ark:- |

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::RandomAccessTableReaderDSortedArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::FindKeyInternal(std::string const&)
kaldi::RandomAccessTableReaderDSortedArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::HasKey(std::string const&)
kaldi::RandomAccessTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::HasKey(std::string const&)
main
__libc_start_main
gmm-latgen-faster() [0x4639d9]

# Accounting: time=7958 threads=1
# Ended (code 255) at Thu May 18 15:16:07 IDT 2017, elapsed time 7958 seconds

I tried to use validate_data_dir.sh and it says the files are in sorted order.
I believed it may have something to do with locale so I used export LANG= and export LC_ALL=C and checked with sort -c to no avail.

How can I fix this issue?

The text was updated successfully, but these errors were encountered:

danpovey · 2017-05-18T17:03:59Z

This looks like an error in the script, not a user error; Vimal will fix it today hopefully.

vimalmanohar · 2017-05-18T17:59:47Z

@nizmagu Can you check if this solves the problem?

nizmagu · 2017-05-22T11:18:22Z

This solves the decode problem, however a new problem came up.

The script crashed at stage 9 with the following error:
run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/retrieve_similar_docs.*.log

Here is a sample log file:

# steps/cleanup/internal/retrieve_similar_docs.py --query-tfidf=exp/segment_train_long/query_docs/split6/query_tf_idf.1.ark.txt --source-text2tfidf-file=exp/segment_train_long/docs/source2tf_idf.scp --source-text-id2doc-ids=exp/segment_train_long/docs/text2doc --query-id2source-text-id=exp/segment_train_long/new2orig_utt --num-neighbors-to-search=1 --neighbor-tfidf-threshold=0.5 --relevant-docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt 
# Started at Mon May 22 14:07:50 IDT 2017
#
usage: retrieve_similar_docs.py [-h] [--verbose {0,1,2,3}]
                                [--num-neighbors-to-search NUM_NEIGHBORS_TO_SEARCH]
                                [--neighbor-tfidf-threshold NEIGHBOR_TFIDF_THRESHOLD]
                                [--partial-doc-fraction PARTIAL_DOC_FRACTION]
                                --source-text-id2doc-ids
                                SOURCE_TEXT_ID2DOC_IDS
                                --query-id2source-text-id
                                QUERY_ID2SOURCE_TEXT_ID --source-text-id2tfidf
                                SOURCE_TEXT_ID2TFIDF --query-tfidf QUERY_TFIDF
                                --relevant-docs RELEVANT_DOCS
retrieve_similar_docs.py: error: argument --source-text-id2tfidf is required
# Accounting: time=0 threads=1
# Ended (code 2) at Mon May 22 14:07:50 IDT 2017, elapsed time 0 seconds

I tried to change --source-text2tfidf-file to --source-text-id2tfidf and this was the result:

# steps/cleanup/internal/retrieve_similar_docs.py --query-tfidf=exp/segment_train_long/query_docs/split6/query_tf_idf.1.ark.txt --source-text-id2tfidf=exp/segment_train_long/docs/source2tf_idf.scp --source-text-id2doc-ids=exp/segment_train_long/docs/text2doc --query-id2source-text-id=exp/segment_train_long/new2orig_utt --num-neighbors-to-search=1 --neighbor-tfidf-threshold=0.5 --relevant-docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt 
# Started at Mon May 22 14:11:06 IDT 2017
#
2017-05-22 14:11:06,790 [retrieve_similar_docs.py:336 - run - INFO ] Retrieved similar documents for 0 queries
Traceback (most recent call last):
  File "steps/cleanup/internal/retrieve_similar_docs.py", line 353, in <module>
    main()
  File "steps/cleanup/internal/retrieve_similar_docs.py", line 348, in main
    args.relevant_docs, args.query_tfidf, args.source_tfidf]:
AttributeError: 'Namespace' object has no attribute 'source_tfidf'
# Accounting: time=0 threads=1
# Ended (code 1) at Mon May 22 14:11:06 IDT 2017, elapsed time 0 seconds

args.source_tfidf seemed to only be referenced once (in the closing command), so I changed it again to args.source_text_id2tfidf.

Then the script crashed again:
run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/get_ctm_edits.*.log

Here is the log file:

# steps/cleanup/internal/stitch_documents.py --query2docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt --input-documents=exp/segment_train_long/docs/split6/1/docs.txt --output-documents=- | steps/cleanup/internal/align_ctm_ref.py --eps-symbol="<eps>" --oov-word='<UNK>' --symbol-table=data/lang/words.txt --hyp-format=CTM --align-full-hyp=false --hyp=exp/segment_train_long/lats/score_10/train_long_uniform_seg.ctm.1 --ref=- --output=exp/segment_train_long/lats/score_10/train_long_uniform_seg.ctm_edits.1 
# Started at Mon May 22 14:15:01 IDT 2017
#
Traceback (most recent call last):
  File "steps/cleanup/internal/align_ctm_ref.py", line 615, in <module>
    main()
  File "steps/cleanup/internal/align_ctm_ref.py", line 598, in main
    args = get_args()
  File "steps/cleanup/internal/align_ctm_ref.py", line 103, in get_args
    "--reco2file-and-channel must be provided for "
RuntimeError: --reco2file-and-channel must be provided for hyp-format=CTM
usage: stitch_documents.py [-h] --query2docs QUERY2DOCS --input-documents
                           INPUT_DOCUMENTS --output-documents OUTPUT_DOCUMENTS
                           [--check-sorted-docs-per-query {true,false}]
stitch_documents.py: error: argument --input-documents: can't open 'exp/segment_train_long/docs/split6/1/docs.txt': [Errno 2] No such file or directory: 'exp/segment_train_long/docs/split6/1/docs.txt'
# Accounting: time=0 threads=1
# Ended (code 1) at Mon May 22 14:15:01 IDT 2017, elapsed time 0 seconds

vimalmanohar · 2017-05-22T16:41:52Z

I'll create a pull request soon.

On Mon, May 22, 2017, 07:18 nizmagu ***@***.***> wrote: This solves the decode problem, however a new problem came up. The script crashed at stage 9 with the following error: run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/retrieve_similar_docs.*.log Here is a sample log file: # steps/cleanup/internal/retrieve_similar_docs.py --query-tfidf=exp/segment_train_long/query_docs/split6/query_tf_idf.1.ark.txt --source-text2tfidf-file=exp/segment_train_long/docs/source2tf_idf.scp --source-text-id2doc-ids=exp/segment_train_long/docs/text2doc --query-id2source-text-id=exp/segment_train_long/new2orig_utt --num-neighbors-to-search=1 --neighbor-tfidf-threshold=0.5 --relevant-docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt # Started at Mon May 22 14:07:50 IDT 2017 # usage: retrieve_similar_docs.py [-h] [--verbose {0,1,2,3}] [--num-neighbors-to-search NUM_NEIGHBORS_TO_SEARCH] [--neighbor-tfidf-threshold NEIGHBOR_TFIDF_THRESHOLD] [--partial-doc-fraction PARTIAL_DOC_FRACTION] --source-text-id2doc-ids SOURCE_TEXT_ID2DOC_IDS --query-id2source-text-id QUERY_ID2SOURCE_TEXT_ID --source-text-id2tfidf SOURCE_TEXT_ID2TFIDF --query-tfidf QUERY_TFIDF --relevant-docs RELEVANT_DOCS retrieve_similar_docs.py: error: argument --source-text-id2tfidf is required # Accounting: time=0 threads=1 # Ended (code 2) at Mon May 22 14:07:50 IDT 2017, elapsed time 0 seconds I tried to change --source-text2tfidf-file to --source-text-id2tfidf and this was the result: # steps/cleanup/internal/retrieve_similar_docs.py --query-tfidf=exp/segment_train_long/query_docs/split6/query_tf_idf.1.ark.txt --source-text-id2tfidf=exp/segment_train_long/docs/source2tf_idf.scp --source-text-id2doc-ids=exp/segment_train_long/docs/text2doc --query-id2source-text-id=exp/segment_train_long/new2orig_utt --num-neighbors-to-search=1 --neighbor-tfidf-threshold=0.5 --relevant-docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt # Started at Mon May 22 14:11:06 IDT 2017 # 2017-05-22 14:11:06,790 [retrieve_similar_docs.py:336 - run - INFO ] Retrieved similar documents for 0 queries Traceback (most recent call last): File "steps/cleanup/internal/retrieve_similar_docs.py", line 353, in <module> main() File "steps/cleanup/internal/retrieve_similar_docs.py", line 348, in main args.relevant_docs, args.query_tfidf, args.source_tfidf]: AttributeError: 'Namespace' object has no attribute 'source_tfidf' # Accounting: time=0 threads=1 # Ended (code 1) at Mon May 22 14:11:06 IDT 2017, elapsed time 0 seconds args.source_tfidf seemed to only be referenced once (in the closing command), so I changed it again to args.source_text_id2tfidf. Then the script crashed again: run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/get_ctm_edits.*.log Here is the log file: # steps/cleanup/internal/stitch_documents.py --query2docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt --input-documents=exp/segment_train_long/docs/split6/1/docs.txt --output-documents=- | steps/cleanup/internal/align_ctm_ref.py --eps-symbol="<eps>" --oov-word='<UNK>' --symbol-table=data/lang/words.txt --hyp-format=CTM --align-full-hyp=false --hyp=exp/segment_train_long/lats/score_10/train_long_uniform_seg.ctm.1 --ref=- --output=exp/segment_train_long/lats/score_10/train_long_uniform_seg.ctm_edits.1 # Started at Mon May 22 14:15:01 IDT 2017 # Traceback (most recent call last): File "steps/cleanup/internal/align_ctm_ref.py", line 615, in <module> main() File "steps/cleanup/internal/align_ctm_ref.py", line 598, in main args = get_args() File "steps/cleanup/internal/align_ctm_ref.py", line 103, in get_args "--reco2file-and-channel must be provided for " RuntimeError: --reco2file-and-channel must be provided for hyp-format=CTM usage: stitch_documents.py [-h] --query2docs QUERY2DOCS --input-documents INPUT_DOCUMENTS --output-documents OUTPUT_DOCUMENTS [--check-sorted-docs-per-query {true,false}] stitch_documents.py: error: argument --input-documents: can't open 'exp/segment_train_long/docs/split6/1/docs.txt': [Errno 2] No such file or directory: 'exp/segment_train_long/docs/split6/1/docs.txt' # Accounting: time=0 threads=1 # Ended (code 1) at Mon May 22 14:15:01 IDT 2017, elapsed time 0 seconds — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1629 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEATV4yNpLasXvMnukCyi13GwmxjtpWmks5r8W8NgaJpZM4NfWes> .

-- Vimal Manohar PhD Student Electrical & Computer Engineering Johns Hopkins University

vimalmanohar · 2017-05-23T18:37:53Z

I fixed some issues in #1639

nizmagu · 2017-05-25T11:52:19Z

That fixed the issue, thanks a lot!

danpovey · 2020-01-22T07:17:37Z

This issue may still exist for the _nnet3 versions of these scripts. See https://groups.google.com/d/msgid/kaldi-help/a783ff67-7cb4-4f7e-b2db-5b67ca032478%40googlegroups.com.

stale · 2020-06-19T06:37:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

vimalmanohar mentioned this issue May 18, 2017

long_utts: Minor fix #1631

Merged

nizmagu closed this as completed May 25, 2017

danpovey reopened this Jan 22, 2020

stale bot added the stale Stale bot on the loose label Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segment_long_utterances.sh failing on decode_segmentation #1629

segment_long_utterances.sh failing on decode_segmentation #1629

nizmagu commented May 18, 2017

danpovey commented May 18, 2017

vimalmanohar commented May 18, 2017

nizmagu commented May 22, 2017

vimalmanohar commented May 22, 2017 via email

vimalmanohar commented May 23, 2017

nizmagu commented May 25, 2017

danpovey commented Jan 22, 2020

stale bot commented Jun 19, 2020

segment_long_utterances.sh failing on decode_segmentation #1629

segment_long_utterances.sh failing on decode_segmentation #1629

Comments

nizmagu commented May 18, 2017

danpovey commented May 18, 2017

vimalmanohar commented May 18, 2017

nizmagu commented May 22, 2017

vimalmanohar commented May 22, 2017 via email

vimalmanohar commented May 23, 2017

nizmagu commented May 25, 2017

danpovey commented Jan 22, 2020

stale bot commented Jun 19, 2020