Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segment_long_utterances.sh failing on decode_segmentation #1629

Open
nizmagu opened this issue May 18, 2017 · 8 comments
Open

segment_long_utterances.sh failing on decode_segmentation #1629

nizmagu opened this issue May 18, 2017 · 8 comments
Labels
stale Stale bot on the loose

Comments

@nizmagu
Copy link

nizmagu commented May 18, 2017

I was trying to use segment_long_utterances.sh on 6 5-hour-long files.
Upon reaching stage 4, I get the following message:

steps/cleanup/decode_segmentation.sh --beam 15.0 --lattice-beam 1.0 --nj 6 --cmd run.pl --mem 4G --skip-scoring true --allow-partial false exp/segment_train_long/graphs_uniform_seg exp/segment_train_long/train_long_uniform_seg exp/segment_train_long/lats
filter_scps.pl: warning: some input lines were output to multiple files [OK if splitting per utt] 
steps/cleanup/decode_segmentation.sh: feature type is lda
run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/decode.*.log

When inspecting the log files, decode_segmentation gave this error:

ERROR (gmm-latgen-faster[5.1.92-a7e61]:FindKeyInternal():util/kaldi-table-inl.h:2122) You provided the "cs" option but are not calling with keys in sorted order: 2010_07_19_9050-1000000-1003000 < 2010_07_19_9050-585000-588000: rspecifier is ark,s,cs:apply-cmvn  --utt2spk=ark:exp/segment_train_long/train_long_uniform_seg/split6/2/utt2spk scp:exp/segment_train_long/train_long_uniform_seg/split6/2/cmvn.scp scp:exp/segment_train_long/train_long_uniform_seg/split6/2/feats.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats exp/segment_train_long/final.mat ark:- ark:- |

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::RandomAccessTableReaderDSortedArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::FindKeyInternal(std::string const&)
kaldi::RandomAccessTableReaderDSortedArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::HasKey(std::string const&)
kaldi::RandomAccessTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::HasKey(std::string const&)
main
__libc_start_main
gmm-latgen-faster() [0x4639d9]

# Accounting: time=7958 threads=1
# Ended (code 255) at Thu May 18 15:16:07 IDT 2017, elapsed time 7958 seconds

I tried to use validate_data_dir.sh and it says the files are in sorted order.
I believed it may have something to do with locale so I used export LANG= and export LC_ALL=C and checked with sort -c to no avail.

How can I fix this issue?

@danpovey
Copy link
Contributor

This looks like an error in the script, not a user error; Vimal will fix it today hopefully.

@vimalmanohar
Copy link
Contributor

@nizmagu Can you check if this solves the problem?

@nizmagu
Copy link
Author

nizmagu commented May 22, 2017

This solves the decode problem, however a new problem came up.

The script crashed at stage 9 with the following error:
run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/retrieve_similar_docs.*.log

Here is a sample log file:

# steps/cleanup/internal/retrieve_similar_docs.py --query-tfidf=exp/segment_train_long/query_docs/split6/query_tf_idf.1.ark.txt --source-text2tfidf-file=exp/segment_train_long/docs/source2tf_idf.scp --source-text-id2doc-ids=exp/segment_train_long/docs/text2doc --query-id2source-text-id=exp/segment_train_long/new2orig_utt --num-neighbors-to-search=1 --neighbor-tfidf-threshold=0.5 --relevant-docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt 
# Started at Mon May 22 14:07:50 IDT 2017
#
usage: retrieve_similar_docs.py [-h] [--verbose {0,1,2,3}]
                                [--num-neighbors-to-search NUM_NEIGHBORS_TO_SEARCH]
                                [--neighbor-tfidf-threshold NEIGHBOR_TFIDF_THRESHOLD]
                                [--partial-doc-fraction PARTIAL_DOC_FRACTION]
                                --source-text-id2doc-ids
                                SOURCE_TEXT_ID2DOC_IDS
                                --query-id2source-text-id
                                QUERY_ID2SOURCE_TEXT_ID --source-text-id2tfidf
                                SOURCE_TEXT_ID2TFIDF --query-tfidf QUERY_TFIDF
                                --relevant-docs RELEVANT_DOCS
retrieve_similar_docs.py: error: argument --source-text-id2tfidf is required
# Accounting: time=0 threads=1
# Ended (code 2) at Mon May 22 14:07:50 IDT 2017, elapsed time 0 seconds

I tried to change --source-text2tfidf-file to --source-text-id2tfidf and this was the result:

# steps/cleanup/internal/retrieve_similar_docs.py --query-tfidf=exp/segment_train_long/query_docs/split6/query_tf_idf.1.ark.txt --source-text-id2tfidf=exp/segment_train_long/docs/source2tf_idf.scp --source-text-id2doc-ids=exp/segment_train_long/docs/text2doc --query-id2source-text-id=exp/segment_train_long/new2orig_utt --num-neighbors-to-search=1 --neighbor-tfidf-threshold=0.5 --relevant-docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt 
# Started at Mon May 22 14:11:06 IDT 2017
#
2017-05-22 14:11:06,790 [retrieve_similar_docs.py:336 - run - INFO ] Retrieved similar documents for 0 queries
Traceback (most recent call last):
  File "steps/cleanup/internal/retrieve_similar_docs.py", line 353, in <module>
    main()
  File "steps/cleanup/internal/retrieve_similar_docs.py", line 348, in main
    args.relevant_docs, args.query_tfidf, args.source_tfidf]:
AttributeError: 'Namespace' object has no attribute 'source_tfidf'
# Accounting: time=0 threads=1
# Ended (code 1) at Mon May 22 14:11:06 IDT 2017, elapsed time 0 seconds

args.source_tfidf seemed to only be referenced once (in the closing command), so I changed it again to args.source_text_id2tfidf.

Then the script crashed again:
run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/get_ctm_edits.*.log

Here is the log file:

# steps/cleanup/internal/stitch_documents.py --query2docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt --input-documents=exp/segment_train_long/docs/split6/1/docs.txt --output-documents=- | steps/cleanup/internal/align_ctm_ref.py --eps-symbol="<eps>" --oov-word='<UNK>' --symbol-table=data/lang/words.txt --hyp-format=CTM --align-full-hyp=false --hyp=exp/segment_train_long/lats/score_10/train_long_uniform_seg.ctm.1 --ref=- --output=exp/segment_train_long/lats/score_10/train_long_uniform_seg.ctm_edits.1 
# Started at Mon May 22 14:15:01 IDT 2017
#
Traceback (most recent call last):
  File "steps/cleanup/internal/align_ctm_ref.py", line 615, in <module>
    main()
  File "steps/cleanup/internal/align_ctm_ref.py", line 598, in main
    args = get_args()
  File "steps/cleanup/internal/align_ctm_ref.py", line 103, in get_args
    "--reco2file-and-channel must be provided for "
RuntimeError: --reco2file-and-channel must be provided for hyp-format=CTM
usage: stitch_documents.py [-h] --query2docs QUERY2DOCS --input-documents
                           INPUT_DOCUMENTS --output-documents OUTPUT_DOCUMENTS
                           [--check-sorted-docs-per-query {true,false}]
stitch_documents.py: error: argument --input-documents: can't open 'exp/segment_train_long/docs/split6/1/docs.txt': [Errno 2] No such file or directory: 'exp/segment_train_long/docs/split6/1/docs.txt'
# Accounting: time=0 threads=1
# Ended (code 1) at Mon May 22 14:15:01 IDT 2017, elapsed time 0 seconds

@vimalmanohar
Copy link
Contributor

vimalmanohar commented May 22, 2017 via email

@vimalmanohar
Copy link
Contributor

I fixed some issues in #1639

@nizmagu
Copy link
Author

nizmagu commented May 25, 2017

That fixed the issue, thanks a lot!

@nizmagu nizmagu closed this as completed May 25, 2017
@danpovey
Copy link
Contributor

This issue may still exist for the _nnet3 versions of these scripts. See https://groups.google.com/d/msgid/kaldi-help/a783ff67-7cb4-4f7e-b2db-5b67ca032478%40googlegroups.com.

@danpovey danpovey reopened this Jan 22, 2020
@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale bot on the loose
Projects
None yet
Development

No branches or pull requests

3 participants