Duplicated token seqs are used for rescoring #105

csukuangfj · 2021-11-03T06:34:11Z

While implementing rescoring with conformer lm, I find that there are duplicated token seqs.

The reason is that the following code

Lines 222 to 237 in 810b193

    
           if isinstance(lattice.aux_labels, torch.Tensor): 
        
               word_seq = k2.ragged.index(lattice.aux_labels, path) 
        
           else: 
        
               word_seq = lattice.aux_labels.index(path) 
        
               word_seq = word_seq.remove_axis(word_seq.num_axes - 2) 
        
           # Each utterance has `num_paths` paths but some of them transduces 
        
           # to the same word sequence, so we need to remove repeated word 
        
           # sequences within an utterance. After removing repeats, each utterance 
        
           # contains different number of paths 
        
           # 
        
           # `new2old` is a 1-D torch.Tensor mapping from the output path index 
        
           # to the input path index. 
        
           _, _, new2old = word_seq.unique( 
        
               need_num_repeats=False, need_new2old_indexes=True 
        
           )

does not remove 0s from word_seq.

Previous versions remove 0s from word_seq, see

icefall/icefall/decode.py

Lines 218 to 227 in abadc71

    
           # word_seq is a k2.RaggedTensor sharing the same shape as `path` 
        
           # but it contains word IDs. Note that it also contains 0s and -1s. 
        
           # The last entry in each sublist is -1. 
        
           if isinstance(lattice.aux_labels, torch.Tensor): 
        
               word_seq = k2.ragged.index(lattice.aux_labels, path) 
        
           else: 
        
               word_seq = lattice.aux_labels.index(path, remove_axis=True) 
        
           # Remove 0 (epsilon) and -1 from word_seq 
        
           word_seq = word_seq.remove_values_leq(0)

It does not affect the final WER, but it incurs extra unncessary computations.

The text was updated successfully, but these errors were encountered:

csukuangfj · 2021-11-03T06:45:29Z

One consequence after the fix is that we can use a larger value for num_paths, which could potentially affect the WER.

csukuangfj mentioned this issue Nov 5, 2021

Remove duplicated token seq in rescoring. #108

Merged

csukuangfj closed this as completed in #108 Nov 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated token seqs are used for rescoring #105

Duplicated token seqs are used for rescoring #105

csukuangfj commented Nov 3, 2021 •

edited

Loading

csukuangfj commented Nov 3, 2021

Duplicated token seqs are used for rescoring #105

Duplicated token seqs are used for rescoring #105

Comments

csukuangfj commented Nov 3, 2021 • edited Loading

csukuangfj commented Nov 3, 2021

csukuangfj commented Nov 3, 2021 •

edited

Loading