Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated token seqs are used for rescoring #105

Closed
csukuangfj opened this issue Nov 3, 2021 · 1 comment · Fixed by #108
Closed

Duplicated token seqs are used for rescoring #105

csukuangfj opened this issue Nov 3, 2021 · 1 comment · Fixed by #108

Comments

@csukuangfj
Copy link
Collaborator

csukuangfj commented Nov 3, 2021

While implementing rescoring with conformer lm, I find that there are duplicated token seqs.

The reason is that the following code

icefall/icefall/decode.py

Lines 222 to 237 in 810b193

if isinstance(lattice.aux_labels, torch.Tensor):
word_seq = k2.ragged.index(lattice.aux_labels, path)
else:
word_seq = lattice.aux_labels.index(path)
word_seq = word_seq.remove_axis(word_seq.num_axes - 2)
# Each utterance has `num_paths` paths but some of them transduces
# to the same word sequence, so we need to remove repeated word
# sequences within an utterance. After removing repeats, each utterance
# contains different number of paths
#
# `new2old` is a 1-D torch.Tensor mapping from the output path index
# to the input path index.
_, _, new2old = word_seq.unique(
need_num_repeats=False, need_new2old_indexes=True
)

does not remove 0s from word_seq.

Previous versions remove 0s from word_seq, see

icefall/icefall/decode.py

Lines 218 to 227 in abadc71

# word_seq is a k2.RaggedTensor sharing the same shape as `path`
# but it contains word IDs. Note that it also contains 0s and -1s.
# The last entry in each sublist is -1.
if isinstance(lattice.aux_labels, torch.Tensor):
word_seq = k2.ragged.index(lattice.aux_labels, path)
else:
word_seq = lattice.aux_labels.index(path, remove_axis=True)
# Remove 0 (epsilon) and -1 from word_seq
word_seq = word_seq.remove_values_leq(0)

It does not affect the final WER, but it incurs extra unncessary computations.

@csukuangfj
Copy link
Collaborator Author

One consequence after the fix is that we can use a larger value for num_paths, which could potentially affect the WER.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant