Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support computing nbest oracle WER. #10

Merged
merged 11 commits into from
Aug 20, 2021

Conversation

csukuangfj
Copy link
Collaborator

nbest oracle WER can help us evaluate different n-best rescoring methods
as it is the best WER that we could get if we had the perfect rescoring method.

@@ -56,6 +57,15 @@ def get_parser():
"consecutive checkpoints before the checkpoint specified by "
"'--epoch'. ",
)

parser.add_argument(
"--scale",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this scale is only used for the nbest-oracle mode, perhaps that should be clarified, e.g. via the name and the documentation? Right now it's a bit unclear whether this would affect other things

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is also useful for other n-best rescoring methods, e.g., attention-decoder rescoring. Tuning this value can
change the number of unique paths in an n-best list, which can potentially affect the final WER.

I'm adding more documentation to clarify its usage.

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Aug 18, 2021

The following screenshot shows the nbest oracle WER with different scale values for the librispeech test-clean and test-other datasets.

Note:

  • For the cell "1.96 || 4.83", it means the WER for test-clean is 1.96 and the WER for test-other is 4.83
  • lattice from HLG decoding, it means the lattice is from the decoding using only HLG, without LM rescoring, without attention decoder
  • HLG + 4-gram whole lattice rescoring, it means the lattice is the one after 4-gram whole lattice rescoring
  • In both cases, the transformer attention decoder is not used
  • For the model we're using for testing nbest oracle WER, its WER when using attention decoder is: (2.76 || 6.4)

Screen Shot 2021-08-18 at 4 21 33 PM

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Aug 18, 2021

The number of unique paths increases when we use a smaller scale value. The following screenshots show this kind of change.

@@ -0,0 +1,27 @@

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how a pre-trained model can be used to transcribe a sound file.
@danpovey

It depends on

  • torchaudio, for reading sound files
  • kaldifeat, for feature extraction

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only HLG decoding with the transformer encoder output is added.
Do we need to use the attention decoder for rescoring?

This comment was marked as outdated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great-- thanks!
Regarding using the attention decoder for rescoring-- yes, I'd like you to add that, because this will probably
be a main feature of the tutorial, and I think having good results is probably worthwhile.


features = features.unsqueeze(0)
logging.info(f"Decoding started")
features = fbank(waves)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing torchaudio.compliance.kaldi with kaldifeat
since it is easier to extract features for multiple sound
files at the same time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I still have adding kaldifeat to Lhotse on my radar. I might remove all other kaldi-related feature extractors at the same time. But I think I won’t be able to do it before the tutorial.

@csukuangfj
Copy link
Collaborator Author

Now it supports transcribing multiple files with LM rescoring and attention decoder rescoring.

Ready for review.

Copy link
Collaborator

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
Perhaps we can mention concretely where one might obtain this checkpoint, words.txt and HLG.pt, if someone were to try to run this without having trained the system? E.g. download location?

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Aug 19, 2021

@pkufool

Could you please upload the following files:

  • best model with model averaging, without optimizer and scheduler information
  • data/lang_bpe/HLG.pt
  • data/lang_bpe/words.txt
  • data/lm/G_4_gram.pt
  • data/lang_bpe/tokens.txt (so we know the sos and eos ID)
  • data/lang_bpe/bpe.model

@csukuangfj
Copy link
Collaborator Author

Perhaps we can mention concretely where one might obtain this checkpoint, words.txt and HLG.pt, if someone were to try to run this without having trained the system? E.g. download location?

I just added some detailed documentation to show how to download and use a pre-trained model, uploaded by @pkufool.

You can find a preview by visiting
https://github.com/k2-fsa/icefall/blob/acefc703226997b0ecc543e7464cf698220ed4e2/egs/librispeech/ASR/conformer_ctc/README.md


Will also create a colab notebook to show how to use the pre-trained model.


Ready to merge.

@danpovey
Copy link
Collaborator

Wow-- very nice and complete documentation!
LGTM.

@csukuangfj
Copy link
Collaborator Author

Here are the logs using the CPU to transcribe the test waves. Useful if someone wants to compare
the decoding time from the logs between CUDA and CPU without running the code.

(1) HLG decoding

$ CUDA_VISIBLE_DEVICES= ./conformer_ctc/pretrained.py \
--checkpoint ./tmp/conformer_ctc/exp/pretraind.pt \
--words-file ./tmp/conformer_ctc/data/lang_bpe/words.txt \
--HLG ./tmp/conformer_ctc/data/lang_bpe/HLG.pt \
./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0002.flac
2021-08-20 11:44:02,306 INFO [pretrained.py:217] device: cpu
2021-08-20 11:44:02,306 INFO [pretrained.py:219] Creating model
2021-08-20 11:44:03,210 INFO [pretrained.py:238] Loading HLG from ./tmp/conformer_ctc/data/lang_bpe/HLG.pt
2021-08-20 11:44:08,006 INFO [pretrained.py:255] Constructing Fbank computer
2021-08-20 11:44:08,008 INFO [pretrained.py:265] Reading sound files: ['./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac', './tmp/conformer_ctc/test_wavs/1221-135766-0001.flac', './tmp/conformer_ctc/test_wavs/1221-135766-0002.flac']
2021-08-20 11:44:08,017 INFO [pretrained.py:271] Decoding started
2021-08-20 11:44:18,029 INFO [pretrained.py:300] Use HLG decoding
2021-08-20 11:44:18,392 INFO [pretrained.py:339]
./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac:
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

./tmp/conformer_ctc/test_wavs/1221-135766-0001.flac:
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN

./tmp/conformer_ctc/test_wavs/1221-135766-0002.flac:
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION


2021-08-20 11:44:18,393 INFO [pretrained.py:341] Decoding Done

(2) HLG decoding + LM rescoring

$ CUDA_VISIBLE_DEVICES= ./conformer_ctc/pretrained.py \
--checkpoint ./tmp/conformer_ctc/exp/pretraind.pt \
--words-file ./tmp/conformer_ctc/data/lang_bpe/words.txt \
--HLG ./tmp/conformer_ctc/data/lang_bpe/HLG.pt \
--method whole-lattice-rescoring \
--G ./tmp/conformer_ctc/data/lm/G_4_gram.pt \
--ngram-lm-scale 0.8 \
./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0002.flac
2021-08-20 11:46:26,077 INFO [pretrained.py:217] device: cpu
2021-08-20 11:46:26,077 INFO [pretrained.py:219] Creating model
2021-08-20 11:46:26,980 INFO [pretrained.py:238] Loading HLG from ./tmp/conformer_ctc/data/lang_bpe/HLG.pt
2021-08-20 11:46:32,169 INFO [pretrained.py:246] Loading G from ./tmp/conformer_ctc/data/lm/G_4_gram.pt
2021-08-20 11:47:16,114 INFO [pretrained.py:255] Constructing Fbank computer
2021-08-20 11:47:16,118 INFO [pretrained.py:265] Reading sound files: ['./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac', './tmp/conformer_ctc/test_wavs/1221-135766-0001.flac', './tmp/conformer_ctc/test_wavs/1221-135766-0002.flac']
2021-08-20 11:47:16,129 INFO [pretrained.py:271] Decoding started
2021-08-20 11:47:26,052 INFO [pretrained.py:305] Use HLG decoding + LM rescoring
2021-08-20 11:47:27,805 INFO [pretrained.py:339]
./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac:
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

./tmp/conformer_ctc/test_wavs/1221-135766-0001.flac:
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN

./tmp/conformer_ctc/test_wavs/1221-135766-0002.flac:
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION


2021-08-20 11:47:27,806 INFO [pretrained.py:341] Decoding Done

(3) HLG decoding + LM rescoring + attention decoder rescoring

$ CUDA_VISIBLE_DEVICES= ./conformer_ctc/pretrained.py \
--checkpoint ./tmp/conformer_ctc/exp/pretraind.pt \
--words-file ./tmp/conformer_ctc/data/lang_bpe/words.txt \
--HLG ./tmp/conformer_ctc/data/lang_bpe/HLG.pt \
--method attention-decoder \
--G ./tmp/conformer_ctc/data/lm/G_4_gram.pt \
--ngram-lm-scale 1.3 \
--attention-decoder-scale 1.2 \
--lattice-score-scale 0.5 \
--num-paths 100 \
--sos-id 1 \
--eos-id 1 \
./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0001.flac \
./tmp/conformer_ctc/test_wavs/1221-135766-0002.flac
2021-08-20 11:50:58,383 INFO [pretrained.py:217] device: cpu
2021-08-20 11:50:58,383 INFO [pretrained.py:219] Creating model
2021-08-20 11:50:59,271 INFO [pretrained.py:238] Loading HLG from ./tmp/conformer_ctc/data/lang_bpe/HLG.pt
2021-08-20 11:51:05,072 INFO [pretrained.py:246] Loading G from ./tmp/conformer_ctc/data/lm/G_4_gram.pt
2021-08-20 11:51:49,799 INFO [pretrained.py:255] Constructing Fbank computer
2021-08-20 11:51:49,803 INFO [pretrained.py:265] Reading sound files: ['./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac', './tmp/conformer_ctc/test_wavs/1221-135766-0001.flac', './tmp/conformer_ctc/test_wavs/1221-135766-0002.flac']
2021-08-20 11:51:49,813 INFO [pretrained.py:271] Decoding started
2021-08-20 11:52:00,036 INFO [pretrained.py:313] Use HLG + LM rescoring + attention decoder rescoring
2021-08-20 11:52:02,372 INFO [pretrained.py:339]
./tmp/conformer_ctc/test_wavs/1089-134686-0001.flac:
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

./tmp/conformer_ctc/test_wavs/1221-135766-0001.flac:
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN

./tmp/conformer_ctc/test_wavs/1221-135766-0002.flac:
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION


2021-08-20 11:52:02,372 INFO [pretrained.py:341] Decoding Done

@csukuangfj csukuangfj merged commit 9d0cc9d into k2-fsa:master Aug 20, 2021
@danpovey
Copy link
Collaborator

BTW, for this thing where we transcribe the waves, it would be nice to know how much we are being affected by batches being too irregular. It should be possible to find how big the WER impact of this is, by changing the lhotse options for the sampler used in our test code.
In speechbrain, Mirco was working on ways to make the conformer code independent of the batching:
speechbrain/speechbrain#933

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants