Begin to use multiple datasets in training #213

csukuangfj · 2022-02-15T12:28:43Z

See details at lhotse-speech/lhotse#554 (comment)

TODOs

Dataset preparation. Will use on the fly feature extraction
Build separate decoder+joiner for LibriSpeech and GigaSpeech
Train on LibriSpeech 100 hours
Decoding
Train on LibriSpeech 960 hours if it turns out to be helpful using GigaSpeech in training

danpovey · 2022-02-15T14:30:05Z

egs/librispeech/ASR/transducer_stateless_multi_datasets/dataset.py

+            "with training dataset. ",
+        )
+
+        group.add_argument(


I wonder whether we should standardize the name?
was asr_dataloader.py in another recipe.

Yes, reverted to the previous name.

csukuangfj · 2022-02-16T04:33:15Z

Note: I am not going to use the changes in lhotse-speech/lhotse#565, which added support for multiplexing among CutSets, because there are utterances from different datasets in a batch if that method is used.

csukuangfj · 2022-02-16T12:29:18Z

Here is the tensorboard log
https://tensorboard.dev/experiment/HRlmSpNCRhKd5NgpqerkNg/#scalars&_smoothingWeight=0

for the following training command

export CUDA_VISIBLE_DEVICES="2,3"


./transducer_stateless_multi_datasets/train.py \
  --world-size 2 \
  --num-epochs 40 \
  --start-epoch 0 \
  --exp-dir transducer_stateless_multi_datasets/exp-100-2 \
  --full-libri 0 \
  --max-duration 300 \
  --lr-factor 1 \
  --bpe-model data/lang_bpe_500/bpe.model \
  --modified-transducer-prob 0.25

It uses the S subset of GigaSpeech, which has 250 hours of data.
80% of the time it selects a batch from LibriSpeech and 20% of the time a batch from GigaSpeech.

You can see that the model starts to converge.

The transducer loss for GigaSpeech is higher than that for LibriSpeech. One possible reason may be that the training sees less data from it.

The following shows the model architecture.

The encoder is shared between LibriSpeech and GigaSpeech, but they have separate decoder/joiner networks.
During training, a batch can come from LibriSpeech or GigaSpeech. When it is from LibriSpeech, only the decoder/joiner for LibriSpeech are run and the other decoder/joiner just do nothing.

csukuangfj · 2022-02-17T10:46:34Z

Here are the results for this PR so far:

Decoding method	test-clean	test-other	Comment
this PR - greedy search (--max-sym-per-frame=1)	7.19	18.89	--epoch 20 --avg 7
this PR - greedy search (--max-sym-per-frame=1)	6.79	17.81	--epoch 30 --avg 10
baseline - greedy search (--max-sym-per-frame=1)	7.65	20.69	--epoch 39 --avg 17

You can see that integrating the GigaSpeech dataset into the training pipeline helps to reduce the WER and results in faster convergence.

The training command for this PR is given in #213 (comment), which is repeated below:

./transducer_stateless_multi_datasets/train.py \
  --world-size 2 \
  --num-epochs 40 \
  --start-epoch 0 \
  --exp-dir transducer_stateless_multi_datasets/exp-100-2 \
  --full-libri 0 \
  --max-duration 300 \
  --lr-factor 1 \
  --bpe-model data/lang_bpe_500/bpe.model \
  --modified-transducer-prob 0.25

The training command for the baseline is given below:
(The code for the baseline is from #200, which is equivalent to the code in the master when --apply-frame-shift=0 --ctc-weight=0.0 is used)

export CUDA_VISIBLE_DEVICES="0,1"

./transducer_stateless/train.py \
  --world-size 2 \
  --num-epochs 40 \
  --start-epoch 0 \
  --exp-dir transducer_stateless/exp-100-no-shift \
  --full-libri 0 \
  --max-duration 300 \
  --lr-factor 1 \
  --bpe-model data/lang_bpe_500/bpe.model \
  --apply-frame-shift 0 \
  --modified-transducer-prob 0.25 \
  --ctc-weight 0.0

danpovey · 2022-02-17T11:36:20Z

Cool!!

csukuangfj · 2022-02-21T05:12:16Z

Here are the results for using train-clean-100 + S subset of GigaSpeech (250 hours):

	test-clean	test-other	comment
greedy search (max sym per frame 1)	6.34	16.7	--epoch 57, --avg 17, --max-duration 100
greedy search (max sym per frame 2)	6.34	16.7	--epoch 57, --avg 17, --max-duration 100
greedy search (max sym per frame 3)	6.34	16.7	--epoch 57, --avg 17, --max-duration 100
modified beam search (beam size 4)	6.31	16.3	--epoch 57, --avg 17, --max-duration 100

The training for --full-libri + L subset of GigaSpeech (2.5k hours) is still running and may take some time to get the results.

A pre-trained model with train-clean-100 is available at https://huggingface.co/csukuangfj/icefall-asr-librispeech-100h-transducer-stateless-multi-datasets-bpe-500-2022-02-21

The tensorboard log can be found at https://tensorboard.dev/experiment/qUEKzMnrTZmOz1EXPda9RA/#scalars&_smoothingWeight=0

[EDITED]:
The results are competitive compared with the ones listed in

danpovey · 2022-02-21T06:16:24Z

Cool!!

csukuangfj · 2022-02-21T06:18:20Z

I will merge it and do some experiments based on it.

The results for the full LibriSpeech will be posted later.

pzelasko · 2022-02-21T13:27:05Z

nice!

Begin to use multiple datasets.

fb1e2ff

danpovey reviewed Feb 15, 2022

View reviewed changes

Finish preparing training datasets.

7cbd6d1

csukuangfj added 4 commits February 16, 2022 12:41

Minor fixes

d6fefe4

Copy files.

e978948

Finish training code.

018d03c

Display losses for gigaspeech and librispeech separately.

1930d72

csukuangfj added 2 commits February 17, 2022 18:31

Fix decode.py

981bf74

Make the probability to select a batch from GigaSpeech configurable.

61b0019

Update results.

9f69daf

csukuangfj changed the title ~~WIP: Begin to use multiple datasets in training~~ Begin to use multiple datasets in training Feb 21, 2022

csukuangfj added the ready label Feb 21, 2022

Minor fixes.

aadd7ca

csukuangfj merged commit 2332ba3 into k2-fsa:master Feb 21, 2022

csukuangfj deleted the multiple-datasets branch February 21, 2022 09:42

This was referenced Mar 1, 2022

Update result for full libri + GigaSpeech using transducer_stateless. #231

Merged

Add force alignment for stateless transducer. #239

Merged

csukuangfj mentioned this pull request Aug 1, 2022

WIP: Add timestamp k2-fsa/sherpa#52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Begin to use multiple datasets in training #213

Begin to use multiple datasets in training #213

csukuangfj commented Feb 15, 2022 •

edited

danpovey Feb 15, 2022

csukuangfj Feb 16, 2022

csukuangfj commented Feb 16, 2022

csukuangfj commented Feb 16, 2022 •

edited

csukuangfj commented Feb 17, 2022

danpovey commented Feb 17, 2022

csukuangfj commented Feb 21, 2022 •

edited

danpovey commented Feb 21, 2022

csukuangfj commented Feb 21, 2022 •

edited

pzelasko commented Feb 21, 2022

Begin to use multiple datasets in training #213

Begin to use multiple datasets in training #213

Conversation

csukuangfj commented Feb 15, 2022 • edited

danpovey Feb 15, 2022

Choose a reason for hiding this comment

csukuangfj Feb 16, 2022

Choose a reason for hiding this comment

csukuangfj commented Feb 16, 2022

csukuangfj commented Feb 16, 2022 • edited

csukuangfj commented Feb 17, 2022

danpovey commented Feb 17, 2022

csukuangfj commented Feb 21, 2022 • edited

danpovey commented Feb 21, 2022

csukuangfj commented Feb 21, 2022 • edited

pzelasko commented Feb 21, 2022

csukuangfj commented Feb 15, 2022 •

edited

csukuangfj commented Feb 16, 2022 •

edited

csukuangfj commented Feb 21, 2022 •

edited

csukuangfj commented Feb 21, 2022 •

edited