Add LSTM for the multi-dataset setup. #558

csukuangfj · 2022-08-29T11:09:53Z

Used ScaledLSTM from #479.

Current results:

(py38) kuangfangjun:greedy_search$ grep -r -n --color "best for test-clean" log-* | sort -n -k2 | head -n 5
log-decode-epoch-13-avg-2-context-2-max-sym-per-frame-1-use-averaged-model-2022-08-29-10-35-19:17:greedy_search 2.93    best for test-clean
log-decode-epoch-13-avg-3-context-2-max-sym-per-frame-1-use-averaged-model-2022-08-29-10-36-34:17:greedy_search 2.93    best for test-clean
log-decode-iter-286000-avg-6-context-2-max-sym-per-frame-1-use-averaged-model-2022-08-27-23-51-34:17:greedy_search      2.94    best for test-clean
log-decode-epoch-12-avg-1-context-2-max-sym-per-frame-1-use-averaged-model-2022-08-28-17-30-38:17:greedy_search 2.96    best for test-clean
log-decode-epoch-12-avg-2-context-2-max-sym-per-frame-1-use-averaged-model-2022-08-28-17-31-57:17:greedy_search 2.96    best for test-clean

(py38) kuangfangjun:modified_beam_search$ grep -r -n --color "best for test-clean" log-* | sort -n -k2 | head -n 5
log-decode-epoch-13-avg-2-modified_beam_search-beam-size-4-use-averaged-model-2022-08-29-10-48-08:18:beam_size_4        2.85    best for test-clean
log-decode-epoch-13-avg-3-modified_beam_search-beam-size-4-use-averaged-model-2022-08-29-10-52-27:18:beam_size_4        2.88    best for test-clean
log-decode-epoch-12-avg-1-modified_beam_search-beam-size-4-use-averaged-model-2022-08-28-17-40-19:18:beam_size_4        2.89    best for test-clean
log-decode-iter-346000-avg-10-modified_beam_search-beam-size-4-use-averaged-model-2022-08-29-11-00-41:18:beam_size_4    2.89    best for test-clean
log-decode-epoch-12-avg-3-modified_beam_search-beam-size-4-use-averaged-model-2022-08-28-17-49-09:18:beam_size_4        2.9     best for test-clean

(py38) kuangfangjun:fast_beam_search$ grep -r -n --color "best for test-clean" log-* | sort -n -k2 | head -n 5
log-decode-iter-306000-avg-12-beam-4.0-max-contexts-4-max-states-8-use-averaged-model-2022-08-28-13-43-08:18:beam_4.0_max_contexts_4_max_states_8    2.88    best for test-clean
log-decode-iter-308000-avg-13-beam-4.0-max-contexts-4-max-states-8-use-averaged-model-2022-08-28-13-15-06:18:beam_4.0_max_contexts_4_max_states_8    2.89    best for test-clean
log-decode-iter-308000-avg-14-beam-4.0-max-contexts-4-max-states-8-use-averaged-model-2022-08-28-13-19-38:18:beam_4.0_max_contexts_4_max_states_8    2.89    best for test-clean
log-decode-epoch-13-avg-3-beam-4.0-max-contexts-4-max-states-8-use-averaged-model-2022-08-29-10-41-46:18:beam_4.0_max_contexts_4_max_states_8        2.9     best for test-clean
log-decode-iter-302000-avg-9-beam-4.0-max-contexts-4-max-states-8-use-averaged-model-2022-08-28-14-35-27:18:beam_4.0_max_contexts_4_max_states_8     2.9     best for test-clean

(py38) kuangfangjun:greedy_search$ grep -r -n --color "best for test-other" log-* | sort -n -k2 | head -n 5
log-decode-epoch-13-avg-3-context-2-max-sym-per-frame-1-use-averaged-model-2022-08-29-10-36-34:26:greedy_search 7.68    best for test-other
log-decode-iter-348000-avg-12-context-2-max-sym-per-frame-1-use-averaged-model-2022-08-29-11-39-15:26:greedy_search     7.69    best for test-other
log-decode-epoch-13-avg-2-context-2-max-sym-per-frame-1-use-averaged-model-2022-08-29-10-35-19:26:greedy_search 7.7     best for test-other
log-decode-iter-346000-avg-7-context-2-max-sym-per-frame-1-use-averaged-model-2022-08-29-10-34-47:26:greedy_search      7.71    best for test-other
log-decode-iter-348000-avg-11-context-2-max-sym-per-frame-1-use-averaged-model-2022-08-29-11-38-00:26:greedy_search     7.73    best for test-other

(py38) kuangfangjun:modified_beam_search$ grep -r -n --color "best for test-other" log-* | sort -n -k2 | head -n 5
log-decode-iter-348000-avg-12-modified_beam_search-beam-size-4-use-averaged-model-2022-08-29-12-08-39:28:beam_size_4    7.48    best for test-other
log-decode-iter-348000-avg-9-modified_beam_search-beam-size-4-use-averaged-model-2022-08-29-11-55-08:28:beam_size_4     7.49    best for test-other
log-decode-iter-346000-avg-8-modified_beam_search-beam-size-4-use-averaged-model-2022-08-29-10-52-03:28:beam_size_4     7.51    best for test-other
log-decode-iter-348000-avg-11-modified_beam_search-beam-size-4-use-averaged-model-2022-08-29-12-04-20:28:beam_size_4    7.51    best for test-other
log-decode-iter-348000-avg-8-modified_beam_search-beam-size-4-use-averaged-model-2022-08-29-11-50-19:28:beam_size_4     7.52    best for test-other

(py38) kuangfangjun:fast_beam_search$ grep -r -n --color "best for test-other" log-* | sort -n -k2 | head -n 5
log-decode-iter-346000-avg-10-beam-4.0-max-contexts-4-max-states-8-use-averaged-model-2022-08-29-10-45-43:28:beam_4.0_max_contexts_4_max_states_8    7.61    best for test-other
log-decode-iter-348000-avg-11-beam-4.0-max-contexts-4-max-states-8-use-averaged-model-2022-08-29-11-46-18:28:beam_4.0_max_contexts_4_max_states_8    7.62    best for test-other
log-decode-epoch-13-avg-1-beam-4.0-max-contexts-4-max-states-8-use-averaged-model-2022-08-29-10-37-48:28:beam_4.0_max_contexts_4_max_states_8        7.63    best for test-other
log-decode-iter-348000-avg-12-beam-4.0-max-contexts-4-max-states-8-use-averaged-model-2022-08-29-11-48-14:28:beam_4.0_max_contexts_4_max_states_8    7.63    best for test-other
log-decode-epoch-13-avg-3-beam-4.0-max-contexts-4-max-states-8-use-averaged-model-2022-08-29-10-41-46:28:beam_4.0_max_contexts_4_max_states_8        7.65    best for test-other

csukuangfj · 2022-08-29T11:14:19Z

Note: Compared to the results of the streaming conformer model from
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#training-on-full-librispeech-use-giga_prob--09

ScaledLSTM has competitive (or even better results) for the low latency setup compared against the streaming conformer.

Training command of this PR:

#!/usr/bin/env bash

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

./lstm_transducer_stateless2/train.py \
  --world-size 8 \
  --num-epochs 35 \
  --start-epoch 10 \
  --full-libri 1 \
  --exp-dir lstm_transducer_stateless2/exp \
  --max-duration 500 \
  --use-fp16 0 \
  --lr-epochs 10 \
  --num-workers 2 \
  --giga-prob 0.9

Decoding command:

#!/usr/bin/env bash

export CUDA_VISIBLE_DEVICES="0"

for m in greedy_search fast_beam_search modified_beam_search; do
  for epoch in 13; do
    for avg in 1 2 3; do
      ./lstm_transducer_stateless2/decode.py \
        --epoch $epoch \
        --avg $avg \
        --exp-dir lstm_transducer_stateless2/exp \
        --max-duration 600 \
        --num-encoder-layers 12 \
        --rnn-hidden-size 1024 \
        --decoding-method $m \
        --use-averaged-model True \
        --beam 4 \
        --max-contexts 4 \
        --max-states 8 \
        --beam-size 4
    done
  done
done

danpovey · 2022-08-29T12:11:28Z

Cool!
I am hoping that after fixing gradient explosion issues, we can further improve these results.

csukuangfj · 2022-08-31T06:35:24Z

Here are the results after training for more epochs (at epoch 16):

The WER for test-clean is 2.76. I think it will continue to decrease after training for more epochs.

csukuangfj · 2022-09-03T10:40:09Z

Results for this PR:

	test-clean	test-other	comment
greedy search (max sym per frame 1)	2.78	7.36	--iter 468000 --avg 16
modified_beam_search	2.73	7.15	--iter 468000 --avg 16
fast_beam_search	2.76	7.31	--iter 468000 --avg 16
greedy search (max sym per frame 1)	2.77	7.35	--iter 472000 --avg 18
modified_beam_search	2.75	7.08	--iter 472000 --avg 18
fast_beam_search	2.77	7.29	--iter 472000 --avg 18

We use --iter rather than --epoch since --iter produces better results in this case.

The model is trained for 18 epochs.

The training command is:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

./lstm_transducer_stateless2/train.py \
  --world-size 8 \
  --num-epochs 35 \
  --start-epoch 1 \
  --full-libri 1 \
  --exp-dir lstm_transducer_stateless2/exp \
  --max-duration 500 \
  --use-fp16 0 \
  --lr-epochs 10 \
  --num-workers 2 \
  --giga-prob 0.9

Note: It was killed manually after getting epoch-18.pt. Also, we resumed
training after getting epoch-9.pt.

The tensorboard log can be found at
https://tensorboard.dev/experiment/1ziQ2LFmQY2mt4dlUr5dyA/

The decoding command is

for m in greedy_search fast_beam_search modified_beam_search; do
  for iter in 472000; do
    for avg in 8 10 12 14 16 18; do
      ./lstm_transducer_stateless2/decode.py \
        --iter $iter \
        --avg $avg \
        --exp-dir lstm_transducer_stateless2/exp \
        --max-duration 600 \
        --num-encoder-layers 12 \
        --rnn-hidden-size 1024 \
        --decoding-method $m \
        --use-averaged-model True \
        --beam 4 \
        --max-contexts 4 \
        --max-states 8 \
        --beam-size 4
    done
  done
done

Pretrained models, training logs, decoding logs, and decoding results
are available at
https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03

danpovey · 2022-09-03T13:40:34Z

Nice!! Could you or @yaozengwei please run with --diagnostics=True with this model? I want to see whether the gradients are blowing up.

danpovey · 2022-09-04T05:47:52Z

I have been looking at the diagnostics file that we got on epoch 18 from running with --print-diagnostics=True.
It seems that gradient explosion is only happening in one layer: layer number 10. Very possibly this is normal,
we don't have very much to compare it with right now.
Firstly, all the layers' lstm and feed_forward modules have output that is of roughly the same magnitude, generally between 0.1 and 0.4 as rms value. So we can compare the gradient magnitudes without worrying that the real importance of
the layer's output needs to be adjusted for the output magnitude.

At the output of the lstm layers, the grad is dramatically bigger for layers 9 and below, by a factor
of about 2000. (it's a bit smaller for layers 3 and 4, about half of the usual, but we can ignore that
for now.. maybe it was relying more on the feedforward module on those layers.)

grep 'lstm.grad\[0\]' diagnostics-start-from-epoch-18.txt | grep 'dim=2' | grep 'rms '
module=encoder.encoder.layers.0.lstm.grad[0], dim=2, size=512, rms percentiles: [1.5e+02 1.9e+02 1.9e+02 2e+02 2.1e+02 2.1e+02 2.2e+02 2.3e+02 2.4e+02 2.6e+02 4.9e+02], mean=2.2e+02, rms=2.2e+02
module=encoder.encoder.layers.1.lstm.grad[0], dim=2, size=512, rms percentiles: [1.2e+02 1.6e+02 1.6e+02 1.7e+02 1.8e+02 1.8e+02 1.9e+02 2e+02 2.1e+02 2.2e+02 3.8e+02], mean=1.9e+02, rms=1.9e+02
module=encoder.encoder.layers.2.lstm.grad[0], dim=2, size=512, rms percentiles: [17 20 21 22 23 23 24 25 26 28 46], mean=24, rms=24
module=encoder.encoder.layers.3.lstm.grad[0], dim=2, size=512, rms percentiles: [16 19 20 21 21 22 23 24 25 27 38], mean=23, rms=23
module=encoder.encoder.layers.4.lstm.grad[0], dim=2, size=512, rms percentiles: [1.2e+02 1.5e+02 1.6e+02 1.6e+02 1.7e+02 1.7e+02 1.8e+02 1.9e+02 1.9e+02 2.1e+02 2.7e+02], mean=1.8e+02, rms=1.8e+02
module=encoder.encoder.layers.5.lstm.grad[0], dim=2, size=512, rms percentiles: [1.2e+02 1.4e+02 1.5e+02 1.6e+02 1.6e+02 1.7e+02 1.7e+02 1.8e+02 1.9e+02 2e+02 2.6e+02], mean=1.7e+02, rms=1.7e+02
module=encoder.encoder.layers.6.lstm.grad[0], dim=2, size=512, rms percentiles: [1.1e+02 1.3e+02 1.4e+02 1.4e+02 1.5e+02 1.5e+02 1.6e+02 1.7e+02 1.7e+02 1.8e+02 2.3e+02], mean=1.6e+02, rms=1.6e+02
module=encoder.encoder.layers.7.lstm.grad[0], dim=2, size=512, rms percentiles: [1.2e+02 1.5e+02 1.5e+02 1.6e+02 1.7e+02 1.7e+02 1.8e+02 1.8e+02 1.9e+02 2e+02 2.5e+02], mean=1.7e+02, rms=1.7e+02
module=encoder.encoder.layers.8.lstm.grad[0], dim=2, size=512, rms percentiles: [1e+02 1.2e+02 1.3e+02 1.3e+02 1.3e+02 1.4e+02 1.4e+02 1.5e+02 1.5e+02 1.6e+02 2e+02], mean=1.4e+02, rms=1.4e+02
module=encoder.encoder.layers.9.lstm.grad[0], dim=2, size=512, rms percentiles: [85 1e+02 1.1e+02 1.2e+02 1.2e+02 1.2e+02 1.3e+02 1.3e+02 1.3e+02 1.4e+02 2.3e+02], mean=1.2e+02, rms=1.2e+02
module=encoder.encoder.layers.10.lstm.grad[0], dim=2, size=512, rms percentiles: [0.036 0.046 0.048 0.049 0.051 0.052 0.054 0.056 0.058 0.063 0.23], mean=0.054, rms=0.055
module=encoder.encoder.layers.11.lstm.grad[0], dim=2, size=512, rms percentiles: [0.042 0.048 0.05 0.053 0.056 0.058 0.061 0.064 0.07 0.077 0.59], mean=0.062, rms=0.067

The same is true of the gradients w.r.t. the feedforward modules. If we look at the param grads, layer 11 has about 100 times smaller magnitude than most layers, but layer 10 actually has the largest param grad, about 10 times the normal value. This confirms that the gradient explosion is happening in layer 10, which is in fact the only place where it could be happening given the previous things I noticed.

grep 'lstm.weight_hh_l0.param_grad' diagnostics-start-from-epoch-18.txt | grep 'dim=1' | grep 'rms '
module=encoder.encoder.layers.0.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [1e+02 1.4e+02 1.6e+02 1.8e+02 2e+02 2.2e+02 2.5e+02 2.9e+02 3.4e+02 4.1e+02 8.6e+02], mean=2.6e+02, rms=2.8e+02
module=encoder.encoder.layers.1.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [53 80 90 1e+02 1.1e+02 1.3e+02 1.4e+02 1.6e+02 1.9e+02 2.3e+02 6.8e+02], mean=1.4e+02, rms=1.6e+02
module=encoder.encoder.layers.2.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [9.4 17 20 23 26 29 32 38 44 54 90], mean=33, rms=36
module=encoder.encoder.layers.3.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [7.7 12 13 15 16 17 18 20 22 25 1.2e+02], mean=18, rms=20
module=encoder.encoder.layers.4.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [90 1.4e+02 1.6e+02 1.8e+02 2.1e+02 2.4e+02 2.8e+02 3.3e+02 3.9e+02 4.6e+02 1.1e+03], mean=2.8e+02, rms=3.2e+02
module=encoder.encoder.layers.5.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [59 85 95 1.1e+02 1.2e+02 1.3e+02 1.4e+02 1.6e+02 1.8e+02 2.2e+02 4.5e+02], mean=1.4e+02, rms=1.6e+02
module=encoder.encoder.layers.6.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [1.1e+02 1.6e+02 1.8e+02 2e+02 2.2e+02 2.4e+02 2.8e+02 3.2e+02 3.8e+02 4.8e+02 1.1e+03], mean=2.9e+02, rms=3.2e+02
module=encoder.encoder.layers.7.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [59 97 1.1e+02 1.3e+02 1.4e+02 1.7e+02 1.9e+02 2.2e+02 2.5e+02 3.2e+02 5.6e+02], mean=1.9e+02, rms=2.1e+02
module=encoder.encoder.layers.8.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [1.8e+02 2.8e+02 3.3e+02 3.8e+02 4.2e+02 4.7e+02 5.3e+02 6.1e+02 7.1e+02 8.2e+02 1.6e+03], mean=5.2e+02, rms=5.7e+02
module=encoder.encoder.layers.9.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [1.4e+02 2.3e+02 2.8e+02 3.3e+02 3.7e+02 4.5e+02 5.4e+02 6.4e+02 7.5e+02 9.6e+02 1.5e+03], mean=5.3e+02, rms=6e+02
module=encoder.encoder.layers.10.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [7.9e+02 1.2e+03 1.4e+03 1.5e+03 1.7e+03 1.9e+03 2.2e+03 2.5e+03 2.9e+03 3.5e+03 6.8e+03], mean=2.2e+03, rms=2.4e+03
module=encoder.encoder.layers.11.lstm.weight_hh_l0.param_grad, dim=1, size=512, rms percentiles: [0.27 0.46 0.56 0.67 0.79 0.93 1.1 1.3 1.5 1.9 11], mean=1.1, rms=1.3

danpovey · 2022-09-05T02:52:36Z

Something else: if we look at the stats over the batch dim, we can see whether some elements of the batch have larger gradients than others. Normally these distributions would be very flat, but here they are not, the last element in the list is enormously larger, meaning a small number of sequences have 1000 times larger gradient. You can see that this does not happen for layers 10 and 11 (this is measured at the output of the lstm module); it only happens for layers prior to 10.

grep 'lstm.grad\[0\]' diagnostics-start-from-epoch-18.txt | grep 'dim=1' | grep 'rms '
module=encoder.encoder.layers.0.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.071 0.18 0.23 0.27 0.33 0.37 0.47 0.61 0.88 1.7 1.1e+03], mean=9.3, rms=89
module=encoder.encoder.layers.1.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.049 0.14 0.16 0.19 0.23 0.26 0.34 0.44 0.65 1.2 9.2e+02], mean=7.8, rms=76
module=encoder.encoder.layers.2.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.042 0.12 0.14 0.16 0.19 0.22 0.25 0.32 0.44 0.97 1e+02], mean=1.6, rms=9.7
module=encoder.encoder.layers.3.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.0074 0.1 0.12 0.14 0.17 0.19 0.22 0.28 0.4 0.73 1.1e+02], mean=1.3, rms=9.2
module=encoder.encoder.layers.4.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.014 0.081 0.11 0.13 0.16 0.18 0.21 0.3 0.48 1 8.5e+02], mean=7.3, rms=71
module=encoder.encoder.layers.5.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.045 0.091 0.11 0.12 0.15 0.16 0.2 0.29 0.45 0.93 8.2e+02], mean=6.6, rms=68
module=encoder.encoder.layers.6.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.046 0.082 0.094 0.11 0.13 0.14 0.17 0.25 0.39 0.92 7.3e+02], mean=6.5, rms=63
module=encoder.encoder.layers.7.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.011 0.067 0.093 0.11 0.13 0.15 0.17 0.21 0.36 0.86 8.2e+02], mean=7.1, rms=69
module=encoder.encoder.layers.8.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.0079 0.035 0.066 0.076 0.092 0.11 0.13 0.18 0.28 0.6 6.9e+02], mean=5.7, rms=57
module=encoder.encoder.layers.9.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.033 0.054 0.066 0.076 0.087 0.1 0.11 0.18 0.27 0.71 6.3e+02], mean=5.1, rms=50
module=encoder.encoder.layers.10.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.0024 0.02 0.025 0.027 0.033 0.039 0.046 0.052 0.062 0.074 0.17], mean=0.044, rms=0.052
module=encoder.encoder.layers.11.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.016 0.032 0.035 0.039 0.042 0.047 0.052 0.057 0.068 0.087 0.21], mean=0.055, rms=0.061

Because the batch sizes vary, in effect the diagnostics code just appends the batch-sizes over all the batches, rather than trying to correlate those dims somehow. So this is telling us that some sequences have blown-up gradients but most sequences are fine.

danpovey · 2022-09-05T13:33:35Z

Zengwei is using the "gradient filter" idea from #564, which zeroes out the gradients from any sequences that seem to have a way-too-large gradient, doing this at the input of the LSTM layer. He is continuing Fangjun's training run from here, but from epoch 19.
It seems to resolve the huge-gradient issue. Running with a threshold of 10.0, meaning that sequences with a more than 10x-larger-than-median gradient have their gradient zeroed, we see the following gradient sizes; the rms value is about 200 times larger than before.

grep 'lstm.grad[0]' /ceph-zw/workspace/rnn/icefall_multi_dataset/egs/librispeech/ASR/lstm_transducer_stateless3/exp-threshold-10.0/diagnostics-epoch-19 | grep 'dim=1' | grep 'rms '

module=encoder.encoder.layers.0.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.04 0.13 0.17 0.19 0.22 0.25 0.29 0.36 0.47 0.67 1.5], mean=0.34, rms=0.42
module=encoder.encoder.layers.1.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.031 0.097 0.12 0.13 0.15 0.18 0.21 0.26 0.34 0.51 1.2], mean=0.25, rms=0.31
module=encoder.encoder.layers.10.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.0025 0.022 0.026 0.029 0.032 0.038 0.044 0.049 0.062 0.072 0.17], mean=0.044, rms=0.051
module=encoder.encoder.layers.11.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.018 0.031 0.034 0.038 0.041 0.044 0.049 0.056 0.069 0.087 0.21], mean=0.054, rms=0.061
module=encoder.encoder.layers.2.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.028 0.089 0.11 0.12 0.14 0.16 0.17 0.2 0.26 0.33 0.74], mean=0.19, rms=0.22
module=encoder.encoder.layers.3.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.0078 0.066 0.092 0.11 0.12 0.14 0.15 0.18 0.23 0.3 0.98], mean=0.17, rms=0.23
module=encoder.encoder.layers.4.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.014 0.062 0.081 0.093 0.1 0.12 0.13 0.16 0.21 0.3 1], mean=0.16, rms=0.21
module=encoder.encoder.layers.5.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.023 0.07 0.083 0.093 0.1 0.11 0.13 0.15 0.19 0.26 1.3], mean=0.16, rms=0.21
module=encoder.encoder.layers.6.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.021 0.062 0.072 0.079 0.088 0.099 0.11 0.13 0.17 0.24 1.7], mean=0.14, rms=0.19
module=encoder.encoder.layers.7.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.0043 0.053 0.077 0.088 0.096 0.11 0.13 0.15 0.19 0.27 2.1], mean=0.15, rms=0.22
module=encoder.encoder.layers.8.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.0074 0.023 0.051 0.061 0.068 0.075 0.086 0.1 0.13 0.19 2.8], mean=0.11, rms=0.22
module=encoder.encoder.layers.9.lstm.grad[0], dim=1, size=5..51, rms percentiles: [0.019 0.044 0.05 0.056 0.061 0.068 0.077 0.089 0.1 0.14 3.1], mean=0.099, rms=0.22

danpovey · 2022-09-18T06:44:29Z

For the record: further experiments with various iterations of the gradient-filtering idea seem to make very little difference to the actual results. Oddly, gradient explosion seems to make very little difference to the convergence, even though it clearly does happen, and we are not even using gradient clipping. It must be quite rare; perhaps we were just unlucky to see it in the diagnostics that we looked at.

) * Support running icefall outside of a git tracked directory. (k2-fsa#470) * Support running icefall outside of a git tracked directory. * Minor fixes. * Rand combine update result (k2-fsa#467) * update RESULTS.md * fix test code in pruned_transducer_stateless5/conformer.py * minor fix * delete doc * fix style * Simplified memory bank for Emformer (k2-fsa#440) * init files * use average value as memory vector for each chunk * change tail padding length from right_context_length to chunk_length * correct the files, ln -> cp * fix bug in conv_emformer_transducer_stateless2/emformer.py * fix doc in conv_emformer_transducer_stateless/emformer.py * refactor init states for stream * modify .flake8 * fix bug about memory mask when memory_size==0 * add @torch.jit.export for init_states function * update RESULTS.md * minor change * update README.md * modify doc * replace torch.div() with << * fix bug, >> -> << * use i&i-1 to judge if it is a power of 2 * minor fix * fix error in RESULTS.md * update multi_quantization installation (k2-fsa#469) * update multi_quantization installation * Update egs/librispeech/ASR/pruned_transducer_stateless6/train.py Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> * [Ready] [Recipes] add aishell2 (k2-fsa#465) * add aishell2 * fix aishell2 * add manifest stats * update prepare char dict * fix lint * setting max duration * lint * change context size to 1 * update result * update hf link * fix decoding comment * add more decoding methods * update result * change context-size 2 default * [WIP] Rnn-T LM nbest rescoring (k2-fsa#471) * add compile_lg.py for aishell2 recipe (k2-fsa#481) * Add RNN-LM rescoring in fast beam search (k2-fsa#475) * fix for case of None stats * Update conformer.py for aishell4 (k2-fsa#484) * update conformer.py for aishell4 * update conformer.py * add strict=False when model.load_state_dict * CTC attention model with reworked Conformer encoder and reworked Transformer decoder (k2-fsa#462) * ctc attention model with reworked conformer encoder and reworked transformer decoder * remove unnecessary func * resolve flake8 conflicts * fix typos and modify the expr of ScaledEmbedding * use original beam size * minor changes to the scripts * add rnn lm decoding * minor changes * check whether q k v weight is None * check whether q k v weight is None * check whether q k v weight is None * style correction * update results * update results * upload the decoding results of rnn-lm to the RESULTS * upload the decoding results of rnn-lm to the RESULTS * Update egs/librispeech/ASR/RESULTS.md Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> * Update egs/librispeech/ASR/RESULTS.md Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> * Update egs/librispeech/ASR/RESULTS.md Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> * Update doc to add a link to Nadira Povey's YouTube channel. (k2-fsa#492) * Update doc to add a link to Nadira Povey's YouTube channel. * fix a typo * Add stats about duration and padding proportion (k2-fsa#485) * add stats about duration and padding proportion * add for utt_duration * add stats for other recipes * add stats for other 2 recipes * modify doc * minor change * Add modified_beam_search for streaming decode (k2-fsa#489) * Add modified_beam_search for pruned_transducer_stateless/streaming_decode.py * refactor * modified beam search for stateless3,4 * Fix comments * Add real streamng ci * Fix using G before assignment in pruned_transducer_stateless/decode.py (k2-fsa#494) * Support using aidatatang_200zh optionally in aishell training (k2-fsa#495) * Use aidatatang_200zh optionally in aishell training. * Fix get_transducer_model() for aishell. (k2-fsa#497) PR k2-fsa#495 introduces an error. This commit fixes it. * [WIP] Pruned-transducer-stateless5-for-WenetSpeech (offline and streaming) (k2-fsa#447) * pruned-rnnt5-for-wenetspeech * style check * style check * add streaming conformer * add streaming decode * changes codes for fast_beam_search and export cpu jit * add modified-beam-search for streaming decoding * add modified-beam-search for streaming decoding * change for streaming_beam_search.py * add README.md and RESULTS.md * change for style_check.yml * do some changes * do some changes for export.py * add some decode commands for usage * add streaming results on README.md * [debug] raise remind when git-lfs not available (k2-fsa#504) * [debug] raise remind when git-lfs not available * modify comment * correction for prepare.sh (k2-fsa#506) * Set overwrite=True when extracting features in batches. (k2-fsa#487) * correction for get rank id. (k2-fsa#507) * Fix no attribute 'data' error. * minor fixes * correction for get rank id. * Add other decoding methods (nbest, nbest oracle, nbest LG) for wenetspeech pruned rnnt2 (k2-fsa#482) * add other decoding methods for wenetspeech * changes for RESULTS.md * add ngram-lm-scale=0.35 results * set ngram-lm-scale=0.35 as default * Update README.md * add nbest-scale for flie name * Support dynamic chunk streaming training in pruned_transcuder_stateless5 (k2-fsa#454) * support dynamic chunk streaming training * Add simulate streaming decoding * Support streaming decoding * fix causal * Minor fixes * fix streaming decode; add results * liear_fst_with_self_loops (k2-fsa#512) * Support exporting to ONNX format (k2-fsa#501) * WIP: Support exporting to ONNX format * Minor fixes. * Combine encoder/decoder/joiner into a single file. * Revert merging three onnx models into a single one. It's quite time consuming to extract a sub-graph from the combined model. For instance, it takes more than one hour to extract the encoder model. * Update CI to test ONNX models. * Decode with exported models. * Fix typos. * Add more doc. * Remove ncnn as it is not fully tested yet. * Fix as_strided for streaming conformer. * Convert ScaledEmbedding to nn.Embedding for inference. (k2-fsa#517) * Convert ScaledEmbedding to nn.Embedding for inference. * Fix CI style issues. * Fix preparing char based lang and add multiprocessing for wenetspeech text segmentation (k2-fsa#513) * add multiprocessing for wenetspeech text segmentation * Fix preparing char based lang for wenetspeech * fix style Co-authored-by: WeijiZhuang <zhuangweiji@xiaomi.com> * change for pruned rnnt5 train.py (k2-fsa#519) * fix about tensorboard (k2-fsa#516) * fix metricstracker * fix style * Merging onnx models (k2-fsa#518) * add export function of onnx-all-in-one to export.py * add onnx_check script for all-in-one onnx model * minor fix * remove unused arguments * add onnx-all-in-one test * fix style * fix style * fix requirements * fix input/output names * fix installing onnx_graphsurgeon * fix instaliing onnx_graphsurgeon * revert to previous requirements.txt * fix minor * Fix loading sampler state dict. (k2-fsa#421) * Fix loading sampler state dict. * skip scan_pessimistic_batches_for_oom if params.start_batch > 0 * fix torchaudio version (k2-fsa#524) * fix torchaudio version * fix torchaudio version * Fix computing averaged loss in the aishell recipe. (k2-fsa#523) * Fix computing averaged loss in the aishell recipe. * Set find_unused_parameters optionally. * Sort results to make it more convenient to compare decoding results (k2-fsa#522) * Sort result to make it more convenient to compare decoding results * Add cut_id to recognition results * add cut_id to results for all recipes * Fix torch.jit.script * Fix comments * Minor fixes * Fix torch.jit.tracing for Pytorch version before v1.9.0 * Add function display_and_save_batch in wenetspeech/pruned_transducer_stateless2/train.py (k2-fsa#528) * Add function display_and_save_batch in egs/wenetspeech/ASR/pruned_transducer_stateless2/train.py * Modify function: display_and_save_batch * Delete empty line in pruned_transducer_stateless2/train.py * Modify code format * Filter non-finite losses (k2-fsa#525) * Filter non-finite losses * Fixes after review * propagate changes from k2-fsa#525 to other librispeech recipes (k2-fsa#531) * propaga changes from k2-fsa#525 to other librispeech recipes * refactor display_and_save_batch to utils * fixed typo * reformat code style * Fix not enough values to unpack error . (k2-fsa#533) * Use ScaledLSTM as streaming encoder (k2-fsa#479) * add ScaledLSTM * add RNNEncoderLayer and RNNEncoder classes in lstm.py * add RNN and Conv2dSubsampling classes in lstm.py * hardcode bidirectional=False * link from pruned_transducer_stateless2 * link scaling.py pruned_transducer_stateless2 * copy from pruned_transducer_stateless2 * modify decode.py pretrained.py test_model.py train.py * copy streaming decoding files from pruned_transducer_stateless2 * modify streaming decoding files * simplified code in ScaledLSTM * flat weights after scaling * pruned2 -> pruned4 * link __init__.py * fix style * remove add_model_arguments * modify .flake8 * fix style * fix scale value in scaling.py * add random combiner for training deeper model * add using proj_size * add scaling converter for ScaledLSTM * support jit trace * add using averaged model in export.py * modify test_model.py, test if the model can be successfully exported by jit.trace * modify pretrained.py * support streaming decoding * fix model.py * Add cut_id to recognition results * Add cut_id to recognition results * do not pad in Conv subsampling module; add tail padding during decoding. * update RESULTS.md * minor fix * fix doc * update README.md * minor change, filter infinite loss * remove the condition of raise error * modify type hint for the return value in model.py * minor change * modify RESULTS.md Co-authored-by: pkufool <wkang.pku@gmail.com> * Update asr_datamodule.py (k2-fsa#538) minor file names correction * minor fixes to LSTM streaming model (k2-fsa#537) * Pruned transducer stateless2 for AISHELL-1 (k2-fsa#536) * Fix not enough values to unpack error . * [WIP] Pruned transducer stateless2 for AISHELL-1 * fix the style issue * code format for black * add pruned-transducer-stateless2 results for AISHELL-1 * simplify result * consider case of empty tensor (k2-fsa#540) * fixed import quantization is none (k2-fsa#541) Signed-off-by: shanguanma <nanr9544@gmail.com> Signed-off-by: shanguanma <nanr9544@gmail.com> Co-authored-by: shanguanma <nanr9544@gmail.com> * fix typo for export jit script (k2-fsa#544) * some small changes for aidatatang_200zh (k2-fsa#542) * Update prepare.sh * Update compute_fbank_aidatatang_200zh.py * fixed no cut_id error in decode_dataset (k2-fsa#549) * fixed import quantization is none Signed-off-by: shanguanma <nanr9544@gmail.com> * fixed no cut_id error in decode_dataset Signed-off-by: shanguanma <nanr9544@gmail.com> * fixed more than one "#" Signed-off-by: shanguanma <nanr9544@gmail.com> * fixed code style Signed-off-by: shanguanma <nanr9544@gmail.com> Signed-off-by: shanguanma <nanr9544@gmail.com> Co-authored-by: shanguanma <nanr9544@gmail.com> * Add clamping operation in Eve optimizer for all scalar weights to avoid (k2-fsa#550) non stable training in some scenarios. The clamping range is set to (-10,2). Note that this change may cause unexpected effect if you resume training from a model that is trained without clamping. * minor changes for correct path names && import module text2segments.py (k2-fsa#552) * Update asr_datamodule.py minor file names correction * minor changes for correct path names && import module text2segments.py * fix scaling converter test for decoder(predictor). (k2-fsa#553) * Disable CUDA_LAUNCH_BLOCKING in wenetspeech recipes. (k2-fsa#554) * Disable CUDA_LAUNCH_BLOCKING in wenetspeech recipes. * minor fixes * Check that read_manifests_if_cached returns a non-empty dict. (k2-fsa#555) * Modified prepare_transcripts.py and preprare_lexicon.py of tedlium3 recipe (k2-fsa#567) * Use modified ctc topo when vocab size is > 500 (k2-fsa#568) * Add LSTM for the multi-dataset setup. (k2-fsa#558) * Add LSTM for the multi-dataset setup. * Add results * fix style issues * add missing file * Adding Dockerfile for Ubuntu18.04-pytorch1.12.1-cuda11.3-cudnn8 (k2-fsa#572) * Changed Dockerfile * Update Dockerfile * Dockerfile * Update README.md * Add Dockerfiles * Update README.md Removed misleading CUDA version, as the Ubuntu18.04-pytorch1.7.1-cuda11.0-cudnn8 Dockerfile can only support CUDA versions >11.0. * support exporting to ncnn format via PNNX (k2-fsa#571) * Small fixes to the transducer training doc (k2-fsa#575) * Update kaldifeat in CI tests (k2-fsa#583) * padding zeros (k2-fsa#591) * Gradient filter for training lstm model (k2-fsa#564) * init files * add gradient filter module * refact getting median value * add cutoff for grad filter * delete comments * apply gradient filter in LSTM module, to filter both input and params * fix typing and refactor * filter with soft mask * rename lstm_transducer_stateless2 to lstm_transducer_stateless3 * fix typos, and update RESULTS.md * minor fix * fix return typing * fix typo * Modified train.py of tedlium3 models (k2-fsa#597) * Add dill to requirements.txt (k2-fsa#613) * Add dill to requirements.txt * Disable style check for python 3.7 * update docs (k2-fsa#611) * update docs Co-authored-by: unknown <mazhihao@jshcbd.cn> Co-authored-by: KajiMaCN <moonlightshadowmzh@gmail.com> * exporting projection layers of joiner separately for onnx (k2-fsa#584) * exporting projection layers of joiner separately for onnx * Remove all-in-one for onnx export (k2-fsa#614) * Remove all-in-one for onnx export * Exit on error for CI * Modify ActivationBalancer for speed (k2-fsa#612) * add a probability to apply ActivationBalancer * minor fix * minor fix * Support exporting to ONNX for the wenetspeech recipe (k2-fsa#615) * Support exporting to ONNX for the wenetspeech recipe * Add doc about model export (k2-fsa#618) * Add doc about model export * fix typos * Fix links in the doc (k2-fsa#619) * fix type hints for decode.py (k2-fsa#623) * Support exporting LSTM with projection to ONNX (k2-fsa#621) * Support exporting LSTM with projection to ONNX * Add missing files * small fixes * CSJ Data Preparation (k2-fsa#617) * workspace setup * csj prepare done * Change compute_fbank_musan.py t soft link * add description * change lhotse prepare csj command * split train-dev here * Add header * remove debug * save manifest_statistics * generate transcript in Lhotse * update comments in config file * fix number of parameters in RESULTS.md (k2-fsa#627) * Add Shallow fusion in modified_beam_search (k2-fsa#630) * Add utility for shallow fusion * test batch size == 1 without shallow fusion * Use shallow fusion for modified-beam-search * Modified beam search with ngram rescoring * Fix code according to review Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> * Add kaldifst to requirements.txt (k2-fsa#631) * Install kaldifst for GitHub actions (k2-fsa#632) * Install kaldifst for GitHub actions * Update train.py (k2-fsa#635) Add the missing step to add the arguments to the parser. * Fix type hints for decode.py (k2-fsa#638) * Fix type hints for decode.py * Fix flake8 * fix typos (k2-fsa#639) * Remove onnx and onnxruntime from requirements.txt (k2-fsa#640) * Remove onnx and onnxruntime from requirements.txt * Checkout the LM for aishell explicitly (k2-fsa#642) * Get timestamps during decoding (k2-fsa#598) * print out timestamps during decoding * add word-level alignments * support to compute mean symbol delay with word-level alignments * print variance of symbol delay * update doc * support to compute delay for pruned_transducer_stateless4 * fix bug * add doc * remove tail padding for non-streaming models (k2-fsa#625) * support RNNLM shallow fusion for LSTM transducer * support RNNLM shallow fusion in stateless5 * update results * update decoding commands * update author info * update * include previous added decoding method * minor fixes * remove redundant test lines * Update egs/librispeech/ASR/lstm_transducer_stateless2/decode.py Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> * Update tdnn_lstm_ctc.rst (k2-fsa#647) * Update README.md (k2-fsa#649) * Update tdnn_lstm_ctc.rst (k2-fsa#648) * fix torchaudio version in dockerfile (k2-fsa#653) * fix torchaudio version in dockerfile * remove kaldiio * update docs * Add fast_beam_search_LG (k2-fsa#622) * Add fast_beam_search_LG * add fast_beam_search_LG to commonly used recipes * fix ci * fix ci * Fix error * Fix LG log file name (k2-fsa#657) * resolve conflict with timestamp feature * resolve conflicts * minor fixes * remove testing file * Apply delay penalty on transducer (k2-fsa#654) * add delay penalty * fix CI * fix CI * Refactor getting timestamps in fsa-based decoding (k2-fsa#660) * refactor getting timestamps for fsa-based decoding * fix doc * fix bug * add ctc_decode.py * fix doc Signed-off-by: shanguanma <nanr9544@gmail.com> Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com> Co-authored-by: LIyong.Guo <839019390@qq.com> Co-authored-by: Yuekai Zhang <zhangyuekai@foxmail.com> Co-authored-by: ezerhouni <61225408+ezerhouni@users.noreply.github.com> Co-authored-by: Mingshuang Luo <37799481+luomingshuang@users.noreply.github.com> Co-authored-by: Daniel Povey <dpovey@gmail.com> Co-authored-by: Quandwang <quandwang@hotmail.com> Co-authored-by: Wei Kang <wkang.pku@gmail.com> Co-authored-by: boji123 <boji123@aliyun.com> Co-authored-by: Lucky Wong <lekai.huang@gmail.com> Co-authored-by: LIyong.Guo <guonwpu@qq.com> Co-authored-by: Weiji Zhuang <zhuangweiji@foxmail.com> Co-authored-by: WeijiZhuang <zhuangweiji@xiaomi.com> Co-authored-by: Yunusemre <yunusemreozkose@gmail.com> Co-authored-by: FNLPprojects <linxinzhulxz@gmail.com> Co-authored-by: yangsuxia <34536059+yangsuxia@users.noreply.github.com> Co-authored-by: marcoyang1998 <45973641+marcoyang1998@users.noreply.github.com> Co-authored-by: rickychanhoyin <ricky.hoyin.chan@gmail.com> Co-authored-by: Duo Ma <39255927+shanguanma@users.noreply.github.com> Co-authored-by: shanguanma <nanr9544@gmail.com> Co-authored-by: rxhmdia <41623136+rxhmdia@users.noreply.github.com> Co-authored-by: kobenaxie <572745565@qq.com> Co-authored-by: shcxlee <113081290+shcxlee@users.noreply.github.com> Co-authored-by: Teo Wen Shen <36886809+teowenshen@users.noreply.github.com> Co-authored-by: KajiMaCN <827272056@qq.com> Co-authored-by: unknown <mazhihao@jshcbd.cn> Co-authored-by: KajiMaCN <moonlightshadowmzh@gmail.com> Co-authored-by: Yunusemre <yunusemre.ozkose@sestek.com> Co-authored-by: Nagendra Goel <nagendra.goel@gmail.com> Co-authored-by: marcoyang <marcoyang1998@gmail.com> Co-authored-by: zr_jin <60612200+JinZr@users.noreply.github.com>

Add LSTM for the multi-dataset setup.

e3128cb

csukuangfj mentioned this pull request Aug 29, 2022

Use ScaledLSTM as streaming encoder #479

Merged

Add results

d1650f6

csukuangfj added 2 commits September 3, 2022 18:46

fix style issues

20ef6bb

add missing file

3d931e3

csukuangfj added the ready label Sep 3, 2022

csukuangfj mentioned this pull request Sep 4, 2022

[Not for merge] ScaledLSTM + multi-dataset setup + gradient clipping #563

Open

csukuangfj merged commit 97b3fc5 into k2-fsa:master Sep 16, 2022

csukuangfj deleted the lstm-giga-libri branch September 16, 2022 10:40

csukuangfj mentioned this pull request Sep 19, 2022

WIP: Add timestamps for streaming ASR k2-fsa/sherpa#119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LSTM for the multi-dataset setup. #558

Add LSTM for the multi-dataset setup. #558

csukuangfj commented Aug 29, 2022

csukuangfj commented Aug 29, 2022

danpovey commented Aug 29, 2022

csukuangfj commented Aug 31, 2022

csukuangfj commented Sep 3, 2022

danpovey commented Sep 3, 2022

danpovey commented Sep 4, 2022

danpovey commented Sep 5, 2022

danpovey commented Sep 5, 2022 •

edited

danpovey commented Sep 18, 2022

Add LSTM for the multi-dataset setup. #558

Add LSTM for the multi-dataset setup. #558

Conversation

csukuangfj commented Aug 29, 2022

csukuangfj commented Aug 29, 2022

danpovey commented Aug 29, 2022

csukuangfj commented Aug 31, 2022

csukuangfj commented Sep 3, 2022

danpovey commented Sep 3, 2022

danpovey commented Sep 4, 2022

danpovey commented Sep 5, 2022

danpovey commented Sep 5, 2022 • edited

danpovey commented Sep 18, 2022

danpovey commented Sep 5, 2022 •

edited