Add the upgraded Zipformer model #1058

yaozengwei · 2023-05-12T13:34:12Z

This PR adds a new recipe of the upgraded Zipformer-Transducer model from @danpovey . (See #1057 for detailed commit history). Compared to the old recipe (pruned_transducer_stateless7), the new model achieves better accuracy with lower memory usage and faster computation.

We will mainly maintain this recipe in the coming days. Other features (e.g., CTC & Attention-decoder model, multi-dataset training, language model rescoring, delay penalty, etc.) will be added into this recipe.

Our models are trained with pruned transducer loss on full librispeech dataset (with 0.9 and 1.1 speed perturbation). We adopt automatic mixed precision training.

The results of non-streaming models of different scales without LM rescoring:

The normal-scaled model (65.5 M), max-duration=1000, ~1h7m per epoch on 4 * V100-32GB gpus

greedy_search	modified_beam_search	fast_beam_search	comment
2.27 & 5.1	2.25 & 5.06	2.25 & 5.04	epoch-30-avg-9
2.23 & 4.96	2.21 & 4.91	2.24 & 4.93	epoch-40-avg-16

The small-scaled model (23.2M), max-duration=1500, ~1h30m per epoch on 2 * V100-32GB gpus

greedy_search	modified_beam_search	fast_beam_search	comment
2.64 & 6.14	2.6 & 6.01	2.62 & 6.06	epoch-30-avg-8
2.49 & 5.91	2.46 & 5.83	2.46 & 5.87	epoch-40-avg-13

The large-scaled model (148.4M), max-duration=1000, ~1h20m per epoch on 4 * V100-32GB gpus

greedy_search	modified_beam_search	fast_beam_search	comment
2.12 & 4.91	2.11 & 4.9	2.13 & 4.93	epoch-30-avg-9
2.12 & 4.8	2.11 & 4.77	2.13 & 4.78	epoch-40-avg-16

The results of streaming model without LM rescoring:

The normal-scaled model (66.1 M), max-duration=1000, ~1h9m per epoch on 4 * V100-32GB gpus,
greedy_search decoding at epoch-30-avg-8

The chunk size is at 50Hz frame rate

chunk size = 320ms

WER	latency	comment
3.21 & 8.17	chunk-16-left-context-64	simulated streaming (time masking)
3.18 & 8.12	chunk-16-left-context-64	real streaming (chunk-wise forward)
3.1 & 7.84	chunk-16-left-context-128	simulated streaming (time masking)
3.06 & 7.78	chunk-16-left-context-128	real streaming (chunk-wise forward)

chunk size = 640ms

WER	latency	comment
2.84 & 7.24	chunk-32-left-context-128	simulated streaming (time masking)
2.85 & 7.25	chunk-32-left-context-128	real streaming (chunk-wise forward)
2.8 & 7.15	chunk-32-left-context-256	simulated streaming (time masking)
2.84 & 7.16	chunk-32-left-context-256	real streaming (chunk-wise forward)

yaozengwei · 2023-05-13T11:35:01Z

The training commands:

normal-scaled model (65.5 M),

./zipformer/train.py \
  --world-size 4 \
  --num-epochs 40 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp \
  --causal 0 \
  --full-libri 1 \
  --max-duration 1000

small-scaled model (23.2M),

./zipformer/train.py \
  --world-size 2 \
  --num-epochs 40 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-small \
  --causal 0 \
  --num-encoder-layers 2,2,2,2,2,2 \
  --feedforward-dim 512,768,768,768,768,768 \
  --encoder-dim 192,256,256,256,256,256 \
  --encoder-unmasked-dim 192,192,192,192,192,192 \
  --base-lr 0.04 \
  --full-libri 1 \
  --max-duration 1500

large-scaled model (148.4M),

./zipformer/train.py \
  --world-size 4 \
  --num-epochs 40 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-large \
  --causal 0 \
  --num-encoder-layers 2,2,4,5,4,2 \
  --feedforward-dim 512,768,1536,2048,1536,768 \
  --encoder-dim 192,256,512,768,512,256 \
  --encoder-unmasked-dim 192,192,256,320,256,192 \
  --full-libri 1 \
  --max-duration 1000

normal-scaled model, streaming (66.1 M),

./zipformer/train.py \
  --world-size 4 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-causal \
  --causal 1 \
  --full-libri 1 \
  --max-duration 1000

yaozengwei · 2023-05-13T12:08:31Z

I use the nvtx profiler to visualize the event timeline for our four encoder models, including the regular Conformer, the reworked Conformer, the old Zipformer, and the upgraded Zipformer, on a V100-32GB gpu. The models are in inference mode. The input tensor shape of each batch is (20, 3000, 80).

regular Conformer, 83.5 M (pruned_transducer_stateless)

reworked Conformer, 77.1 M (pruned_transducer_stateless2)

old Zipformer, 68.9 M (pruned_transducer_stateless7)

upgraded Zipformer, 64.0 M

marcoyang1998 · 2023-05-18T03:00:13Z

egs/librispeech/ASR/zipformer/zipformer.py

+                                              causal=causal)
+
+        # TODO: remove it
+        self.bypass_scale = nn.Parameter(torch.full((embed_dim,), 0.5))


Shall we remove this? I think it's unused.

Yes. But the trained model has already the bypass_scale parameters... If we remove that, it would fail to load the saved model and optimizer state_dicts.

This reverts commit 0c6e93c.

csukuangfj · 2023-05-19T03:38:04Z

egs/librispeech/ASR/zipformer/decoder.py

@@ -0,0 +1,123 @@
+# Copyright    2021  Xiaomi Corp.        (authors: Fangjun Kuang)


Suggested change

# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang)

# Copyright 2021-2023 Xiaomi Corp. (authors: Zengwei Yao)

csukuangfj · 2023-05-19T04:03:47Z

egs/librispeech/ASR/zipformer/zipformer.py

+        value_head_dim (int or Tuple[int]): dimension of value in each attention head
+        pos_head_dim (int or Tuple[int]): dimension of positional-encoding projection per


Please switch the order of the doc for value_head_dim and pos_head_dim to match the actual argument order.

csukuangfj · 2023-05-19T04:07:21Z