zipformer wenetspeech #1130

pkufool · 2023-06-15T02:33:19Z

This is the wenetspeech recipe on the latest zipformer model (modeling with characters).

Non streaming model

The training command (use the default medium size model)

./zipformer/train.py \
  --world-size 6 \
  --num-epochs 12 \
  --use-fp16 1 \
  --max-duration 450 \
  --training-subset L \
  --lr-epochs 1.5 \
  --context-size 2 \
  --exp-dir zipformer/exp_L_context_2 \
  --causal 0 \
  --num-workers 8

Best results for each epoch

Epoch	Greedy search(dev & net & meeting)	Modified beam search(dev & net & meeting)
4	7.83 & 8.86 &13.73	7.75 & 8.81 & 13.67	avg=1;blank-penalty=2
5	7.75 & 8.46 & 13.38	7.68 & 8.41 & 13.27	avg=1;blank-penalty=2
6	7.72 & 8.19 & 13.16	7.62 & 8.14 & 13.06	avg=1;blank-penalty=2
7	7.59 & 8.08 & 12.97	7.53 & 8.01 & 12.87	avg=2;blank-penalty=2
8	7.68 & 7.87 & 12.96	7.61 & 7.81 & 12.88	avg=1;blank-penalty=2
9	7.57 & 7.77 & 12.87	7.5 & 7.71 & 12.77	avg=1;blank-penalty=2
10	7.45 & 7.7 & 12.69	7.39 & 7.63 & 12.59	avg=2;blank-penalty=2
11	7.35 & 7.67 & 12.46	7.31 & 7.63 & 12.43	avg=3;blank-penalty=2
12	7.36 & 7.65 & 12.43	7.32 & 7.61 & 12.35	avg=4;blank-penalty=2

The influence of blank-penalty (greedy search result at epoch 12)

blank-penalty	Dev	Test-net	Test-meeting
0	8.58	7.84	14.64
1	7.82	7.64	13.08
1.5	7.55	7.62	12.63
2	7.36	7.65	12.43
2.5	7.24	7.77	12.37
3	7.24	7.94	12.48

Streaming model

The training command (use the default medium size model)

./zipformer/train.py \
  --world-size 8 \
  --num-epochs 12 \
  --use-fp16 1 \
  --max-duration 450 \
  --training-subset L \
  --lr-epochs 1.5 \
  --context-size 2 \
  --exp-dir zipformer/exp_L_causal_context_2 \
  --causal 1 \
  --num-workers 8

Best results for each epoch (--chunk-size=16; --left-context-frames=128)

Epoch	Greedy search(dev & net & meeting)	Modified beam search(dev & net & meeting)
6	9.14 & 10.75 & 18.15	8.79 & 10.54 & 17.64	avg=1;blank-penalty=1.5
7	9.11 & 10.61 & 17.86	8.8 & 10.42 & 17.29	avg=1;blank-penalty=1.5
8	8.89 & 10.32 & 17.44	8.59 & 10.09 & 16.9	avg=1;blank-penalty=1.5
9	8.86 & 10.11 & 17.35	8.55 & 9.87 & 16.76	avg=1;blank-penalty=1.5
10	8.66 & 10.0 & 16.94	8.39 & 9.83 & 16.47	avg=2;blank-penalty=1.5
11	8.58 & 9.92 & 16.67	8.32 & 9.77 & 16.27	avg=3;blank-penalty=1.5
12	8.45 & 9.89 & 16.46	8.21 & 9.77 & 16.07	avg=4;blank-penalty=1.5

The influence of blank-penalty (greedy search result at epoch 12, chunk-size=32,left-context-frames=128)

blank-penalty	Dev	Test-net	Test-meeting
0	9.03	9.22	16.63
1	8.26	9.02	15.48
1.5	8.01	9.05	15.32
2	7.88	9.19	15.39
2.5	7.9	9.44	15.7
3	8.03	9.77	16.3

The decoding result for different latency (greedy search results)

--chunk-size=16; --left-context-frames=64

Epoch	Dev	Test-net	Test-meeting
6	9.17	10.91	18.78	avg=1;blank-penalty=1.5
7	9.12	10.77	18.48	avg=1;blank-penalty=1.5
8	8.95	10.48	18.12	avg=1;blank-penalty=1.5
9	8.92	10.28	18.02	avg=1;blank-penalty=1.5
10	8.73	10.15	17.58	avg=2;blank-penalty=1.5
11	8.68	10.08	17.37	avg=3;blank-penalty=1.5
12	8.54	10.04	17.16	avg=4;blank-penalty=1.5

--chunk-size=32; --left-context-frames=128

Epoch	Dev	Test-net	Test-meeting
6	8.7	9.86	16.83	avg=1;blank-penalty=1.5
7	8.71	9.7	16.6	avg=1;blank-penalty=1.5
8	8.52	9.46	16.23	avg=1;blank-penalty=1.5
9	8.46	9.29	16.17	avg=1;blank-penalty=1.5
10	8.25	9.14	15.74	avg=2;blank-penalty=1.5
11	8.15	9.08	15.52	avg=3;blank-penalty=1.5
12	8.01	9.05	15.32	avg=4;blank-penalty=1.5

--chunk-size=64; --left-context-frames=256

Epoch	Dev	Test-net	Test-meeting
6	8.36	9.18	15.5	avg=1;blank-penalty=1.5
7	8.36	9.05	15.32	avg=1;blank-penalty=1.5
8	8.16	8.85	14.96	avg=1;blank-penalty=1.5
9	8.14	8.64	14.89	avg=1;blank-penalty=1.5
10	7.91	8.54	14.54	avg=2;blank-penalty=1.5
11	7.82	8.49	14.31	avg=3;blank-penalty=1.5
12	7.67	8.47	14.07	avg=4;blank-penalty=1.5

kobenaxie · 2023-06-19T09:32:37Z

hi, @pkufool, could you talk about why 'blank penalty' can improve the accuracy ?

pkufool · 2023-06-19T10:02:29Z

hi, @pkufool, could you talk about why 'blank penalty' can improve the accuracy ?

We add blank penalty because we saw a lot of deletion errors in the decoded results, it might relate to the subsampling mechanism in zipformer, @danpovey may have more to say.

pkufool · 2023-06-23T12:47:22Z

The best results:

Type	Greedy(dev & net & meeting)	Beam search(dev & net & meeting)
Non-streaming	7.36 & 7.65 & 12.43	7.32 & 7.61 & 12.35	--epoch=12
Streaming	8.45 & 9.89 & 16.46	8.21 & 9.77 & 16.07	--epoch=12; --chunk-size=16; --left-context-frames=256
Streaming	8.0 & 9.0 & 15.11	7.84 & 8.94 & 14.92	--epoch=12; --chunk-size=32; --left-context-frames=256

The model (Non-streaming): https://huggingface.co/pkufool/icefall-asr-zipformer-wenetspeech-20230615
The model (Streaming) : https://huggingface.co/pkufool/icefall-asr-zipformer-streaming-wenetspeech-20230615

Comparing with other open-sourced results:

Toolkit	Dev	Test-Net	Test-Meeting	AIshell
Kaldi	9.07	12.83	24.72	5.41
Espnet	9.70	8.90	15.90	3.90
Wenet	8.88	9.70	15.59	4.61
Next-gen Kaldi	7.32	7.61	12.35	3.7

Comparing with our previous results (pruned_transducer_stateless5):

Model	Type	Greedy	Beam Search
Reworked Conformer	Non-streaming	8.22 & 9.03 & 14.54	8.17 & 9.04 & 14.44	--epoch 4
Zipformer	Non-streaming	7.83 & 8.86 &13.73	7.75 & 8.81 & 13.67	--epoch 4
Reworked Conformer	Streaming	8.78 & 10.12 & 16.16	8.53 & 9.95 & 15.81	--epoch 7; latency=320ms
Zipformer	Streaming	8.35 & 9.59 & 16.26	8.35 & 9.46 & 15.85	--epoch 7; latency=320ms

csukuangfj · 2023-06-23T12:49:47Z

Did you use CTC loss during training?

pkufool · 2023-06-23T12:50:25Z

Did you use CTC loss during training?

No

pkufool · 2023-06-23T12:53:50Z

@csukuangfj I made some changes to export.py and export-onnx.py (accepting tokens.txt rather than bpe.model), so that they can be shared among different reicpes, I think it is better to mantain only one copy of exporting code.

csukuangfj · 2023-06-23T12:55:31Z

I think it is better to mantain only one copy of exporting code.

Agreed. You can use symlinks to avoid additional copies.

pkufool · 2023-06-23T12:58:16Z

I think it is better to mantain only one copy of exporting code.

Agreed. You can use symlinks to avoid additional copies.

Then, you can have a look at the changes under librispeech/ASR/zipformer, see if it is OK，and do I miss some other code that need to be changed.

danpovey · 2023-06-24T23:46:28Z

hi, @pkufool, could you talk about why 'blank penalty' can improve the accuracy ?

We add blank penalty because we saw a lot of deletion errors in the decoded results, it might relate to the subsampling mechanism in zipformer, @danpovey may have more to say.

We don't really know why we had to compensate for deletion errors in this particular setup, because we haven't seen this effect in other zipformer examples or in other types of system on this data. If it recurs we may develop a better theory.

csukuangfj · 2023-07-05T02:32:18Z

egs/librispeech/ASR/zipformer/jit_pretrained_ctc.py

-
-    params.vocab_size = sp.get_piece_size()
+    token_table = k2.SymbolTable.from_file(params.tokens)
+    params.vocab_size = num_tokens(token_table)


It should be

params.vocab_size = num_tokens(token_table) + 1

+1 is missing.

OswaldoBornemann · 2023-12-22T07:56:26Z

@pkufool So I noticed that you used 6 GPUs to train the zipformer2, and each epoch costs 22 hours. So what kind of GPU did you use?

yaozengwei · 2023-12-22T07:59:11Z

@pkufool So I noticed that you used 6 GPUs to train the zipformer2, and each epoch costs 22 hours. So what kind of GPU did you use?

32G nvidia tesla v100

OswaldoBornemann · 2023-12-22T13:14:44Z

I see. Have you noticed that the latest lhoste makes zipformer training much faster than before? Do you have similar experience?

pkufool · 2023-12-25T04:37:43Z

I see. Have you noticed that the latest lhoste makes zipformer training much faster than before? Do you have similar experience?

We didn't train it recently, will try it. Thanks!

xingchensong · 2023-12-26T15:37:01Z

hi, @pkufool, could you talk about why 'blank penalty' can improve the accuracy ?

We add blank penalty because we saw a lot of deletion errors in the decoded results, it might relate to the subsampling mechanism in zipformer, @danpovey may have more to say.

Does the deletion error usually occur at the beginning, middle, or end of the decoded result?

xingchensong · 2023-12-26T16:19:59Z

In the conformer model, I have similarly encountered an exceptionally high proportion of deletion errors in test_meeting and the majority of these errors consist of omissions of modal particles and redundant characters.

xingchensong · 2024-01-05T05:28:46Z

Hi guys, I add a similar penalty to CTC-based conformer and find that it is really helpful.

I guess that this is caused by the training dataset (wenetspeech), in which we can find many low-quality paired data.

For more infos, plz see wenet-e2e/wenet#2278

xingchensong · 2024-01-05T05:29:06Z

cc @pkufool @danpovey

xingchensong · 2024-01-05T05:36:39Z

This is a very interesting phenomenon, and I believe it's worth our time to delve deeper into the underlying principles together.

pkufool · 2024-01-05T08:14:41Z

@xingchensong FYI, blank-penalty does not help on zipformer large mode (around 148M params). Yes, very interesting phenomenon.

pkufool · 2024-01-05T08:21:49Z

@xingchensong What's your model size, we found that for small & medium zipformer we need blank-penanty, but for large model (more powerful ?) we don't need it. Maybe you can try increase the #param to see if it is also true for your model.

xingchensong · 2024-01-05T08:28:15Z

@xingchensong What's your model size, we found that for small & medium zipformer we need blank-penanty, but for large model (more powerful ?) we don't need it. Maybe you can try increase the #param to see if it is also true for your model.

116.9M trained under unified streaming&non-streaming mode.

https://e2qq6pi6j9.feishu.cn/docx/EFpod2n30omXITx08OAcMSjlnxd

SongLi89 · 2024-02-18T02:57:04Z

Hi, I used 4 GPU to train the streaming zipformer model (rnnt-loss) using wenetspeech. The parameter setting is the same as the one provided in the huggingface (https://huggingface.co/pkufool/icefall-asr-zipformer-streaming-wenetspeech-20230615/tree/main/logs/training), and the training/validation loss seems good, but the testing results（WER) are not so good as the pre-trained model, about 1-4% WER larger then the models provided in the huggingface in each epoch. For example, Epoch 9(1 average, chunk-size=32; left-context-frames=256), but I have got 19.6% WER for MEETING test set.
The only difference I see is that I used 4 GPU but the pre-trained one used 8 GPU, is there any parameter should be manually tunned based on the number of GPU? or is there any other possible reasons?

yaozengwei · 2024-02-18T03:11:27Z

Hi, I used 4 GPU to train the streaming zipformer model (rnnt-loss) using wenetspeech. The parameter setting is the same as the one provided in the huggingface (https://huggingface.co/pkufool/icefall-asr-zipformer-streaming-wenetspeech-20230615/tree/main/logs/training), and the training/validation loss seems good, but the testing results（WER) are not so good as the pre-trained model, about 1-4% WER larger then the models provided in the huggingface in each epoch. For example, Epoch 9(1 average, chunk-size=32; left-context-frames=256), but I have got 19.6% WER for MEETING test set. The only difference I see is that I used 4 GPU but the pre-trained one used 8 GPU, is there any parameter should be manually tunned based on the number of GPU? or is there any other possible reasons?

What value of max-duration are you using? For the pretrained model, what's the result at epoch 9 with the decoding setup: 1 average, chunk-size=32; left-context-frames=256?

csukuangfj and others added 14 commits November 14, 2022 14:51

copy files

ab38f4a

update train.py

1d49455

small fixes

96cff34

Add decode.py

903ef3b

Merge branch 'master' into wenetspeech

4756595

Fix dataloader in decode.py

42513d2

add blank penalty

4f28e15

Merge branch 'master' into wenetspeech

961750c

Add blank-penalty to other decoding method

899f858

Minor fixes

04c3f9a

add zipformer2 recipe

2d32ba5

Merge branch 'master' into wenetspeech

a1b12cf

Minor fixes

bf36d19

Remove pruned7

28d3f6d

pkufool mentioned this pull request Jun 15, 2023

Add zipformer(pruned_transducer_stateless7) wenetspeech recipe #1097

Closed

2 tasks

pkufool added 5 commits June 16, 2023 11:19

export and test models

389e191

Replace bpe with tokens in export.py and pretrain.py

78ef1d1

Minor fixes

3ef74b0

Merge with master

802bf98

Minor fixes

a7d0588

pkufool added 2 commits June 21, 2023 18:13

Minor fixes

63e53ba

Fix export

ae47b73

pkufool added ready and removed ready labels Jun 24, 2023

Fix ci

8887b28

pkufool added ready and removed ready labels Jun 25, 2023

Fix CI

601de7e

pkufool added ready and removed ready labels Jun 25, 2023

Fix CI

403c3bd

pkufool merged commit 219bba1 into k2-fsa:master Jun 26, 2023
3 checks passed

csukuangfj reviewed Jul 5, 2023

View reviewed changes

csukuangfj mentioned this pull request Jan 23, 2024

Severe Deletion Errors with pruned_transducer_stateless7_streaming on Japanese Dataset Validation #1456

Open

chiiyeh mentioned this pull request Jan 23, 2024

Adding blank penalty k2-fsa/sherpa-onnx#541

Closed

marcoyang1998 mentioned this pull request Feb 20, 2024

Whisper large fine-tuning on wenetspeech, mutli-hans-zh #1483

Merged

2 tasks

xingchensong mentioned this pull request Feb 27, 2024

[examples] better results on wenetspeech using revised transcripts wenet-e2e/wenet#2371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zipformer wenetspeech #1130

zipformer wenetspeech #1130

pkufool commented Jun 15, 2023 •

edited

kobenaxie commented Jun 19, 2023

pkufool commented Jun 19, 2023

pkufool commented Jun 23, 2023

csukuangfj commented Jun 23, 2023

pkufool commented Jun 23, 2023

pkufool commented Jun 23, 2023 •

edited

csukuangfj commented Jun 23, 2023

pkufool commented Jun 23, 2023

danpovey commented Jun 24, 2023

csukuangfj Jul 5, 2023

OswaldoBornemann commented Dec 22, 2023

yaozengwei commented Dec 22, 2023

OswaldoBornemann commented Dec 22, 2023

pkufool commented Dec 25, 2023

xingchensong commented Dec 26, 2023

xingchensong commented Dec 26, 2023

xingchensong commented Jan 5, 2024

xingchensong commented Jan 5, 2024

xingchensong commented Jan 5, 2024

pkufool commented Jan 5, 2024

pkufool commented Jan 5, 2024

xingchensong commented Jan 5, 2024

SongLi89 commented Feb 18, 2024

yaozengwei commented Feb 18, 2024

zipformer wenetspeech #1130

zipformer wenetspeech #1130

Conversation

pkufool commented Jun 15, 2023 • edited

Non streaming model

The training command (use the default medium size model)

Best results for each epoch

The influence of blank-penalty (greedy search result at epoch 12)

Streaming model

The training command (use the default medium size model)

Best results for each epoch (--chunk-size=16; --left-context-frames=128)

The influence of blank-penalty (greedy search result at epoch 12, chunk-size=32,left-context-frames=128)

The decoding result for different latency (greedy search results)

--chunk-size=16; --left-context-frames=64

--chunk-size=32; --left-context-frames=128

--chunk-size=64; --left-context-frames=256

kobenaxie commented Jun 19, 2023

pkufool commented Jun 19, 2023

pkufool commented Jun 23, 2023

csukuangfj commented Jun 23, 2023

pkufool commented Jun 23, 2023

pkufool commented Jun 23, 2023 • edited

csukuangfj commented Jun 23, 2023

pkufool commented Jun 23, 2023

danpovey commented Jun 24, 2023

csukuangfj Jul 5, 2023

Choose a reason for hiding this comment

OswaldoBornemann commented Dec 22, 2023

yaozengwei commented Dec 22, 2023

OswaldoBornemann commented Dec 22, 2023

pkufool commented Dec 25, 2023

xingchensong commented Dec 26, 2023

xingchensong commented Dec 26, 2023

xingchensong commented Jan 5, 2024

xingchensong commented Jan 5, 2024

xingchensong commented Jan 5, 2024

pkufool commented Jan 5, 2024

pkufool commented Jan 5, 2024

xingchensong commented Jan 5, 2024

SongLi89 commented Feb 18, 2024

yaozengwei commented Feb 18, 2024

pkufool commented Jun 15, 2023 •

edited

pkufool commented Jun 23, 2023 •

edited