Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the upgraded Zipformer model #1058

Merged
merged 20 commits into from May 19, 2023

Conversation

yaozengwei
Copy link
Collaborator

@yaozengwei yaozengwei commented May 12, 2023

This PR adds a new recipe of the upgraded Zipformer-Transducer model from @danpovey . (See #1057 for detailed commit history). Compared to the old recipe (pruned_transducer_stateless7), the new model achieves better accuracy with lower memory usage and faster computation.

We will mainly maintain this recipe in the coming days. Other features (e.g., CTC & Attention-decoder model, multi-dataset training, language model rescoring, delay penalty, etc.) will be added into this recipe.

Our models are trained with pruned transducer loss on full librispeech dataset (with 0.9 and 1.1 speed perturbation). We adopt automatic mixed precision training.

  1. The results of non-streaming models of different scales without LM rescoring:
  • The normal-scaled model (65.5 M), max-duration=1000, ~1h7m per epoch on 4 * V100-32GB gpus
greedy_search modified_beam_search fast_beam_search comment
2.27 & 5.1 2.25 & 5.06 2.25 & 5.04 epoch-30-avg-9
2.23 & 4.96 2.21 & 4.91 2.24 & 4.93 epoch-40-avg-16
  • The small-scaled model (23.2M), max-duration=1500, ~1h30m per epoch on 2 * V100-32GB gpus
greedy_search modified_beam_search fast_beam_search comment
2.64 & 6.14 2.6 & 6.01 2.62 & 6.06 epoch-30-avg-8
2.49 & 5.91 2.46 & 5.83 2.46 & 5.87 epoch-40-avg-13
  • The large-scaled model (148.4M), max-duration=1000, ~1h20m per epoch on 4 * V100-32GB gpus
greedy_search modified_beam_search fast_beam_search comment
2.12 & 4.91 2.11 & 4.9 2.13 & 4.93 epoch-30-avg-9
2.12 & 4.8 2.11 & 4.77 2.13 & 4.78 epoch-40-avg-16
  1. The results of streaming model without LM rescoring:
  • The normal-scaled model (66.1 M), max-duration=1000, ~1h9m per epoch on 4 * V100-32GB gpus,

  • greedy_search decoding at epoch-30-avg-8

  • The chunk size is at 50Hz frame rate

    • chunk size = 320ms
    WER latency comment
    3.21 & 8.17 chunk-16-left-context-64 simulated streaming (time masking)
    3.18 & 8.12 chunk-16-left-context-64 real streaming (chunk-wise forward)
    3.1 & 7.84 chunk-16-left-context-128 simulated streaming (time masking)
    3.06 & 7.78 chunk-16-left-context-128 real streaming (chunk-wise forward)
    • chunk size = 640ms
    WER latency comment
    2.84 & 7.24 chunk-32-left-context-128 simulated streaming (time masking)
    2.85 & 7.25 chunk-32-left-context-128 real streaming (chunk-wise forward)
    2.8 & 7.15 chunk-32-left-context-256 simulated streaming (time masking)
    2.84 & 7.16 chunk-32-left-context-256 real streaming (chunk-wise forward)

@yaozengwei
Copy link
Collaborator Author

yaozengwei commented May 13, 2023

The training commands:

  • normal-scaled model (65.5 M),
./zipformer/train.py \
  --world-size 4 \
  --num-epochs 40 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp \
  --causal 0 \
  --full-libri 1 \
  --max-duration 1000 
  • small-scaled model (23.2M),
./zipformer/train.py \
  --world-size 2 \
  --num-epochs 40 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-small \
  --causal 0 \
  --num-encoder-layers 2,2,2,2,2,2 \
  --feedforward-dim 512,768,768,768,768,768 \
  --encoder-dim 192,256,256,256,256,256 \
  --encoder-unmasked-dim 192,192,192,192,192,192 \
  --base-lr 0.04 \
  --full-libri 1 \
  --max-duration 1500 
  • large-scaled model (148.4M),
./zipformer/train.py \
  --world-size 4 \
  --num-epochs 40 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-large \
  --causal 0 \
  --num-encoder-layers 2,2,4,5,4,2 \
  --feedforward-dim 512,768,1536,2048,1536,768 \
  --encoder-dim 192,256,512,768,512,256 \
  --encoder-unmasked-dim 192,192,256,320,256,192 \
  --full-libri 1 \
  --max-duration 1000 
  • normal-scaled model, streaming (66.1 M),
./zipformer/train.py \
  --world-size 4 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-causal \
  --causal 1 \
  --full-libri 1 \
  --max-duration 1000 

@yaozengwei
Copy link
Collaborator Author

yaozengwei commented May 13, 2023

I use the nvtx profiler to visualize the event timeline for our four encoder models, including the regular Conformer, the reworked Conformer, the old Zipformer, and the upgraded Zipformer, on a V100-32GB gpu. The models are in inference mode. The input tensor shape of each batch is (20, 3000, 80).

Screenshot 2023-05-13 at 20 17 52 Screenshot 2023-05-13 at 20 18 32 Screenshot 2023-05-13 at 20 18 54
  • upgraded Zipformer, 64.0 M
Screenshot 2023-05-13 at 20 19 10

causal=causal)

# TODO: remove it
self.bypass_scale = nn.Parameter(torch.full((embed_dim,), 0.5))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove this? I think it's unused.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But the trained model has already the bypass_scale parameters... If we remove that, it would fail to load the saved model and optimizer state_dicts.

@yaozengwei yaozengwei added the zipformer For the upgraded zipformer recipes label May 18, 2023
@yaozengwei yaozengwei added zipformer For the upgraded zipformer recipes and removed zipformer For the upgraded zipformer recipes labels May 18, 2023
@yaozengwei yaozengwei added zipformer For the upgraded zipformer recipes and removed zipformer For the upgraded zipformer recipes labels May 18, 2023
@yaozengwei yaozengwei added zipformer For the upgraded zipformer recipes and removed zipformer For the upgraded zipformer recipes labels May 18, 2023
@yaozengwei yaozengwei added zipformer For the upgraded zipformer recipes and removed zipformer For the upgraded zipformer recipes labels May 18, 2023
@yaozengwei yaozengwei added zipformer For the upgraded zipformer recipes and removed zipformer For the upgraded zipformer recipes labels May 18, 2023
@yaozengwei yaozengwei added zipformer For the upgraded zipformer recipes and removed zipformer For the upgraded zipformer recipes labels May 18, 2023
@@ -0,0 +1,123 @@
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang)
# Copyright 2021-2023 Xiaomi Corp. (authors: Zengwei Yao)

Comment on lines 68 to 69
value_head_dim (int or Tuple[int]): dimension of value in each attention head
pos_head_dim (int or Tuple[int]): dimension of positional-encoding projection per
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please switch the order of the doc for value_head_dim and pos_head_dim to match the actual argument order.

respectively.
"""
def __init__(self, *args):
assert len(args) >= 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert len(args) >= 1
assert len(args) >= 1, len(args)

else:
self.pairs = [ (float(x), float(y)) for x,y in args ]
for (x,y) in self.pairs:
assert isinstance(x, float) or isinstance(x, int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert isinstance(x, float) or isinstance(x, int)
assert isinstance(x, (float, int)), type(x)

self.pairs = [ (float(x), float(y)) for x,y in args ]
for (x,y) in self.pairs:
assert isinstance(x, float) or isinstance(x, int)
assert isinstance(y, float) or isinstance(y, int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert isinstance(y, float) or isinstance(y, int)
assert isinstance(y, (float, int)), type(y)

* [(sp[0], sp[1] + xp[1]) for sp, xp in zip(s.pairs, x.pairs)])

def max(self, x):
if isinstance(x, float) or isinstance(x, int):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(x, float) or isinstance(x, int):
if isinstance(x, (float, int)):

include_crossings: if true, include in the x values positions
where the functions indicate by this and p crosss.
"""
assert isinstance(p, PiecewiseLinear)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert isinstance(p, PiecewiseLinear)
assert isinstance(p, PiecewiseLinear), type(p)

assert isinstance(p, PiecewiseLinear)

# get sorted x-values without repetition.
x_vals = sorted(set([ x for x, y in self.pairs ] + [ x for x, y in p.pairs ]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x_vals = sorted(set([ x for x, y in self.pairs ] + [ x for x, y in p.pairs ]))
x_vals = sorted(set([ x for x, _ in self.pairs ] + [ x for x, _ in p.pairs ]))

Example:
self.dropout = ScheduledFloat((0.0, 0.2), (4000.0, 0.0), default=0.0)

`default` is used when self.batch_count is not set or in training or mode or in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in training or mode

Please fix the typo.


def __float__(self):
batch_count = self.batch_count
if batch_count is None or not self.training or torch.jit.is_scripting():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be

not torch.jit.is_scripting()

Also, should we add not torch.jit.is_tracing()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks.

@yaozengwei yaozengwei merged commit f18b539 into k2-fsa:master May 19, 2023
1 of 3 checks passed
@OswaldoBornemann
Copy link

May i ask when could we use the latest zipformer ? It seems that i could not find the related code yet.

@csukuangfj
Copy link
Collaborator

May i ask when could we use the latest zipformer ?

Whenever you want.

Please see k2-fsa/sherpa#379

@PPGGG

@OswaldoBornemann
Copy link

May i ask when could we use the latest zipformer ?

Whenever you want.

Please see k2-fsa/sherpa#379

@PPGGG

Yeah that's great. But i would like to find the training recipe to train the lastest zipformer on my own dataset. Should i just use the zipformer folder under librispeech/ASR ? Thanks.

@csukuangfj
Copy link
Collaborator

Yes, you can find the usage in RESULTS.md

@rookie0607
Copy link

Hello developers, I have some questions about simulated streaming (time masking), how exactly is simulated streaming implemented? Is there a paper or code for this? Best wishes. @yaozengwei

@JinZr
Copy link
Collaborator

JinZr commented Mar 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
zipformer For the upgraded zipformer recipes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants