Best practices for backstitch #1942

danpovey · 2017-10-17T17:41:51Z

@freewym, I want to get to a situation where most of our frequently used example scripts have suitable backstitch settings, as it does seem to give a reliable improvement.

I think rather than relying on you to do that, it may be a good idea to just make everyone aware of what the recommended settings are, with guidance for tuning it if applicable-- with some idea of how much it's expected to improve the results. Can you please comment on this issue with answers to those questions?

freewym · 2017-10-17T18:50:20Z

To turn on the backstitch training, there are just a few lines to add/change to the shell script:

pass the following options to steps/nnet3/chain/train.py:
--trainer.optimization.backstitch-training-scale $alpha \
--trainer.optimization.backstitch-training-interval $back_interval \

where a typical setting is:
$alpha=0.3
$back_interval=1

or to get speed-up at the cost of potentially a small degradation (which is observed in our swbd experiments):
$alpha=1.0
$back_interval=4

Meanwhile, we need to double the value of num-epochs when doing backstitch training (e.g, if num-epochs=4 with normal SGD training, then num-epochs=8 with backstitch training). If the the valid objf has not converged after doubling num-epochs, further increase it until convergence.

For TDNN-LSTM recipes of the chain model, backstitch obtains ~10% relative WER improvement on SWBD, AMI-SDM and tedlium. For TDNN-LSTM cross-entropy models, the improvement is smaller (2-4%). For non-recurrent architectures (e.g., TDNN), the improvement may be even smaller.

Note that the recommended settings above apply to our ASR tasks with chain/cross-entropy models. It may be different for other tasks like image classification (e.g., in CIFAR Resnet recipes, alpha=0.5, back-interval=1, and num-epochs is around 30% more than the one in the normal SGD training) .

danpovey · 2017-10-17T20:35:49Z

Hm. Nice improvement, but it seems like the impact on training time is substantial, what with the increased num-epochs and the fact the backstitch necessitates processing each minibatch twice (at least if back_interval=1). I'm wondering whether we should have different 'XX_back.sh' versions of each recipe XX.sh. But I'm concerned that this will explode the number of recipes and create more burden on testing. Do you get any improvement without increasing the num-epochs? And I wonder if you ever tried increasing the initial learning rate a bit... this might make it learn faster and have a similar effect to more epochs.

…

On Tue, Oct 17, 2017 at 2:50 PM, Yiming Wang ***@***.***> wrote: To turn on the backstitch training, there are just a few lines to add/change to the shell script: pass the following options to steps/nnet3/chain/train.py: --trainer.optimization.backstitch-training-scale $alpha \ --trainer.optimization.backstitch-training-interval $back_interval \ where a typical setting is: $alpha=0.3 $back_interval=1 or to get speed-up at the cost of potentially a small degradation (which is observed in our swbd experiments): $alpha=1.0 $back_interval=4 Meanwhile, we need to double the value of num-epochs when doing backstitch training (e.g, if num-epochs=4 with normal SGD training, then num-epochs=8 with backstitch training). If the the valid objf has not converged after doubling num-epochs, further increase it until convergence. For TDNN-LSTM recipes of the chain model, backstitch obtains ~10% relative WER improvement on SWBD, AMI-SDM and tedlium. For TDNN-LSTM cross-entropy models, the improvement is smaller (2-4%). For non-recurrent architectures (e.g., TDNN), the improvement may be even smaller. Note that the recommended settings above apply to our ASR tasks with chain/cross-entropy models. It may be different for other tasks like image classification (e.g., in CIFAR Resnet recipes, alpha=0.5, back-interval=1, and num-epochs is around 30% more than the one in the normal SGD training) . — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1942 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu5IUbOOVR2m51EC40UIsDTjRXzv0ks5stPbygaJpZM4P8jjT> .

freewym · 2017-10-17T20:41:54Z

Most of the time with the same num-epochs backstitch is worse.
I can try increasing the init-learning-rate.

stale · 2020-06-19T09:51:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-07-19T06:25:00Z

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

stale bot added the stale Stale bot on the loose label Jun 19, 2020

stale bot closed this as completed Jul 19, 2020

kkm000 removed the stale Stale bot on the loose label Jul 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for backstitch #1942

Best practices for backstitch #1942

danpovey commented Oct 17, 2017

freewym commented Oct 17, 2017

danpovey commented Oct 17, 2017 via email

freewym commented Oct 17, 2017

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

Best practices for backstitch #1942

Best practices for backstitch #1942

Comments

danpovey commented Oct 17, 2017

freewym commented Oct 17, 2017

danpovey commented Oct 17, 2017 via email

freewym commented Oct 17, 2017

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020