Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison between k2 CTC loss and PyTorch CTC loss #575

Closed
zhu-han opened this issue Jan 8, 2021 · 57 comments
Closed

Comparison between k2 CTC loss and PyTorch CTC loss #575

zhu-han opened this issue Jan 8, 2021 · 57 comments

Comments

@zhu-han
Copy link

zhu-han commented Jan 8, 2021

Has anyone compaired the performance between k2 CTC loss implementation and the CTCLoss in PyTorch?

I tried to write a K2CTCLoss with k2 to replace torch.nn.CTCLoss and did some experiments using ESPnet. It shows there is a gap between K2CTCLoss and torch.nn.CTCLoss.

The experiments are conducted on Librispeech 100h and the training criterion is CTC only. Acoustic model is BLSTM or Transformer based encoder. For CTC modeling unit, I tried char and bpe 5000. Here are some conclusions of my experiments:

  • K2CTCLoss could work with BLSTM based acoustic model, though torch.nn.CTCLoss could reduce the loss faster;

  • K2CTCLoss did't work with Transformer. When using bpe 5000 as CTC modeling unit, the loss curve of K2CTCLoss would be like :
    k2_ctc_loss
    In comparison, torch.nn.CTCLoss with Transformer is like:
    torch_ctc_loss

  • The above conclusions are the same when CTC modeling unit is char or bpe 5000.

  • In snowfall, the CTC implementation is (1) acoustic feature->phone->word. I did a experiment using the K2CTCLoss with (2) acoustic feature->char structure. And the WERs are (1) 12.84% and (2) 15.99% respectively. So I think the K2CTCLoss implementation should be fine.

Could anyone give me some advice on how to make it work better? And does anyone know why it can't work well with transformer? Thanks!

@danpovey
Copy link
Collaborator

danpovey commented Jan 8, 2021

That's interesting.
It's possible that it could be a bug in k2, but there are many places it could be.
I checked the documentation for torch.nn.CTCLoss but it is a little vague so it's hard to know whether they are attempting to implement the same thing as us.
One thing you could do which would be helpful to us is to try to evaluate k2's version of the loss given the model trained with PyTorch. If it looks similar to PyTorch's loss, it would likely indicate a bug in computing derivatives.

@danpovey
Copy link
Collaborator

danpovey commented Jan 8, 2021

Also it would be nice if someone could compute the sum of the derivative (.grad) of our CTC loss and make sure the sum on each frame of each sequence is close to 1.0. [if we can somehow access the .grad w.r.t. the nnet output].

@zhu-han
Copy link
Author

zhu-han commented Jan 8, 2021

Using random initialized transformer model, the first 10 iteration loss computed with K2CTCLoss and torch.nn.CTCLoss are:

iteration K2CTCLoss torch.nn.CTCLoss
1 3379.74 3385.51
2 2644.70 2643.39
3 2760.64 2765.41
4 2593.41 2595.71
5 2360.99 2363.25
6 2351.16 2346.32
7 3471.35 3478.80
8 2540.69 2540.17
9 2953.67 2955.29
10 2190.49 2189.26

Using transformer model trained with torch.nn.CTCLoss for 10 epoch, the first 10 iteration loss computed with K2CTCLoss and torch.nn.CTCLoss are:

iteration K2CTCLoss torch.nn.CTCLoss
1 862.52 35.42
2 782.07 35.21
3 870.66 39.65
4 804.01 27.49
5 821.23 37.25
6 806.56 36.18
7 717.92 31.09
8 705.32 32.66
9 829.99 28.65
10 749.98 27.37

It seems that we can get a similar loss value with random initialized model but not a pretrained model.

@danpovey
Copy link
Collaborator

danpovey commented Jan 8, 2021

Thanks a lot!! For the transformer model, can you clarify how you were training it? Was it with one of the two CTC losses?

@danpovey
Copy link
Collaborator

danpovey commented Jan 8, 2021

And in the 2nd table, can you clarify if you were training with the same loss functions you were evaluating?
What I want is for you to train with one loss and also evaluate the objectives with the other. To see if the actual loss calculation is the same (might be bug in derivative computation)

@zhu-han
Copy link
Author

zhu-han commented Jan 8, 2021

In the 2nd table, the pretrained transformer model is trained with torch.nn.CTCLoss only. And then the training and loss calculation used the same loss function.

@danpovey
Copy link
Collaborator

danpovey commented Jan 8, 2021

OK, but what I want is for you to train with the torch loss and evaluate with k2 CTC loss, with the same model. So same code will evaluate 2 objectives.

With the random transformer model, what are the iterations? That is, what objective are you training with?

@zhu-han
Copy link
Author

zhu-han commented Jan 8, 2021

Sorry for the misunderstanding. When trained with torch.nn.CTCLoss and also evaluate the K2CTCLoss, no matter with a random initialized or a pretrained model, the two loss value is the same.

In the 1st table above (random transformer model results), the two column are trained with K2CTCLoss and torch.nn.CTCLoss as objective respectively.

@danpovey
Copy link
Collaborator

danpovey commented Jan 8, 2021

So if you train with the PyTorch loss and evaluate also with the k2 one, you'll get the same value? Because in iteration 1 of your 2nd table, they're very different... if you showed iteration 0, would the k2 one be the same?

@zhu-han
Copy link
Author

zhu-han commented Jan 8, 2021

I checked the code and find a bug which accidently make the two loss function the same.

The real results is : with a random initialized model, the two losses are similar. With a pretrained model, the two losses are very different. I will paste the results bellow.

@zhu-han
Copy link
Author

zhu-han commented Jan 8, 2021

The training objective is torch.nn.CTCLoss and the evaluation is performed on both K2CTCLoss and torch.nn.CTCLoss

  • Using random initialized transformer model
iteration K2CTCLoss torch.nn.CTCLoss
1 3379.74 3385.51
2 2644.70 2643.39
3 2760.65 2765.41
4 2593.45 2595.71
5 2361.09 2363.25
6 2351.38 2346.32
7 3471.62 3478.80
8 2540.69 2540.17
9 2953.91 2955.29
10 2191.66 2189.26
  • Using pretrained transformer model which trained with torch.nn.CTCLoss as objective for 10 epoch.
iteration K2CTCLoss torch.nn.CTCLoss
1 862.52 35.42
2 782.09 35.21
3 870.74 39.65
4 804.16 27.49
5 821.47 37.25
6 806.93 36.18
7 718.35 31.09
8 705.85 32.66
9 830.93 28.65
10 750.99 27.37

@danpovey
Copy link
Collaborator

danpovey commented Jan 8, 2021

OK. WIthout seeing the code it will be hard to comment much further or help debug.
Fanjun says he will try to debug the derivatives of the k2 loss over the weekend.

@zhu-han
Copy link
Author

zhu-han commented Jan 8, 2021

Thanks a lot for your help! If anyone is interested, my K2CTCLoss implementation is in https://github.com/zhu-han/espnet-k2/blob/main/espnet/nets/pytorch_backend/ctc_graph.py.

@danpovey
Copy link
Collaborator

danpovey commented Jan 8, 2021

[re-posting directly, mail is unreliable.]
You are not using 'indices' to sort the FSAs in the graphs.
I'm not sure if our Fsa object has an operator [] that can take a Tensor, but it might.

Basically, your graphs are in the wrong order.
You could also possibly reorder targets and target_lengths before compiling the graph.

@danpovey
Copy link
Collaborator

danpovey commented Jan 8, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Jan 8, 2021

possibly

decoding_graph = k2.index(decoding_graph, indices)

would work (not sure though)

@zhu-han
Copy link
Author

zhu-han commented Jan 8, 2021

Thanks for your help! I will change my code accordingly and do the experiments.

@sw005320
Copy link

sw005320 commented Jan 8, 2021

@zhu-han, thanks for sharing your interesting report.
I will also take a look at this.
We (@brianyan918) are also working on comparing pytorch CTC, warpCTC, k2 CTC, and gtn CTC.

@zhu-han
Copy link
Author

zhu-han commented Jan 9, 2021

After fixing the graph order issue, K2CTCLoss could work with Transformer now. With bpe 500 as CTC modeling unit, the loss curve is like:
modified_k2_ctc_loss

And previous results make sense now. Before making batch, the training samples is sorted according to input length. So with a smaller batch size, all samples in the same batch is more likely to have the same length. With the same length, the sorted text could match the order of unsorted graph. In my experiments, BLSTM has smaller batch size than Transformers (20 vs 256), so BLSTM suffers less than Transformer because of this bug. That's why BLSTM could work and Transformer could not work in previous results.

Thanks a lot!

@zhu-han
Copy link
Author

zhu-han commented Jan 9, 2021

@sw005320 My revised K2CTCLoss is in https://github.com/zhu-han/espnet-k2/blob/main/espnet/nets/pytorch_backend/ctc_graph.py. I will be glad to help on this.

@csukuangfj
Copy link
Collaborator

I just added gradients test for k2 CTC loss. Please see #577

It shows that k2 CTC loss is identical to PyTorch CTC loss and warp-ctc when they are given the same input.

The gradients of k2 and PyTorch are also the same.

@zhu-han
Copy link
Author

zhu-han commented Jan 10, 2021

Thanks! But since I find models trained with k2 CTC loss and PyTorch CTC loss did have some differences, I added additional test cases baed on test_random_case1 in ctc_gradients_test.py to check it. Here are some results:

  • When I run this test case directly, it could pass;
  • When I changed parameters T and C to match my experiment's setup, i.e., T = 400 (16s training sample and 4 × subsampling factor), C = 5000 (with bpe 5000 as CTC modeling unit), this test case is failed. Specifically ,the gradient check assert torch.allclose(torch_activation.grad, k2_activation.grad, atol=1e-2) is failed.
  • When I keep T as orignial and only change C to 5000, the gradient check is passed. But when I keep C and change the sample length T to 400, and the gradient check is also failed.

It seems with longer samples, the difference is larger.

@zhu-han
Copy link
Author

zhu-han commented Jan 10, 2021

And these are the results I got on librispeech 100h using PyTorch CTC loss and k2 CTC loss:

  • PyTorch CTC loss:
Criterion Test clean Test other
CTC 17.1 35.9
Hybrid CTC/Attention 10.3 27.1
  • k2 CTC loss:
Criterion Test clean Test other
CTC 17.3 36.4
Hybrid CTC/Attention 10.6 27.5

Detailed setup:

  • k2 CTC loss: In k2.intersect_dense(), set output_beam = 10.0
  • Training: For both criterions, SpecAugument is not used. For CTC, epochs=30, batch size = 256. For hybrid CTC/Attention, epochs = 80, CTC weight = 0.3.
  • Decoding: For both criterions, best 5 models based on validation performance are averaged to get the final model, beam size = 10, and language model is not used. For hybrid CTC/Attention, CTC weight = 0.4.
  • Model: For hybrid CTC/Attention, Transformer with 12 encoder layers and 6 decoder layers. Attention heads = 4, attention dimension = 256, feed forward dimension = 2048. For CTC, same encoder structure is used.

@danpovey
Copy link
Collaborator

Cool!
Regarding the gradient-check: sometimes there can be roundoff error that causes the posteriors on some frames to sum to a number different than 1. Can you compute those sums? I.e. the sum of the grad, per frame...

@zhu-han
Copy link
Author

zhu-han commented Jan 10, 2021

Given the same input, the PyTorch CTC gradient sum per frame is:
[ 0.0000e+00, 2.3842e-07, -3.5763e-07, -2.3842e-07, -3.5763e-07,...]
and the k2 CTC gradient sum per frame is:
[-1.1921e-06, -2.3842e-07, 1.0729e-06, 8.3447e-07, 4.7684e-07,...]

@danpovey
Copy link
Collaborator

danpovey commented Jan 10, 2021 via email

@zhu-han
Copy link
Author

zhu-han commented Jan 10, 2021

Those was already after softmax results.
For example, the torch gradient for one frame is:

[ -9.4860,   2.4738,   5.9179,   4.7736,   5.5900,   2.8961,   6.4206,
            4.4688,   2.8942,   4.0882, -74.9657,   4.3691,   5.7488,   6.3485,
            6.4876,   2.9647,   3.2492,   4.7775,   3.5132,   2.7532,   4.7165]

It's sum is 5.2452e-06.

k2 gradient of this same frame is:

[ -9.4859,   2.4738,   5.9179,   4.7736,   5.5900,   2.8961,   6.4206,
            4.4688,   2.8942,   4.0882, -74.9657,   4.3691,   5.7488,   6.3485,
            6.4876,   2.9647,   3.2492,   4.7775,   3.5132,   2.7532,   4.7165]

And it's sum is -8.5831e-06

These two gradients only have one different value: -9.4860 vs -9.4859 in the first dimension.

@danpovey
Copy link
Collaborator

danpovey commented Jan 10, 2021 via email

@zhu-han
Copy link
Author

zhu-han commented Jan 10, 2021

Oh, I misunderstand that ,I thought you mean the loss was computed prior to the softmax. I will update results.

@zhu-han
Copy link
Author

zhu-han commented Jan 10, 2021

When I set learning rate as 1 and use k2 CTC loss, the gradient sum per frame of the tensor after logsoftmax is -1. I'm not sure it is what you want to check.

@danpovey
Copy link
Collaborator

Yes that sounds right. See if the same is true of PyTorch's one; the error could be there.

@zhu-han
Copy link
Author

zhu-han commented Jan 10, 2021

For PyTorch, these values is near to 0, i.e.,[-4.7088e-6, -4.6492e-6, ...]

@danpovey
Copy link
Collaborator

Ah, I guess it does the normalization internally.
It's unlikely, IMO, that there is a roundoff problem in k2, given what you say.  More likely in pytorch itself and the WER differences may be tuning-dependent, most likely.

@csukuangfj
Copy link
Collaborator

csukuangfj commented Jan 10, 2021

For the simplest case,

#             blk    a    b    c    d
activation = [0.2, 0.2, 0.2, 0.2, 0.2]
log_probs = log_softmax of activation
log_probs.retain_grad()

And if the target label is a,

  • for k2, log_probs.grad is [0, -1, 0, 0, 0]. log_probs.grad.sum() is -1
  • for PyTorch, log_probs.grad is [0.2, -0.8, 0.2, 0.2, 0.2]. log_probs.grad.sum() is 0

@danpovey
Copy link
Collaborator

PyTorch is obviously doing the log-softmax normalization as part of the CTC computation; in k2 those things are separate.

@danpovey
Copy link
Collaborator

Do we know of any difference in speed?

@csukuangfj
Copy link
Collaborator

We (@brianyan918) are also working on comparing pytorch CTC, warpCTC, k2 CTC, and gtn CTC.

@sw005320 Could you share the progress with us? Does the comparison include speed differences?

@brianyan918
Copy link

I tested these different CTC modes in espnet with these results on voxforge italian eval:

Model CER WER
Conformer (warpctc) 8.5 30.0
Conformer (pytorch) 8.6 30.6
Conformer (gtnctc) 8.5 30.0
Conformer (k2) 8.7 30.8

Previously I was able to compare the speeds of pytorch vs warp vs gtn, but for k2 I used a different device. I'll provide an update with speed comparisons shortly.

@zhu-han
Copy link
Author

zhu-han commented Jan 14, 2021

When training on librispeech 100h for one epoch, the results are:

Method Time
PyTorch 15.69 min
k2 17.78 min

@danpovey
Copy link
Collaborator

danpovey commented Jan 14, 2021 via email

@zhu-han
Copy link
Author

zhu-han commented Jan 14, 2021

I followed https://k2.readthedocs.io/en/latest/installation.html#install-k2-from-source to install k2. Is this in release mode by default?

@csukuangfj
Copy link
Collaborator

cmake -DCMAKE_BUILD_TYPE=Release ..

If you followed it step by step, then it is a Release build.

@zhu-han
Copy link
Author

zhu-han commented Jan 14, 2021

Yes, it is in release mode then.

@csukuangfj
Copy link
Collaborator

python3 -m k2.version

should tell you whether k2 was built in Release mode or in Debug mode.

@zhu-han
Copy link
Author

zhu-han commented Jan 14, 2021

It shows Build type: Release.

@danpovey
Copy link
Collaborator

danpovey commented Jan 14, 2021 via email

@zhu-han
Copy link
Author

zhu-han commented Jan 14, 2021

Pulled on 2021/01/06.

@danpovey
Copy link
Collaborator

danpovey commented Jan 14, 2021 via email

@csukuangfj
Copy link
Collaborator

This pull-request #571 (comment), merged on Jan 8, made
GetTransposeReodering 2-3x faster than before. Not sure how it would affect the training speed.

@zhu-han
Copy link
Author

zhu-han commented Jan 14, 2021

Tried with the latest k2, training time is similar. Previous training time is 17.78 min and the latest one is 17.68 min.

@csukuangfj
Copy link
Collaborator

Which version of CUDA Toolkit are you using? The change is enabled only for NVCC version > 10.1.105.

@zhu-han
Copy link
Author

zhu-han commented Jan 14, 2021

I'm using CUDA 10.1, NVCC version 10.1.243

@danpovey
Copy link
Collaborator

danpovey commented Jan 14, 2021 via email

@yaguanghu
Copy link
Contributor

I just added gradients test for k2 CTC loss. Please see #577

It shows that k2 CTC loss is identical to PyTorch CTC loss and warp-ctc when they are given the same input.

The gradients of k2 and PyTorch are also the same.

In test_case3, when I change the input to torch_activation after softmax and remove the softmax function below, the gradient of k2 seems not identical to pytorch build-in ctc loss. e.g.
torch_activation = torch.tensor([[ [-5, -4, -3, -2, -1], [-10, -9, -8, -7, -6], [-15, -14, -13, -12, -11.], ]]).permute(1, 0, 2).detach().log_softmax(2).requires_grad_(True)
It seems strange, what might be the reason?

@danpovey
Copy link
Collaborator

danpovey commented Feb 4, 2021

How different are they?
I'm not convinced that what we implemented is 100% the same as the standard CTC loss.
We may not be treating repeats of the same symbol quite the same way, e.g. "aa" at the nnet output
could represent either the single symbol "a" or "a" followed by "a".

@yaguanghu
Copy link
Contributor

yaguanghu commented Feb 4, 2021

Repeats symbols are already handled by the current CTC topology as standard CTC loss does, and I don't think there's a difference between them.
I'm just curious about why log_softmax affects the gradient.

@danpovey
Copy link
Collaborator

danpovey commented Feb 5, 2021

It is definitely expected to affect the gradient. In our implementation we don't do that as part of the FSA stuff, it is a separate component, so our CTC loss needs the log-softmax.

@zhu-han zhu-han closed this as completed Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants