New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weights before softmax error in the weighted loss function #8
Comments
I'm not sure if I understand your argumentation. The return value of |
I think you maybe didn't understand my argument, because I made a mistake in my explanation. I should have read the code more closely before posting. But I'll try to explain the problem in more detail (correctly this time ;) and outline the implementation. So lets imagine you have ten times more pixels with label a than you have with label b. To balance this out you want the gradients that come from pixels with label b to count ten times as much as gradients that come from pixels with label a, so that when you add it all up learning for both cases happens at the same speed. The gradients are the derivative with respect to the loss. The math is a bit involved, but I found this explanation of how to compute the drivatives of the last output layer in the case of a subsequent softmax activation function with cross entropy loss. The basic takeaway is that the gradient is ∂L / ∂oi = pi - yi where L is the loss, oi is the output of the last layer (the logits), yi is the label (vector) and pi is the result of the softmax activation function: pi = eoi / ( Σj eoj ) Now if you multiply the loss L by 10, the gradient will also be multiplied by 10. But if you multiply the logits oi by 10, it will only influence the gradient through the result of the softmax pi. Specifically, the way it is implemented now (and here is where I was wrong above), the values oi (or ob, w, h, i to make it consistend with TensorFlow dimensions) will be multiplied with a certain weight wi across the whole feature map only depending on their class i and irrespective of the label of their pixel. I sent you an (untested) PR for the implementation. You should multiply the label array with the weights and sum it up across the last dimension (the one that defines the classes), so that you have a weightmap which corresponds to the pixel labels. Then reshape it into a 1D vector and multiply it elementwise with the loss. That way you have larger gradients for pixels with a label with a larger weight, and smaller gradients for pixels with a label with a smaller weight. |
Ok thats interessting. I need to think about this a little bit and thanks for the PR. I'll have a closer look |
I had a closer look at your PR and the referenced explanation. I think I understand the concept but I'm struggling a bit with the implementation. |
Hm, that's indeed not good :) As I said I just coded it down in an editor and didn't have time to test it. I'll see if I have time to look at it again. But if you find an error either in the concept or the implementation, I'm of course always grateful. |
I just pushed a little extension to the toy problem such that the unet has to segment background (85%), circles (12%) and rectangles (2%). Maybe this is going to help to track down the issue |
Some resources I found while looking at this issue: |
HI nicolov, thanks for also looking into this.
|
Yep, I agree with your analysis. I believe the |
I just pushed a new branch to make it easier to add new cost functions. Furhtermore I also included an implementation of the dice coeffient loss. |
I merged the branch quite a while ago |
Hi,
in the implementation of the weighted loss function the weights are applied to the logits before the softmax activation function. The result for a two class problem is that the bigger value after the application of the softmax function will increase, the smaller value will decrease. In other words, the network will look more confident in its predictions. If the weight was large and the prediction was wrong the gradients will also be larger though not necessarily by the expected amount. If the prediction was right, however, the gradients will be smaller than they would have been otherwise.
To ensure correct scaling, the weights should be applied after the call to
tf.nn.softmax_cross_entropy_with_logits()
and before the call totf.reduce_mean()
The text was updated successfully, but these errors were encountered: