You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In your experimental result, There are many teacher and student pairs.
Especially, in KD(Distilling the Knowledge in a Neural Network) method, the optimal setting(ie. temperature) may differ with each pairs.
Does performance change a lot with temperature difference?
There may be a similar problem with KD as well as other methods, how do you think about this?
The text was updated successfully, but these errors were encountered:
For all methods (including KD), I only tune hyper-parameters for one of the pair. After that, I kept those parameters fixed and evaluated on other pairs.
T=4 is what I found optimal and also consistent with previous works. I think you are right, it might be different for different pairs. But on the other hand, the point of this benchmark is to see the generalization ability of different methods, i.e., you can use the same hyper-parameters on different models but still get good performance.
Thank you for share benchmark.
In your experimental result, There are many teacher and student pairs.
Especially, in KD(Distilling the Knowledge in a Neural Network) method, the optimal setting(ie. temperature) may differ with each pairs.
Does performance change a lot with temperature difference?
There may be a similar problem with KD as well as other methods, how do you think about this?
The text was updated successfully, but these errors were encountered: