Skip to content

lizhemin15/self-distillation

Repository files navigation

self-distillation

self-distillation

Ref: [1]Dong, B., Hou, J., Lu, Y., & Zhang, Z. (2019). Distillation $\approx$ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network. 1–22. Retrieved from http://arxiv.org/abs/1910.01255

[2]Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 2018-Decem(5), 8571–8580.

[3]Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y., & Ma, Z. (2019). Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks. Retrieved from http://arxiv.org/abs/1901.06523

I'm trying to analysis why distillation > early stopping in frequency domain. All the codes are based on pytorch.

How to use?

My codes inculde:(you can run all the codes in this sequntial)

  1. generate.py . We choose Cifar-10 as the datasets. This code allow us generate a new dataset which with 40% wrong labels, we marked it as $\mathcal{D}$. You can modified it by youself.
  2. train.py . Then you can start training the dataset $\mathcal{D}$ which generate in step.1. (We change the loss into the one which the article mentioned.) You can use python train.py --dis=1 to run with the loss which the article mentioned. Or python train.py --dis=1 to run without distillation, only shake-shake. cross_entropy.py is the authors' orginal codes.
  3. newtxt.py . Conclude all the training history in one file.
  4. plthis.py . Plot the loss, acc and frequency domain analysis.

Main idea of paper[1].

They peopose a new algorithm which based on distillation, aim on solve the problem of noise dataset learning. In general, the algorithm is proposed to learn the label which combined with the network prediction results and roginal labels. In this paper, they use NTK[2] to analysis why distillation > early stopping.

Our work

We want to analysis this phenomenon in frequency domain[3]. In this paper[3], it proposed that DNN have a property that low frequency convergience first. Marked $X$ as the test dataset, $Y$ as the orginal test label. $\mathcal{N}$ as the neural network trained by us. We proposed a gaussian smoothing to separate low frequency components from orginal label $Y$, we marked it as $Y_{low}$, and for each $y\in Y$ $y_{high}=y-y_{low}$. $y_{i}^{l o w, \delta}=\frac{1}{C_{i}} \sum_{j=0}^{n-1} y_{j} G^{\delta}\left(x_{i}-x_{j}\right)$.

We calculate $\Delta$ as: $\Delta=\frac{|y_i^{low}|}{|y_i^{low}|+|y_i^{high}|}$

Main result

We can observe self-distillation > earlystopping, but we can not observe too much different on frequency domain. In these pics, dis_ means the result is using distillation, val_ is in the test dataset ,and $\Delta$ mens subtract the result of distillation and without distillation.

accuracy frequency loss

About

self-distillation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages