Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time



Ref: [1]Dong, B., Hou, J., Lu, Y., & Zhang, Z. (2019). Distillation $\approx$ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network. 1–22. Retrieved from

[2]Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 2018-Decem(5), 8571–8580.

[3]Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y., & Ma, Z. (2019). Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks. Retrieved from

I'm trying to analysis why distillation > early stopping in frequency domain. All the codes are based on pytorch.

How to use?

My codes inculde:(you can run all the codes in this sequntial)

  1. . We choose Cifar-10 as the datasets. This code allow us generate a new dataset which with 40% wrong labels, we marked it as $\mathcal{D}$. You can modified it by youself.
  2. . Then you can start training the dataset $\mathcal{D}$ which generate in step.1. (We change the loss into the one which the article mentioned.) You can use python --dis=1 to run with the loss which the article mentioned. Or python --dis=1 to run without distillation, only shake-shake. is the authors' orginal codes.
  3. . Conclude all the training history in one file.
  4. . Plot the loss, acc and frequency domain analysis.

Main idea of paper[1].

They peopose a new algorithm which based on distillation, aim on solve the problem of noise dataset learning. In general, the algorithm is proposed to learn the label which combined with the network prediction results and roginal labels. In this paper, they use NTK[2] to analysis why distillation > early stopping.

Our work

We want to analysis this phenomenon in frequency domain[3]. In this paper[3], it proposed that DNN have a property that low frequency convergience first. Marked $X$ as the test dataset, $Y$ as the orginal test label. $\mathcal{N}$ as the neural network trained by us. We proposed a gaussian smoothing to separate low frequency components from orginal label $Y$, we marked it as $Y_{low}$, and for each $y\in Y$ $y_{high}=y-y_{low}$. $y_{i}^{l o w, \delta}=\frac{1}{C_{i}} \sum_{j=0}^{n-1} y_{j} G^{\delta}\left(x_{i}-x_{j}\right)$.

We calculate $\Delta$ as: $\Delta=\frac{|y_i^{low}|}{|y_i^{low}|+|y_i^{high}|}$

Main result

We can observe self-distillation > earlystopping, but we can not observe too much different on frequency domain. In these pics, dis_ means the result is using distillation, val_ is in the test dataset ,and $\Delta$ mens subtract the result of distillation and without distillation.

accuracy frequency loss





No releases published


No packages published