self-distillation

Ref: [1]Dong, B., Hou, J., Lu, Y., & Zhang, Z. (2019). Distillation $\approx$ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network. 1–22. Retrieved from http://arxiv.org/abs/1910.01255

[2]Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 2018-Decem(5), 8571–8580.

[3]Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y., & Ma, Z. (2019). Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks. Retrieved from http://arxiv.org/abs/1901.06523

I'm trying to analysis why distillation > early stopping in frequency domain. All the codes are based on pytorch.

How to use?

My codes inculde:(you can run all the codes in this sequntial)

generate.py . We choose Cifar-10 as the datasets. This code allow us generate a new dataset which with 40% wrong labels, we marked it as $\mathcal{D}$. You can modified it by youself.
train.py . Then you can start training the dataset $\mathcal{D}$ which generate in step.1. (We change the loss into the one which the article mentioned.) You can use python train.py --dis=1 to run with the loss which the article mentioned. Or python train.py --dis=1 to run without distillation, only shake-shake. cross_entropy.py is the authors' orginal codes.
newtxt.py . Conclude all the training history in one file.
plthis.py . Plot the loss, acc and frequency domain analysis.

Main idea of paper[1].

They peopose a new algorithm which based on distillation, aim on solve the problem of noise dataset learning. In general, the algorithm is proposed to learn the label which combined with the network prediction results and roginal labels. In this paper, they use NTK[2] to analysis why distillation > early stopping.

Our work

We want to analysis this phenomenon in frequency domain[3]. In this paper[3], it proposed that DNN have a property that low frequency convergience first. Marked $X$ as the test dataset, $Y$ as the orginal test label. $\mathcal{N}$ as the neural network trained by us. We proposed a gaussian smoothing to separate low frequency components from orginal label $Y$, we marked it as $Y_{low}$, and for each $y\in Y$ $y_{high}=y-y_{low}$. $y_{i}^{l o w, \delta}=\frac{1}{C_{i}} \sum_{j=0}^{n-1} y_{j} G^{\delta}\left(x_{i}-x_{j}\right)$.

We calculate $\Delta$ as: $\Delta=\frac{|y_i^{low}|}{|y_i^{low}|+|y_i^{high}|}$

Main result

We can observe self-distillation > earlystopping, but we can not observe too much different on frequency domain. In these pics, dis_ means the result is using distillation, val_ is in the test dataset ,and $\Delta$ mens subtract the result of distillation and without distillation.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
acc.png		acc.png
cosine_optim.py		cosine_optim.py
cross_entropy.py		cross_entropy.py
frequency.png		frequency.png
frequency.py		frequency.py
generate.py		generate.py
loss.png		loss.png
model.py		model.py
new.py		new.py
newtxt.py		newtxt.py
plthis.py		plthis.py
train.py		train.py
utils.py		utils.py

lizhemin15/self-distillation

Folders and files

Latest commit

History

Repository files navigation

self-distillation

How to use?

Main idea of paper[1].

Our work

Main result

About

Resources

Stars

Watchers

Forks

Languages