Skip to content
This repository has been archived by the owner on Sep 10, 2022. It is now read-only.

Problem about Cifar10 Expriements Reproduction #1

Open
cwfxcz opened this issue Mar 23, 2018 · 2 comments
Open

Problem about Cifar10 Expriements Reproduction #1

cwfxcz opened this issue Mar 23, 2018 · 2 comments

Comments

@cwfxcz
Copy link

cwfxcz commented Mar 23, 2018

Hi there,
Thanks for your great work of shampoo implementation in Pytorch. I'm trying to reproduce the cifar10 results in the Shampoo paper. But I got a much lower testing results. I have tried changing the learning rate form 0.01 to 10(according to the paper suggests), but still got a near 85% acc. Here are my experiments results:

  • We use the Resnet32 network in Cifar10 experiments.
  • --momentum, 0.9
    --epsilon, 1e-4
    --batchSize, 128

lr=0.1:(250 epochs)

Training Loss Training Acc Testing loss Testing Acc
0.65 77.03% 0.68 76.39%

lr=1: (250 epochs)

Training Loss Training Acc Testing loss Testing Acc
0.25 91.33% 0.57 84.04%

lr=2: (250 epochs)

Training Loss Training Acc Testing loss Testing Acc
0.23 91.87% 0.72 82.02%

lr=5: (250 epochs)

Training Loss Training Acc Testing loss Testing Acc
0.22 92.33% 0.75 82.04%

When training for 500 epochs for different lr above, the testing acc ramains almost the same. Still can't reach even 90% acc.

Any idea or suggestions about this problem? Thanks for your time.

@moskomule
Copy link
Owner

Thank you for your comprehensive experiments. Indeed, I also cannot reproduce the reported results with my implementation even though using the average of gradients.
So far, I'm also still investigating the reason. If you find something, please let me know.

@cwfxcz
Copy link
Author

cwfxcz commented Mar 26, 2018

Hi, some questions about the Algorithm 2 code.
In the Shampoo paper, for different dimension it use the original gradselection_132 to calculate the contractionselection_131.

But in the code, the grad will be updated for each dimension, and then used to calculate the contraction for the next dimension. Is it sth wrong of my understanding about the code or the algo.2 in the paper?
.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants