Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Update] Support SVD method to calculate M^{-1/p} #103

Merged
merged 47 commits into from Feb 5, 2023

Conversation

kozistr
Copy link
Owner

@kozistr kozistr commented Feb 4, 2023

Problem (Why?)

Theoretically, Schur-Newton method is faster than SVD method to calculate M^{-1/p}. However, the inefficiency of the loop and others, in some cases, SVD is much faster than that. So, supports SVD method too. (related to #100)

Solution (What/How?)

  • calculate the power of the matrix with SVD
    • compute on GPU (not CPU this time)
  • perform batch SVD when the shapes of the pre-conditioners are all same. (maybe later, RaggedTensor could be used here.)
  • rename the new Shampoo optimizer to ScalableShampoo. (actually, not scalable though)
  • implement the original Shampoo optimizer.

Benchmark

tested on i7700K + GTX 1060 6GB.

backbone: resmlp_12_distilled_224, bs: 16

x4.325 faster (A -> B)

  • AdamP: 3.73 iter / s
  • (old) Shampoo: 25s / iter
  • Scalable Shampoo w/ Schur-Newton (block size = 256): 1.73s / iter -> A
  • Scalable Shampoo w/ SVD (block size = 256): 1.59 iter / s
  • Scalable Shampoo w/ SVD (block size = 512): 2.50 iter / s -> B

backbone: mixer_s32_224, bs: 8

x5.408 slower (A -> B)

  • AdamP: 3.85 iter / s
  • Scalable Shampoo w/ Schur-Newton (block size = 256): 1.68s / iter
  • Scalable Shampoo w/ Schur-Newton (block size = 512): 1.05 iter / s -> A
  • Scalable Shampoo w/ SVD (block size = 256): 5.15s / iter -> B
  • Scalable Shampoo w/ SVD (block size = 512): 7.01s / iter

backbone: mixer_b16_224, bs: 2

x3.292 slower (A -> B)

  • AdamP: 3.15 iter / s
  • (old) Shampoo: over 2 mins / iter
  • Scalable Shampoo w/ Schur-Newton (block size = 256): 5.11s / iter -> A
  • Scalable Shampoo w/ SVD (block size = 256): 16.82s / s -> B
  • Scalable Shampoo w/ SVD (block size = 512): 32.47s / iter

code

    from timm import create_model
    from tqdm import tqdm

    model = create_model(backbone, pretrained=False, num_classes=1)
    model.train()
    model.cuda()

    optimizer = load_optimizer('scalableshampoo')(
        model.parameters(), 
        start_preconditioning_step=1,
        block_size=block_size,
        use_svd=use_svd,
    )

    inp = torch.randn((bs, 3, 224, 224), dtype=torch.float32).cuda()
    y = torch.randn((bs, 1), dtype=torch.float32).cuda()

    for _ in tqdm(range(100)):
        optimizer.zero_grad()

        torch.nn.functional.binary_cross_entropy_with_logits(model(inp), y).backward()

        optimizer.step()

Other changes (bug fixes, small refactors)

nope

Notes

nope

@kozistr kozistr added enhancement New feature or request feature New features labels Feb 4, 2023
@kozistr kozistr self-assigned this Feb 4, 2023
@codecov
Copy link

codecov bot commented Feb 4, 2023

Codecov Report

Merging #103 (01b5c5a) into main (de06f63) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #103   +/-   ##
=======================================
  Coverage   99.70%   99.71%           
=======================================
  Files          39       39           
  Lines        3034     3125   +91     
=======================================
+ Hits         3025     3116   +91     
  Misses          9        9           
Impacted Files Coverage Δ
pytorch_optimizer/__init__.py 100.00% <100.00%> (ø)
pytorch_optimizer/optimizer/shampoo.py 100.00% <100.00%> (ø)
pytorch_optimizer/optimizer/shampoo_utils.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Feb 4, 2023
@kozistr kozistr changed the title [Update] Use SVD to calculate M^{-1/p} instead of Schur-Newton method [Update] Support SVD method to calculate M^{-1/p} Feb 5, 2023
@kozistr kozistr merged commit 19c3df6 into main Feb 5, 2023
@kozistr kozistr deleted the update/shampoo-optimizer branch February 5, 2023 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature New features size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant