[Update] Support SVD method to calculate `M^{-1/p}` #103

kozistr · 2023-02-04T13:39:24Z

Problem (Why?)

Theoretically, Schur-Newton method is faster than SVD method to calculate M^{-1/p}. However, the inefficiency of the loop and others, in some cases, SVD is much faster than that. So, supports SVD method too. (related to #100)

Solution (What/How?)

calculate the power of the matrix with SVD
- compute on GPU (not CPU this time)
perform batch SVD when the shapes of the pre-conditioners are all same. (maybe later, RaggedTensor could be used here.)
rename the new Shampoo optimizer to ScalableShampoo. (actually, not scalable though)
implement the original Shampoo optimizer.

Benchmark

tested on i7700K + GTX 1060 6GB.

backbone: `resmlp_12_distilled_224`, bs: 16

x4.325 faster (A -> B)

AdamP: 3.73 iter / s
(old) Shampoo: 25s / iter
Scalable Shampoo w/ Schur-Newton (block size = 256): 1.73s / iter -> A
Scalable Shampoo w/ SVD (block size = 256): 1.59 iter / s
Scalable Shampoo w/ SVD (block size = 512): 2.50 iter / s -> B

backbone: `mixer_s32_224`, bs: 8

x5.408 slower (A -> B)

AdamP: 3.85 iter / s
Scalable Shampoo w/ Schur-Newton (block size = 256): 1.68s / iter
Scalable Shampoo w/ Schur-Newton (block size = 512): 1.05 iter / s -> A
Scalable Shampoo w/ SVD (block size = 256): 5.15s / iter -> B
Scalable Shampoo w/ SVD (block size = 512): 7.01s / iter

backbone: `mixer_b16_224`, bs: 2

x3.292 slower (A -> B)

AdamP: 3.15 iter / s
(old) Shampoo: over 2 mins / iter
Scalable Shampoo w/ Schur-Newton (block size = 256): 5.11s / iter -> A
Scalable Shampoo w/ SVD (block size = 256): 16.82s / s -> B
Scalable Shampoo w/ SVD (block size = 512): 32.47s / iter

code

    from timm import create_model
    from tqdm import tqdm

    model = create_model(backbone, pretrained=False, num_classes=1)
    model.train()
    model.cuda()

    optimizer = load_optimizer('scalableshampoo')(
        model.parameters(), 
        start_preconditioning_step=1,
        block_size=block_size,
        use_svd=use_svd,
    )

    inp = torch.randn((bs, 3, 224, 224), dtype=torch.float32).cuda()
    y = torch.randn((bs, 1), dtype=torch.float32).cuda()

    for _ in tqdm(range(100)):
        optimizer.zero_grad()

        torch.nn.functional.binary_cross_entropy_with_logits(model(inp), y).backward()

        optimizer.step()

Other changes (bug fixes, small refactors)

nope

Notes

nope

codecov · 2023-02-04T13:41:16Z

Codecov Report

Merging #103 (01b5c5a) into main (de06f63) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #103   +/-   ##
=======================================
  Coverage   99.70%   99.71%           
=======================================
  Files          39       39           
  Lines        3034     3125   +91     
=======================================
+ Hits         3025     3116   +91     
  Misses          9        9

Impacted Files	Coverage Δ
pytorch_optimizer/__init__.py	`100.00% <100.00%> (ø)`
pytorch_optimizer/optimizer/shampoo.py	`100.00% <100.00%> (ø)`
pytorch_optimizer/optimizer/shampoo_utils.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

…e all same

kozistr added 7 commits February 4, 2023 22:14

update: grafting

c6c1c72

refactor: is_precondition_step

85e0cd2

feature: support SVD method to calculate M^{-1/p}

ca9de76

update: compute_power_svd

607f6ae

refactor: compute_power_schur_newton

385be0a

docs: compute_power_svd() docstring

fc020c1

feature: use_svd

e6595cd

kozistr added enhancement New feature or request feature New features labels Feb 4, 2023

kozistr self-assigned this Feb 4, 2023

pull-request-size bot added the size/M label Feb 4, 2023

kozistr added 11 commits February 4, 2023 22:44

update: test_compute_power

e6f6e1a

update: test_compute_power

4f5829a

update: test_shampoo_pre_conditioner

b56b794

docs: compute_power_svd docstring

010f8d1

update: compute_power_svd

49de081

docs: compute_power_schur_newton, _compute_power_svd

c85eb1e

update: compute_pre_conditioners

31fa001

fix: typo

b1b1dc3

update: Shampoo recipes

466f6e7

update: Shampoo recipes

bfa2cf4

update: recipes

06ae7b6

pull-request-size bot added size/L and removed size/M labels Feb 4, 2023

kozistr added 5 commits February 5, 2023 00:04

update: recipes

60ae998

update: recipes

d2c5316

feature: perform batch svd when the shapes of the pre-conditioners ar…

ebf1240

…e all same

update: block_size to 512

c70fed0

docs: Shampoo docstring

a05ffd2

kozistr added 24 commits February 5, 2023 16:49

update: block size to 256

898a8ef

update: optimizers

347bffb

docs: Shampoo optimizer

b2fca31

update: shampoo recipe

5004def

update: test_get_supported_optimizers

42ebe2b

feature: Shampoo optimizer

21765ee

update: test_scalable_shampoo_optimizer

995bdd6

update: test_update_frequency

dc0b41e

update: test_bf16_gradient

cb16c18

update: Shampoo recipe

c94e7ac

update: NO_SPARSE_OPTIMIZERS

001a077

style: fix ERA001

5dbc7b5

update: __name__ to __str__

c15e503

update: cases

0e94d1f

update: compute_pre_conditioners

02696a7

update: default value of matrix_eps to 1e-6

a8a0c4b

update: copy to inv_pre_cond

0e879a4

update: Shampoo optimizer

ca2d5ae

docs: compute_pre_conditioners docstring

10ce92e

update: power_iter

ceef415

update: power_iter

2b9221e

update: compute_power_schur_newton

24282a9

update: use_svd to False

e650bae

update: test_scalable_shampoo_pre_conditioner

01b5c5a

kozistr changed the title ~~[Update] Use SVD to calculate M^{-1/p} instead of Schur-Newton method~~ [Update] Support SVD method to calculate M^{-1/p} Feb 5, 2023

kozistr merged commit 19c3df6 into main Feb 5, 2023

kozistr deleted the update/shampoo-optimizer branch February 5, 2023 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Update] Support SVD method to calculate `M^{-1/p}` #103

[Update] Support SVD method to calculate `M^{-1/p}` #103

kozistr commented Feb 4, 2023 •

edited

codecov bot commented Feb 4, 2023 •

edited

[Update] Support SVD method to calculate M^{-1/p} #103

[Update] Support SVD method to calculate M^{-1/p} #103

Conversation

kozistr commented Feb 4, 2023 • edited

Problem (Why?)

Solution (What/How?)

Benchmark

backbone: resmlp_12_distilled_224, bs: 16

backbone: mixer_s32_224, bs: 8

backbone: mixer_b16_224, bs: 2

Other changes (bug fixes, small refactors)

Notes

codecov bot commented Feb 4, 2023 • edited

Codecov Report

[Update] Support SVD method to calculate `M^{-1/p}` #103

[Update] Support SVD method to calculate `M^{-1/p}` #103

kozistr commented Feb 4, 2023 •

edited

backbone: `resmlp_12_distilled_224`, bs: 16

backbone: `mixer_s32_224`, bs: 8

backbone: `mixer_b16_224`, bs: 2

codecov bot commented Feb 4, 2023 •

edited