Updated Shampoo uber slow performance #100

redknightlois · 2023-01-31T20:26:43Z

I just swap out Nero optimizer in my Lightning AI loop and gave the new Shampoo a try. There is something going on with it, as this card is typically able to do 2 it per second on almost anything. Old Shampoo was not fast, but it was expected for a second order optimizer to achieve half the iterations per sec.

kozistr · 2023-02-01T01:47:57Z

Thanks for reporting!

Could you tell me the model (e.g. resnet50) and the parameters of Shampoo optimizers?

Actually, I didn't test on many configurations, but it seems that pre-conditioning (based on the google impl) is really slow than i expected. I'll figured it out.

redknightlois · 2023-02-01T01:57:51Z

This is the configuration I am using for Mixer MLP

        "activation": "mish",
        "architecture": "mixer_mlp",
        "depth": 12,
        "expansion_factor": 2,
        "expansion_factor_token": 0.5,
        "feature_dropout": 0.2,         
        "latent_dim": 4096,
        "normalization": "none",
        "position_encoding": "none",

Feature size is token_size = 128, token_count = 16, this is roughly 200M parameters network.

kozistr · 2023-02-01T12:59:31Z

I'm working on #101 and tested it on my local machine (GTX 1060 6GB).

backbone: resmlp_12_distilled_224 (about 15M params)
batch size: 4 (bigger bs causes OOM on my machine :( )
input size: (3, 224, 224)
iteration: 100

It took 3.48s / iter and I roughly guess that the speed came within the expected range while still compute_power() function which calculates G^{-1/p} using a coupled Newton iteration takes much time.

I'll check more and release the package with a new version v2.4.0 (maybe soon).

Here's a benchmark code.

    from timm import create_model
    from tqdm import tqdm

    model = create_model('resmlp_12_distilled_224', pretrained=False, num_classes=1)
    model.train()
    model.cuda()

    optimizer = load_optimizer('shampoo')(model.parameters())

    inp = torch.zeros((4, 3, 224, 224), dtype=torch.float32).cuda()
    y = torch.ones((4, 1), dtype=torch.float32).cuda()

    for _ in tqdm(range(100)):
        optimizer.zero_grad()

        torch.nn.functional.binary_cross_entropy_with_logits(model(inp), y).backward()

        optimizer.step()

kozistr · 2023-02-03T12:28:44Z

I released a new version v2.4.0 with the fixes! please check still there's a performance issue with your settings!

best regards

redknightlois · 2023-02-03T13:39:02Z

Much faster but still taking 114 seconds per iteration. Same GPU model but slightly bigger model (300M parameter) in this case as this is the GPU that just finished an epoch. For reference, 2 iterations per seconds on Nero.

kozistr · 2023-02-03T14:24:08Z

Much faster but still taking 114 seconds per iteration. Same GPU model but a slightly bigger model (300M parameter) in this case as this is the GPU that just finished an epoch. For reference, 2 iterations per second on Nero.

Oh, thanks for testing. then, still, there's a problem with preconditioner I guess. Maybe only the JAX implementation version could go well :sad-pepe: ( loop with the if-statement implementation of Schur-Newton Method in Pytorch is really slow though :( ).

I'll do more investigations on that.

rollback schur-newton method to svd
re-implement based on the old version of Shampoo optimizer

thanks in advance!

redknightlois · 2023-02-03T14:26:19Z

Let me know when you want me to test something.

kozistr · 2023-02-06T07:18:42Z

I just deployed a new version v2.4.1 with some improvements! Change Log

In short,

In my experiments, SVD method is fast in a few cases. However, the Newton method is usually faster than SVD. (you can use SVD method to set use_svd option to True)
Tuning block_size brings a meaningful speed gain.
Schur-Newton or SVD take 99.99% of the time (optimizer part). And, I venture a guess that It's hard to boost more than this unless computing the inverse matrix in a distributed environment with lots of CPUs or XLA devices like the paper.
Old Shampoo optimizer is returned! (you can test both of them)
- load_optimizer('shampoo') -> old shampoo optimizer
- load_optimizer('scalableshamopoo') -> new shampoo optimizer
- or you can import them directly from pytorch_optimizer import Shampoo, ScalableShampoo

Any feedbacks & requests are welcome!

Here are the benchmarks.

backbone: resmlp_12_distilled_224, bs: 16

x2.5 faster

AdamP: 3.73 iter / s
(old) Shampoo: over 25s / iter
Scalable Shampoo w/ Schur-Newton (block size = 256): 1.68 s / iter
Scalable Shampoo w/ Schur-Newton (block size = 512): 1.12 iter / s
Scalable Shampoo w/ SVD (block size = 256): 1.60 iter / s
Scalable Shampoo w/ SVD (block size = 512): 2.50 iter / s

backbone: mixer_b16_224, bs: 8

x0.5 faster

AdamP: 3.15 iter / s
Nero: 2.93 iter / s
(old) Shampoo: over 2 mins / iter
Scalable Shampoo w/ Schur-Newton (block size = 256): 5.33 s / iter
Scalable Shampoo w/ Schur-Newton (block size = 512): 2.97 s / iter
Scalable Shampoo w/ SVD (block size = 256): 11.26 s / iter
Scalable Shampoo w/ SVD (block size = 512): 21.15 s / iter

redknightlois · 2023-02-15T03:37:12Z

Much better but still too slow for the depth I am working on at. Nero is doing a great job.

kozistr · 2023-04-22T08:26:14Z

@redknightlois I did more work (#128, #129) on the scalable shampoo optimizer (cleanup code, optimize pytorch code, change the default parameters, ...) and just released v2.6.0.

Maybe, it's much faster than before because I changed the default value for preconditioning_compute_steps from 1 to 1000, which is the most compute-intensive part, while the authors said it doesn't have a significant effect on the convergence.

+) Also, I'm roughly guessing that the current implementation is the nearly optimal version of scalable shampoo (w/ synchronous precondition updates on a single GPU), So, how about closing this issue for now? (if there's news, I'll re-open or create another issue though)

if there're any requests, please feel free to use & feedback by any chance :)

Thank you!

kozistr self-assigned this Feb 1, 2023

kozistr added the performance Performance label Feb 1, 2023

kozistr mentioned this issue Feb 1, 2023

[Feature] Implements D-Adaptation optimizers #101

Merged

8 tasks

kozistr mentioned this issue Feb 4, 2023

[Update] Support SVD method to calculate M^{-1/p} #103

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Shampoo uber slow performance #100

Updated Shampoo uber slow performance #100

redknightlois commented Jan 31, 2023

kozistr commented Feb 1, 2023 •

edited

redknightlois commented Feb 1, 2023 •

edited

kozistr commented Feb 1, 2023

kozistr commented Feb 3, 2023

redknightlois commented Feb 3, 2023

kozistr commented Feb 3, 2023 •

edited

redknightlois commented Feb 3, 2023

kozistr commented Feb 6, 2023

redknightlois commented Feb 15, 2023

kozistr commented Apr 22, 2023

Updated Shampoo uber slow performance #100

Updated Shampoo uber slow performance #100

Comments

redknightlois commented Jan 31, 2023

kozistr commented Feb 1, 2023 • edited

redknightlois commented Feb 1, 2023 • edited

kozistr commented Feb 1, 2023

kozistr commented Feb 3, 2023

redknightlois commented Feb 3, 2023

kozistr commented Feb 3, 2023 • edited

redknightlois commented Feb 3, 2023

kozistr commented Feb 6, 2023

backbone: resmlp_12_distilled_224, bs: 16

backbone: mixer_b16_224, bs: 8

redknightlois commented Feb 15, 2023

kozistr commented Apr 22, 2023

kozistr commented Feb 1, 2023 •

edited

redknightlois commented Feb 1, 2023 •

edited

kozistr commented Feb 3, 2023 •

edited