adafactor optimizer #106

isamu-isozaki · 2023-08-26T21:26:04Z

I'm planning to add in adafactor optimizer used in the official implementation. The main benefit of this over adam +adamw is that we don't need 3x the vram but I think a bit above 2x the vram of the models. I currently have the code up https://github.com/isamu-isozaki/adafactor-pytorch and after adding a triton version, I will bring a pr to here!

isamu-isozaki · 2023-09-16T22:28:12Z

I finished at least the python version! For triton, it seems like multiplying row matrix to column matrix to 16 so a rank of 1 needs some more ideas.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adafactor optimizer #106

adafactor optimizer #106

isamu-isozaki commented Aug 26, 2023

isamu-isozaki commented Sep 16, 2023

adafactor optimizer #106

adafactor optimizer #106

Comments

isamu-isozaki commented Aug 26, 2023

isamu-isozaki commented Sep 16, 2023