8-bit LION, take 2 #514

dblalock · 2023-08-10T07:01:02Z

Adds an 8-bit version of the LION optimizer. Also features 1 byte of (optional) auxiliary error correction state for each parameter to make pure bf16 training work.

Code changes:

Adds lion8b.py to llm-foundry/optim
Adds DecoupledLionW_8bit to llm-foundry/optim/__init__.py
Adds lion8b as an option in llm-foundry/optim/builders.py
Adds test_lion8b.py to the tests.
Adds mosaicml-turbo to the GPU dependencies in setup.py. This is the repo that currently holds all the CUDA kernels. These are in a separate repo for now to avoid complicating LLM foundry {install, deps, source code}.
Adds an optional master_weight_dtype field in train.py. If set to bf16 or fp16, the script does model.to(dtype=<that dtype>) before training. This works when we have error correction turned on.
Tweaks config_utils.py to set FSDP's param_dtype to None if the master weights are already fp16/bf16.

Non-obvious design choices:

Features the option to store a few extra bits per parameter to facilitate bf16 or f16 master weights, which should save us space and hopefully squeeze out a little more MFU.
There's an undocumented _fused arg that's kind of needed for testing and maybe (?) should be part of the API in case someone wants to check whether it's causing issues. I kind of want to get rid of this though once we trust the kernel logic fully.
This should work fine on CPUs—it just doesn't quantize. We could save some LoC if we instead said "sorry, you need a GPU to run this" but I'm worried about users wanting to prototype/debug locally.
We have a "_MaybeQuantizedTensor" class to hold the quantized optimizer states. This is a pretty good solution but I don't love having a tensor-like object that doesn't implement the full Tensor interface.

There's enough test coverage here that I'm not super worried about these choices, but wanted to highlight them in case someone has strong opinions.

WandB report

llmfoundry/optim/lion8b.py

llmfoundry/utils/builders.py

llmfoundry/utils/config_utils.py

tests/test_lion8b.py

llmfoundry/optim/lion8b.py

…sing?

llmfoundry/utils/config_utils.py

tests/test_lion8b.py

vchiley · 2023-08-22T04:52:24Z

High level design point
If DecoupledLionW_8bit is requested, it kind of seems like it should always be quantized.
If people are running on CPU or do not ask for the quantized version, they should be running DecoupledLionW not DecoupledLionW_8bit...

Should we make sure that the state dict of DecoupledLionW_8bit can be loaded into DecoupledLionW (and vice versa), then error out directing people to use DecoupledLionW if they are in a setting where DecoupledLionW_8bit isn't available.
This setup would mean that we can get rid of the _MaybeQuantizedTensor cls (as well as potentially some other code)

Yes this point should have been brought up a LONG time ago. I'm not saying we should do this, just opening the discussion.
The downside is obviously that it would be tougher to test quantized vs not quantized.
(The answer to this will probably be: we're going to enable the non-quantized version in the DecoupledLionW_8bit class as well 😄 )

Also can the state dict of DecoupledLionW_8bit can be loaded into DecoupledLionW (and vice versa)?

tests/test_lion8b.py

dblalock · 2023-08-22T18:54:02Z

@vchiley How would you feel about just making this be DecoupledLionW in a future PR? So we'd have one optimizer with a quantize flag instead of two classes?

Asking because this informs whether I should rip out the flag. I like the idea of avoiding redundant code, but maybe we have reasons to not do this.

vchiley · 2023-08-22T22:57:07Z

@bmosaicml originally wrote lion implementation
@bmosaicml do you have an opinion on this?

llmfoundry/optim/lion8b.py

bmosaicml

Incredible work Davis!

dblalock · 2023-08-23T20:28:31Z

@bmosaicml so are you good with "yes, leave the quantization flag and make this the single LION implementation"?

dblalock added 2 commits August 10, 2023 06:59

add decoupled lion8b optimizer + tests + builder option + deps

c7c5ae6

Merge branch 'main' into davis/lion8b-v2

cace0b6

dblalock changed the title ~~add decoupled lion8b optimizer + tests + builder option + deps~~ [WIP] 8-bit LION, take 2 Aug 11, 2023

dblalock added 6 commits August 11, 2023 17:15

pre-commit fixes

1a14857

merge upstream

e32f1bf

move lion8b kernels dep to "gpu" extra_deps

16ca215

Merge branch 'main' into davis/lion8b-v2

014ba69

move fused error checks to llmfoundry

f391b7f

make precommit + CodeQL happy?

bcf55bf

dblalock mentioned this pull request Aug 11, 2023

Add 8-bit LION optimizer #279

Closed

dblalock added 5 commits August 12, 2023 03:31

disable fsdp param_dtype for low-bit master weights

6fc1782

add low-precision master weights option + rm needles .get(..., None)

ba0e317

fix missing import in config_utils

7a55e07

hopefully fix lion8b fsdp checkpointing

225ceac

pre-commit fixes

d53f0e5

dblalock requested a review from vchiley August 14, 2023 07:37

dblalock marked this pull request as ready for review August 14, 2023 07:38

Merge branch 'main' into davis/lion8b-v2

c9217b6