[megatron_gpt2] checkpoint v3 #13508

stas00 · 2021-09-10T03:42:22Z

The BigScience generates Megatron-Deepspeed checkpoints en masse, so we need a better v3 checkpoint support for megatron_gpt2 (also working on a direct Megatron-Deepspeed to HF-style script).

This PR:

removes the requirement that the source checkpoint is a zip file. Most of the checkpoints aren't .zip files, so supporting both.
removes the need to manually feed the config, when all the needed data is already in the checkpoint. (while supporting the old format)
disables a debug print that repeats for each layer
switch to default gelu from gelu_new - which is what the current megatron-lm uses
fixes a bug in the previous version. The hidden size is dimension 1 as it can be seen from:
https://github.com/NVIDIA/Megatron-LM/blob/3860e995269df61d234ed910d4756e104e1ab844/megatron/model/language_model.py#L140-L141
The previous script happened to work because max_position_embeddings happened to be equal to hidden_size

@LysandreJik, @sgugger

stas00 · 2021-09-18T01:55:43Z

src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

@@ -80,6 +80,20 @@ def convert_megatron_checkpoint(args, input_state_dict, config):
    # The converted output model.
    output_state_dict = {}

+    nargs = input_state_dict["args"]
+    # from pprint import pprint
+    # pprint(vars(nargs))


the debug prints are essential for quick troubleshooting when converting, so they are a feature and not a left-over, so I hope it's ok if they remain there. Thank you!

Can we use the logger then and set it at a debug level?

We don't know which parts need to be debugged ahead of time, hence everything is prepared for the developer in the various places. and a single on/off logger usually is too much information.

The Deepspeed framework has a ton of these and they sometimes have these commented out, but most of the time they use a helper debug print util where they have a flag force which is False by default, so the code is normal but it doesn't do anything unless the developer modifies the flag to True.

print_rank0("some debug/trace", force=False)

Would be happy to adopt this approach at transformers if it resonates better. But then we need to create such a convenience function (in another PR perhaps).

Also, this is similar to:

# keep for quick debug # print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] + cmd)); die execute_subprocess_async(cmd, env=self.get_env())

I have in various extended tests, and I use it quite often, but only when I need to. And I want the process to abort at that point, hence the die trigger (from Perl where it's actually a built-in function but works for python as well as it's just a random undefined symbol there). The commented out code saves a ton of time and I use it quite often when debugging these tests.

we will get back to this subject matter and revisit the code once we agree on what the best approach is

sgugger

LGTM, thanks for fixing!

* [megatron_gpt2] checkpoint v3 * bug fix * fixes * switch to default from - which is what the current megatron-lm uses * cleanup * back compat

[megatron_gpt2] checkpoint v3

f9fbf7e

stas00 marked this pull request as draft September 10, 2021 03:43

stas00 added 5 commits September 13, 2021 19:57

bug fix

bea0eeb

fixes

eb8017e

switch to default from - which is what the current megatron-lm uses

254c1b9

Merge remote-tracking branch 'origin/master' into megatron-lm-chkpt-v3

ee5fe78

cleanup

3dc16a8

stas00 marked this pull request as ready for review September 18, 2021 01:53

stas00 commented Sep 18, 2021

View reviewed changes

back compat

74dc9b0

sgugger approved these changes Sep 20, 2021

View reviewed changes

stas00 merged commit 0af901e into huggingface:master Sep 20, 2021

stas00 deleted the megatron-lm-chkpt-v3 branch September 20, 2021 15:50

stas00 mentioned this pull request Sep 24, 2021

[megatron gpt checkpoint conversion] causal mask requires pos_embed dimension #13735

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron_gpt2] checkpoint v3 #13508

[megatron_gpt2] checkpoint v3 #13508

stas00 commented Sep 10, 2021 •

edited

stas00 Sep 18, 2021

sgugger Sep 20, 2021

stas00 Sep 20, 2021 •

edited

stas00 Sep 20, 2021

sgugger left a comment

[megatron_gpt2] checkpoint v3 #13508

[megatron_gpt2] checkpoint v3 #13508

Conversation

stas00 commented Sep 10, 2021 • edited

stas00 Sep 18, 2021

Choose a reason for hiding this comment

sgugger Sep 20, 2021

Choose a reason for hiding this comment

stas00 Sep 20, 2021 • edited

Choose a reason for hiding this comment

stas00 Sep 20, 2021

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

stas00 commented Sep 10, 2021 •

edited

stas00 Sep 20, 2021 •

edited