Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[megatron_gpt2] checkpoint v3 #13508

Merged
merged 7 commits into from
Sep 20, 2021
Merged

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Sep 10, 2021

The BigScience generates Megatron-Deepspeed checkpoints en masse, so we need a better v3 checkpoint support for megatron_gpt2 (also working on a direct Megatron-Deepspeed to HF-style script).

This PR:

  • removes the requirement that the source checkpoint is a zip file. Most of the checkpoints aren't .zip files, so supporting both.
  • removes the need to manually feed the config, when all the needed data is already in the checkpoint. (while supporting the old format)
  • disables a debug print that repeats for each layer
  • switch to default gelu from gelu_new - which is what the current megatron-lm uses
  • fixes a bug in the previous version. The hidden size is dimension 1 as it can be seen from:
    https://github.com/NVIDIA/Megatron-LM/blob/3860e995269df61d234ed910d4756e104e1ab844/megatron/model/language_model.py#L140-L141
    The previous script happened to work because max_position_embeddings happened to be equal to hidden_size

@LysandreJik, @sgugger

@stas00 stas00 marked this pull request as draft September 10, 2021 03:43
@stas00 stas00 marked this pull request as ready for review September 18, 2021 01:53
@@ -80,6 +80,20 @@ def convert_megatron_checkpoint(args, input_state_dict, config):
# The converted output model.
output_state_dict = {}

nargs = input_state_dict["args"]
# from pprint import pprint
# pprint(vars(nargs))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the debug prints are essential for quick troubleshooting when converting, so they are a feature and not a left-over, so I hope it's ok if they remain there. Thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the logger then and set it at a debug level?

Copy link
Contributor Author

@stas00 stas00 Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know which parts need to be debugged ahead of time, hence everything is prepared for the developer in the various places. and a single on/off logger usually is too much information.

The Deepspeed framework has a ton of these and they sometimes have these commented out, but most of the time they use a helper debug print util where they have a flag force which is False by default, so the code is normal but it doesn't do anything unless the developer modifies the flag to True.

print_rank0("some debug/trace", force=False)

Would be happy to adopt this approach at transformers if it resonates better. But then we need to create such a convenience function (in another PR perhaps).

Also, this is similar to:

        # keep for quick debug
        # print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] + cmd)); die
        execute_subprocess_async(cmd, env=self.get_env())

I have in various extended tests, and I use it quite often, but only when I need to. And I want the process to abort at that point, hence the die trigger (from Perl where it's actually a built-in function but works for python as well as it's just a random undefined symbol there). The commented out code saves a ton of time and I use it quite often when debugging these tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will get back to this subject matter and revisit the code once we agree on what the best approach is

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing!

@stas00 stas00 merged commit 0af901e into huggingface:master Sep 20, 2021
@stas00 stas00 deleted the megatron-lm-chkpt-v3 branch September 20, 2021 15:50
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 13, 2022
* [megatron_gpt2] checkpoint v3

* bug fix

* fixes

* switch to default  from  - which is what the current megatron-lm uses

* cleanup

* back compat
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
* [megatron_gpt2] checkpoint v3

* bug fix

* fixes

* switch to default  from  - which is what the current megatron-lm uses

* cleanup

* back compat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants