Add final_layer_norm to OPT model #17785

thomasw21 · 2022-06-20T22:51:08Z

What does this PR do?

OPT models have a final_layer_norm: https://github.com/facebookresearch/metaseq/blob/e0c4f6b0e4c523906ad8d561f727e3f2ac3a8e73/metaseq/models/transformer.py#L466-L477
So we update HF models + conversion script to take in account that missing layer norm.

Test on OPT-125m (restored.pt file from patrickvonplaten/opt_metaseq_125m):

>>> model_path="fixed_opt_125m"
>>> prompt="Hello my name is"
>>> log_probs_with_ppl(model_path, prompt)
Input torch.Size([1, 5])
Logits torch.Size([1, 5, 50272])
torch.return_types.max(
values=tensor([[0.2398, 0.2326, 0.3332, 0.9363, 0.0097]], grad_fn=<MaxBackward0>),
indices=tensor([[ 100,    6,  766,   16, 1236]]))
argmax probility: [[0.23982257 0.23258895 0.33315504 0.9362957  0.00967377]]
argmax log probability: [[-1.4278558  -1.4584825  -1.0991473  -0.06582398 -4.6383367 ]]
argmax tokens: I, name is j
cross entropy loss: 4.051314830780029
ppl: 57.47297286987305

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

HuggingFaceDocBuilderDev · 2022-06-20T23:03:41Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker

LGTM we should update the tf and flax checkpoints

younesbelkada · 2022-06-21T07:05:21Z

Thank you very much for the fix! I think that we'll have to change the generation tests a bit for the other models as well

patrickvonplaten · 2022-06-21T08:38:27Z

Great find @thomasw21 - thanks a lot for fixing it!

Think the checkpoints were then also incorrectly loaded inside the metaseq codebase - could you maybe double check that the following script gives identical results between fairseq and transformers: https://huggingface.co/patrickvonplaten/opt_metaseq_125m -> The logits should match there (maybe an incorrect configuration in the metaseq model?)

Also could you please update the slow model tests?

ArthurZucker · 2022-06-21T08:40:20Z

@thomasw21 I can update the tests and check the outputs if you want

thomasw21 · 2022-06-21T08:41:45Z

@patrickvonplaten from what I understood logits comparison equality test were only done in 350m? @younesbelkada
I actually manually converted restored.pt from https://huggingface.co/patrickvonplaten/opt_metaseq_125m using the updated conversion script.

@ArthurZucker if you have the bandwidth, I'd appreciate it! Thanks!

src/transformers/models/opt/convert_opt_original_pytorch_checkpoint_to_pytorch.py

@thomasw21

If we don't transfer the `"decoder.version"` to the singleton checkpoint, a very sneaky bug happens which was found by @thomasw21 as part of this PR: huggingface/transformers#17785 If the `decoder.version` param is not present in the state_dict it follows that upon loading the single-ton checkpoint the loaded layer_norm is set to `None` here: https://github.com/facebookresearch/metaseq/blob/e0c4f6b0e4c523906ad8d561f727e3f2ac3a8e73/metaseq/models/transformer.py#L932 So it's absolutely crucial that we include this variable. I will update all of the converted HF checkpoints here later today and then I think we can be sure that OPT works correctly 🥳 https://huggingface.co/models?other=opt_metasq

src/transformers/models/opt/modeling_opt.py

patrickvonplaten · 2022-06-21T11:41:10Z

src/transformers/models/opt/modeling_opt.py

@@ -492,7 +492,11 @@ def __init__(self, config: OPTConfig):
        else:
            self.project_in = None

-        self.layer_norm = None
+        if config.do_layer_norm_before:


Suggested change

if config.do_layer_norm_before:

if config.do_layer_norm_before and not config._remove_final_layer_norm:

src/transformers/models/opt/modeling_flax_opt.py

src/transformers/models/opt/modeling_tf_opt.py

src/transformers/models/opt/modeling_opt.py

src/transformers/models/opt/configuration_opt.py

patrickvonplaten

Thanks for fixing it! Let's merge the PR once the checkpoints are correctly uploaded under @ArthurZucker's namespace

src/transformers/models/opt/convert_opt_original_pytorch_checkpoint_to_pytorch.py

thomasw21 · 2022-06-21T12:09:52Z

@patrickvonplaten Yep I've looked at the changes with your comment, feel free to merge those : D

younesbelkada · 2022-06-21T12:35:56Z

When releasing the patch can we merge at the same time #17437 ? The problem of NaNs for batched generation still persists with this fix, but is resolved with #17437

@thomasw21

…checkpoint to run correctly (#164) * Singleton checkpoint needs to include decoder.version If we don't transfer the `"decoder.version"` to the singleton checkpoint, a very sneaky bug happens which was found by @thomasw21 as part of this PR: huggingface/transformers#17785 If the `decoder.version` param is not present in the state_dict it follows that upon loading the single-ton checkpoint the loaded layer_norm is set to `None` here: https://github.com/facebookresearch/metaseq/blob/e0c4f6b0e4c523906ad8d561f727e3f2ac3a8e73/metaseq/models/transformer.py#L932 So it's absolutely crucial that we include this variable. I will update all of the converted HF checkpoints here later today and then I think we can be sure that OPT works correctly 🥳 https://huggingface.co/models?other=opt_metasq * Update convert_to_singleton.py Co-authored-by: Stephen Roller <roller@fb.com>

ArthurZucker · 2022-06-21T14:20:35Z

BTW @patrickvonplaten do you have the expected values for the slow test?

…to thomas/fix_opt

patrickvonplaten · 2022-06-21T18:01:35Z

BTW @patrickvonplaten do you have the expected values for the slow test?

Corrected the tests as well now

sgugger

Thanks for investigating, fixing and humoring my push for backward compatibility :-)

patrickvonplaten · 2022-06-21T18:26:32Z

Good job @thomasw21 !

* Add final_layer_norm to OPT model * Add JAX and TF version * Fix Keras name * Woops * Allow for non breaking change * Apply suggestions from code review * add tests Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

huggingface/transformers#17785

@thomasw21

…checkpoint to run correctly (#164) * Singleton checkpoint needs to include decoder.version If we don't transfer the `"decoder.version"` to the singleton checkpoint, a very sneaky bug happens which was found by @thomasw21 as part of this PR: huggingface/transformers#17785 If the `decoder.version` param is not present in the state_dict it follows that upon loading the single-ton checkpoint the loaded layer_norm is set to `None` here: https://github.com/facebookresearch/metaseq/blob/e0c4f6b0e4c523906ad8d561f727e3f2ac3a8e73/metaseq/models/transformer.py#L932 So it's absolutely crucial that we include this variable. I will update all of the converted HF checkpoints here later today and then I think we can be sure that OPT works correctly 🥳 https://huggingface.co/models?other=opt_metasq * Update convert_to_singleton.py Co-authored-by: Stephen Roller <roller@fb.com>

* Add final_layer_norm to OPT model * Add JAX and TF version * Fix Keras name * Woops * Allow for non breaking change * Apply suggestions from code review * add tests Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Add final_layer_norm to OPT model

4cfa8c8

thomasw21 requested review from younesbelkada, ArthurZucker and patrickvonplaten June 20, 2022 22:51

ArthurZucker approved these changes Jun 21, 2022

View reviewed changes

Add JAX and TF version

ac13cca

younesbelkada approved these changes Jun 21, 2022

View reviewed changes

thomasw21 added 2 commits June 21, 2022 09:07

Fix Keras name

869029c

Woops

8dfe55a

thomasw21 mentioned this pull request Jun 21, 2022

Abnormal behavior of OPT except OPT-350m #17653

Closed

4 tasks