Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cross_attention_hidden_size in Encoder-Decoder models #14378

Merged

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Nov 12, 2021

What does this PR do?

  • Add a projection layer (enc_to_dec_proj) between encoder and decoder models in composite models, incorporating the attribute cross_attention_hidden_size.

    • add some pt/tf equivalence and pt/flax equivalence tests in tf/flax composite model test scripts.
    • also make some logging and ValueError messages consistent across composite model scripts.

@ydshieh ydshieh changed the title [WIP] Use cross_attention_hidden_size in Encoder-Decoder models Use cross_attention_hidden_size in Encoder-Decoder models Nov 13, 2021
@ydshieh ydshieh marked this pull request as ready for review November 13, 2021 11:35
@ydshieh
Copy link
Collaborator Author

ydshieh commented Nov 13, 2021

I ran slow tests for all the encoder-decoder models test scripts, and it is fine. (e.g. RUN_SLOW=1 python -m pytest ...)

BTW, is there an easy way to run all cross tests in a test script, i.e. disabling @is_pt_tf_cross_test or @is_pt_flax_cross_test?

encoder(encoder.dummy_inputs)
decoder(decoder.dummy_inputs)
tf_model = TFEncoderDecoderModel(encoder=encoder, decoder=decoder)

Copy link
Collaborator Author

@ydshieh ydshieh Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TF encoder-decoder model family doesn't work smoothly with checkpoint loading, and requires some hacks to make it working.

In the case here, if a TF composite model (whose weights are created under the scope of the top model) saves its encoder/decoder component separately, the 2 checkpoints will contain the top model names, i.e. the encoder/decoder checkpoint weights will begin with tf_encoder_decoder_model.
This causes problems when we want to load them again, in particular, in from_encoder_decoder_pretrained.

However, if a TF composite model is constructed by having the encoder & decoder models first, their weight names don't have the top model name, and we can save the 2 components and reload them again.

P.S.: Once PR #14016 is merged, the equivalence tests need to be reworked in order to pass.

# self.assertTrue(config.hidden_size != decoder_config.hidden_size)
# self.check_equivalence_pt_to_tf(config, decoder_config, inputs_dict)
# self.check_equivalence_tf_to_pt(config, decoder_config, inputs_dict)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no easy way to deal with enc_to_dec_proj for TF composite models with regard to checkpoint loading, while we need to load the encoder/decoder components separately.

if (
self.encoder.config.hidden_size != self.decoder.config.hidden_size
and self.decoder.config.cross_attention_hidden_size is None
):
Copy link
Collaborator Author

@ydshieh ydshieh Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this block the same as in other encoder/decoder models.

if (
self.encoder.config.hidden_size != self.decoder.config.hidden_size
and self.decoder.config.cross_attention_hidden_size is None
):
Copy link
Collaborator Author

@ydshieh ydshieh Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this block the same as in other encoder/decoder models.

f"No `encoder_model` is passed to kwargs: {kwargs_encoder}. "
f"In this case make sure that `encoder_pretrained_model_name_or_path` defined"
"If `encoder_model` is not defined as an argument, a `encoder_pretrained_model_name_or_path` has "
"to be defined."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to be the same as the corresponding occurrence in other encoder decoder models.

@ydshieh ydshieh changed the title Use cross_attention_hidden_size in Encoder-Decoder models [WIP] Use cross_attention_hidden_size in Encoder-Decoder models Nov 14, 2021
@ydshieh ydshieh marked this pull request as draft November 14, 2021 12:46
@ydshieh ydshieh changed the title [WIP] Use cross_attention_hidden_size in Encoder-Decoder models Use cross_attention_hidden_size in Encoder-Decoder models Nov 14, 2021
@ydshieh ydshieh marked this pull request as ready for review November 14, 2021 15:07
Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this consistency.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Nov 29, 2021

Hey @ydshieh,

We need to slightly update this PR for the speech encoder decoder classes sadly so that the newly introduced variable config.output_hidden_size as shown here:

self.encoder_output_dim = getattr(config.encoder, "output_hidden_size", config.encoder.hidden_size)
is compatible with it.

The other files can stay the same :-)

@ydshieh
Copy link
Collaborator Author

ydshieh commented Nov 29, 2021

Hey @ydshieh,

We need to slightly update this PR for the speech encoder decoder classes sadly so that the newly introduced variable config.output_hidden_size as shown here:

self.encoder_output_dim = getattr(config.encoder, "output_hidden_size", config.encoder.hidden_size)

is compatible with it.

The other files can stay the same :-)

No problem, @patrickvonplaten. But I have a slight doubt at this line:

self.enc_to_dec_proj = nn.Linear(self.encoder.config.hidden_size, self.decoder.config.hidden_size)

Should it be

self.enc_to_dec_proj = nn.Linear(self.encoder_output_dim, self.decoder.config.hidden_size)

if config.output_hidden_size is introduced in the config and used here? I didn't go through the speech model, but it looks more natural to do so.

@ydshieh ydshieh force-pushed the add_cross_attention_hidden_size branch from f5c0df5 to 7b9d31a Compare November 29, 2021 17:23
@ydshieh
Copy link
Collaborator Author

ydshieh commented Nov 29, 2021

I made the necessary updates where config.output_hidden_size is involved.
I didn't change the line

self.enc_to_dec_proj = nn.Linear(self.encoder.config.hidden_size, self.decoder.config.hidden_size)

despite a slight doubt.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Nov 29, 2021

(Fixed)

The failed TF/Torch test is due to #14016 being merged to master (and I rebased this PR on master), which is expected. I will take care of this issue.

@ydshieh ydshieh force-pushed the add_cross_attention_hidden_size branch from d78af6a to f178e1d Compare December 2, 2021 16:11
Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this!

@patrickvonplaten patrickvonplaten merged commit 4cdb67c into huggingface:master Dec 6, 2021
@ydshieh ydshieh deleted the add_cross_attention_hidden_size branch May 5, 2022 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants