Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about constrained decoding #20

Closed
rela0426 opened this issue Jun 26, 2022 · 4 comments
Closed

Some questions about constrained decoding #20

rela0426 opened this issue Jun 26, 2022 · 4 comments

Comments

@rela0426
Copy link

rela0426 commented Jun 26, 2022

Hello, Mr. Lu. In the constraint decoding algorithm, there is a judgment that is not clear. Can you help explain it?

def check_state(self, tgt_generated):
        if tgt_generated[-1] == self.tokenizer.pad_token_id:
            return 'start', -1

Here,tgt_generated[-1]==self.tokenizer.pad_token_idmeansstart,Why?Can we substitute decoder_start_token_id for self.tokenizer.pad_token_id?Or just use the value 0?

In my opinion, if tgt_generated[-1] == self.tokenizer.pad_token_id,It means that the last one is pad_token, so the generation enters the end phase instead of the start phase.So judge the start of generation with decoder_ start_ token_ id is recommended, is it right?

@luyaojie
Copy link
Owner

luyaojie commented Jun 26, 2022

Hi,

This is because T5 uses the pad_token_id as the starting token for decoder_input_ids generation and eos_token as the ending token.
decoder_start_token_id is better for other tokenizers.

@rela0426
Copy link
Author

Hi,

This is because T5 uses the pad_token_id as the starting token for decoder_input_ids generation. Employing decoder_start_token_id is better for other tokenizers.

I use XLMRobertaToken, it takes cls_token as the start tag, sep_token as the end tag,pad_token as the pad tag. To adapt to T5Model,I add codes as

config.eos_token_id = tokenizer.eos_token_id
config.pad_token_id = tokenizer.pad_token_id

Does that mean my judgment should be tgt_generated[-1]==self.tokenizer.cls_token_id?

At present, the program can run without constraint decoding algorithm, and the effect is OK; With the constraint decoding algorithm, the program F value is 0. So what's the problem?

@luyaojie
Copy link
Owner

I think it is no need to add config.eos_token_id = tokenizer.eos_token_id.

You can rewrite the constraint decoding based on the XLMRobertaTokenizer, as you stated that cls_token as the start tag, sep_token as the end tag.
I think the modification of some special symbols in the raw code for the constraint decoding should be working.
For example, change type_start/type_end/pad_token_id/eos_token_id according the generation state and the XLMRobertaTokenizer.

For the problem of F=0, it is better to analyze the content of the generation.
For example, one possible results of generation are all empty (no event) <extra_id_0> <extra_id_1>.

@rela0426
Copy link
Author

I think it is no need to add config.eos_token_id = tokenizer.eos_token_id.

You can rewrite the constraint decoding based on the XLMRobertaTokenizer, as you stated that cls_token as the start tag, sep_token as the end tag. I think the modification of some special symbols in the raw code for the constraint decoding should be working. For example, change type_start/type_end/pad_token_id/eos_token_id according the generation state and the XLMRobertaTokenizer.

For the problem of F=0, it is better to analyze the content of the generation. For example, one possible results of generation are all empty (no event) <extra_id_0> <extra_id_1>.

Thank you for your analysis. I seem to have some ideas!
Wish you every success in your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants