Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too Long Problems #22

Open
MT010104 opened this issue Nov 20, 2022 · 8 comments
Open

Too Long Problems #22

MT010104 opened this issue Nov 20, 2022 · 8 comments

Comments

@MT010104
Copy link

MT010104 commented Nov 20, 2022

1668943017588
There are some long problems in APPS, so I truncated them after encoding. But the output of the model is in the form of "problem + answer", so output is definitely longer than input. Max_length(1024-input_ids) is set for the output. Actually, if output's length needs to meet the requirement, input's length is much less than 1024. Otherwise we won't get a complete answer even if not reporting an error. Is it right? Also, why is max_length of output is set to "1024-inputs"?

@xksteven
Copy link
Collaborator

Where in the code is this? It's hard to search from pictures.

@MT010104
Copy link
Author

generate_gpt_codes.py #176

@xksteven
Copy link
Collaborator

xksteven commented Dec 1, 2022

We can make that into a variable that gets passed in and possibly throw a warning when the value less than some threshold like 10. Let me know if you think this would be a good compromise.

@MT010104
Copy link
Author

MT010104 commented Dec 2, 2022

1669943525134
Sorry, I couldn't make sense of it. Long inputs lead to longer outputs. But if “max_length” is set larger such as 2048, the same error will occur.

@xksteven
Copy link
Collaborator

xksteven commented Dec 2, 2022

Do you have a proposal then?

Some models have a max input size that needs to be respected.

You can consider preprocessing your inputs to have less tokens.

Open to alternative suggestions.

@NUAAZXY
Copy link

NUAAZXY commented Dec 2, 2022

I encountered the same problem. Finally, I set truncation=True while encoding, but will this cause other problem?

@MT010104
Copy link
Author

MT010104 commented Dec 2, 2022

I used the data provided by APPS and did not change it. As far as I know, gpt2 has a max input size of 1024 and gpt-neo has a max input size of 2048. But they do not make explicit requirements for the output. If there is really no help for it, we may discard these very long questions.

@xksteven
Copy link
Collaborator

Sorry for the confusion.
We used gpt-neo which had max length of 2048 so by capping it to 1024 it still had 1024 room to output a solution.
For gpt I can see how this is an issue. I have to double check how we evaluated that. We had to truncate from the beginning of the questions and set the max to be smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants