-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate: slow assisted generation test #23125
Conversation
@@ -397,10 +397,6 @@ class RobertaModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMi | |||
) | |||
fx_compatible = True | |||
|
|||
@unittest.skip(reason="Fix me @gante") | |||
def test_assisted_greedy_search_matches_greedy_search(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this one was skipped because it was flaky)
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that we want to make PR CI clean, but I am not a super fan to see the daily CI report being flaky neither. It has been clear that it's hard to focus on the daily CI report, although things get a bit better with a recent PR to add diff between 2 PRs.
OK for me, and we can discuss/find a way to deal with this flaky situation 🔥 .
@@ -1457,6 +1457,7 @@ def test_contrastive_generate_dict_outputs_use_cache(self): | |||
for output in (output_contrastive, output_generate): | |||
self._check_outputs(output, input_ids, model.config, use_cache=True) | |||
|
|||
@slow # TODO(Joao): remove this. Some models (e.g. data2vec, xcom, roberta) have an error rate between 1 and 10%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, since this test uses batch size 1 and short sequence, could we run it 10 times and check the results with a ratio similar to #22996 ?
(well, it depends how fast is one test run however)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is quite fast -- testing on all models takes ~14 seconds on my machine. With a low flaky ratio (<1% on average), repeating the test would not be expensive.
Nevertheless, since I found that some models are the cause, I'd like to defer repetition to after I explore the issue :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the error stay with non-random models? Maybe the test will be less flaky if done on a pretrained checkpoint (and since we are passing it at slow it would be ok to use those).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good for me to pass them as slow in any case, while investigating this further.
Definitely! However, I think there is a deeper problem here, the logits diverge way more than I'd expect on some models, and it's odd that those models rely on the same base code (roberta). After I finish preparing the release for assisted generation, I'll get back to sorting related bugs |
What does this PR do?
test_assisted_decoding_matches_greedy_search
fails once in a while, which blocks development. This PR removes the blocker by moving it to a slow test.Why a slow test (and not redesign the test or add the flaky decorator)?