Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate: slow assisted generation test #23125

Merged
merged 3 commits into from
May 3, 2023

Conversation

gante
Copy link
Member

@gante gante commented May 3, 2023

What does this PR do?

test_assisted_decoding_matches_greedy_search fails once in a while, which blocks development. This PR removes the blocker by moving it to a slow test.

Why a slow test (and not redesign the test or add the flaky decorator)?

  1. It is impossible to remove at 100% the non-determinism in this test. Some form of masking has to be used by design, which means that there is always a chance for the generations to diverge. When the generated sequences do diverge, the scores should be very similar at the step they diverge, as they are caused by the very small values within the numerical attention masks.
  2. I've tried to add the check above (when the sequences diverge, the scores should be similar)... but some models still failed that check quite hard when the sequences didn't match. Some well-established models can run it without observed failures (e.g. 10k runs on GPT2 = 0 sequence mismatches). Others, especially roberta-based models, fail at a high rate. This means I should explore this further.
  3. Since there is something to be explored, I believe the slow decorator is more appropriate: we can track failures without risking red CI on PRs.

@gante gante requested review from sgugger and ydshieh May 3, 2023 10:26
@@ -397,10 +397,6 @@ class RobertaModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMi
)
fx_compatible = True

@unittest.skip(reason="Fix me @gante")
def test_assisted_greedy_search_matches_greedy_search(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this one was skipped because it was flaky)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 3, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that we want to make PR CI clean, but I am not a super fan to see the daily CI report being flaky neither. It has been clear that it's hard to focus on the daily CI report, although things get a bit better with a recent PR to add diff between 2 PRs.

OK for me, and we can discuss/find a way to deal with this flaky situation 🔥 .

@@ -1457,6 +1457,7 @@ def test_contrastive_generate_dict_outputs_use_cache(self):
for output in (output_contrastive, output_generate):
self._check_outputs(output, input_ids, model.config, use_cache=True)

@slow # TODO(Joao): remove this. Some models (e.g. data2vec, xcom, roberta) have an error rate between 1 and 10%.
Copy link
Collaborator

@ydshieh ydshieh May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, since this test uses batch size 1 and short sequence, could we run it 10 times and check the results with a ratio similar to #22996 ?

(well, it depends how fast is one test run however)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is quite fast -- testing on all models takes ~14 seconds on my machine. With a low flaky ratio (<1% on average), repeating the test would not be expensive.

Nevertheless, since I found that some models are the cause, I'd like to defer repetition to after I explore the issue :D

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the error stay with non-random models? Maybe the test will be less flaky if done on a pretrained checkpoint (and since we are passing it at slow it would be ok to use those).

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me to pass them as slow in any case, while investigating this further.

@gante
Copy link
Member Author

gante commented May 3, 2023

Maybe the test will be less flaky if done on a pretrained checkpoint

Definitely! However, I think there is a deeper problem here, the logits diverge way more than I'd expect on some models, and it's odd that those models rely on the same base code (roberta). After I finish preparing the release for assisted generation, I'll get back to sorting related bugs

@gante gante merged commit ce31e3c into huggingface:main May 3, 2023
4 checks passed
@gante gante deleted the assisted_flakyness branch May 3, 2023 13:24
qywu pushed a commit to qywu/transformers that referenced this pull request May 3, 2023
gojiteji pushed a commit to gojiteji/transformers that referenced this pull request Jun 5, 2023
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants