Generate: slow assisted generation test #23125

gante · 2023-05-03T10:26:21Z

What does this PR do?

test_assisted_decoding_matches_greedy_search fails once in a while, which blocks development. This PR removes the blocker by moving it to a slow test.

Why a slow test (and not redesign the test or add the flaky decorator)?

It is impossible to remove at 100% the non-determinism in this test. Some form of masking has to be used by design, which means that there is always a chance for the generations to diverge. When the generated sequences do diverge, the scores should be very similar at the step they diverge, as they are caused by the very small values within the numerical attention masks.
I've tried to add the check above (when the sequences diverge, the scores should be similar)... but some models still failed that check quite hard when the sequences didn't match. Some well-established models can run it without observed failures (e.g. 10k runs on GPT2 = 0 sequence mismatches). Others, especially roberta-based models, fail at a high rate. This means I should explore this further.
Since there is something to be explored, I believe the slow decorator is more appropriate: we can track failures without risking red CI on PRs.

gante · 2023-05-03T10:27:06Z

tests/models/roberta/test_modeling_roberta.py

@@ -397,10 +397,6 @@ class RobertaModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMi
    )
    fx_compatible = True

-    @unittest.skip(reason="Fix me @gante")
-    def test_assisted_greedy_search_matches_greedy_search(self):


(this one was skipped because it was flaky)

HuggingFaceDocBuilderDev · 2023-05-03T10:40:24Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh

I know that we want to make PR CI clean, but I am not a super fan to see the daily CI report being flaky neither. It has been clear that it's hard to focus on the daily CI report, although things get a bit better with a recent PR to add diff between 2 PRs.

OK for me, and we can discuss/find a way to deal with this flaky situation 🔥 .

ydshieh · 2023-05-03T12:13:52Z

tests/generation/test_utils.py

@@ -1457,6 +1457,7 @@ def test_contrastive_generate_dict_outputs_use_cache(self):
            for output in (output_contrastive, output_generate):
                self._check_outputs(output, input_ids, model.config, use_cache=True)

+    @slow  # TODO(Joao): remove this. Some models (e.g. data2vec, xcom, roberta) have an error rate between 1 and 10%.


Just wondering, since this test uses batch size 1 and short sequence, could we run it 10 times and check the results with a ratio similar to #22996 ?

(well, it depends how fast is one test run however)

The test is quite fast -- testing on all models takes ~14 seconds on my machine. With a low flaky ratio (<1% on average), repeating the test would not be expensive.

Nevertheless, since I found that some models are the cause, I'd like to defer repetition to after I explore the issue :D

sgugger

Does the error stay with non-random models? Maybe the test will be less flaky if done on a pretrained checkpoint (and since we are passing it at slow it would be ok to use those).

sgugger

Good for me to pass them as slow in any case, while investigating this further.

gante · 2023-05-03T13:22:49Z

Maybe the test will be less flaky if done on a pretrained checkpoint

Definitely! However, I think there is a deeper problem here, the logits diverge way more than I'd expect on some models, and it's odd that those models rely on the same base code (roberta). After I finish preparing the release for assisted generation, I'll get back to sorting related bugs

gante added 3 commits May 3, 2023 08:49

make it non flaky

06633d9

remove skip

b1d62ca

make it slow instead

0afba50

gante requested review from sgugger and ydshieh May 3, 2023 10:26

gante commented May 3, 2023

View reviewed changes

ydshieh approved these changes May 3, 2023

View reviewed changes

ydshieh reviewed May 3, 2023

View reviewed changes

sgugger reviewed May 3, 2023

View reviewed changes

sgugger approved these changes May 3, 2023

View reviewed changes

gante merged commit ce31e3c into huggingface:main May 3, 2023
4 checks passed

gante deleted the assisted_flakyness branch May 3, 2023 13:24

qywu pushed a commit to qywu/transformers that referenced this pull request May 3, 2023

Generate: slow assisted generation test (huggingface#23125)

ad4a755

gojiteji pushed a commit to gojiteji/transformers that referenced this pull request Jun 5, 2023

Generate: slow assisted generation test (huggingface#23125)

b65bcb4

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

Generate: slow assisted generation test (huggingface#23125)

70b53d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: slow assisted generation test #23125

Generate: slow assisted generation test #23125

gante commented May 3, 2023

gante May 3, 2023

HuggingFaceDocBuilderDev commented May 3, 2023 •

edited

Loading

ydshieh left a comment

ydshieh May 3, 2023 •

edited

Loading

gante May 3, 2023

sgugger left a comment

sgugger left a comment

gante commented May 3, 2023

Generate: slow assisted generation test #23125

Generate: slow assisted generation test #23125

Conversation

gante commented May 3, 2023

What does this PR do?

gante May 3, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 3, 2023 • edited Loading

ydshieh left a comment

Choose a reason for hiding this comment

ydshieh May 3, 2023 • edited Loading

Choose a reason for hiding this comment

gante May 3, 2023

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

gante commented May 3, 2023

HuggingFaceDocBuilderDev commented May 3, 2023 •

edited

Loading

ydshieh May 3, 2023 •

edited

Loading