Improve has_similar_generate_outputs assertions#44166
Conversation
|
run-slow: dia, gemma3n, glm_ocr, kosmos2, kyutai_speech_to_text, t5gemma, t5gemma2 |
|
This comment contains models: ["models/dia", "models/gemma3n", "models/glm_ocr", "models/kosmos2", "models/kyutai_speech_to_text", "models/t5gemma", "models/t5gemma2"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
The CI does not output useful info on this flaky test - `tests.models.olmo.test_modeling_olmo.OlmoModelTest testMethod=test_generate_with_static_cache`
and makes it harder to determine the root problem when not reproducible locally.
This patch improves this assertion, so we get full details when it fails.
Before (CI output):
```
AssertionError: False is not true
```
After (CI output would look like):
```
AssertionError: Generate outputs are not similar enough (atol=1e-05, rtol=1e-05).
Sequence mismatch: 3/20 tokens differ (first at position 12).
Batch index: 0
Token at mismatch — output_1: 1542, output_2: 8903
Score diff at first mismatch — max: 2.384186e-04, mean: 1.127943e-05
```
This will tell us immediately whether it's a tolerance issue (small diffs
suggesting we might just need a slightly larger atol), a completely different
generation path, or a specific batch element problem.
While I've changed all the callers in our tests, I kept the original function
intact for backward compact (even though it might be overkill)
38f9997 to
1209e55
Compare
Rocketknight1
left a comment
There was a problem hiding this comment.
One nit about the docstrings but this seems ready!
|
run-slow: dia, gemma3n, glm_ocr, kosmos2, kyutai_speech_to_text, t5gemma, t5gemma2 |
|
This comment contains models: ["models/dia", "models/gemma3n", "models/glm_ocr", "models/kosmos2", "models/kyutai_speech_to_text", "models/t5gemma", "models/t5gemma2"] |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: dia, gemma3n, glm_ocr, kosmos2, kyutai_speech_to_text, t5gemma, t5gemma2 |
This patch improves the `has_similar_generate_outputs` assertion, so we get full details when it fails.
Before (CI output):
```
AssertionError: False is not true
```
After (CI output would look like):
```
AssertionError: Generate outputs are not similar enough (atol=1e-05, rtol=1e-05).
Sequence mismatch: 3/20 tokens differ (first at position 12).
Batch index: 0
Token at mismatch — output_1: 1542, output_2: 8903
Score diff at first mismatch — max: 2.384186e-04, mean: 1.127943e-05
```
This will tell us immediately whether it's a tolerance issue (small diffs
suggesting we might just need a slightly larger atol), a completely different
generation path, or a specific batch element problem.
The CI does not output useful info on this flaky test -
tests.models.olmo.test_modeling_olmo.OlmoModelTest testMethod=test_generate_with_static_cacheand makes it harder to determine the root problem when not reproducible locally.This patch improves this assertion, so we get full details when it fails.
Before (CI output):
After (CI output would look like):
This will tell us immediately whether it's a tolerance issue (small diffs suggesting we might just need a slightly larger atol), a completely different generation path, or a specific batch element problem.
While I've changed all the callers in our tests, I kept the original function intact for backward compact (even though it might be overkill)