Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Let's update the doc to add justification!
ArthurZucker
left a comment
There was a problem hiding this comment.
TY let's explicit in comments or doc with justification maybe?
vasqu
left a comment
There was a problem hiding this comment.
Similar problem as we had with test_torch_compile_for_training - can you take a look at test_generate_compile_model_forward_fullgraph
Forcing batched_mm or changing the type (although we do compare outputs, so not sure if it would introduce flakiness) should solve it
docs/source/en/experts_interface.md
Outdated
| | `"grouped_mm"` | Orders tokens by selected experts and uses `torch._grouped_mm` to project all tokens in a single grouped GEMMF (Requires PyTorch 2.9+). | | ||
|
|
||
| `batched_mm` is fastest for very small inputs and compilation speeds it up further. `grouped_mm` performs best for larger inputs. | ||
| On GPU: |
There was a problem hiding this comment.
i think it'd be cleaner to add two separate columns to the table for GPU and CPU, and then you can add the relevant comments for each implementation. makes it easier to quickly scan as well!
There was a problem hiding this comment.
aah makes sense ! hope it won't get crowded when rendered
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
8d082da to
14b8e0f
Compare
5524dea to
3134916
Compare
|
thanks @stevhliu I updated the table and left one note about the decode-stage optimization on gpu. @vasqu I switched to bf16 on cpu+grouped_mm+compile, imo it's better to test the grouped_mm on cpu here because it's what a user will get by default, switching to batched_mm will pass the tests but won't catch errors in the default cpu path. wdyt ? |
vasqu
left a comment
There was a problem hiding this comment.
Yes ok let's move to bf16 but gotta keep an eye out if it does indeed produce flakiness / failing tests
stevhliu
left a comment
There was a problem hiding this comment.
one last nit, otherwise lgtm! 😄
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43438&sha=1d01c7 |
What does this PR do?
Fixes # (issue)

Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.