-
Notifications
You must be signed in to change notification settings - Fork 30.6k
Add collated reports job to Nvidia CI #40470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add collated reports job to Nvidia CI #40470
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
.github/workflows/self-scheduled.yml
Outdated
job: run_models_gpu | ||
report_repo_id: ${{ inputs.report_repo_id }} | ||
gpu_name: ${{ inputs.runner_type }} | ||
machine_type: ${{ matrix.machine_type }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure about this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's single-gpu
or multi-gpu
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be it, yes. But i'm not sure "matrix" works in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah I see it now. Move the collated reports section inside of github/workflows/model_jobs.yml
instead 👍
cc @ydshieh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm but I'll let the CI master make the final call
slack_report_channel: "#transformers-ci-past-future" | ||
docker: huggingface/transformers-all-latest-torch-nightly-gpu | ||
ci_event: Nightly CI | ||
runner_type: "a10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ydshieh you had an idea for how to get the gpu name dynamically, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am too lazy to do anything here but just keep it a10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this is not the workflow you want to compare. This workflow is running against torch nightly build.
What you want to compare against is
.github/workflows/self-scheduled-caller.yml
I guess
.github/workflows/self-scheduled.yml
Outdated
job: run_models_gpu | ||
report_repo_id: ${{ inputs.report_repo_id }} | ||
gpu_name: ${{ inputs.runner_type }} | ||
machine_type: ${{ matrix.machine_type }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's single-gpu
or multi-gpu
, no?
I am not a super fan to have this for Nvidia runs (I might change the mind in the future, but not now). I thought this report is only for AMD when @ivarflakstad worked on it . However, I am fine to have it run and see how it goes. Please keep that job working stable (although I guess the whole workflow still works if that jobs fails). BTW, what are those runs https://huggingface.co/datasets/optimum-amd/transformers_daily_ci/tree/main/2025-08-25/runs with small workflow run number which workflow uploaded them? |
Before merge, please try to trigger a run (use push event) to see it works well , but with a small list of models. Don't hesitate to reach out to me if you need info about how to do that |
Just as paper trail, here's a successful run with push trigger: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
What does this PR do?
This PR adds the new collated reports job to Nvidia as well that produces reports like this: https://huggingface.co/datasets/optimum-amd/transformers_daily_ci/blob/main/2025-08-25/runs/39-17221003312/ci_results_run_models_gpu/collated_reports_e68146f.json
This is required to compare test result between platforms easily.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.