Skip to content

Conversation

ahadnagy
Copy link
Contributor

What does this PR do?

This PR adds the new collated reports job to Nvidia as well that produces reports like this: https://huggingface.co/datasets/optimum-amd/transformers_daily_ci/blob/main/2025-08-25/runs/39-17221003312/ci_results_run_models_gpu/collated_reports_e68146f.json

This is required to compare test result between platforms easily.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

job: run_models_gpu
report_repo_id: ${{ inputs.report_repo_id }}
gpu_name: ${{ inputs.runner_type }}
machine_type: ${{ matrix.machine_type }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure about this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's single-gpu or multi-gpu, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be it, yes. But i'm not sure "matrix" works in this context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah I see it now. Move the collated reports section inside of github/workflows/model_jobs.yml instead 👍

@Rocketknight1
Copy link
Member

cc @ydshieh

Copy link
Member

@ivarflakstad ivarflakstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm but I'll let the CI master make the final call

slack_report_channel: "#transformers-ci-past-future"
docker: huggingface/transformers-all-latest-torch-nightly-gpu
ci_event: Nightly CI
runner_type: "a10"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydshieh you had an idea for how to get the gpu name dynamically, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am too lazy to do anything here but just keep it a10

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this is not the workflow you want to compare. This workflow is running against torch nightly build.

What you want to compare against is

.github/workflows/self-scheduled-caller.yml

I guess

job: run_models_gpu
report_repo_id: ${{ inputs.report_repo_id }}
gpu_name: ${{ inputs.runner_type }}
machine_type: ${{ matrix.machine_type }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's single-gpu or multi-gpu, no?

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 28, 2025

I am not a super fan to have this for Nvidia runs (I might change the mind in the future, but not now).

I thought this report is only for AMD when @ivarflakstad worked on it .

However, I am fine to have it run and see how it goes. Please keep that job working stable (although I guess the whole workflow still works if that jobs fails).

BTW, what are those runs

https://huggingface.co/datasets/optimum-amd/transformers_daily_ci/tree/main/2025-08-25/runs

with small workflow run number 14-17197751843, 15-17208128961 etc

which workflow uploaded them?

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 28, 2025

Before merge, please try to trigger a run (use push event) to see it works well , but with a small list of models.

Don't hesitate to reach out to me if you need info about how to do that

@ahadnagy
Copy link
Contributor Author

ahadnagy commented Sep 2, 2025

Just as paper trail, here's a successful run with push trigger:
https://huggingface.co/datasets/hf-internal-testing/transformers_daily_ci/tree/main/2025-09-02/runs/1277-17398796375/ci_results_run_models_gpu

7510f4a

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ahadnagy ahadnagy merged commit 8c60a7c into huggingface:main Sep 2, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants