Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .github/workflows/check_failed_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ on:
report_repo_id:
required: true
type: string
commit_sha:
required: false
type: string


env:
Expand Down Expand Up @@ -87,7 +90,7 @@ jobs:
- name: Update clone
working-directory: /transformers
if: ${{ env.process == 'true' }}
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

- name: Get target commit
working-directory: /transformers/utils
Expand Down
5 changes: 4 additions & 1 deletion .github/workflows/model_jobs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ on:
docker:
required: true
type: string
commit_sha:
required: false
type: string
report_name_prefix:
required: false
default: run_models_gpu
Expand Down Expand Up @@ -70,7 +73,7 @@ jobs:

- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
Expand Down
37 changes: 13 additions & 24 deletions .github/workflows/self-nightly-caller.yml
Original file line number Diff line number Diff line change
@@ -1,43 +1,32 @@
name: Self-hosted runner (nightly-ci)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make it more descriptive


name: Nvidia CI with nightly torch

on:
repository_dispatch:
schedule:
- cron: "17 2 * * *"
# triggered when the daily scheduled Nvidia CI is completed.
# This way, we can compare the results more easily.
workflow_run:
workflows: ["Nvidia CI"]
branches: ["main"]
types: [completed]
Comment on lines +5 to +10
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runs after the scheduled CI with stable torch finishing the run.

So this workflow could compare the results to it at the end.

push:
branches:
- run_nightly_ci*
- run_ci_with_nightly_torch*

jobs:
build_nightly_ci_images:
name: Build Nightly CI Docker Images
if: (github.event_name == 'schedule') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_nightly_ci'))
build_nightly_torch_ci_images:
name: Build CI Docker Images with nightly torch
uses: ./.github/workflows/build-nightly-ci-docker-images.yml
secrets: inherit

model-ci:
name: Model CI
needs: [build_nightly_ci_images]
needs: build_nightly_torch_ci_images
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_models_gpu
slack_report_channel: "#transformers-ci-past-future"
runner: ci
docker: huggingface/transformers-all-latest-torch-nightly-gpu
ci_event: Nightly CI
secrets: inherit

deepspeed-ci:
name: DeepSpeed CI
needs: [build_nightly_ci_images]
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_torch_cuda_extensions_gpu
slack_report_channel: "#transformers-ci-past-future"
runner: ci
# test deepspeed nightly build with the latest release torch
docker: huggingface/transformers-pytorch-deepspeed-latest-gpu
ci_event: Nightly CI
working-directory-prefix: /workspace
report_repo_id: hf-internal-testing/transformers_daily_ci_with_torch_nightly
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a fix. Without it, the workflow is invalid

commit_sha: ${{ github.event.workflow_run.head_sha || github.sha }}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github.event.workflow_run.head_sha --> when triggered via workflow_run

github.sha --> when triggered via push or events of other types

secrets: inherit
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to put this in another workflow file (caller, similar to this one)

This job is running with stable torch but with nightly deepspeed

11 changes: 8 additions & 3 deletions .github/workflows/self-scheduled-caller.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
name: Self-hosted runner (scheduled)

name: Nvidia CI
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more descriptive


on:
repository_dispatch:
schedule:
- cron: "17 2 * * *"
push:
branches:
- run_scheduled_ci*
- run_nvidia_ci*
workflow_dispatch:
inputs:
prev_workflow_run_id:
Expand Down Expand Up @@ -54,6 +53,7 @@ jobs:
docker: huggingface/transformers-all-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caller is triggered with push or schedule event, not via workflow_run

secrets: inherit

torch-pipeline:
Expand All @@ -65,6 +65,7 @@ jobs:
docker: huggingface/transformers-pytorch-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit

example-ci:
Expand All @@ -76,6 +77,7 @@ jobs:
docker: huggingface/transformers-all-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit

trainer-fsdp-ci:
Expand All @@ -87,6 +89,7 @@ jobs:
docker: huggingface/transformers-all-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit

deepspeed-ci:
Expand All @@ -99,6 +102,7 @@ jobs:
ci_event: Daily CI
working-directory-prefix: /workspace
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit

quantization-ci:
Expand All @@ -110,4 +114,5 @@ jobs:
docker: huggingface/transformers-quantization-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
21 changes: 14 additions & 7 deletions .github/workflows/self-scheduled.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Self-hosted runner (scheduled)
name: Nvidia CI (job definitions)

# Note that each job's dependencies go into a corresponding docker file.
#
Expand Down Expand Up @@ -28,6 +28,9 @@ on:
report_repo_id:
required: true
type: string
commit_sha:
required: false
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it false, and below we use

commit_sha: ${{ inputs.commit_sha || github.sha }}

The caller needs to make sure it pass this whenever necessary

type: string


env:
Expand All @@ -46,8 +49,8 @@ env:

jobs:
setup:
if: contains(fromJSON('["run_models_gpu", "run_trainer_and_fsdp_gpu", "run_quantization_torch_gpu"]'), inputs.job)
name: Setup
if: contains(fromJSON('["run_models_gpu", "run_trainer_and_fsdp_gpu", "run_quantization_torch_gpu"]'), inputs.job)
strategy:
matrix:
machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
Expand Down Expand Up @@ -119,6 +122,7 @@ jobs:
slice_id: ${{ matrix.slice_id }}
runner_map: ${{ needs.setup.outputs.runner_map }}
docker: ${{ inputs.docker }}
commit_sha: ${{ inputs.commit_sha || github.sha }}
secrets: inherit

run_trainer_and_fsdp_gpu:
Expand All @@ -137,6 +141,7 @@ jobs:
slice_id: ${{ matrix.slice_id }}
runner_map: ${{ needs.setup.outputs.runner_map }}
docker: ${{ inputs.docker }}
commit_sha: ${{ inputs.commit_sha || github.sha }}
report_name_prefix: run_trainer_and_fsdp_gpu
secrets: inherit

Expand All @@ -155,7 +160,7 @@ jobs:
steps:
- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
Expand Down Expand Up @@ -223,7 +228,7 @@ jobs:
steps:
- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
Expand Down Expand Up @@ -292,7 +297,7 @@ jobs:
steps:
- name: Update clone
working-directory: ${{ inputs.working-directory-prefix }}/transformers
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: ${{ inputs.working-directory-prefix }}/transformers
Expand Down Expand Up @@ -400,7 +405,7 @@ jobs:

- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
Expand Down Expand Up @@ -464,6 +469,7 @@ jobs:
uses: actions/checkout@v4
with:
fetch-depth: 2
ref: ${{ inputs.commit_sha || github.sha }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slick 👌


- name: Install transformers
run: pip install transformers
Expand Down Expand Up @@ -518,6 +524,7 @@ jobs:
quantization_matrix: ${{ needs.setup.outputs.quantization_matrix }}
ci_event: ${{ inputs.ci_event }}
report_repo_id: ${{ inputs.report_repo_id }}
commit_sha: ${{ inputs.commit_sha || github.sha }}

secrets: inherit

Expand All @@ -528,7 +535,7 @@ jobs:
uses: ./.github/workflows/check_failed_tests.yml
with:
docker: ${{ inputs.docker }}
start_sha: ${{ github.sha }}
start_sha: ${{ inputs.commit_sha || github.sha }}
job: ${{ inputs.job }}
slack_report_channel: ${{ inputs.slack_report_channel }}
ci_event: ${{ inputs.ci_event }}
Expand Down
10 changes: 9 additions & 1 deletion .github/workflows/slack-report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ on:
report_repo_id:
required: true
type: string
commit_sha:
required: false
type: string


env:
TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
Expand All @@ -41,6 +45,10 @@ jobs:
echo "Setup status: ${{ inputs.setup_status }}"

- uses: actions/checkout@v4
with:
fetch-depth: 2
ref: ${{ inputs.commit_sha || github.sha }}

- uses: actions/download-artifact@v4

- name: Prepare some setup values
Expand All @@ -67,7 +75,7 @@ jobs:
SLACK_REPORT_CHANNEL: ${{ inputs.slack_report_channel }}
ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
CI_EVENT: ${{ inputs.ci_event }}
CI_SHA: ${{ github.sha }}
CI_SHA: ${{ inputs.commit_sha || github.sha }}
CI_TEST_JOB: ${{ inputs.job }}
SETUP_STATUS: ${{ inputs.setup_status }}
REPORT_REPO_ID: ${{ inputs.report_repo_id }}
Expand Down
6 changes: 3 additions & 3 deletions utils/notification_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -669,7 +669,7 @@ def payload(self) -> str:
"text": {
"type": "mrkdwn",
# TODO: We should NOT assume it's always Nvidia CI, but it's the case at this moment.
"text": f"*There are {nb_new_failed_tests} failed tests unique to {'this run' if not is_amd_daily_ci_workflow else 'AMD'}*\n\n(compared to Nvidia CI: <https://github.com/huggingface/transformers/actions/runs/{prev_workflow_run_id}|{prev_workflow_run_id}>)",
"text": f"*There are {nb_new_failed_tests} failed tests unique to this run*\n\n(compared to{' Nvidia CI ' if is_scheduled_ci_run else ' '}run: <https://github.com/huggingface/transformers/actions/runs/{prev_workflow_run_id}|{prev_workflow_run_id}>)",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make the logic more accurate and less complex

},
"accessory": {
"type": "button",
Expand Down Expand Up @@ -1406,13 +1406,13 @@ def pop_default(l: list[Any], i: int, default: Any) -> Any:
is_scheduled_ci_run = os.environ.get("GITHUB_EVENT_NAME") == "schedule"
# For AMD workflow runs: the different AMD CI callers (MI210/MI250/MI300, etc.) are triggered by `workflow_run`
# event of `.github/workflows/self-scheduled-amd-caller.yml`.
if is_amd_daily_ci_workflow:
if os.environ.get("GITHUB_EVENT_NAME") == "workflow_run":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now, not only related to AMD

# Get the path to the file on the runner that contains the full event webhook payload.
event_payload_path = os.environ.get("GITHUB_EVENT_PATH")
# Load the event payload
with open(event_payload_path) as fp:
event_payload = json.load(fp)
# The event that triggers the `workflow_run` event.
# The event that triggers the original `workflow_run`.
if "workflow_run" in event_payload:
is_scheduled_ci_run = event_payload["workflow_run"]["event"] == "schedule"

Expand Down