Update slow test actions #8381

DN6 · 2024-06-03T05:06:41Z

What does this PR do?

Sync the container settings across ssh into runner, nightly and slow tests actions so that they are the same. Now they all use the same share memory size, gpus and diffusers specific cache. It should make it easier to reproduce CI issues.

Additionally, I think we can remove the Tailscale action from the slow test runner since we now use the SSH into runner workflow for debugging purposes.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sayakpaul

Thanks! I left some comments.

sayakpaul · 2024-06-03T07:36:02Z

.github/workflows/nightly_tests.yml

@@ -59,7 +59,7 @@ jobs:
    runs-on: [single-gpu, nvidia-gpu, t4, ci]
    container:
      image: diffusers/diffusers-pytorch-cuda
-      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
+      options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus 0 --privileged


Can we do gpus all here?

Also, does it make sense to have the cache paths defined as env vars?

The cache path is passed to the run options of the container, while the env variables are accessed inside the container.

Don't think we can just pass the env variable to this command here.

sayakpaul · 2024-06-03T07:36:45Z

.github/workflows/push_tests.yml

-      - name: Tailscale
-        uses: huggingface/tailscale-action@v1
-        with:
-          authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }}
-          slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }}
-          slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}


How would we know the IP address to SSH into without this step?

The SSH access to a particular runner is only available for a certain amount of time. Practically speaking we might not have the time to actually debug anything while this access is still available (just due to having other obligations)

We have a dedicated workflow that allows us to SSH into a runner machine which is better for debugging since we can spin it up as needed instead of when the job completes

Makes sense.

sayakpaul · 2024-06-03T07:37:00Z

.github/workflows/ssh-runner.yml

@@ -25,7 +25,7 @@ jobs:
    runs-on: [single-gpu, nvidia-gpu, "${{ github.event.inputs.runner_type }}", ci]
    container:
      image: ${{ github.event.inputs.docker_image }}
-      options: --gpus all --privileged --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+      options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus 0 --privileged


Why not --gpus all?

These were all meant to run on single gpu machines, so there's only one GPU available anyway.

But were the nightly tests also meant to run on a single GPU?

Disregard sorry.

DN6 added 2 commits June 3, 2024 04:45

update

b8a61d6

update

cfbfce4

DN6 requested a review from sayakpaul June 3, 2024 05:06

sayakpaul reviewed Jun 3, 2024

View reviewed changes

update

278c16a

sayakpaul approved these changes Jun 3, 2024

View reviewed changes

update

cb8c90b

DN6 merged commit 4d633bf into main Jun 3, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update slow test actions #8381

Update slow test actions #8381

DN6 commented Jun 3, 2024

sayakpaul left a comment

sayakpaul Jun 3, 2024

DN6 Jun 3, 2024

sayakpaul Jun 3, 2024

DN6 Jun 3, 2024

sayakpaul Jun 3, 2024

sayakpaul Jun 3, 2024

DN6 Jun 3, 2024

sayakpaul Jun 3, 2024

sayakpaul Jun 3, 2024

Update slow test actions #8381

Update slow test actions #8381

Conversation

DN6 commented Jun 3, 2024

What does this PR do?

Before submitting

Who can review?

sayakpaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment