-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update slow test actions #8381
Update slow test actions #8381
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I left some comments.
.github/workflows/nightly_tests.yml
Outdated
@@ -59,7 +59,7 @@ jobs: | |||
runs-on: [single-gpu, nvidia-gpu, t4, ci] | |||
container: | |||
image: diffusers/diffusers-pytorch-cuda | |||
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0 | |||
options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus 0 --privileged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do gpus all
here?
Also, does it make sense to have the cache paths defined as env vars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache path is passed to the run options of the container, while the env variables are accessed inside the container.
Don't think we can just pass the env variable to this command here.
- name: Tailscale | ||
uses: huggingface/tailscale-action@v1 | ||
with: | ||
authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }} | ||
slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }} | ||
slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we know the IP address to SSH into without this step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SSH access to a particular runner is only available for a certain amount of time. Practically speaking we might not have the time to actually debug anything while this access is still available (just due to having other obligations)
We have a dedicated workflow that allows us to SSH into a runner machine which is better for debugging since we can spin it up as needed instead of when the job completes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
@@ -25,7 +25,7 @@ jobs: | |||
runs-on: [single-gpu, nvidia-gpu, "${{ github.event.inputs.runner_type }}", ci] | |||
container: | |||
image: ${{ github.event.inputs.docker_image }} | |||
options: --gpus all --privileged --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ | |||
options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus 0 --privileged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not --gpus all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were all meant to run on single gpu machines, so there's only one GPU available anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But were the nightly tests also meant to run on a single GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disregard sorry.
What does this PR do?
Sync the container settings across ssh into runner, nightly and slow tests actions so that they are the same. Now they all use the same share memory size, gpus and diffusers specific cache. It should make it easier to reproduce CI issues.
Additionally, I think we can remove the Tailscale action from the slow test runner since we now use the SSH into runner workflow for debugging purposes.
Fixes # (issue)
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.