-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Add Github Actions workflow that runs on a self-hosted TPU VM runner. #13000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,59 @@ | ||||||||||
name: Cloud TPU nightly | ||||||||||
|
||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @GMNGeoffrey stealing your idea of starting a thread.
Huh. Well I think like the idea of having a separate user anyway (I actually meant to ask you about this with your setup, but you were OOO, and then I forgot :)). We don't need sudo access for this workflow (yet?), and we can always enable that later. But it does add complexity. So I'm leaning towards keeping as-is, but could easily be convinced otherwise. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well I do have a separate user. It just has sudo :-) I think there are arguments either way. I'm not sure how much protection you actually get from someone not having sudo? In our case the runners are ephemeral though, so it matters less |
||||||||||
on: | ||||||||||
schedule: | ||||||||||
- cron: "0 14 * * *" # daily at 7am PST | ||||||||||
workflow_dispatch: # allows triggering the workflow run manually | ||||||||||
|
||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want to set permissions for the GitHub token? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They're already set to read-only via the project settings. Should I also set it here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's sufficient, but it's best practice to also include them here, in case someone modifies the project settings later on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it, done. I put There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't heard of that best practice before. I guess it makes some sense, but seems a bit verbose. The people changing the project settings should be at least repository admins, and if they change the default settings it's presumably because that's the goal they're trying to achieve. In IREE, I've actually set the default to read-only at the organization level and that appears to make it impossible to set to write at the repo level (which is actually odd; I thought that was only supposed to set the default). I guess this is more explicit in that it documents that this workflow only needs read access to the repository contents and not to releases or whatever. I don't know if I actually have a point... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. config-as-code is nice to have during code / audit reviews and incidence response, especially that it's part of the commit history... unlike the settings. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, fair |
||||||||||
# This should also be set to read-only in the project settings, but it's nice to | ||||||||||
# document and enforce the permissions here. | ||||||||||
permissions: | ||||||||||
contents: read | ||||||||||
|
||||||||||
jobs: | ||||||||||
cloud-tpu-test: | ||||||||||
runs-on: [self-hosted, tpu, v4-8] | ||||||||||
strategy: | ||||||||||
fail-fast: false # don't cancel all jobs on failure | ||||||||||
matrix: | ||||||||||
python-version: ["3.10"] # TODO(jakevdp): update to 3.11 when available. | ||||||||||
jaxlib-version: [latest, nightly] | ||||||||||
steps: | ||||||||||
# https://opensource.google/documentation/reference/github/services#actions | ||||||||||
# mandates using a specific commit for non-Google actions. | ||||||||||
- uses: actions/checkout@93ea575cb5d8a053eaa0ac8fa3b40d7e05a33cc8 # v3 | ||||||||||
- name: Set up Python ${{ matrix.python-version }} | ||||||||||
uses: actions/setup-python@13ae5bb136fac2878aff31522b9efb785519f984 # v4 | ||||||||||
with: | ||||||||||
python-version: ${{ matrix.python-version }} | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could cache the pip installs here https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this work? Like where is it cached? Right now everything is cached since it's not an ephemeral runner, maybe that's bad... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The GitHub actions cache. It actually might not work very well because you're not running in Azure right next to their caches |
||||||||||
- name: Install JAX test requirements | ||||||||||
run: | | ||||||||||
pip install -r build/test-requirements.txt | ||||||||||
- name: Install JAX | ||||||||||
run: | | ||||||||||
pip uninstall -y jax jaxlib libtpu-nightly | ||||||||||
if [ "${{ matrix.jaxlib-version }}" == "latest" ]; then | ||||||||||
pip install .[tpu] \ | ||||||||||
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh unrelated to this PR, but I'd actually be interested in how you decided to set this up. We used to parse the GitHub releases page, but they changed the format, so now we make a hacky not-index and are looking for what the best long-term solution is: iree-org/iree#9532 I see you're using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We only use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah let's take this offline :-) |
||||||||||
|
||||||||||
elif [ "${{ matrix.jaxlib-version }}" == "nightly" ]; then | ||||||||||
pip install . | ||||||||||
pip install --pre jaxlib \ | ||||||||||
-f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html | ||||||||||
pip install libtpu-nightly \ | ||||||||||
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html | ||||||||||
|
||||||||||
else | ||||||||||
echo "Unknown jaxlib-version: ${{ matrix.jaxlib-version }}" | ||||||||||
exit 1 | ||||||||||
fi | ||||||||||
|
||||||||||
python3 -c 'import jax; print("jax version:", jax.__version__)' | ||||||||||
python3 -c 'import jaxlib; print("jaxlib version:", jaxlib.__version__)' | ||||||||||
python3 -c 'import jax; print("libtpu version:", | ||||||||||
jax.lib.xla_bridge.get_backend().platform_version)' | ||||||||||
|
||||||||||
- name: Run tests | ||||||||||
env: | ||||||||||
JAX_PLATFORMS: tpu,cpu | ||||||||||
run: python -m pytest --tb=short tests examples | ||||||||||
Comment on lines
+56
to
+59
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The brevity of this job is 👨🍳 💋 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I promise it'll get worse with time :) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
ACTIONS_RUNNER_HOOK_JOB_STARTED=$HOME/jax/.github/workflows/self_hosted_runner_utils/validate_job.sh | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you clone the whole jax repo onto the runner? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeop |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
#!/bin/bash | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. README file for all this might be nice. I guess you've got an internal google doc. Maybe link it from these files? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. It mostly just links to the doc for now, but maybe I'll move more of it to be public. |
||
|
||
# Copyright 2022 The JAX Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# https://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# Heavily influenced by | ||
# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think at this point, somewhat less :-) MIght want to tweak the comment There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done ("Heavily influenced by ...") |
||
|
||
set -eux | ||
|
||
if [ "$#" -ne 3 ]; then | ||
echo "Usage: setup_runner.sh <runner name> <tags> <github token>" | ||
fi | ||
|
||
runner_name="$1" | ||
runner_tags="$2" | ||
runner_token="$3" | ||
|
||
# Secret fourth argument for setting the repo URL. Useful for testing with forks. | ||
# - sets empty string as default to avoid unbound variable error from set -u | ||
jax_repo_url="${4-}" | ||
if [ -z "${jax_repo_url}" ]; then | ||
jax_repo_url="https://github.com/google/jax" | ||
fi | ||
|
||
# Create `runner` user. This user won't have sudo access unless you ssh into the | ||
# GCP VM as `runner` using gcloud. Don't do that! | ||
sudo useradd runner -m | ||
|
||
# Find the latest actions-runner download. The runner will automatically update | ||
# itself when new versions are released. Github requires that all self-hosted | ||
# runners are updated to the latest version within 30 days of release | ||
# (https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#controlling-runner-software-updates-on-self-hosted-runners). | ||
# Example URL: | ||
# https://github.com/actions/runner/releases/download/v2.298.2/actions-runner-linux-x64-2.298.2.tar.gz | ||
actions_runner_download_regexp='https://github.com/actions/runner/releases/'\ | ||
'download/v[0-9.]\+/actions-runner-linux-x64-[0-9.]\+\.tar\.gz' | ||
# Use `head -n 1` because there are multiple instances of the same URL | ||
actions_runner_download=$( | ||
curl -s -X GET 'https://api.github.com/repos/actions/runner/releases/latest' | | ||
grep -o "${actions_runner_download_regexp}" | | ||
head -n 1) | ||
echo "actions_runner_download: ${actions_runner_download}" | ||
|
||
# Run the rest of the setup as `runner`. | ||
# | ||
# Note that env vars in the heredoc will be expanded according to the _calling_ | ||
# environment, not the `runner` environment we're creating -- it acts like a | ||
# double-quoted string. This is how variables like $runner_name are inserted | ||
# without using sudo -E (which would cause the current environment to be | ||
# inherited). This also means we must be careful to escape any variables that | ||
# we'd like to evaluate in the `runner` environment, e.g. $HOME. | ||
sudo -i -u runner bash -ex <<EOF | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I didn't like heredoc'ing this and created a separate script file. Since you've already cloned the whole repo, you could do that here also There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could! And in fact the logic here is suuuuper subtle because heredoc replaces variables from the containing environment, just as a double-quoted string would. You can use In fact, I realize I had a bug due to this subtle behavior! That said, I'm not gonna take your suggestion because I like having everything in one file :P I'll be sure to tell you if/when I regret this decision and split it up so you can tell me "I told you so". I will add a comment and fix the bug though. |
||
|
||
cd ~/ | ||
|
||
git clone ${jax_repo_url} | ||
|
||
# Based on https://github.com/google/jax/settings/actions/runners/new | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That link 404s for non-admins. Might be worth noting that so people aren't confused (GitHub makes the questionable choice of conflating 404 and 403) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea, done |
||
# (will be 404 for github users with insufficient repo permissions) | ||
mkdir actions-runner && cd actions-runner | ||
curl -o actions-runner-linux-x64.tar.gz -L ${actions_runner_download} | ||
tar xzf ./actions-runner-linux-x64.tar.gz | ||
|
||
# Register the runner with Github | ||
./config.sh --unattended \ | ||
--url ${jax_repo_url} \ | ||
--labels ${runner_tags} \ | ||
--token ${runner_token} \ | ||
--name ${runner_name} | ||
|
||
# Setup pre-job hook | ||
cat ~/jax/.github/workflows/self_hosted_runner_utils/runner.env | envsubst >> ~/actions-runner/.env | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oo that's a neat trick. So envsubst allows you to write env variables in your config file and then they get filled in here (like you have $HOME)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup! |
||
|
||
# Setup Github Actions Runner to automatically start on reboot (e.g. due to TPU | ||
# VM maintenance events) | ||
echo "@reboot \${HOME}/jax/.github/workflows/self_hosted_runner_utils/start_github_runner.sh" | crontab - | ||
|
||
EOF |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
#!/bin/bash | ||
|
||
# Copyright 2022 The JAX Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# https://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# More or less copied from | ||
# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config | ||
|
||
~/actions-runner/run.sh &> /tmp/actions-runner.`date +"%Y%m%d-%H%M"`.log |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
#!/bin/bash | ||
|
||
# Copyright 2022 The JAX Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# https://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# More or less copied from | ||
# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config | ||
|
||
set -euo pipefail | ||
|
||
ALLOWED_EVENTS=( | ||
"schedule" | ||
"workflow_dispatch" | ||
) | ||
|
||
# Tests if the first argument is contained in the array in the second argument. | ||
# Usage `is_contained "element" "${array[@]}"` | ||
is_contained() { | ||
local e; | ||
local match="$1" | ||
shift | ||
for e in "$@"; do | ||
if [[ "${e}" == "${match}" ]]; then | ||
return 0 | ||
fi | ||
done | ||
return 1 | ||
} | ||
|
||
if ! is_contained "${GITHUB_EVENT_NAME}" "${ALLOWED_EVENTS[@]}"; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add other sort of validation like branch names? The triggers don't accept arbitrary branches. The workflow_dispatch does but is limited to users with push, so low risk I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer to allow arbitrary branches actually, because it's useful for testing the workflow (since we can't use PRs). We could force devs to setup a fork with their own runner for testing, but given it's low risk to keep as-is, I'd prefer to keep it semi-convenient. |
||
echo "Event type '${GITHUB_EVENT_NAME}' is not allowed on this runner. Aborting workflow." | ||
# clean up any nefarious stuff we may have fetched in job setup. | ||
cd ~/actions-runner/_work | ||
rm -rfv _actions/ _temp/ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that I'm not totally sure that this is all the directories that the fetch step could have populated. In our case, we're using ephemeral runners, so that matters much less, but something to investigate. The whole thing where this fetching happens before the pre-job hook is totally weird to me: actions/runner#1951 |
||
exit 1 | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Here for threading]
You will find that getting notifications is way harder than it should be. This is our issue for it: iree-org/iree#9857. Would love to combine efforts on this though. No point reinventing solutions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gahh! I kinda knew about this but didn't realize how dire the situation was. @yashk2810 will be particularly unhappy. But yes, let's combine efforts, I would really love to figure out a workable solution for notifications.
As a potential workaround for email altogether, @jakevdp wrote this workflow that files an issue on failure: https://github.com/google/jax/blob/main/.github/workflows/upstream-nightly.yaml#L114
It has some fancy logic for reusing an existing open issue, so it doesn't end up spamming the issue tracker on repeated failures.
Alternatively, @alexalemi wrote this workflow for pinging Google Chat on new releases: https://github.com/google/jax/blob/main/.github/workflows/release-notification.yml
Presumably this could be reused to start a chat thread on CI failure. Everyone uses chat instead of email now right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh filing issues! That's actually a great idea. We should do that. I'm less psyched about chat integration. I think it's bad the amount of stuff that has moved to chat services...