Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions .github/workflows/cloud-tpu-ci-nightly.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: Cloud TPU nightly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Here for threading]

You will find that getting notifications is way harder than it should be. This is our issue for it: iree-org/iree#9857. Would love to combine efforts on this though. No point reinventing solutions here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gahh! I kinda knew about this but didn't realize how dire the situation was. @yashk2810 will be particularly unhappy. But yes, let's combine efforts, I would really love to figure out a workable solution for notifications.

As a potential workaround for email altogether, @jakevdp wrote this workflow that files an issue on failure: https://github.com/google/jax/blob/main/.github/workflows/upstream-nightly.yaml#L114
It has some fancy logic for reusing an existing open issue, so it doesn't end up spamming the issue tracker on repeated failures.

Alternatively, @alexalemi wrote this workflow for pinging Google Chat on new releases: https://github.com/google/jax/blob/main/.github/workflows/release-notification.yml
Presumably this could be reused to start a chat thread on CI failure. Everyone uses chat instead of email now right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh filing issues! That's actually a great idea. We should do that. I'm less psyched about chat integration. I think it's bad the amount of stuff that has moved to chat services...


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GMNGeoffrey stealing your idea of starting a thread.

So I deliberately made the runner user have sudo because that matches what GitHub hosted runners look like. That was also the recommendation from @sethvargo. root access can be quite useful for installing software, looking at low-level system info, etc. I guess in your case that's much less likely to be the case, since these are very specialized runners. Are you planning to do any benchmarking on these? That frequently requires root access to do properly.

Huh. Well I think like the idea of having a separate user anyway (I actually meant to ask you about this with your setup, but you were OOO, and then I forgot :)). We don't need sudo access for this workflow (yet?), and we can always enable that later. But it does add complexity. So I'm leaning towards keeping as-is, but could easily be convinced otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I do have a separate user. It just has sudo :-) I think there are arguments either way. I'm not sure how much protection you actually get from someone not having sudo? In our case the runners are ephemeral though, so it matters less

on:
schedule:
- cron: "0 14 * * *" # daily at 7am PST
workflow_dispatch: # allows triggering the workflow run manually

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to set permissions for the GitHub token?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're already set to read-only via the project settings. Should I also set it here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sufficient, but it's best practice to also include them here, in case someone modifies the project settings later on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, done. I put contents: read, that seems to be sufficient (https://github.com/skye/jax/actions/runs/3365397735/jobs/5580821741). It's unclear to me if it even needs that, or if it gets the current checkout "for free".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't heard of that best practice before. I guess it makes some sense, but seems a bit verbose. The people changing the project settings should be at least repository admins, and if they change the default settings it's presumably because that's the goal they're trying to achieve. In IREE, I've actually set the default to read-only at the organization level and that appears to make it impossible to set to write at the repo level (which is actually odd; I thought that was only supposed to set the default). I guess this is more explicit in that it documents that this workflow only needs read access to the repository contents and not to releases or whatever. I don't know if I actually have a point...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config-as-code is nice to have during code / audit reviews and incidence response, especially that it's part of the commit history... unlike the settings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair

# This should also be set to read-only in the project settings, but it's nice to
# document and enforce the permissions here.
permissions:
contents: read

jobs:
cloud-tpu-test:
runs-on: [self-hosted, tpu, v4-8]
strategy:
fail-fast: false # don't cancel all jobs on failure
matrix:
python-version: ["3.10"] # TODO(jakevdp): update to 3.11 when available.
jaxlib-version: [latest, nightly]
steps:
# https://opensource.google/documentation/reference/github/services#actions
# mandates using a specific commit for non-Google actions.
- uses: actions/checkout@93ea575cb5d8a053eaa0ac8fa3b40d7e05a33cc8 # v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@13ae5bb136fac2878aff31522b9efb785519f984 # v4
with:
python-version: ${{ matrix.python-version }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could cache the pip installs here

https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages

Suggested change
python-version: ${{ matrix.python-version }}
python-version: ${{ matrix.python-version }}
cache: 'pip'
cache-dependency-path: build/test-requirements.txt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? Like where is it cached? Right now everything is cached since it's not an ephemeral runner, maybe that's bad...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub actions cache. It actually might not work very well because you're not running in Azure right next to their caches

- name: Install JAX test requirements
run: |
pip install -r build/test-requirements.txt
- name: Install JAX
run: |
pip uninstall -y jax jaxlib libtpu-nightly
if [ "${{ matrix.jaxlib-version }}" == "latest" ]; then
pip install .[tpu] \
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh unrelated to this PR, but I'd actually be interested in how you decided to set this up. We used to parse the GitHub releases page, but they changed the format, so now we make a hacky not-index and are looking for what the best long-term solution is: iree-org/iree#9532

I see you're using -f and --pre

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use --pre when installing jaxlib nightlies, so pip picks up dev versions (the nightlies are released as jaxlib dev versions). We generate the index file directly from the GCS bucket contents. We decided to do this because we host many large wheels so would quickly exceed pypi's size limits. Happy to answer more questions on this here or offline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's take this offline :-)


elif [ "${{ matrix.jaxlib-version }}" == "nightly" ]; then
pip install .
pip install --pre jaxlib \
-f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
pip install libtpu-nightly \
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html

else
echo "Unknown jaxlib-version: ${{ matrix.jaxlib-version }}"
exit 1
fi

python3 -c 'import jax; print("jax version:", jax.__version__)'
python3 -c 'import jaxlib; print("jaxlib version:", jaxlib.__version__)'
python3 -c 'import jax; print("libtpu version:",
jax.lib.xla_bridge.get_backend().platform_version)'

- name: Run tests
env:
JAX_PLATFORMS: tpu,cpu
run: python -m pytest --tb=short tests examples
Comment on lines +56 to +59
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The brevity of this job is 👨‍🍳 💋

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I promise it'll get worse with time :)

1 change: 1 addition & 0 deletions .github/workflows/self_hosted_runner_utils/runner.env
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ACTIONS_RUNNER_HOOK_JOB_STARTED=$HOME/jax/.github/workflows/self_hosted_runner_utils/validate_job.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you clone the whole jax repo onto the runner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeop

90 changes: 90 additions & 0 deletions .github/workflows/self_hosted_runner_utils/setup_runner.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README file for all this might be nice. I guess you've got an internal google doc. Maybe link it from these files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It mostly just links to the doc for now, but maybe I'll move more of it to be public.


# Copyright 2022 The JAX Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Heavily influenced by
# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point, somewhat less :-) MIght want to tweak the comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ("Heavily influenced by ...")


set -eux

if [ "$#" -ne 3 ]; then
echo "Usage: setup_runner.sh <runner name> <tags> <github token>"
fi

runner_name="$1"
runner_tags="$2"
runner_token="$3"

# Secret fourth argument for setting the repo URL. Useful for testing with forks.
# - sets empty string as default to avoid unbound variable error from set -u
jax_repo_url="${4-}"
if [ -z "${jax_repo_url}" ]; then
jax_repo_url="https://github.com/google/jax"
fi

# Create `runner` user. This user won't have sudo access unless you ssh into the
# GCP VM as `runner` using gcloud. Don't do that!
sudo useradd runner -m

# Find the latest actions-runner download. The runner will automatically update
# itself when new versions are released. Github requires that all self-hosted
# runners are updated to the latest version within 30 days of release
# (https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#controlling-runner-software-updates-on-self-hosted-runners).
# Example URL:
# https://github.com/actions/runner/releases/download/v2.298.2/actions-runner-linux-x64-2.298.2.tar.gz
actions_runner_download_regexp='https://github.com/actions/runner/releases/'\
'download/v[0-9.]\+/actions-runner-linux-x64-[0-9.]\+\.tar\.gz'
# Use `head -n 1` because there are multiple instances of the same URL
actions_runner_download=$(
curl -s -X GET 'https://api.github.com/repos/actions/runner/releases/latest' |
grep -o "${actions_runner_download_regexp}" |
head -n 1)
echo "actions_runner_download: ${actions_runner_download}"

# Run the rest of the setup as `runner`.
#
# Note that env vars in the heredoc will be expanded according to the _calling_
# environment, not the `runner` environment we're creating -- it acts like a
# double-quoted string. This is how variables like $runner_name are inserted
# without using sudo -E (which would cause the current environment to be
# inherited). This also means we must be careful to escape any variables that
# we'd like to evaluate in the `runner` environment, e.g. $HOME.
sudo -i -u runner bash -ex <<EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I didn't like heredoc'ing this and created a separate script file. Since you've already cloned the whole repo, you could do that here also

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could! And in fact the logic here is suuuuper subtle because heredoc replaces variables from the containing environment, just as a double-quoted string would. You can use <<'EOF' to get single-quote-like behavior, but I'm actually relying on the double-quote behavior to insert values like $jax_repo_url. IMO this is confusing because it looks like variables should be expanded according to runner's env, not the containing one.

In fact, I realize I had a bug due to this subtle behavior! echo "@reboot $HOME/.../start_github_runner.sh" | crontab - is wrong, it should be echo "@reboot \$HOME/...". Otherwise it expands to /home/skyewm (or whoever's running the script) instead of /home/runner. This worked because I check out the whole repo in order to run this file.

That said, I'm not gonna take your suggestion because I like having everything in one file :P I'll be sure to tell you if/when I regret this decision and split it up so you can tell me "I told you so". I will add a comment and fix the bug though.


cd ~/

git clone ${jax_repo_url}

# Based on https://github.com/google/jax/settings/actions/runners/new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That link 404s for non-admins. Might be worth noting that so people aren't confused (GitHub makes the questionable choice of conflating 404 and 403)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done

# (will be 404 for github users with insufficient repo permissions)
mkdir actions-runner && cd actions-runner
curl -o actions-runner-linux-x64.tar.gz -L ${actions_runner_download}
tar xzf ./actions-runner-linux-x64.tar.gz

# Register the runner with Github
./config.sh --unattended \
--url ${jax_repo_url} \
--labels ${runner_tags} \
--token ${runner_token} \
--name ${runner_name}

# Setup pre-job hook
cat ~/jax/.github/workflows/self_hosted_runner_utils/runner.env | envsubst >> ~/actions-runner/.env
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oo that's a neat trick. So envsubst allows you to write env variables in your config file and then they get filled in here (like you have $HOME)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup!


# Setup Github Actions Runner to automatically start on reboot (e.g. due to TPU
# VM maintenance events)
echo "@reboot \${HOME}/jax/.github/workflows/self_hosted_runner_utils/start_github_runner.sh" | crontab -

EOF
20 changes: 20 additions & 0 deletions .github/workflows/self_hosted_runner_utils/start_github_runner.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash

# Copyright 2022 The JAX Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# More or less copied from
# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config

~/actions-runner/run.sh &> /tmp/actions-runner.`date +"%Y%m%d-%H%M"`.log
47 changes: 47 additions & 0 deletions .github/workflows/self_hosted_runner_utils/validate_job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/bin/bash

# Copyright 2022 The JAX Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# More or less copied from
# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config

set -euo pipefail

ALLOWED_EVENTS=(
"schedule"
"workflow_dispatch"
)

# Tests if the first argument is contained in the array in the second argument.
# Usage `is_contained "element" "${array[@]}"`
is_contained() {
local e;
local match="$1"
shift
for e in "$@"; do
if [[ "${e}" == "${match}" ]]; then
return 0
fi
done
return 1
}

if ! is_contained "${GITHUB_EVENT_NAME}" "${ALLOWED_EVENTS[@]}"; then
Copy link

@laurentsimon laurentsimon Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add other sort of validation like branch names? The triggers don't accept arbitrary branches. The workflow_dispatch does but is limited to users with push, so low risk I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to allow arbitrary branches actually, because it's useful for testing the workflow (since we can't use PRs). We could force devs to setup a fork with their own runner for testing, but given it's low risk to keep as-is, I'd prefer to keep it semi-convenient.

echo "Event type '${GITHUB_EVENT_NAME}' is not allowed on this runner. Aborting workflow."
# clean up any nefarious stuff we may have fetched in job setup.
cd ~/actions-runner/_work
rm -rfv _actions/ _temp/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm not totally sure that this is all the directories that the fetch step could have populated. In our case, we're using ephemeral runners, so that matters much less, but something to investigate. The whole thing where this fetching happens before the pre-job hook is totally weird to me: actions/runner#1951

exit 1
fi