Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 21 additions & 2 deletions .github/workflows/test_e2e.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ env:
E2E_BRANCH: e2e-${{ github.event.pull_request.number || github.run_id }}-${{ github.run_attempt }}

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this change?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change updates the concurrency group to be scoped to this workflow instead of the workflow + the pr.

this way the actions should only run one time per workflow, and not per pr, so in the case multiple PR's are open, it should only run this workflow one at a time to avoid race conditions.

since this change adds a step to delete the staging repo, if the workflows ran in parallel then it could hypothetically have a race condition if one action deleted the repo after the other action uploaded and before it downloaded in the get_kernel call.

so this change should guard against that race condition

group: ${{ github.workflow }}
cancel-in-progress: true

jobs:
Expand All @@ -45,6 +45,23 @@ jobs:
env:
USER: runner

# Remove the existing test repo if it exists so we test repo creation in the upload step
- name: Delete existing test repo
run: |
cat > /tmp/delete_repos.py << 'PYEOF'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep it under scripts or workflows directory?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move it if you think thats a better location - however since the snippet is so small and only really expected to be used in this case I thought it made sense to inline.

lmk what you think!

from huggingface_hub import HfApi
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use delete_repo() without having to initialize the API?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I wasn't aware that was possible? can you share a snippet we can use to delete a repo in a more concise way? I was under the impression we'd need to create the api object pointing to the staging url.

Happy to make changes to a better approach!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import os

api = HfApi(
endpoint=os.environ["HF_ENDPOINT"]
)
repo_id = os.environ["E2E_REPO_ID"]
for repo_type in ["kernel", "model"]:
api.delete_repo(repo_id, repo_type=repo_type, missing_ok=True)
print(f"Deleted {repo_type} repo (or it did not exist)")
PYEOF
nix-shell -p python3Packages.huggingface-hub --run "python /tmp/delete_repos.py"

- name: Init kernel project
run: |
cd /tmp
Expand Down Expand Up @@ -121,6 +138,8 @@ jobs:
CUDA_TAG=$(echo "$VARIANT" | grep -oP 'cu\d+')
echo "Installing torch matching variant $VARIANT (CUDA tag: $CUDA_TAG)"
uv sync --all-extras --dev
# E2E validates this checkout, so use the local kernels-data binding instead of the released wheel from uv.lock.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was added to ensure the E2E test is using the latest kernel-data in the repo, instead of the version from the lock file.

I ran into an issue with a difference in the api used to load the metadata in a previous action run https://github.com/huggingface/kernels/actions/runs/25754879907/job/75640986832#step:5:36

and by reinstalling the kernels-data library after syncing, it make sure to use the latest changes (and not the one pinned in the uv.lock)

uv pip install --reinstall ../kernels-data/bindings/python
uv pip install --upgrade torch --index-url https://download.pytorch.org/whl/$CUDA_TAG
uv run --no-sync python -c "import torch; print(f'torch={torch.__version__}, cuda={torch.version.cuda}, cxx11_abi={torch.compiled_with_cxx11_abi()}')"

Expand All @@ -131,7 +150,7 @@ jobs:
import torch
from kernels import get_kernel

kernel = get_kernel('${{ env.E2E_REPO_ID }}', revision='${{ env.E2E_BRANCH }}')
kernel = get_kernel('${{ env.E2E_REPO_ID }}', revision='${{ env.E2E_BRANCH }}', trust_remote_code=True)

x = torch.randn(1024, 1024, dtype=torch.float32, device='cuda')
result = kernel.kernels_upload_test(x)
Expand Down
Loading