Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reapply [workflows] Split pr-code-format into two parts to make it more secure (#78215) #80495

Merged
merged 5 commits into from
Mar 22, 2024

Conversation

tstellar
Copy link
Collaborator

@tstellar tstellar commented Feb 2, 2024

Actions triggered by pull_request_target events have access to all repository secrets, so it is unsafe to use them when executing untrusted code. The pr-code-format workflow does not execute any untrusted code, but it passes untrused input into clang-format. An attacker could use this to exploit a flaw in clang-format and potentially gain access to the repository secrets.

By splitting the workflow, we can use the pull_request target which is more secure and isolate the issue write permissions in a separate job. The pull_request target also makes it easier to test changes to the code-format-helepr.py script, because the version of the script from the pull request will be used rather than the version of the script from main.

Fixes #77142

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 2, 2024

@llvm/pr-subscribers-github-workflow

Author: Tom Stellard (tstellar)

Changes

After bc06cd5 we started pulling code-fromat-helper.py from the pull request branch, but if the pull request branch is not up-to-date, then the job will fail, because this workflow is incompatible with the old version of the script.

In order to fix this, we need to pull the script from the main branch like we were doing before to ensure that it always gets the latest version of the script.


Full diff: https://github.com/llvm/llvm-project/pull/80495.diff

1 Files Affected:

  • (modified) .github/workflows/pr-code-format.yml (+17-3)
diff --git a/.github/workflows/pr-code-format.yml b/.github/workflows/pr-code-format.yml
index 1475d872498d4..3aa40a29852c9 100644
--- a/.github/workflows/pr-code-format.yml
+++ b/.github/workflows/pr-code-format.yml
@@ -25,6 +25,20 @@ jobs:
           separator: ","
           skip_initial_fetch: true
 
+      # We need to pull the script from the main branch, so that we ensure
+      # we get a version of the script that supports the --wirte-comment-to-file
+      # option.
+      - name: Fetch code formatting utils
+        uses: actions/checkout@v4
+        with:
+          reository: ${{ github.repository }}
+          ref: ${{ github.base_ref }}
+          sparse-checkout: |
+            llvm/utils/git/requirements_formatting.txt
+            llvm/utils/git/code-format-helper.py
+          sparse-checkout-cone-mode: false
+          path: code-format-tools
+
       - name: "Listed files"
         env:
           CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
@@ -42,10 +56,10 @@ jobs:
         with:
           python-version: '3.11'
           cache: 'pip'
-          cache-dependency-path: 'llvm/utils/git/requirements_formatting.txt'
+          cache-dependency-path: 'code-format-tools/llvm/utils/git/requirements_formatting.txt'
 
       - name: Install python dependencies
-        run: pip install -r llvm/utils/git/requirements_formatting.txt
+        run: pip install -r code-format-tools/llvm/utils/git/requirements_formatting.txt
 
       - name: Run code formatter
         env:
@@ -58,7 +72,7 @@ jobs:
         # explicitly in code-format-helper.py and not have to diff starting at
         # the merge base.
         run: |
-          python ./llvm/utils/git/code-format-helper.py \
+          python ./code-format-tools/llvm/utils/git/code-format-helper.py \
             --write-comment-to-file \
             --token ${{ secrets.GITHUB_TOKEN }} \
             --issue-number $GITHUB_PR_NUMBER \

@boomanaiden154
Copy link
Contributor

Wouldn't any PR that has an updated version of .github/works/pr-code-format.yml (so is actually running the code formatting job) also have an updated version of the python script, assuming they were both modified in the same commit?

…re secure (llvm#78215)

Actions triggered by pull_request_target events have access to all
repository secrets, so it is unsafe to use them when executing untrusted
code. The pr-code-format workflow does not execute any untrusted code,
but it passes untrused input into clang-format. An attacker could use
this to exploit a flaw in clang-format and potentially gain access to
the repository secrets.

By splitting the workflow, we can use the pull_request target which is
more secure and isolate the issue write permissions in a separate job.
The pull_request target also makes it easier to test changes to the
code-format-helepr.py script, because the version of the script from the
pull request will be used rather than the version of the script from
main.

Fixes llvm#77142
@tstellar tstellar changed the title [workflows] Always pull code-format-helper.py from the main branch Reapply [workflows] Split pr-code-format into two parts to make it more secure (#78215) Feb 3, 2024
@tstellar tstellar requested a review from tru February 3, 2024 17:59
@tstellar
Copy link
Collaborator Author

tstellar commented Feb 3, 2024

Wouldn't any PR that has an updated version of .github/works/pr-code-format.yml (so is actually running the code formatting job) also have an updated version of the python script, assuming they were both modified in the same commit?

My guess is that since the older branches had the pull_request_target version of the workflow that caused GitHub Actions to pull the workflow file from main.

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the relevant line in the Github documentation:

For pull requests from a forked repository to the base repository, GitHub sends the pull_request, issue_comment, pull_request_review_comment, pull_request_review, and pull_request_target events to the base repository. No pull request events occur on the forked repository.

So it uses the workflow definition from the base branch, but then tries to use the script from the PR branch, where it doesn't exist in most cases. So pulling the scripts from the main branch should fix the issue.

LGTM, other than a couple minor nits.

.github/workflows/pr-code-format.yml Outdated Show resolved Hide resolved
github.event.workflow_run.event == 'pull_request'
steps:
- name: 'Download artifact'
uses: actions/download-artifact@6b208ae046db98c579e8a3aa621ab581ff575935 # v4.1.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this version different than the upload artifact action in the pull_request job? 4.3.0 vs 4.1.1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two actions are versioned differently.

@jyknight
Copy link
Member

jyknight commented Feb 6, 2024

OK, I looked into this in a bit more detail:
Firstly -- a pull_request run is always run in the context of the "merge" branch, not the PR head. The "merge" branch is auto-updated (if there are no file conflicts) to the latest revision of the target branch. So, a PR will generally automatically see a new .github/workflow file after it's committed -- and by default, we'd get all the rest of the new files too!

However, the issue is that we're explicitly instructing it to checkout the SHA of the "head" commit, not the "merge" commit.

In fact, I think this whole process gets a lot easier if we make use the auto-generated merge commit, because then we:

  1. Checkout the merge commit (which is the default behavior of "checkout" if we weren't passing with: ref: ${{ github.event.pull_request.head.sha }}) with fetch_depth=2. We don't need to deepen, because the 2 parents of the merge-commit are the exact two things we need to compare!
  2. For the diff between those two commit hashes: get the list of changed files, run the formatter on the diff, etc...we don't care about any intermediate commits -- just those two endpoints is all we need.

Another advantage: if we fix an issue in the formatting jobs, most PRs will get the new version immediately, without the author needing to rebase or merge the main branch themselves.

That said, I think it's fine to defer these fixes to a separate PR: it becomes a lot easier to test this when we're no longer using pull_request_target!

console.log(runInfo);


// Query to find the number of the pull request that triggered this job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need all this? How can it end up with multiple associated pull requests? Or have a different baseRepository?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The associated pull requests are based off of the branch name, so if you create a pull request for a branch, close it, and then create another pull request with the same branch, then this query will return two associated pull requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, ok. Can you add comments to the code explaining that?

await comments.forEach(function (comment) {
if (comment.id) {
// Security check: Ensure that this comment was created by
// the github-actions bot, so a malisious input won't overwrite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: malicious

@boomanaiden154
Copy link
Contributor

Firstly -- a pull_request run is always run in the context of the "merge" branch, not the PR head. The "merge" branch is auto-updated (if there are no file conflicts) to the latest revision of the target branch. So, a PR will generally automatically see a new .github/workflow file after it's committed -- and by default, we'd get all the rest of the new files too!

Right, and it doesn't run if there are any merge conflicts (one reason why the current design is the way it is). My point was mainly about where the workflow definition is pulled from that gets run. For some reason I thought it might have been the fork branch, but that isn't the case (based on the documentation).

I agree that looking at the merge commit would make things a lot simpler. I think that's good to split out into a separate patch though so we can make sure that the current state in-tree is reliable after the split. That is another thing on my todo list.

@tstellar
Copy link
Collaborator Author

@jyknight @boomanaiden154 So it's OK to commit this as-is and then we switch to using the merge commit in a follow up patch?

# PR for security reasons as we're using pull_request_target. Checkout
# the target branch with the necessary files.
# We need to pull the script from the main branch, so that we ensure
# we get a version of the script that supports the --write-comment-to-file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we should probably reword this to say something about making sure the script is updated rather than specifically mentioning the --write-comment-to-file option, which I think would quickly become a dated comment.

@boomanaiden154
Copy link
Contributor

@jyknight @boomanaiden154 So it's OK to commit this as-is and then we switch to using the merge commit in a follow up patch?

I think it should be good. I would think any work being done to move this to checkout the default pull_request ref would either need to be done in this patch, or would be preferred to be done after this patch, so it should just land to make forward progress there.

This just keeps the current state of things and doesn't introduce any regressions while being more secure, other than the (somewhat) edge case where there are merge conflicts, so seems fine to land to me.

Copy link
Member

@jyknight jyknight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's fine to commit this as-is and do further cleanup in follow-on PRs.

console.log(runInfo);


// Query to find the number of the pull request that triggered this job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, ok. Can you add comments to the code explaining that?

@tstellar tstellar merged commit 2120f57 into llvm:main Mar 22, 2024
4 of 5 checks passed
chencha3 pushed a commit to chencha3/llvm-project that referenced this pull request Mar 23, 2024
…re secure (llvm#78215) (llvm#80495)

Actions triggered by pull_request_target events have access to all
repository secrets, so it is unsafe to use them when executing untrusted
code. The pr-code-format workflow does not execute any untrusted code,
but it passes untrused input into clang-format. An attacker could use
this to exploit a flaw in clang-format and potentially gain access to
the repository secrets.
    
By splitting the workflow, we can use the pull_request target which is
more secure and isolate the issue write permissions in a separate job.
The pull_request target also makes it easier to test changes to the
code-format-helepr.py script, because the version of the script from the
pull request will be used rather than the version of the script from
main.
    
Fixes llvm#77142
- name: Fetch code formatting utils
uses: actions/checkout@v4
with:
reository: ${{ github.repository }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reository -> repository.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f6c87be

Thanks for pointing this out!

DavidSpickett added a commit to DavidSpickett/llvm-project that referenced this pull request Apr 5, 2024
…nnot (llvm#81142)"

This reverts commit 124cd11.

The job originally failed because any workflow run on a PR runs
in a context that cannot write to the PR itself (otherwise people
could damage the repo using a workflow in a PR).

llvm#80495 recently added
a job that is purely for commenting on issues, to solve a similair
problem. This limits the damage the PR can do to adding a spam comment.
- name: 'Download artifact'
uses: actions/download-artifact@6b208ae046db98c579e8a3aa621ab581ff575935 # v4.1.1
with:
github-token: ${{ secrets.ISSUE_WRITE_DOWNLOAD_ARTIFACT }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tstellar, we use clang-format check in downstream repository and this split "breaks" some functionality. According to my understanding, we should have ISSUE_WRITE_DOWNLOAD_ARTIFACT secret available to GitHub Actions in our repository in order to have comment from clang-format action. Could you clarify what permissions should be granted to ISSUE_WRITE_DOWNLOAD_ARTIFACT secret, please?
BTW, why can't we use GITHUB_TOKEN secret?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The token needs actions:read permissions. You can't use 'GITHUB_TOKEN`, because it's only scoped for the current workflow run. See https://github.com/actions/download-artifact?tab=readme-ov-file#download-artifacts-from-other-workflow-runs-or-repositories

I do have a workaround that will allow us to download the artifact without a token, but that is part of a larger PR here. If you think it would be useful to avoid using the token, I can try to pull out this change into its own PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying!

I do have a workaround that will allow us to download the artifact without a token, but that is part of a larger PR here. If you think it would be useful to avoid using the token, I can try to pull out this change into its own PR.

Do you plan to use "Unprivileged Download Artifact" in issue-write? If no, we will use the token.

The only problem I have with introducing a new token is that it's kind of unexpected and requires some efforts to investigate why things are broken. If we can make it work out-of-box and keep the upstream project secure, it would be great.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to use "Unprivileged Download Artifact" in issue-write? If no, we will use the token.

Yeah, I was planning to to do this eventually. I'll start looking into this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your help with this. I really appreciate the efforts to make GitHub Actions CI useful for downstream forks. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Split pr-code-format.yml into separate untrusted+trusted workflows
5 participants