New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Add Github Actions workflow for TLD updates PRs. #1166
Conversation
This commit adds a Github Actions workflow called `tld-update.yml` that uses `tools/patchnewgltds` to create pull requests from a fork to update the `public_suffix_list.dat` gTLD section automatically. It replaces a similar mechanism that used a separate Github repo[0], a bash script[1], and Travis CI[2] to accomplish the same. Travis has not been running the job reliably and the Github Action version is more self-contained, and improves on several other pain points (e.g. it won't open duplicate PRs). !! Important !! To operate correctly a secret env var named "BOT_PAT" must be configured with a Github personal access token for a bot user that has forked the repository to the same place as configured in `push-to-fork` in `tld-update.yml`. !! Important !! [0]: https://github.com/cpu/psl-autopull [1]: https://github.com/cpu/psl-autopull/blob/master/autopull [2]: https://travis-ci.com/github/cpu/psl-autopull
Also worth mentioning that I tested it successfully: cpu#1 I had to cheat slightly and comment out the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @cpu! Just double checking my understanding on a few bits, but I think we're good to merge (and I too welcome our GitHub Action overlords)
@cpu -
Many thanks for this and the manual interventions over the past couple of
weeks...
I just have a metaQ - will this still mean we're manually reviewing the PRs
and squash/merging ?
…-Jothan
On Mon, Dec 14, 2020 at 9:50 AM Daniel McCarney ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In .github/workflows/tld-update.yml
<#1166 (comment)>:
> + run: tools/patchnewgtlds
+
+ - name: Create pull-request
+ id: cpr
+ uses: ***@***.***
+ with:
+ token: "${{ secrets.BOT_PAT }}"
+ push-to-fork: tld-update-bot/list
+ commit-message: "util: gTLD data autopull updates for ${{ env.NOW }}"
+ title: "util: gTLD autopull updates for ${{ env.NOW }}"
+ body: "Public suffix list gTLD data updates from `tools/patchnewgtlds` for ${{ env.NOW }}."
+ committer: "GitHub ***@***.***>"
+ author: "GitHub ***@***.***>"
+ labels: tld-update
+ branch: psl-gtld-update
+ delete-branch: true
In the context of the discussion at zmap/zlint#515
<zmap/zlint#515> (comment) , does this become
easier if we change the default?
The situation in this repo is slightly different because I chose to keep
the separate bot account + to have the PRs opened from a fork. The branch
the bot creates will be on the fork repo, whereas in ZLint there is no bot
presently and the branch is created on the main repo.
And we'd also be affected by the lack of testing runs, right? Or does the
fact that we're still on Travis and have not yet migrated to GH Actions
make this largely moot until we do?
Similar answer here: using the separate bot account makes this repo's
implementation a little bit more complex but will avoid the problem related
to testing runs in the future if CI is moved from Travis.
You're correct that it's largely moot until the Travis bits have been
migrated - no matter how the PR is created those should continue to work
the same.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1166 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACQTJNZZ6O7IRWA6WAMYMDSUZF7JANCNFSM4UZTH6QQ>
.
|
@dnsguru That's correct - this side of the PR workflow will remain unchanged. |
Does this need merging, or other next steps? |
@dnsguru I was going to take a pass at @sleevi's nit about the date. I have some holiday time this week and should be able to push a commit shortly. Afterwards I believe it is ready to go. We will still need to coordinate on adding the @tld-update-bot's PAT to the repo configuration. @dnsguru Are you the best person to ask to do that part? Thanks! |
The `tld-update.yml` workflow needs the current date in a human readable form to include in commit messages, PR titles, and the PR body. Previously this was achieved in a clunky way with an env var echoed into `$GITHUB_ENV`. This commit replaces that logic with sleevi's cleaner recommendation: setting the date string as the output of a step and then referencing it later in the workflow.
@sleevi happy new year. How is this looking? Would like to merge this and we need to ensure it is automated as well. |
OK, I've added the |
Porting my comment from the first bot PR Upside: we know the PAT is correct and the action is executing correctly. Downside: it's revealed one bug. The previous automation had special case logic to discard these "header date update only" diffs. The new automation does not and so opened a PR that's a no-op :-/ @sleevi @dnsguru How would you like to proceed? I'm slammed this week with other commitments and likely won't have a chance to fix this bug for a few days. Do you want to revert 9d89164 until then so we don't get spammed with silly PRs? Other options: if you leave #1183 open the bot will just keep updating it with new timestamps/commits but never create a new PR. |
Disable this workflow until we can sort through #1166 (comment)
@sleevi Sounds good to me. I'll ping folks when I have a fix worked out. If anyone wants to poke at it before I get a chance I went back and figured out how the old version handled this and it's pretty hacky. The part of the script that checks for diffs is using |
Definitely want it to not make a pr unless entries change (not just file
header is different from previous)...
Header on ICANN json file changes daily as I believe they generate it
daily, changes or not.
…On Tue, Jan 12, 2021, 2:14 PM Daniel McCarney ***@***.***> wrote:
@sleevi <https://github.com/sleevi> Sounds good to me. I'll ping folks
when I have a fix worked out.
If anyone wants to poke at it before I get a chance I went back and
figured out how the old version handled this and it's pretty hacky. The part
of the script
<https://github.com/cpu/psl-autopull/blob/7ffe171735407ca471df0bdad07d30f392342359/autopull#L40>
that checks for diffs is using git diff --numstat and filtering on the
output that matches the "header only" update with the expectation those are
always a 1 line diff (vs content diffs that would be 2+).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1166 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACQTJMZYVIHDGXBMVEBNDLSZTCT5ANCNFSM4UZTH6QQ>
.
|
Haven't forgotten about picking this work up again and fixing this last bug. Hoping to get a chance to take a look this coming weekend 🤞 |
Take a peek at #1204 - It ended up being a little bit more involved than I was hoping but I'm happy with the outcome in that branch. |
== tools: improved newgtlds.go, removed replace-between. - d996dad Previously the `tools/newgtlds.go` utility was difficult to use with the new Github action based pull-request workflow because it _always_ updated the timestamp in the header comment in the gTLD section of `public_suffix_list.dat` when it was run, even if no gTLD data changed. This meant there was a diff in the workdir after the tool ran that would cause a PR to be opened by the action. See #1166 for more discussion/background. In the old travis version of the automation I used a crude shell pipeline to exclude the header comment when deciding if there was a diff or not. With the new action it's not possible to change how the diff status is determined (without re-inventing the pull request action in-repo). So instead we need to abandon the timestamp in the gTLD section header comment (that would be sad) or, we need the script to be idempotent when there's no true data change. To make the tooling smarter about when it makes changes I reworked `newgtlds.go` to be able to consume the existing `public_suffix_list.dat` and to split apart the data templating and the header comment templating. Now the tool can exclude the header comment and compare against the existing data to determine if there is a real data change or not. If there's no change in the data then the file is left untouched. If there is a change then the file is updated and only then is the header comment with the date stamp updated along with the new data. I've included unit tests that get 80% statement coverage of this new tooling and also tested it locally with success. Conveniently this rework also lets us remove the `tools/replace-between` Perl script reducing the number of languages in play for this process. == CI: re-enable github tld-update workflow - 7856839 Now that the tooling won't produce PRs without meaningful data updates we can re-enable the `tld-update.yml` workflow. == CI: run go unit tests in tld-update workflow - f6749a5 Before using the `tools/newgtlds.go` tooling to open a PR we should run the unit tests to make sure the code works the way we expect. This will also help catch any bitrot.
Summary
I wanted to describe the situation in detail but it made for a long PR description. Here's the important parts up front:
Background
Since ~August 2019 the gTLD section of
public_suffix_list.dat
has been kept up to date by a combination of thetools/newgtlds.go
command in this repo and automation I've been maintaining in a separate Github repo. To date the bot has opened 67 pull requests.The automation in
cpu/psl-autopull
is based on a fork of nbio/autopull that @weppos has been using to keep publicsuffix-go updated with this repo. It's implemented as a BASH script that is meant to be run by a CI provider (in our case, Travis CI). Its configuration is spread between the script and Travis CI environment variables.Unfortunately as of late the Travis CI platform has been letting us down. I've moved several other repos to Github Actions because the Travis CI open source build queue time was getting far too long. More importantly the "cron" build that is scheduled to run the autopull script once a day has not been running reliably. This is causing updates to arrive late (when I remember to kick the builds by hand...).
Many open source projects are replacing Travis with Github Actions and I think it makes sense here too.
Github Actions
Replacing Travis with Github Actions has several advantages:
Bot Account
This branch configures create-pull-request to open PRs from a fork using a Github personal access token (PAT) stored as a repository secret env var called
"BOT_PAT"
.This will need to be set up by a repository administrator by following the docs for creating an encrypted secret for a repository.
The PAT must correspond to the Github account that owns the fork of the repository configured in the
push-to-fork
setting of the workflow.I have configured the
push-to-fork
value totld-update-bot/list
on the assumption we'll be adding @tld-update-bot's PAT to theBOT_PAT
secret var. If the maintainers would prefer to create their own Github account please make sure to update this piece of config to point to the new account's fork. Otherwise we should exchange the @tld-update-bot secret PAT value out of band using whatever is least painful for you (GPG, encrypting to SSH keys, quantum key exchange, smoke signals).To be 100% explicit: in either case adding the PAT to the repo config as an encrypted env var does not grant any access to the repository to the bot. The bot opens PRs from a fork to avoid needing push access to this repository. No extra access should be configured.
I'll also note that it is possible to not have any bot account, PAT, or fork involved by using the Github Action's default token. This is how I implemented similar logic for another open source repo. It has one important caveat: without a dedicated bot github account the PRs opened by the workflow won't in turn run further workflows (e.g. unit tests, etc). I discuss some of this in detail in this issue. For this repository because the data format is more complex & not a Go datastructure I think it's important to maintain the separate bot account + full CI workflow potential for the automation PRs.
Future Work
It would be nice to move the repository CI from Travis to Github Actions as well. I suspect it will be a better experience overall. I would be willing to take this work on but thought it was best to put one toe in with this automation before revamping anything else.