Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Add Github Actions workflow for TLD updates PRs. #1166

Merged
merged 2 commits into from Jan 12, 2021

Conversation

cpu
Copy link
Contributor

@cpu cpu commented Dec 13, 2020

Summary

I wanted to describe the situation in detail but it made for a long PR description. Here's the important parts up front:

  • Travis & the old gTLD PR automation is effectively broken.
  • Let's move the gTLD automation into this repo, using Github actions.
  • If you agree, stamp this PR with an approval.
  • And we need to coordinate on adding the @tld-update-bot's Github "personal access token" as a secret env var in this repo's config. (Alternatively: maintainers can make their own account and plug a PAT in.)

Background

Since ~August 2019 the gTLD section of public_suffix_list.dat has been kept up to date by a combination of the tools/newgtlds.go command in this repo and automation I've been maintaining in a separate Github repo. To date the bot has opened 67 pull requests.

The automation in cpu/psl-autopull is based on a fork of nbio/autopull that @weppos has been using to keep publicsuffix-go updated with this repo. It's implemented as a BASH script that is meant to be run by a CI provider (in our case, Travis CI). Its configuration is spread between the script and Travis CI environment variables.

Unfortunately as of late the Travis CI platform has been letting us down. I've moved several other repos to Github Actions because the Travis CI open source build queue time was getting far too long. More importantly the "cron" build that is scheduled to run the autopull script once a day has not been running reliably. This is causing updates to arrive late (when I remember to kick the builds by hand...).

Many open source projects are replacing Travis with Github Actions and I think it makes sense here too.

Github Actions

Replacing Travis with Github Actions has several advantages:

  • The configuration is entirely in this repo, not spread across multiple repos & services.
  • It removes a dependency on an external service & their billing/product changes. We already depend on Github.
  • Better automation. The create-pull-request action is clever enough to not open duplicates. The old Travis autopull automation would create duplicate PRs if they weren't merged fast enough.
  • The Github actions cron schedule is much more flexible. This PR configures the workflow to run once an hour vs once a day with Travis.

Bot Account

This branch configures create-pull-request to open PRs from a fork using a Github personal access token (PAT) stored as a repository secret env var called "BOT_PAT".

This will need to be set up by a repository administrator by following the docs for creating an encrypted secret for a repository.

The PAT must correspond to the Github account that owns the fork of the repository configured in the push-to-fork setting of the workflow.

I have configured the push-to-fork value to tld-update-bot/list on the assumption we'll be adding @tld-update-bot's PAT to the BOT_PAT secret var. If the maintainers would prefer to create their own Github account please make sure to update this piece of config to point to the new account's fork. Otherwise we should exchange the @tld-update-bot secret PAT value out of band using whatever is least painful for you (GPG, encrypting to SSH keys, quantum key exchange, smoke signals).

To be 100% explicit: in either case adding the PAT to the repo config as an encrypted env var does not grant any access to the repository to the bot. The bot opens PRs from a fork to avoid needing push access to this repository. No extra access should be configured.

I'll also note that it is possible to not have any bot account, PAT, or fork involved by using the Github Action's default token. This is how I implemented similar logic for another open source repo. It has one important caveat: without a dedicated bot github account the PRs opened by the workflow won't in turn run further workflows (e.g. unit tests, etc). I discuss some of this in detail in this issue. For this repository because the data format is more complex & not a Go datastructure I think it's important to maintain the separate bot account + full CI workflow potential for the automation PRs.

Future Work

It would be nice to move the repository CI from Travis to Github Actions as well. I suspect it will be a better experience overall. I would be willing to take this work on but thought it was best to put one toe in with this automation before revamping anything else.

This commit adds a Github Actions workflow called `tld-update.yml` that
uses `tools/patchnewgltds` to create pull requests from a fork to update
the `public_suffix_list.dat` gTLD section automatically.

It replaces a similar mechanism that used a separate Github repo[0],
a bash script[1], and Travis CI[2] to accomplish the same. Travis has
not been running the job reliably and the Github Action version is more
self-contained, and improves on several other pain points (e.g. it won't
open duplicate PRs).

                      !! Important !!
To operate correctly a secret env var named "BOT_PAT" must be configured
with a Github personal access token for a bot user that has forked the
repository to the same place as configured in `push-to-fork` in
`tld-update.yml`.
                      !! Important !!

[0]: https://github.com/cpu/psl-autopull
[1]: https://github.com/cpu/psl-autopull/blob/master/autopull
[2]: https://travis-ci.com/github/cpu/psl-autopull
@cpu
Copy link
Contributor Author

cpu commented Dec 13, 2020

@dnsguru @weppos @sleevi Would appreciate your thoughts on this PR when you have a chance to review. I'll continue to kick the old automation until we've all had a chance to discuss what the best path forward is based on the description in this PR.

Thanks!

@cpu
Copy link
Contributor Author

cpu commented Dec 13, 2020

Also worth mentioning that I tested it successfully: cpu#1

I had to cheat slightly and comment out the push-to-fork setting and add @tld-update-bot as a contributor to my fork. Github won't let you fork a fork so otherwise the bot wasn't able to push push-to-fork to open an automated PR on my fork of the list.

Copy link
Contributor

@sleevi sleevi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @cpu! Just double checking my understanding on a few bits, but I think we're good to merge (and I too welcome our GitHub Action overlords)

.github/workflows/tld-update.yml Outdated Show resolved Hide resolved
.github/workflows/tld-update.yml Show resolved Hide resolved
@dnsguru
Copy link
Member

dnsguru commented Dec 14, 2020 via email

@cpu
Copy link
Contributor Author

cpu commented Dec 14, 2020

I just have a metaQ - will this still mean we're manually reviewing the PRs and squash/merging ?

@dnsguru That's correct - this side of the PR workflow will remain unchanged.

@dnsguru
Copy link
Member

dnsguru commented Dec 21, 2020

Does this need merging, or other next steps?

@cpu
Copy link
Contributor Author

cpu commented Dec 21, 2020

Does this need merging, or other next steps?

@dnsguru I was going to take a pass at @sleevi's nit about the date. I have some holiday time this week and should be able to push a commit shortly.

Afterwards I believe it is ready to go. We will still need to coordinate on adding the @tld-update-bot's PAT to the repo configuration. @dnsguru Are you the best person to ask to do that part?

Thanks!

The `tld-update.yml` workflow needs the current date in a human readable
form to include in commit messages, PR titles, and the PR body.
Previously this was achieved in a clunky way with an env var echoed into
`$GITHUB_ENV`.

This commit replaces that logic with sleevi's cleaner recommendation:
setting the date string as the output of a step and then referencing it
later in the workflow.
@cpu
Copy link
Contributor Author

cpu commented Dec 21, 2020

I was going to take a pass at @sleevi's nit about the date. I have some holiday time this week and should be able to push a commit shortly.

All set: ff6c3ba.

@dnsguru
Copy link
Member

dnsguru commented Jan 8, 2021

@sleevi happy new year. How is this looking? Would like to merge this and we need to ensure it is automated as well.

@sleevi
Copy link
Contributor

sleevi commented Jan 12, 2021

OK, I've added the BOT_PAT secret, and am now going to try merging and hoping it works :)

@sleevi sleevi merged commit 9d89164 into publicsuffix:master Jan 12, 2021
@cpu cpu deleted the cpu-gh-actions-tld-update branch January 12, 2021 19:04
@cpu
Copy link
Contributor Author

cpu commented Jan 12, 2021

Porting my comment from the first bot PR

Upside: we know the PAT is correct and the action is executing correctly.

Downside: it's revealed one bug. The previous automation had special case logic to discard these "header date update only" diffs. The new automation does not and so opened a PR that's a no-op :-/

@sleevi @dnsguru How would you like to proceed? I'm slammed this week with other commitments and likely won't have a chance to fix this bug for a few days. Do you want to revert 9d89164 until then so we don't get spammed with silly PRs?

Other options: if you leave #1183 open the bot will just keep updating it with new timestamps/commits but never create a new PR.

sleevi added a commit that referenced this pull request Jan 12, 2021
Disable this workflow until we can sort through #1166 (comment)
@sleevi
Copy link
Contributor

sleevi commented Jan 12, 2021

@cpu I went ahead and just renamed the file in e69dbb9 so that it won't run :)

@cpu
Copy link
Contributor Author

cpu commented Jan 12, 2021

@sleevi Sounds good to me. I'll ping folks when I have a fix worked out.

If anyone wants to poke at it before I get a chance I went back and figured out how the old version handled this and it's pretty hacky. The part of the script that checks for diffs is using git diff --numstat and filtering on the output that matches the "header only" update with the expectation those are always a 1 line diff (vs content diffs that would be 2+).

@dnsguru
Copy link
Member

dnsguru commented Jan 12, 2021 via email

@cpu
Copy link
Contributor Author

cpu commented Feb 3, 2021

Haven't forgotten about picking this work up again and fixing this last bug. Hoping to get a chance to take a look this coming weekend 🤞

@cpu
Copy link
Contributor Author

cpu commented Feb 7, 2021

Hoping to get a chance to take a look this coming weekend

Take a peek at #1204 - It ended up being a little bit more involved than I was hoping but I'm happy with the outcome in that branch.

weppos pushed a commit that referenced this pull request Feb 18, 2021
== tools: improved newgtlds.go, removed replace-between. - 

d996dad

Previously the `tools/newgtlds.go` utility was difficult to use with the new Github action based pull-request workflow because it _always_ updated the timestamp in the header comment in the gTLD section of `public_suffix_list.dat` when it was run, even if no gTLD data changed. This meant there was a diff in the workdir after the tool ran that would cause a PR to be opened by the action. See #1166 for more discussion/background.

In the old travis version of the automation I used a crude shell pipeline to exclude the header comment when deciding if there was a diff or not. With the new action it's not possible to change how the diff status is determined (without re-inventing the pull request action in-repo). So instead we need to abandon the timestamp in the gTLD section header comment (that would be sad) or, we need the script to be idempotent when there's no true data change.

To make the tooling smarter about when it makes changes I reworked `newgtlds.go` to be able to consume the existing `public_suffix_list.dat` and to split apart the data templating and the header comment templating. Now the tool can exclude the header comment and compare against the existing data to determine if there is a real data change or not. If there's no change in the data then the file is left untouched. If there is a change then the file is updated and only then is the header comment with the date stamp updated along with the new data. I've included unit tests that get 80% statement coverage of this new tooling and also tested it locally with success. Conveniently this rework also lets us remove the `tools/replace-between` Perl script reducing the number of languages in play for this process.

== CI: re-enable github tld-update workflow - 

7856839

Now that the tooling won't produce PRs without meaningful data updates we can re-enable the `tld-update.yml` workflow.

== CI: run go unit tests in tld-update workflow - 

f6749a5

Before using the `tools/newgtlds.go` tooling to open a PR we should run the unit tests to make sure the code works the way we expect. This will also help catch any bitrot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants