Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/publicsuffix: automate periodic regeneration #15518

Open
yaslama opened this issue May 3, 2016 · 11 comments
Open

x/net/publicsuffix: automate periodic regeneration #15518

yaslama opened this issue May 3, 2016 · 11 comments

Comments

@yaslama
Copy link

@yaslama yaslama commented May 3, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
    1.6.1
  2. What operating system and processor architecture are you using (go env)?
    linux/amd64
  3. What did you do?

I used golang.org/x/net/publicsuffix and it doesn't contain the last updated list

  1. What did you expect to see?

An updated list

  1. What did you see instead?

A list from 2016-03-01

The list is changing more than before. The trigger was LetsEncrypt but people begin to understand the security implications beside Letsencrypt.

@bradfitz bradfitz changed the title Please update golang.org/x/net/publicsuffix/table.go from github.com/publicsuffix/list x/net/publicsuffix: update May 3, 2016
@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented May 3, 2016

@nigeltao, we should talk about this today.

@bradfitz bradfitz added this to the Unreleased milestone May 3, 2016
@bradfitz bradfitz self-assigned this May 3, 2016
@yaslama
Copy link
Author

@yaslama yaslama commented May 5, 2016

@bradfitz and @nigeltao, did you reach any conclusion about the update ?
Thanks a lot.

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented May 5, 2016

We talked about making some online serving mode where your program could fetch the latest data at runtime and in the background periodically, relieving of us of the need to regularly rebuild this and litter the git repo's history with huge binary blob updates every week.

If somebody wants to write that mode, that'd be great.

@weppos
Copy link

@weppos weppos commented May 26, 2016

FYI to solve this problem I made available an alternative implementation that can parse a list dynamically: https://github.com/weppos/publicsuffix-go

There is an open ticket in the Let's Encrypt repo about using a different way to pull the PSL changes.
letsencrypt/boulder#1479

@yaslama
Copy link
Author

@yaslama yaslama commented May 26, 2016

@weppos
Copy link

@weppos weppos commented May 26, 2016

Indeed, wrong buffer. Thanks for pointing it out, I've updated the link.

@bradfitz bradfitz changed the title x/net/publicsuffix: update x/net/publicsuffix: automate periodic regeneration Apr 24, 2017
@bradfitz bradfitz removed their assignment Apr 24, 2017
@nigeltao
Copy link
Contributor

@nigeltao nigeltao commented Jul 18, 2018

This came up again in a private conversation with a colleague.

As I understand the original request, "automate periodic regeneration", the idea is to subscribe for changes to the canonical https://publicsuffix.org/ data. When it changes, generate and commit the x/net/publicsuffix compiled form automatically. I don't know if there are any operational security or other administrative concerns about having a bot (or automated processes in general) committing code. I don't know if we already have bots pushing to go.googlesource.com. Perhaps @bradfitz would know??

To echo an earlier comment by @bradfitz, another cause for pushback is that every update causes non-trivial growth in the x/net git repository. Even if an individual upstream commit's diff is relatively small, the downstream x/net/publicsuffix commit's diff can be relatively large, since we pack the implicit tree structure in that text file into a dense index of nodes and a dense index of strings. Deleting one entry in the upstream text file will re-index every subsequent entry in the downstream form. As a data point, picking up only 7 weeks worth of upstream changes:

$ git clone https://go.googlesource.com/net
Cloning into 'net'...
remote: Total 7752 (delta 5384), reused 7752 (delta 5384)
Receiving objects: 100% (7752/7752), 5.73 MiB | 8.30 MiB/s, done.
Resolving deltas: 100% (5384/5384), done.
$ du --bytes --summarize net
10841140	net
$ cd net/publicsuffix/
$ time go generate

real	0m4.048s
user	0m2.288s
sys	0m0.336s
$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   table.go
	modified:   table_test.go

no changes added to commit (use "git add" and/or "git commit -a")
$ git commit -a -m blahblahblah
[master c7fbea3] blahblahblah
 2 files changed, 9743 insertions(+), 9749 deletions(-)
 rewrite publicsuffix/table.go (68%)
$ cd ../..
$ du --bytes --summarize net
10995803	net

The disk usage difference, 10995803 - 10841140, is 154663 bytes, or 1.4% of 10841140.

One idea (not necessarily a good one, but I don't want to forget it) to cap git repo size concerns is to keep x/net/publicsuffix as the package name that callers should import, for back compat, but have its implementation delegate to x/publicsuffixvolume000 (yes, there's probably a better name), which is updated more frequently. Whenever that gets too big (for whatever the value of "big" is), roll it over to volume001, then volume002, etc. Over time, programmers who go get golang.org/x/net/publicsuffix might end up also cloning a moderately large git repo, but they shouldn't need to clone very large ones, nor should x/net grow too large with old PSL data, and they might be able to garbage collect no longer needed 'volumes'.

Back to solving the original problem, x/net/publicsuffix being stale, I don't know what e.g. Let's Encrypt does these days. Perhaps @weppos can say.

Again, as @bradfitz suggested above, we could add API where the caller brings their own copy of the PSL-as-a-text-file for the library to compile. Compilation used to take minutes, but now takes seconds, so that might be more feasible than it was a few years ago. We might have to cut the overall feature into multiple packages, so that people who just want the pre-compiled version don't have to import all the dependencies for dynamic generation, but if we want to do dynamic generation, that ought to be a solvable problem.

To be clear, I don't have much spare time to work on this myself, but I'm writing this all down before I forget the conversation.

@weppos
Copy link

@weppos weppos commented Jul 18, 2018

Back to solving the original problem, x/net/publicsuffix being stale, I don't know what e.g. Let's Encrypt does these days. Perhaps @weppos can say.

They currently use my Go library
https://github.com/weppos/publicsuffix-go

I setup a process that does exactly what you described. It checks for changes, and automatically creates a PR for review.
https://github.com/weppos/publicsuffix-go/search?q=autopull&type=Issues

In my case, this is trivial because I don't generate a blob. I cleanup, pre-process the list and pre-compile a Go file for optimal load, but I don't go any further. As a result, diffs are small:
https://github.com/weppos/publicsuffix-go/pull/128/files

Furthermore, weppos/publicsuffix-go#128 was designed to allow to load an external list with the exact intent of allowing the consumer to bypass the shipped version of the list. There's a couple of cases where this is a good idea: if the packaged list is outdated, or if you want to use a custom list (after all, the PSL is a format as well so you can submit your own list).

@nigeltao
Copy link
Contributor

@nigeltao nigeltao commented Jul 18, 2018

@weppos my colleague had some concerns about https://github.com/weppos/publicsuffix-go performance. Every List.Find call does a linear iteration over the entire list of rules, right? (Over 8000, by default?) Have any of your other users raised performance concerns? Is there any intention (an issue?) to build a fancier data structure (i.e. a tree)?

@weppos
Copy link

@weppos weppos commented Jul 18, 2018

@nigeltao the concern is absolutely valid and the only reason I haven't flagged the lib as 1.x (it's still 0.x) is exactly because I want to change the lookup strategy.

My plan was to switch to the same Hash-based lookup I implemented in the Ruby library (weppos/publicsuffix-ruby#133) that proved to be extremely efficient. I also made tests and benchmarks with a Trie based lookup (weppos/publicsuffix-ruby#134) but there was little if no gain. I also made some research to use a DAWG/DAFSA but it turns out while it is extremely memory efficient (I tried to balance lookup performance with memory size), it works well only in cases where the list is not intended to be dynamically modified. In fact, it works well for the Chromium use case (the folks at Chrome implemented the lookup with a DAFSA) but not in the case of a more general-purpose library.

Peter from Amazon also reported another proposal that seems to provide further efficiency over the current version that I still need to review weppos/publicsuffix-ruby#143

All these versions provide good performance still adopting a representation that is diff-efficient.

Long story short, I definitely want to use a more efficient data structure. To be fair, I did not put too much effort on it lately as nobody really complained about performance. Still, it's on the roadmap. I should probably create a ticket on the repo and dump this info there (so that it can also inspire contributions).

@vanbroup
Copy link
Contributor

@vanbroup vanbroup commented Aug 9, 2018

You might want to try https://github.com/globalsign/publicsuffix which aims for performance close to https://golang.org/x/net/publicsuffix whilst adding the capability to update the PSL similar to https://github.com/weppos/publicsuffix-go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants