Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better document the PSL roadmap and needs #671

Open
sleevi opened this issue May 25, 2018 · 25 comments
Open

Better document the PSL roadmap and needs #671

sleevi opened this issue May 25, 2018 · 25 comments
Labels
waiting-followup Blocked for need of follow-up

Comments

@sleevi
Copy link
Contributor

sleevi commented May 25, 2018

If we look at where things historically were, the PSL emerged from three distinct needs:

  1. Registries, particularly ccTLD registries, had registration policies and designs different from how gTLDs were operated. The best example of this is .co.uk, which separated out the .uk namespace into a set of 2LD groupings that were organized similar to how the gTLD namespace was organized at the time: com / net / org translated to co.uk / .net.uk / .org.uk
  2. Registered domains that themselves served as domain registries. This included a spectrum of participants - looking at one of the earlier versions (circa 2010) shows this included Registrars that explicitly registered domains underneath their hierarchy (e.g. CentralNic with ar.com or gb.com - acting as ccTLDs within the .com space, .gb.net in .net, et c.),
  3. Hosting providers - The version from 2010 only had three, AFAICT - operaunite.com, appspot.com, and blogspot.com. This was a rather late addition - two of these were only added in 2010, operaunite.com slightly before then.

The registry data was almost entirely reported by the PSL maintainers, chasing down registry operators. Registered domains acting as domain registries was largely due to CentralNic, a popular Registrar that also operated or partnered with several ccTLDs, and thus the data was incidentally picked up. The third case - hosting providers - was not really imagined in the PSLs creation, although it's come to dominate the number of changes to the PSL today.

The PSL has had some growing pains along the way - the opening of the gTLD space by ICANN meant that self-maintaining registry data was no longer an operation that could be done by the PSL maintainers alone, because the sheer number of new registries prevented the effective and ongoing maintenance of that. Registries started to be added by script, and the manual curation of existing records no longer became a thing much dedicated time was spent towards.

A number of dynamic DNS providers were added, which are in a similar-but-not-identical case as the second - there's generally not WHOIS services being provided, registration policies are a bit ad-hoc, but both are aligned in that they provide vanity suffices for registrants.

The growth of Internet services (and the centralization onto common platforms) has driven a significant amount of churn in the third case. New providers come up and old providers wither away, and the maintenance of that list is done almost exclusively based on self-reporting, with some basic automation before addition (the TXT records), since it's no longer possible to scale the previously investigative-analysis that every PSL change got.

As the PSL itself has grown, consumers have had to dramatically alter how they consume they list - filtering out some use cases (such as the third), pushing for more information to be included for the first two use cases, or even rewriting the data structures used, going from static lists to hash lists to tries (compressed or full). Each big growth spurt of the PSL has forced some change for consumers.

Similarly, the adoption of the third case has increased the rate of change in the PSL. While previously the first case could be largely met by a static list updated annually, supporting the second and third cases mean that changes on the order of days are at times necessary for consumers, as otherwise domain holders can't use certain features or they don't work correctly.

The PSL is thus at an inflection point - supporting all of these use cases means that its pace of change and its growth rate are no longer sustainable for the use cases and consumers it supports, and every new use of it brings greater overall risk into the ecosystem.

We thus need to figure out a roadmap for how the PSL will be maintained and scale, what use cases it will consider and not consider, and if and how to wean existing consumers off it, in the search for better solutions.

@pzb
Copy link

pzb commented May 25, 2018

In addition to your list of issues, I would add the failure to rely upon * as the default rule has caused numerous issues. With the gTLD program, full TLDs are coming and going way more often than once a year. Because the PSL has included all TLDs, even if they are simply duplicative of the * rule, it is being used by programs in lieu of the root zone file to get a TLD list.

@dnsguru dnsguru self-assigned this May 29, 2018
@dnsguru dnsguru removed their assignment Jul 26, 2019
@dnsguru dnsguru added the waiting-followup Blocked for need of follow-up label Jul 26, 2019
@dnsguru
Copy link
Member

dnsguru commented Feb 15, 2020

@sleevi @weppos I haven't a sense that we have gotten anywhere on this (likely due to #dayjobs), but I have made some great headway with respect to how we engage with other entities in the ICANN and domain space.

I have worked with the ICANN Office of the CTO team on helping create a document to be distributed within the IANA to ccTLD and gTLD administrators to help elevate their awareness of the PSL, and we'll be presenting this at the ICANN 67 meeting in Cancun, Mexico in March of 2020.

I believe that this will help us with an improved quality of the requests that come in to the ICANN section at the top of the file.

Where I think we're suffering is the PRIVATE section and the increasing volume of requests that are hitting us. We'd proposed splitting the file at that horizon, and I think it is a good idea.

IF we did that, we need to prepare people for it. It seems to me the place that would get folks to notice would be in the header sections or adding a new comment line or two in the file itself.

@sleevi
Copy link
Contributor Author

sleevi commented Feb 15, 2020 via email

@peterthomassen
Copy link
Contributor

I suppose it would be better to not split the list, unless there is a demand by those who want to treat the sections differently (there is a CA use case given on the PSL web site). If there is no such demand, why bother?

@dnsguru
Copy link
Member

dnsguru commented Feb 15, 2020

Jothan: It might be useful to focus on the problem you’d like to solve, rather than the solution.

I'll back off on the idea... that's just a bias for results within me fighting to help this project thrive.
@sleevi sounds like maybe splitting the file would not be something we would place in the roadmap. I had seen the scaling issue represented as a design concern, and candidly the PUBLIC section seems like it is where the majority of the expansion is occurring. One the one hand I see expressed that the file size is increasing and as I review the PR / Issues, the majority seem to be focused in the PRIVATE section. If this is less of an issue than I see, I really don't have a hill to defend or die on for this.

Both as a maintainer and as a consumer, I don’t believe there is any benefit to be had at all from splitting the file, and that it would do more harm than good.

Completely see this perspective. I would not want in any way to introduce disruption.

That said, I’m probably missing something important that you’re concerned about, and so I’d want to make sure we got that documented, before discussing a solution. It would probably be good to open up as a separate issue for the specific problem(s) you see, which we can reference here, so that the roadmap solution is “Solve Problem X” rather than “Do Thing Y”

I think the challenge here, for all of us as volunteers, is the #dayjobs vs time to invest in the architectural stuff and writing up documentation.

Clearly, though we have a mailing list and the ability to communicate via github or dm, we can discuss things, but I wonder if we might benefit from some form of ability to announce stuff like changes or proposals and or poll the integrators/users about their biases.

@sleevi
Copy link
Contributor Author

sleevi commented Feb 15, 2020

I had seen the scaling issue represented as a design concern, and candidly the PUBLIC section seems like it is where the majority of the expansion is occurring. One the one hand I see expressed that the file size is increasing and as I review the PR / Issues, the majority seem to be focused in the PRIVATE section. If this is less of an issue than I see, I really don't have a hill to defend or die on for this.

Right, every known consumer wants both, so splitting doesn’t solve any problems for consumers. It also doesn’t reduce the number of PRs, and having changes go to different files just increases complexity without compelling benefits (at least, AIUI; if there are overlooked benefits, we should nail them down)

Clearly, though we have a mailing list and the ability to communicate via github or dm, we can discuss things, but I wonder if we might benefit from some form of ability to announce stuff like changes or proposals and or poll the integrators/users about their biases.

It seems like we have that already, as you mention? It’s not clear to me what would be missing in that?

@dnsguru
Copy link
Member

dnsguru commented Feb 25, 2020

From my POV we closed the discussion on splitting the file into two sections - just using my leaf blower on the remnants of the chalk dust from the outline of that horse.

Moving on

...Announcements/Polls

It seems like we have that already, as you mention? It’s not clear to me what would be missing in that?

To answer that, lets journey back to the initial issue -

We thus need to figure out a roadmap for how the PSL will be maintained and scale, what use cases it will consider and not consider, and if and how to wean existing consumers off it, in the search for better solutions.

IF we embark on that type of roadmap dialog, should we not engage the integrators, users and consumers of the PSL? I assert that most of them blindly download the .dat file and are not on mailing lists or monitoring this on github.

I am not saying or recommending we do it, but it seems that the most effective manner to reach the largest number of PSL interested parties might be to tweak the file header to include announcements of some sort in a manner that would let us engage them w/o breaking stuff.

@dnsguru
Copy link
Member

dnsguru commented Feb 26, 2020

I am closing a few lingering issue reports, and have caught some meta issues that I'll document in issues which we could incorporate into a roadmap concept and reference them in this Issue

@weppos
Copy link
Member

weppos commented Feb 26, 2020

I agree with @sleevi that splitting would not solve the problem of the management of the private section. May solve other problems, but I'm find myself in great agreement with @sleevi statement

“Solve Problem X” rather than “Do Thing Y”

I strongly believe automating the submission and validation process is the key. I do have some proposals on how to make it happen leveraging a slightly revised version of the DNS validation we use today. I hope to be able to find the time to make a prototype.

In short terms, I'd like to:

  1. Adjust the current DNS-based validation to be self-referencing: the main blocker for an automated process is that the DNS entry we require today references a GitHub ticket. You need to have a ticket to add the DNS entry, and we often see people opening the PR with the changes, getting the ID, then adding the record. This is not practical from the automation POV. Ideally, the DNS entry should be self-referencing so that any automated tool can validate it.
  2. Build a tool that can be run, given a set of hostname, and that will perform the necessary DNS-based validation (similar to what Let's Encrypt is doing today with the DNS challenge... although much simpler)
  3. Automate the validation, and possibly the submission

@dnsguru
Copy link
Member

dnsguru commented Feb 26, 2020

I strongly believe automating the submission and validation process is the key. I do have some proposals on how to make it happen leveraging a slightly revised version of the DNS validation we use today.

Could this be leveraged for automation of removals at some point?

@dnsguru
Copy link
Member

dnsguru commented Feb 26, 2020

Adjust the current DNS-based validation to be self-referencing

What would this look like? The LE process deals with a "token" they provide which for all intents and purposes is the # of the PR within the _PSL txt record currently helps indicate to me that there is a tether to the PR

@sleevi
Copy link
Contributor Author

sleevi commented Feb 26, 2020 via email

@dnsguru
Copy link
Member

dnsguru commented Feb 26, 2020

I am all for automating as much as we can, and leveraging the DNS infrastructure where possible for it.

Not trying to go too far down the road on being prescriptive but the RFC 8552 stuff that a zone admin might add could hold a txt record that matches the git user handle so we could know who's an authoritative rep

@peterthomassen
Copy link
Contributor

The Git hash of the modified version of the PSL, for example. You could compute that prior to submitting the PR by making your modifications against the current HEAD.

I suppose the objective is that such a string would never end up in the DNS, unless so intended by an authorized admin. That can also be achieved by using a hash of the concatenation of psl: and the suffix, maybe with a version tag. That way, the hash does not depend on the state of HEAD, but only on the suffix itself, which would appear more stable to me.

@sleevi
Copy link
Contributor Author

sleevi commented Feb 27, 2020

I suppose the objective is that such a string would never end up in the DNS, unless so intended by an authorized admin. That can also be achieved by using a hash of the concatenation of psl: and the suffix, maybe with a version tag. That way, the hash does not depend on the state of HEAD, but only on the suffix itself, which would appear more stable to me.

I don’t think stability is necessarily a goal here. The goal is to be able to quickly authenticate a pull request, which is why the current method uses the PR number. It’s fairly common for a PR to modify multiple domains.

To be clear, it is not that someone needs to continually be updating this value. The primary objective is merely authenticating the PR.

@weppos
Copy link
Member

weppos commented Feb 27, 2020

Could this be leveraged for automation of removals at some point?

Possibly, yes.

Regarding the how is going to work, I'm working to get a proposal out for feedback. I am trying to stay away to use anything connected to how we manage the list. In other words, using something related to Git will make the process strictly tied to how we use Git today, similar to the fact the DNS TXT today references the GitHub repo.

I am more inclined to find something that doesn't require any extra shared state besides the suggested PSL change and the hostname itself. That would be sufficient, in combination with the fact the authentication is ultimately whether the user can edit the DNS records or not.

@peterthomassen
Copy link
Contributor

peterthomassen commented Mar 3, 2020

I suppose the objective is that such a string would never end up in the DNS, unless so intended by an authorized admin. That can also be achieved by using a hash of the concatenation of psl: and the suffix, maybe with a version tag. That way, the hash does not depend on the state of HEAD, but only on the suffix itself, which would appear more stable to me.

I don’t think stability is necessarily a goal here. The goal is to be able to quickly authenticate a pull request, which is why the current method uses the PR number. It’s fairly common for a PR to modify multiple domains.

True. In an earlier comment, it was said that the PR number should be replaced by something self-referencing. The question is what "self" should be: Should it identify the change, or should it identify the candidate public suffix at which the record is added?

The latter has the advantage that, if changes in a PR are required, that would not invalidate the verification records configured in the DNS prior to PR submission; they could be reused for a changed or even a completely new PR. That is what I meant with stability; I did not mean long-term stability for continued verification.

The same goal can be achieved by allowing the hash of any commit within the PR as a verification token. The invalidation problem upon PR changes can then be avoided by adding changes as new commits, squashing them at merge time. However, I think that's more complicated for users.

I have no stakes in this, I simply proposed this because I thought it covers the requirements (as far as they are known to me) and is suitable to reach the goal with minimal friction.

@benaubin
Copy link
Contributor

benaubin commented Apr 14, 2020

A significant majority of the entries in the PRIVATE section of the list are simply entries of a domain without any use of the list's special features or syntax. Inclusion on the list is simply used as a signal that subdomains are untrusted, mostly for cookie security.

List entry is basically a reflection / descriptor of domain's DNS configuration. To enable automatic updates to the list, each domains entry could be stored inside of a TXT record on the domain.

An automated system could automatically update the list by checking the TXT record. There would be no need for tokens - presence of the record would be enough authentication to indicate authorization.

The ability to manage DNS is enough to indicate intent to be on the list, as having the authorization to manage DNS is the authorization you need in order to manage DNS records of subdomains.

Further, I don't necessarily think there needs to be a central list of private domains. I can't think of any reasonable use case where the PRIVATE list is used for anything except lookups - there's no need for enumeration.

We could instead standardize a DNS record indicating the status of a domain as a "public suffix." I'm not 100% sure what to call it, but maybe something like PUDI (for "public unrestricted domain issuance") or maybe DTP ("domain trust policy"). Consumers of the current list would instead retrieve the record from DNS.

This wouldn't be a ton of overhead - DNS is very fast and designed for pretty much this use-case (at its essence, it's a distributed hosts list).

For example, on first connection to a domain, browsers could request and cache that record, and use its value to enforce cross-origin policies. DNS lookups add relatively low latency to requests (which already require a network connection), and the result is cacheable.

I can't think of a use-case where enumeration or offline lookups are required - and I think standardizing a DNS record would be a much more maintainable strategy.

However, using DNS records as a basis for generating the list would mean no complex authorization schemes. All that would be required would be submitting a domain to an automated system which compiles the list from the records. However, there'd have to be thought put in to anti-spam/abuse of the automated system.

@sleevi
Copy link
Contributor Author

sleevi commented Apr 14, 2020 via email

@benaubin
Copy link
Contributor

Thanks for the links, @sleevi! The design of the internet is fascinating to me. Especially interested in HTTP State Tokens as an alternative to cookies.

Will definitely read more of the archives from the DBOUND list, but I'm glad more experienced people than myself have already considered that option.

Anyways, would it be possible to use similar dns records to automate maintenance of the list itself? That would at least answer the "when can we remove entries?" question.

However, I understand additions to the list are obviously much more difficult in order to prevent abuse. What there was an automated system which charged a nominal fee ($25?) similar to domain registration? The funds could go to IETF or a similar non-polarizing public-benefit organization and used to disuade people from using the list to circumvent things such as LetsEncrypt's rate limiting and, with guidance, help a requestor better understand the economic impacts of addition. A manual review option could be available for projects who could not afford the fee.

Plus, money changing hands through conventional means almost always leads to auditable identity and accountability in cases of abuse.

@peterthomassen
Copy link
Contributor

Consumers of the current list would instead retrieve the record from DNS.

While your point is the operation of the PSL through the DNS, I wanted to point out that consumers can already use the DNS for querying the PSL, see https://publicsuffix.zone/.

@sleevi
Copy link
Contributor Author

sleevi commented Apr 14, 2020 via email

@benaubin
Copy link
Contributor

@sleevi That makes a lot of sense. I admire the work y'all do.

Let me know if there's anything I can do to help.

@dnsguru
Copy link
Member

dnsguru commented Apr 14, 2020 via email

@dnsguru
Copy link
Member

dnsguru commented Oct 9, 2020

Updates: See

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-followup Blocked for need of follow-up
Projects
None yet
Development

No branches or pull requests

6 participants