Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement rules using a trie data structure. #97

Merged
merged 8 commits into from
Sep 1, 2017

Conversation

remusao
Copy link
Collaborator

@remusao remusao commented Jul 23, 2017

Hi there,

I spent some time this week-end implementing a different representation for the rules: replacing the regex by a more efficient Trie data structure. This changes a lot of code, so I would understand that it might require more work to make "production-ready". Also, comments are pretty sparse at the moment, but I would be glad to provide more where you think it makes sense!

I implemented some benchmark to have an idea of the performance gain, and it looks promising (although I did not register the memory consumption, I don't expect it to be unreasonable). Each run is calling the benchmarked function on 100 random domains and I'm using the benchmark npm package (I can attach the code somewhere as well if you want to have a look):

Current version of the code on master (v2.0.0)

tldjs#tldExists x 3,162 ops/sec ±1.74% (92 runs sampled)
tldjs#getDomain x 35.82 ops/sec ±1.65% (62 runs sampled)
tldjs#getSubdomain x 35.77 ops/sec ±1.60% (62 runs sampled)
tldjs#getPublicSuffix x 37.39 ops/sec ±1.13% (64 runs sampled)

New implementation based on the Trie data-structure

tldjs#tldExists x 3,265 ops/sec ±0.83% (92 runs sampled)
tldjs#getDomain x 2,030 ops/sec ±3.08% (84 runs sampled)
tldjs#getSubdomain x 2,037 ops/sec ±2.85% (85 runs sampled)
tldjs#getPublicSuffix x 2,195 ops/sec ±3.53% (83 runs sampled)

So the speed-up is around x60.

It now seems that cleanHostValue is the bottle-neck, since if we make it a no-op, the results become:

tldjs#tldExists x 57,589 ops/sec ±2.23% (92 runs sampled)
tldjs#getDomain x 8,414 ops/sec ±0.75% (94 runs sampled)
tldjs#getSubdomain x 8,257 ops/sec ±0.77% (94 runs sampled)
tldjs#getPublicSuffix x 10,857 ops/sec ±1.27% (92 runs sampled)

I added a flag (dirtyHost) on each of these methods to allow disabling the cleaning (if you know that your host value is already clean and you want maximum performances).

While implementing this new representation, I stumbled upon some cases for which I'm not really sure of the good behavior. I would love to have your feedback on them.

Result if the public suffix does not exist

The following test-case: expect(tld.getPublicSuffix('www.freedom.nsa')).to.be(null); seems to expect the wrong result. According to the specification (https://publicsuffix.org/list/). When no matching rule is found, then "the prevailing rule is "*".", which from what I understand means that we consider the latest label of the hostname to be the public suffix. The result for this test would then be nsa instead of null. What do you think?

Result of exceptions

The following test-case: expect(tld.getPublicSuffix('www.www.ck')).to.be('www.ck'); should probably expect ck instead of www.ck as a public suffix. According to the aforementioned specification, the following should be done when an exception is matched (here: !www.ck):

If the prevailing rule is a exception rule, modify it by removing the leftmost label.

I guess we should then remove www (the leftmost label), resulting in a public suffix of ck.

Result when not public suffix is found

What should be the result of getDomain when no valid public suffix is found? Should it be null as well, or the value passed as host to the function?

Ignoring trailing and leading dots

The specification says that trailing and leading dots found on rules should be ignored. Although there is no such case at the moment, maybe we should handle them and add a test-case? I left a TODO in the parser.

Thanks for reading, and I hope this work can be useful to the project,
Have a nice evening.

@thom4parisot
Copy link
Owner

thom4parisot commented Jul 24, 2017

Hey @remusao,

thanks for the hard work! Figures are impressive 🙂
I will try to have a close look at it tomorrow—I am viewing houses in London at the moment and some transcripts take the rest of my time at the moment.

When no matching rule is found, then "the prevailing rule is "*".", which from what I understand means that we consider the latest label of the hostname to be the public suffix. The result for this test would then be nsa instead of null. What do you think?

Yeah, I feel it would be nicer to respect what the public suffix spec says — I was feeling frustrated about the methods, what they were doing and for what in #25 (#54 maybe). It started from "using public-suffix.org to enforce a verification of known TLDs" to "implementing PublicSuffix for cookies and stuff while trying to address the former intention".

Likewise for the rest of your proposals — you have done a better job at reading the spec than me ;-)

@remusao
Copy link
Collaborator Author

remusao commented Jul 24, 2017

Hey @oncletom,

Thanks for the answer. Let me know if I can do anything to ease the review (more comments, a tl;dr about the implementation details, or more direct talk are possible!).

Regarding the issues you mention, I was thinking about using only the ICANN section of the file to get the domain (and use the PRIVATE section only to get the public suffix). Do you think it could work?

@thom4parisot
Copy link
Owner

@remusao ICANN is a good idea! Or IANA (cf. #40) too – I don't know which one validates gTLD first. It would be useful if the Public Suffix list does not encompass all the gTLDs or with a major delay.

I found a place to hopefully Sunday I should be able to put my head around your proposal. Will let you know if I have any question or else — if other contributor(s) want to have a peak at it, they are obviously welcome to do so 🙂

@remusao
Copy link
Collaborator Author

remusao commented Aug 8, 2017

@oncletom Thanks for your reply! I understand that time is precious and this is a big PR. If I can do anything to make it easier to review, let me know.

@ctavan
Copy link

ctavan commented Aug 9, 2017

@remusao since I'm very interested in this PR getting merged I tried to verify the two cases that you highlight in the issue description.

www.freedom.nsa

I think your reasoning is right, the public suffix should be nsa. To verify the reasoning I simply took the definitions from https://publicsuffix.org/list/ and highlighted the essential parts:

Definitions

  • [ … ]
  • A domain is said to match a rule if and only if all of the following conditions are met:
    • When the domain and rule are split into corresponding labels, that the domain contains as many or more labels than the rule.
    • Beginning with the right-most labels of both the domain and the rule, and continuing for all labels in the rule, one finds that for every pair, either they are identical, or that the label from the rule is "*".

Algorithm

  1. Match domain against all rules and take note of the matching ones.
  2. If no rules match, the prevailing rule is "*".
  3. If more than one rule matches, the prevailing rule is the one which is an exception rule.
  4. If there is no matching exception rule, the prevailing rule is the one with the most labels.
  5. If the prevailing rule is a exception rule, modify it by removing the leftmost label.
  6. The public suffix is the set of labels from the domain which match the labels of the prevailing rule, using the matching algorithm above.
  7. The registered or registrable domain is the public suffix plus one additional label.
prevailing rule = *

Domain: | www | freedom | nsa |
Rule:   |     |         |  *  |

public suffix = nsa

www.www.ck

I think your reasoning is again right (public suffix should be ck):

  1. Match domain against all rules and take note of the matching ones.

    • Matching rules:

      *.ck
      !www.ck
      
  2. If no rules match, the prevailing rule is "*".

  3. If more than one rule matches, the prevailing rule is the one which is an exception rule.

    • Matching rule:

      !www.ck
      
  4. If there is no matching exception rule, the prevailing rule is the one with the most labels.

  5. If the prevailing rule is a exception rule, modify it by removing the leftmost label.

    • Modified matching rule:

      ck
      
  6. The public suffix is the set of labels from the domain which match the labels of the prevailing rule, using the matching algorithm above.

  7. The registered or registrable domain is the public suffix plus one additional label.

prevailing rule = ck

Domain: | www | www | ck |
Rule:   |     |     | ck |

public suffix = ck
registered domain = www.ck

@luciancor
Copy link

nice work @remusao !

@thom4parisot
Copy link
Owner

@remusao I have been thinking about the clean host vs. dirty host. I agree there is a need for cleaning if it is relevant only. But by changing the interface we can no longer trust the flag which states if the value is clean or not.

In my opinion, the cleanHost function should return a frozen CleanHostString object. Maybe like this:

class CleanHostString extends String {
  constructor(value) {
    super(value);

    this.hostValue = value instanceof CleanHost ? String(value) : cleanHost(value);

    // we ensure the object becomes immutable
    Object.freeze(this);
  }

  // returns the value when the conversion to String primitive happens, with `String(obj)` 
  valueOf() {
    return this.hostValue;
  }
}

What do you think?

@remusao
Copy link
Collaborator Author

remusao commented Aug 31, 2017

@oncletom Yeah the flag might not be the ideal solution. What about just asking users to clean the url before giving it to functions of tldjs? So that we simplify the API and all functions expect an already clean host.

Another option could be, as you suggest, to wrap clean host in a different type and check from each function if we are given an already clean host, or if the cleaning needs to happen? But then it's not propagated to the caller and we still might have redundancy. Although I guess it's nice because we hide all the complexity from the user and it's more efficient than what we have now (but then users cannot chose to not have any pre-processing at all, or do it themselves, is it something we want to offer?).

Also, on a side note, I started working on a more efficient version of cleanHost to limit its impact on the overall performance of the library.

Thanks for your answer! If you think that the best option is to wrap the clean host in a different type, I can implement it in this PR.

Also, I see you make use of class in your example, and it made me wonder what version of Javascript this library should support? es5, es6, something else?

lib/domain.js Outdated
hostTld = extractTldFromHost(host);
rules = getRulesForTld(allRules, hostTld, new Rule({"firstLevel": hostTld, "isHost": _validHosts.indexOf(hostTld) !== -1}));
rule = getCandidateRule(host, rules);
// Check if `host` ends with '.' followed by one host specified in validHosts.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the trailing dot should be cleaned or kept as part of the clean host value?

If cleaned, I guess the following block can move to the cleanHost function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that the cleanHost would actually check if the host is part of the validHosts and return this value directly right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I mean the trailing dot thingy appears in various places in the codebase. If the trailing dots are removed as part of the host cleaning process, it simplifies a lot various verification steps. Don't you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is confusing, but it's not the same trailing dot that we are dealing with here. Basically we just check if one element of validHosts is a valid general domain of cleanHost. But it's not enough to check if hostname.endsWith(vhost), so we need to be sure that either hostname === vhost or the character before hostname.indexOf(vhost) is .. I will add some comment and create a helper function to make it clearer.

lib/domain.js Outdated
rules = getRulesForTld(allRules, hostTld, new Rule({"firstLevel": hostTld, "isHost": _validHosts.indexOf(hostTld) !== -1}));
rule = getCandidateRule(host, rules);
// Check if `host` ends with '.' followed by one host specified in validHosts.
for (var i = 0; i < validHosts.length; i++) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we loop over the validHosts array, I'd rather use validHosts.some()/find() and interrupt the loop as soon as we find the relevant vhost.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop is already exited early when we find a match with return vhost. Am I missing something here?

Copy link
Owner

@thom4parisot thom4parisot Aug 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, you are right! I missed the return statement and assumed you were using the vhost afterwards.

lib/domain.js Outdated
host.replace(new RegExp(rule.getPattern()), function (m, d) {
domain = d;
});
if (suffix.length === cleanHost.length) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this condition refers to? If the suffix is equal to the cleanhost, then there is no domain to return? (I wonder why the check is made upon the length rather than the value of the variables).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, if the suffix has the same length of the host itself then they are equal (and we save a string comparison), so there is no domain to return (i.e.: cleanHost is just a valid public suffix).

lib/domain.js Outdated
// google.fr (length 9)
// suffix = fr (length 2)
// 5 = 9 - 2 - 1 (ignore the dot) - 1 (zero-based indexing)
var lastDotBeforeSuffixIndex = cleanHost.lastIndexOf('.', cleanHost.length - suffix.length - 2);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't hesitate to make it a function that returns null or a CleanHostString.

@thom4parisot
Copy link
Owner

thom4parisot commented Aug 31, 2017

Also, on a side note, I started working on a more efficient version of cleanHost to limit its impact on the overall performance of the library.

👍 for it, and sorry for being so slow at reviewing!

If you think that the best option is to wrap the clean host in a different type, I can implement it in this PR.

It is just a hunch — I don't know what is the performance impact going to be.
I feel it might help keeping a clean API and to be descriptive about the undertaken actions in the codebase.

Also, I see you make use of class in your example, and it made me wonder what version of Javascript this library should support? es5, es6, something else?

I have not thought much about it — I realised only after publishing the comment the codebase was in ES5 😄 If the trie implementation does not break anything in the API, I'd rather keep it as is and later do a major bump to move to ES2015, when Node 8 is out for example.

@remusao
Copy link
Collaborator Author

remusao commented Aug 31, 2017

@oncletom No problem for the delay, thank you so much for reviewing. I will address your comment and add some changes.

One thing that remains is what to do regarding the two failing tests, should I change them to adopt a behavior as described by the specification?

I have not thought much about it — I realised only after publishing the comment the codebase was in ES5 😄 If the trie implementation does not break anything in the API, I'd rather keep it as is and later do a major bump to move to ES2015, when Node 8 is out for example.

Makes sense, I will keep the code compatible (the trie should not break anything).

@thom4parisot
Copy link
Owner

One thing that remains is what to do regarding the two failing tests, should I change them to adopt a behavior as described by the specification?

Yes, please do proceed as is 🙂 it is totally fine to fix bugs along the way!

@remusao
Copy link
Collaborator Author

remusao commented Aug 31, 2017

@oncletom Actually, after thinking a bit about this cleanHost preprocessing, and if we're ok making a bold mode but being totally clear about it, it would make some sense to me that we only have functions working on hostnames, and provide a function to extract the hostname of any url (cleanHost). In a way it would be conceptually simpler and function would just do one thing. On the other hand, that puts more responsibility in the hands of the user of the library since they have to make sure they only give hostnames to the different functions.

Another option could be to offer two versions of each function: getDomain + getDomainFromHostname, getSubdomain + getSubdomainFromHost, etc. But it does not look as good as the first option IMHO.

The only thing I'm not really confident with, is that we don't have a good way to tell the caller that what was passed in is not a valid hostname and they should clean it first. Which might lead to a lot of silent failures (we would just return null). But then such check would also incur a cost.

I need to think a bit more about it.

@ctavan
Copy link

ctavan commented Aug 31, 2017

I remember at least one part of a production project I was involved with, where Object.freeze() had a considerable negative performance impact. We ended up removing the frozen objects in the end. So should you consider adding immutability using Object.freeze() I highly recommend to benchmark the results with and without Object.freeze().

References:

On the sanitization topic: As a long-time user of this library I often found the interface a bit confusing with respect to whether it required a semantically valid hostname or not. I would prefer a clear contract here and would tend towards dropping all semantic validation/sanitization and leaving this part to the user (or to a different module) such that tld.js would be exclusively concerned with validating and parsing semantically valid hostnames/domains.

@remusao
Copy link
Collaborator Author

remusao commented Aug 31, 2017

@ctavan I agree with the use of Object.freeze, it's extremely slow and we can do otherwise.

I also agree on your second point regarding the API. What do you think @oncletom?

@thom4parisot
Copy link
Owner

thom4parisot commented Aug 31, 2017

Yeah, Object.freeze used to be slow but I did not know it applied to Object.prototype. Thanks for sharing this insight 🙂 Object.freeze does not have to be used, especially if it induced a performance hit.

I think the trie implementation and an API design update are two separate decisions. They have different impacts and we do not have to try to solve them at once; and especially in one single pull request. I'd rather ship a good performance improvement and break the API later on, for good developer experience reasons.

@ctavan when you write

I often found the interface a bit confusing with respect to whether it required a semantically valid hostname or not.

This is very valuable and I would be happy if we could make emerge the pain points. Maybe it is just a documentation issue. Maybe it is an API design issue. And it is certainly a domain/concept issues to clear out.

If you are okay with it, let's discuss API and concepts improvements in a new issue while we are still around here :-)

if we're ok making a bold mode but being totally clear about it, it would make some sense to me that we only have functions working on hostnames, and provide a function to extract the hostname of any url (cleanHost)

A bold move okay but in another PR then, or in an another issue so as we can talk about the design before implementing it.

@remusao
Copy link
Collaborator Author

remusao commented Aug 31, 2017

@oncletom Makes sense. Let's preserve the API for this PR and discuss a possible evolution in another issue. I will clean-up/comment the code a bit and address the concerns between now and tomorrow. Thanks again for the review and discussion!

@remusao
Copy link
Collaborator Author

remusao commented Sep 1, 2017

@oncletom I just updated the PR:

  • Addressed PR comments
  • Cleaned-up a bit
  • Added extensive comments
  • Increased test coverage to 100%
  • Re-introduced the use of cleanHost in all functions (let's discuss how to proceed with this in another issue as you suggested).
  • Changed the expected result for the 2 tests discussed above (regarding spec compliance)

Let me know if there are other things to change before it's ready for merge.

@thom4parisot
Copy link
Owner

@remusao great! Will have a look at it more in depth within the next hour or so — and might merge and release in the continuity of it. Great work. Thank you very much for taking the time to do it, to be open about the various thoughts and remarks 👍

@remusao
Copy link
Collaborator Author

remusao commented Sep 1, 2017

@oncletom Thanks a lot! Happy to help and improve this library. Shall I create an issue to start the discussion around host cleaning?

@thom4parisot
Copy link
Owner

@remusao and I am thankful for your help 🙂

Shall I create an issue to start the discussion around host cleaning?

Yes, please go ahead 👍

@thom4parisot thom4parisot merged commit 807eeff into thom4parisot:master Sep 1, 2017
@thom4parisot
Copy link
Owner

@remusao also, could you create a separate PR with benchmark as a devDependency and ./bin/benchmark as a script to run the benchmarks? Would be nice to compare further experimentations 🙂

@remusao
Copy link
Collaborator Author

remusao commented Sep 1, 2017

@oncletom Thank you for merging! :D

I will create the issue soon + detail the different options we discussed already, as a starting point for the brainstorm!

could you create a separate PR with benchmark as a devDependency and ./bin/benchmark as a script to run the benchmarks? Would be nice to compare further experimentations 🙂

Will do! Talking about benchmark, I see you added the numbers to the README, but I think we need to correct them. On this benchmark, each op is actually processing 100 hostnames (clean and no unicode, to reduce the impact of cleanHost). So the corrected numbers should be (x100):

tldjs#tldExists x 326,500 ops/sec ±0.83% (92 runs sampled)
tldjs#getDomain x 203,000 ops/sec ±3.08% (84 runs sampled)
tldjs#getSubdomain x 203,700 ops/sec ±2.85% (85 runs sampled)
tldjs#getPublicSuffix x 219,500 ops/sec ±3.53% (83 runs sampled)

I will improve the benchmarks to be more representative of real cases when adding the benchmark binary.

@olegpisklov
Copy link

Hey guys,

Regarding this new implementation of the getPublicSuffix method - this is a breaking change since now it returns a tld instead of null. So I suppose the major version of the library should've been updated.

@thom4parisot
Copy link
Owner

Hey, @Nekimola thanks for sharing your concern. I actually thought of doing a major bump. The reasoning for not doing so was the behaviour change is a bug fix, as the past implementation was wrong. You were probably relying on a bug rather than on the expected behaviour.

What were you trying to do when expecting a null result?

@olegpisklov
Copy link

Thanks for reply, I was checking if a public suffix of a domain doesn't exist.

@thom4parisot
Copy link
Owner

thom4parisot commented Sep 5, 2017

I was checking if a public suffix of a domain doesn't exist

@Nekimola Well, if you are checking for a public suffix, according to the spec and the following tests (to be found here), you can see an unlisted TLD actually returns a public suffix.

// Unlisted TLD.
checkPublicSuffix('example', null);
checkPublicSuffix('example.example', 'example.example');
checkPublicSuffix('b.example.example', 'example.example');
checkPublicSuffix('a.b.example.example', 'example.example');

If you want to check if a TLD exists, I'd suggest you use #tldExists instead.

Let us know if more precisions are needed.

@olegpisklov
Copy link

Yeah, I'm using tldExists method now, thanks again.

@remusao remusao deleted the faster-rules branch January 10, 2019 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants