Implement rules using a trie data structure. #97

remusao · 2017-07-23T18:38:42Z

Hi there,

I spent some time this week-end implementing a different representation for the rules: replacing the regex by a more efficient Trie data structure. This changes a lot of code, so I would understand that it might require more work to make "production-ready". Also, comments are pretty sparse at the moment, but I would be glad to provide more where you think it makes sense!

I implemented some benchmark to have an idea of the performance gain, and it looks promising (although I did not register the memory consumption, I don't expect it to be unreasonable). Each run is calling the benchmarked function on 100 random domains and I'm using the benchmark npm package (I can attach the code somewhere as well if you want to have a look):

Current version of the code on master (v2.0.0)

tldjs#tldExists x 3,162 ops/sec ±1.74% (92 runs sampled)
tldjs#getDomain x 35.82 ops/sec ±1.65% (62 runs sampled)
tldjs#getSubdomain x 35.77 ops/sec ±1.60% (62 runs sampled)
tldjs#getPublicSuffix x 37.39 ops/sec ±1.13% (64 runs sampled)

New implementation based on the Trie data-structure

tldjs#tldExists x 3,265 ops/sec ±0.83% (92 runs sampled)
tldjs#getDomain x 2,030 ops/sec ±3.08% (84 runs sampled)
tldjs#getSubdomain x 2,037 ops/sec ±2.85% (85 runs sampled)
tldjs#getPublicSuffix x 2,195 ops/sec ±3.53% (83 runs sampled)

So the speed-up is around x60.

It now seems that cleanHostValue is the bottle-neck, since if we make it a no-op, the results become:

tldjs#tldExists x 57,589 ops/sec ±2.23% (92 runs sampled)
tldjs#getDomain x 8,414 ops/sec ±0.75% (94 runs sampled)
tldjs#getSubdomain x 8,257 ops/sec ±0.77% (94 runs sampled)
tldjs#getPublicSuffix x 10,857 ops/sec ±1.27% (92 runs sampled)

I added a flag (dirtyHost) on each of these methods to allow disabling the cleaning (if you know that your host value is already clean and you want maximum performances).

While implementing this new representation, I stumbled upon some cases for which I'm not really sure of the good behavior. I would love to have your feedback on them.

Result if the public suffix does not exist

The following test-case: expect(tld.getPublicSuffix('www.freedom.nsa')).to.be(null); seems to expect the wrong result. According to the specification (https://publicsuffix.org/list/). When no matching rule is found, then "the prevailing rule is "*".", which from what I understand means that we consider the latest label of the hostname to be the public suffix. The result for this test would then be nsa instead of null. What do you think?

Result of exceptions

The following test-case: expect(tld.getPublicSuffix('www.www.ck')).to.be('www.ck'); should probably expect ck instead of www.ck as a public suffix. According to the aforementioned specification, the following should be done when an exception is matched (here: !www.ck):

If the prevailing rule is a exception rule, modify it by removing the leftmost label.

I guess we should then remove www (the leftmost label), resulting in a public suffix of ck.

Result when not public suffix is found

What should be the result of getDomain when no valid public suffix is found? Should it be null as well, or the value passed as host to the function?

Ignoring trailing and leading dots

The specification says that trailing and leading dots found on rules should be ignored. Although there is no such case at the moment, maybe we should handle them and add a test-case? I left a TODO in the parser.

Thanks for reading, and I hope this work can be useful to the project,
Have a nice evening.

thom4parisot · 2017-07-24T14:03:37Z

Hey @remusao,

thanks for the hard work! Figures are impressive 🙂
I will try to have a close look at it tomorrow—I am viewing houses in London at the moment and some transcripts take the rest of my time at the moment.

When no matching rule is found, then "the prevailing rule is "*".", which from what I understand means that we consider the latest label of the hostname to be the public suffix. The result for this test would then be nsa instead of null. What do you think?

Yeah, I feel it would be nicer to respect what the public suffix spec says — I was feeling frustrated about the methods, what they were doing and for what in #25 (#54 maybe). It started from "using public-suffix.org to enforce a verification of known TLDs" to "implementing PublicSuffix for cookies and stuff while trying to address the former intention".

Likewise for the rest of your proposals — you have done a better job at reading the spec than me ;-)

remusao · 2017-07-24T14:29:38Z

Hey @oncletom,

Thanks for the answer. Let me know if I can do anything to ease the review (more comments, a tl;dr about the implementation details, or more direct talk are possible!).

Regarding the issues you mention, I was thinking about using only the ICANN section of the file to get the domain (and use the PRIVATE section only to get the public suffix). Do you think it could work?

thom4parisot · 2017-08-03T14:05:11Z

@remusao ICANN is a good idea! Or IANA (cf. #40) too – I don't know which one validates gTLD first. It would be useful if the Public Suffix list does not encompass all the gTLDs or with a major delay.

I found a place to hopefully Sunday I should be able to put my head around your proposal. Will let you know if I have any question or else — if other contributor(s) want to have a peak at it, they are obviously welcome to do so 🙂

remusao · 2017-08-08T13:32:19Z

@oncletom Thanks for your reply! I understand that time is precious and this is a big PR. If I can do anything to make it easier to review, let me know.

ctavan · 2017-08-09T14:03:35Z

@remusao since I'm very interested in this PR getting merged I tried to verify the two cases that you highlight in the issue description.

www.freedom.nsa

I think your reasoning is right, the public suffix should be nsa. To verify the reasoning I simply took the definitions from https://publicsuffix.org/list/ and highlighted the essential parts:

Definitions

[ … ]
A domain is said to match a rule if and only if all of the following conditions are met:
- When the domain and rule are split into corresponding labels, that the domain contains as many or more labels than the rule.
- Beginning with the right-most labels of both the domain and the rule, and continuing for all labels in the rule, one finds that for every pair, either they are identical, or that the label from the rule is "*".

Algorithm

Match domain against all rules and take note of the matching ones.
If no rules match, the prevailing rule is "*".
If more than one rule matches, the prevailing rule is the one which is an exception rule.
If there is no matching exception rule, the prevailing rule is the one with the most labels.
If the prevailing rule is a exception rule, modify it by removing the leftmost label.
The public suffix is the set of labels from the domain which match the labels of the prevailing rule, using the matching algorithm above.
The registered or registrable domain is the public suffix plus one additional label.

prevailing rule = *

Domain: | www | freedom | nsa |
Rule:   |     |         |  *  |

public suffix = nsa

www.www.ck

I think your reasoning is again right (public suffix should be ck):

Match domain against all rules and take note of the matching ones.
- Matching rules:
```
*.ck
!www.ck
```
~~If no rules match, the prevailing rule is "*".~~
If more than one rule matches, the prevailing rule is the one which is an exception rule.
- Matching rule:
```
!www.ck
```
~~If there is no matching exception rule, the prevailing rule is the one with the most labels.~~
If the prevailing rule is a exception rule, modify it by removing the leftmost label.
- Modified matching rule:
```
ck
```
The public suffix is the set of labels from the domain which match the labels of the prevailing rule, using the matching algorithm above.
The registered or registrable domain is the public suffix plus one additional label.

prevailing rule = ck

Domain: | www | www | ck |
Rule:   |     |     | ck |

public suffix = ck
registered domain = www.ck

luciancor · 2017-08-22T16:58:19Z

nice work @remusao !

thom4parisot · 2017-08-31T16:22:01Z

@remusao I have been thinking about the clean host vs. dirty host. I agree there is a need for cleaning if it is relevant only. But by changing the interface we can no longer trust the flag which states if the value is clean or not.

In my opinion, the cleanHost function should return a frozen CleanHostString object. Maybe like this:

class CleanHostString extends String {
  constructor(value) {
    super(value);

    this.hostValue = value instanceof CleanHost ? String(value) : cleanHost(value);

    // we ensure the object becomes immutable
    Object.freeze(this);
  }

  // returns the value when the conversion to String primitive happens, with `String(obj)` 
  valueOf() {
    return this.hostValue;
  }
}

What do you think?

remusao · 2017-08-31T16:31:01Z

@oncletom Yeah the flag might not be the ideal solution. What about just asking users to clean the url before giving it to functions of tldjs? So that we simplify the API and all functions expect an already clean host.

Another option could be, as you suggest, to wrap clean host in a different type and check from each function if we are given an already clean host, or if the cleaning needs to happen? But then it's not propagated to the caller and we still might have redundancy. Although I guess it's nice because we hide all the complexity from the user and it's more efficient than what we have now (but then users cannot chose to not have any pre-processing at all, or do it themselves, is it something we want to offer?).

Also, on a side note, I started working on a more efficient version of cleanHost to limit its impact on the overall performance of the library.

Thanks for your answer! If you think that the best option is to wrap the clean host in a different type, I can implement it in this PR.

Also, I see you make use of class in your example, and it made me wonder what version of Javascript this library should support? es5, es6, something else?

thom4parisot · 2017-08-31T16:24:57Z

lib/domain.js

-  hostTld = extractTldFromHost(host);
-  rules = getRulesForTld(allRules, hostTld, new Rule({"firstLevel": hostTld, "isHost": _validHosts.indexOf(hostTld) !== -1}));
-  rule = getCandidateRule(host, rules);
+  // Check if `host` ends with '.' followed by one host specified in validHosts.


I wonder if the trailing dot should be cleaned or kept as part of the clean host value?

If cleaned, I guess the following block can move to the cleanHost function.

You mean that the cleanHost would actually check if the host is part of the validHosts and return this value directly right?

Hm, I mean the trailing dot thingy appears in various places in the codebase. If the trailing dots are removed as part of the host cleaning process, it simplifies a lot various verification steps. Don't you think?

I think the comment is confusing, but it's not the same trailing dot that we are dealing with here. Basically we just check if one element of validHosts is a valid general domain of cleanHost. But it's not enough to check if hostname.endsWith(vhost), so we need to be sure that either hostname === vhost or the character before hostname.indexOf(vhost) is .. I will add some comment and create a helper function to make it clearer.

thom4parisot · 2017-08-31T16:29:23Z

lib/domain.js

-  rules = getRulesForTld(allRules, hostTld, new Rule({"firstLevel": hostTld, "isHost": _validHosts.indexOf(hostTld) !== -1}));
-  rule = getCandidateRule(host, rules);
+  // Check if `host` ends with '.' followed by one host specified in validHosts.
+  for (var i = 0; i < validHosts.length; i++) {


If we loop over the validHosts array, I'd rather use validHosts.some()/find() and interrupt the loop as soon as we find the relevant vhost.

The loop is already exited early when we find a match with return vhost. Am I missing something here?

True, you are right! I missed the return statement and assumed you were using the vhost afterwards.

thom4parisot · 2017-08-31T16:30:50Z

lib/domain.js

-  host.replace(new RegExp(rule.getPattern()), function (m, d) {
-    domain = d;
-  });
+  if (suffix.length === cleanHost.length) {


What does this condition refers to? If the suffix is equal to the cleanhost, then there is no domain to return? (I wonder why the check is made upon the length rather than the value of the variables).

Yes exactly, if the suffix has the same length of the host itself then they are equal (and we save a string comparison), so there is no domain to return (i.e.: cleanHost is just a valid public suffix).

thom4parisot · 2017-08-31T16:53:24Z

lib/domain.js

+  // google.fr (length 9)
+  // suffix = fr (length 2)
+  // 5 = 9 - 2 - 1 (ignore the dot) - 1 (zero-based indexing)
+  var lastDotBeforeSuffixIndex = cleanHost.lastIndexOf('.', cleanHost.length - suffix.length - 2);


Don't hesitate to make it a function that returns null or a CleanHostString.

thom4parisot · 2017-08-31T17:18:52Z

Also, on a side note, I started working on a more efficient version of cleanHost to limit its impact on the overall performance of the library.

👍 for it, and sorry for being so slow at reviewing!

If you think that the best option is to wrap the clean host in a different type, I can implement it in this PR.

It is just a hunch — I don't know what is the performance impact going to be.
I feel it might help keeping a clean API and to be descriptive about the undertaken actions in the codebase.

Also, I see you make use of class in your example, and it made me wonder what version of Javascript this library should support? es5, es6, something else?

I have not thought much about it — I realised only after publishing the comment the codebase was in ES5 😄 If the trie implementation does not break anything in the API, I'd rather keep it as is and later do a major bump to move to ES2015, when Node 8 is out for example.

remusao · 2017-08-31T17:28:52Z

@oncletom No problem for the delay, thank you so much for reviewing. I will address your comment and add some changes.

One thing that remains is what to do regarding the two failing tests, should I change them to adopt a behavior as described by the specification?

I have not thought much about it — I realised only after publishing the comment the codebase was in ES5 😄 If the trie implementation does not break anything in the API, I'd rather keep it as is and later do a major bump to move to ES2015, when Node 8 is out for example.

Makes sense, I will keep the code compatible (the trie should not break anything).

thom4parisot · 2017-08-31T18:01:59Z

One thing that remains is what to do regarding the two failing tests, should I change them to adopt a behavior as described by the specification?

Yes, please do proceed as is 🙂 it is totally fine to fix bugs along the way!

remusao · 2017-08-31T19:19:26Z

@oncletom Actually, after thinking a bit about this cleanHost preprocessing, and if we're ok making a bold mode but being totally clear about it, it would make some sense to me that we only have functions working on hostnames, and provide a function to extract the hostname of any url (cleanHost). In a way it would be conceptually simpler and function would just do one thing. On the other hand, that puts more responsibility in the hands of the user of the library since they have to make sure they only give hostnames to the different functions.

Another option could be to offer two versions of each function: getDomain + getDomainFromHostname, getSubdomain + getSubdomainFromHost, etc. But it does not look as good as the first option IMHO.

The only thing I'm not really confident with, is that we don't have a good way to tell the caller that what was passed in is not a valid hostname and they should clean it first. Which might lead to a lot of silent failures (we would just return null). But then such check would also incur a cost.

I need to think a bit more about it.

ctavan · 2017-08-31T19:21:12Z

I remember at least one part of a production project I was involved with, where Object.freeze() had a considerable negative performance impact. We ended up removing the frozen objects in the end. So should you consider adding immutability using Object.freeze() I highly recommend to benchmark the results with and without Object.freeze().

References:

On the sanitization topic: As a long-time user of this library I often found the interface a bit confusing with respect to whether it required a semantically valid hostname or not. I would prefer a clear contract here and would tend towards dropping all semantic validation/sanitization and leaving this part to the user (or to a different module) such that tld.js would be exclusively concerned with validating and parsing semantically valid hostnames/domains.

remusao · 2017-08-31T19:23:11Z

@ctavan I agree with the use of Object.freeze, it's extremely slow and we can do otherwise.

I also agree on your second point regarding the API. What do you think @oncletom?

thom4parisot · 2017-08-31T20:50:58Z

Yeah, Object.freeze used to be slow but I did not know it applied to Object.prototype. Thanks for sharing this insight 🙂 Object.freeze does not have to be used, especially if it induced a performance hit.

I think the trie implementation and an API design update are two separate decisions. They have different impacts and we do not have to try to solve them at once; and especially in one single pull request. I'd rather ship a good performance improvement and break the API later on, for good developer experience reasons.

@ctavan when you write

I often found the interface a bit confusing with respect to whether it required a semantically valid hostname or not.

This is very valuable and I would be happy if we could make emerge the pain points. Maybe it is just a documentation issue. Maybe it is an API design issue. And it is certainly a domain/concept issues to clear out.

If you are okay with it, let's discuss API and concepts improvements in a new issue while we are still around here :-)

if we're ok making a bold mode but being totally clear about it, it would make some sense to me that we only have functions working on hostnames, and provide a function to extract the hostname of any url (cleanHost)

A bold move okay but in another PR then, or in an another issue so as we can talk about the design before implementing it.

remusao · 2017-08-31T21:24:03Z

@oncletom Makes sense. Let's preserve the API for this PR and discuss a possible evolution in another issue. I will clean-up/comment the code a bit and address the concerns between now and tomorrow. Thanks again for the review and discussion!

remusao · 2017-09-01T14:10:58Z

@oncletom I just updated the PR:

Addressed PR comments
Cleaned-up a bit
Added extensive comments
Increased test coverage to 100%
Re-introduced the use of cleanHost in all functions (let's discuss how to proceed with this in another issue as you suggested).
Changed the expected result for the 2 tests discussed above (regarding spec compliance)

Let me know if there are other things to change before it's ready for merge.

thom4parisot · 2017-09-01T16:01:28Z

@remusao great! Will have a look at it more in depth within the next hour or so — and might merge and release in the continuity of it. Great work. Thank you very much for taking the time to do it, to be open about the various thoughts and remarks 👍

remusao · 2017-09-01T16:15:24Z

@oncletom Thanks a lot! Happy to help and improve this library. Shall I create an issue to start the discussion around host cleaning?

thom4parisot · 2017-09-01T16:41:07Z

@remusao and I am thankful for your help 🙂

Shall I create an issue to start the discussion around host cleaning?

Yes, please go ahead 👍

thom4parisot · 2017-09-01T16:55:06Z

@remusao also, could you create a separate PR with benchmark as a devDependency and ./bin/benchmark as a script to run the benchmarks? Would be nice to compare further experimentations 🙂

remusao · 2017-09-01T18:54:19Z

@oncletom Thank you for merging! :D

I will create the issue soon + detail the different options we discussed already, as a starting point for the brainstorm!

could you create a separate PR with benchmark as a devDependency and ./bin/benchmark as a script to run the benchmarks? Would be nice to compare further experimentations 🙂

Will do! Talking about benchmark, I see you added the numbers to the README, but I think we need to correct them. On this benchmark, each op is actually processing 100 hostnames (clean and no unicode, to reduce the impact of cleanHost). So the corrected numbers should be (x100):

tldjs#tldExists x 326,500 ops/sec ±0.83% (92 runs sampled)
tldjs#getDomain x 203,000 ops/sec ±3.08% (84 runs sampled)
tldjs#getSubdomain x 203,700 ops/sec ±2.85% (85 runs sampled)
tldjs#getPublicSuffix x 219,500 ops/sec ±3.53% (83 runs sampled)

I will improve the benchmarks to be more representative of real cases when adding the benchmark binary.

refs #97

olegpisklov · 2017-09-04T16:07:01Z

Hey guys,

Regarding this new implementation of the getPublicSuffix method - this is a breaking change since now it returns a tld instead of null. So I suppose the major version of the library should've been updated.

thom4parisot · 2017-09-05T09:22:33Z

Hey, @Nekimola thanks for sharing your concern. I actually thought of doing a major bump. The reasoning for not doing so was the behaviour change is a bug fix, as the past implementation was wrong. You were probably relying on a bug rather than on the expected behaviour.

What were you trying to do when expecting a null result?

olegpisklov · 2017-09-05T10:54:48Z

Thanks for reply, I was checking if a public suffix of a domain doesn't exist.

thom4parisot · 2017-09-05T20:38:40Z

I was checking if a public suffix of a domain doesn't exist

@Nekimola Well, if you are checking for a public suffix, according to the spec and the following tests (to be found here), you can see an unlisted TLD actually returns a public suffix.

// Unlisted TLD.
checkPublicSuffix('example', null);
checkPublicSuffix('example.example', 'example.example');
checkPublicSuffix('b.example.example', 'example.example');
checkPublicSuffix('a.b.example.example', 'example.example');

If you want to check if a TLD exists, I'd suggest you use #tldExists instead.

Let us know if more precisions are needed.

olegpisklov · 2017-09-06T10:48:19Z

Yeah, I'm using tldExists method now, thanks again.

remusao requested a review from thom4parisot July 23, 2017 18:38

remusao force-pushed the faster-rules branch from 5179bf3 to d1a6ce3 Compare July 23, 2017 18:38

Implement rules using a trie data structure.

20c78e3

remusao force-pushed the faster-rules branch from d1a6ce3 to 20c78e3 Compare July 23, 2017 18:39

remusao added 6 commits July 23, 2017 21:29

Remove use of startsWith for backward compatibility

f4fe87f

trie lookup: check exceptions only if we have a normal match

f6fe7ac

Revert changes on package.json

a3e68ad

remove default function argument for backward compatibility

891790f

Remove use of endsWith for backward compatibility

61245bd

Fix optional host cleaning

c0a7904

thom4parisot reviewed Aug 31, 2017

View reviewed changes

Address PR comments and increase test coverage.

21a419f

remusao force-pushed the faster-rules branch from aaf0853 to 21a419f Compare September 1, 2017 14:04

thom4parisot approved these changes Sep 1, 2017

View reviewed changes

thom4parisot merged commit 807eeff into thom4parisot:master Sep 1, 2017

thom4parisot pushed a commit that referenced this pull request Sep 1, 2017

fix(readme): provide more accurate figures

02cb561

refs #97

This was referenced Sep 2, 2017

Discuss use of cleanHost by default, and API in general #99

Closed

Find a faster way to iterate over the ruleset #57

Closed

remusao deleted the faster-rules branch January 10, 2019 20:00

Implement rules using a trie data structure. #97

Implement rules using a trie data structure. #97

Conversation

remusao commented Jul 23, 2017 • edited Loading

Current version of the code on master (v2.0.0)

New implementation based on the Trie data-structure

Result if the public suffix does not exist

Result of exceptions

Result when not public suffix is found

Ignoring trailing and leading dots

thom4parisot commented Jul 24, 2017 • edited Loading

remusao commented Jul 24, 2017

thom4parisot commented Aug 3, 2017

remusao commented Aug 8, 2017

ctavan commented Aug 9, 2017

Definitions

Algorithm

luciancor commented Aug 22, 2017

thom4parisot commented Aug 31, 2017

remusao commented Aug 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thom4parisot Aug 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thom4parisot commented Aug 31, 2017 • edited Loading

remusao commented Aug 31, 2017

thom4parisot commented Aug 31, 2017

remusao commented Aug 31, 2017

ctavan commented Aug 31, 2017

remusao commented Aug 31, 2017

thom4parisot commented Aug 31, 2017 • edited Loading

remusao commented Aug 31, 2017

remusao commented Sep 1, 2017

thom4parisot commented Sep 1, 2017

remusao commented Sep 1, 2017

thom4parisot commented Sep 1, 2017

thom4parisot commented Sep 1, 2017

remusao commented Sep 1, 2017 • edited Loading

olegpisklov commented Sep 4, 2017

thom4parisot commented Sep 5, 2017

olegpisklov commented Sep 5, 2017

thom4parisot commented Sep 5, 2017 • edited Loading

olegpisklov commented Sep 6, 2017

remusao commented Jul 23, 2017 •

edited

Loading

thom4parisot commented Jul 24, 2017 •

edited

Loading

remusao commented Aug 31, 2017 •

edited

Loading

thom4parisot Aug 31, 2017 •

edited

Loading

thom4parisot commented Aug 31, 2017 •

edited

Loading

thom4parisot commented Aug 31, 2017 •

edited

Loading

remusao commented Sep 1, 2017 •

edited

Loading

thom4parisot commented Sep 5, 2017 •

edited

Loading