-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement rules using a trie data structure. #97
Conversation
Hey @remusao, thanks for the hard work! Figures are impressive 🙂
Yeah, I feel it would be nicer to respect what the public suffix spec says — I was feeling frustrated about the methods, what they were doing and for what in #25 (#54 maybe). It started from "using public-suffix.org to enforce a verification of known TLDs" to "implementing PublicSuffix for cookies and stuff while trying to address the former intention". Likewise for the rest of your proposals — you have done a better job at reading the spec than me ;-) |
Hey @oncletom, Thanks for the answer. Let me know if I can do anything to ease the review (more comments, a tl;dr about the implementation details, or more direct talk are possible!). Regarding the issues you mention, I was thinking about using only the ICANN section of the file to get the domain (and use the PRIVATE section only to get the public suffix). Do you think it could work? |
@remusao ICANN is a good idea! Or IANA (cf. #40) too – I don't know which one validates gTLD first. It would be useful if the Public Suffix list does not encompass all the gTLDs or with a major delay. I found a place to hopefully Sunday I should be able to put my head around your proposal. Will let you know if I have any question or else — if other contributor(s) want to have a peak at it, they are obviously welcome to do so 🙂 |
@oncletom Thanks for your reply! I understand that time is precious and this is a big PR. If I can do anything to make it easier to review, let me know. |
@remusao since I'm very interested in this PR getting merged I tried to verify the two cases that you highlight in the issue description. www.freedom.nsaI think your reasoning is right, the public suffix should be Definitions
Algorithm
www.www.ckI think your reasoning is again right (public suffix should be
|
nice work @remusao ! |
@remusao I have been thinking about the clean host vs. dirty host. I agree there is a need for cleaning if it is relevant only. But by changing the interface we can no longer trust the flag which states if the value is clean or not. In my opinion, the class CleanHostString extends String {
constructor(value) {
super(value);
this.hostValue = value instanceof CleanHost ? String(value) : cleanHost(value);
// we ensure the object becomes immutable
Object.freeze(this);
}
// returns the value when the conversion to String primitive happens, with `String(obj)`
valueOf() {
return this.hostValue;
}
} What do you think? |
@oncletom Yeah the flag might not be the ideal solution. What about just asking users to clean the url before giving it to functions of tldjs? So that we simplify the API and all functions expect an already clean host. Another option could be, as you suggest, to wrap clean host in a different type and check from each function if we are given an already clean host, or if the cleaning needs to happen? But then it's not propagated to the caller and we still might have redundancy. Although I guess it's nice because we hide all the complexity from the user and it's more efficient than what we have now (but then users cannot chose to not have any pre-processing at all, or do it themselves, is it something we want to offer?). Also, on a side note, I started working on a more efficient version of Thanks for your answer! If you think that the best option is to wrap the clean host in a different type, I can implement it in this PR. Also, I see you make use of |
lib/domain.js
Outdated
hostTld = extractTldFromHost(host); | ||
rules = getRulesForTld(allRules, hostTld, new Rule({"firstLevel": hostTld, "isHost": _validHosts.indexOf(hostTld) !== -1})); | ||
rule = getCandidateRule(host, rules); | ||
// Check if `host` ends with '.' followed by one host specified in validHosts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the trailing dot should be cleaned or kept as part of the clean host value?
If cleaned, I guess the following block can move to the cleanHost
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean that the cleanHost
would actually check if the host is part of the validHosts
and return this value directly right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I mean the trailing dot thingy appears in various places in the codebase. If the trailing dots are removed as part of the host cleaning process, it simplifies a lot various verification steps. Don't you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is confusing, but it's not the same trailing dot that we are dealing with here. Basically we just check if one element of validHosts
is a valid general domain of cleanHost
. But it's not enough to check if hostname.endsWith(vhost)
, so we need to be sure that either hostname === vhost
or the character before hostname.indexOf(vhost)
is .
. I will add some comment and create a helper function to make it clearer.
lib/domain.js
Outdated
rules = getRulesForTld(allRules, hostTld, new Rule({"firstLevel": hostTld, "isHost": _validHosts.indexOf(hostTld) !== -1})); | ||
rule = getCandidateRule(host, rules); | ||
// Check if `host` ends with '.' followed by one host specified in validHosts. | ||
for (var i = 0; i < validHosts.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we loop over the validHosts
array, I'd rather use validHosts.some()/find()
and interrupt the loop as soon as we find the relevant vhost
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop is already exited early when we find a match with return vhost
. Am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, you are right! I missed the return
statement and assumed you were using the vhost
afterwards.
lib/domain.js
Outdated
host.replace(new RegExp(rule.getPattern()), function (m, d) { | ||
domain = d; | ||
}); | ||
if (suffix.length === cleanHost.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this condition refers to? If the suffix is equal to the cleanhost, then there is no domain to return? (I wonder why the check is made upon the length
rather than the value of the variables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, if the suffix has the same length of the host itself then they are equal (and we save a string comparison), so there is no domain to return (i.e.: cleanHost
is just a valid public suffix).
lib/domain.js
Outdated
// google.fr (length 9) | ||
// suffix = fr (length 2) | ||
// 5 = 9 - 2 - 1 (ignore the dot) - 1 (zero-based indexing) | ||
var lastDotBeforeSuffixIndex = cleanHost.lastIndexOf('.', cleanHost.length - suffix.length - 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't hesitate to make it a function that returns null
or a CleanHostString
.
👍 for it, and sorry for being so slow at reviewing!
It is just a hunch — I don't know what is the performance impact going to be.
I have not thought much about it — I realised only after publishing the comment the codebase was in ES5 😄 If the trie implementation does not break anything in the API, I'd rather keep it as is and later do a major bump to move to ES2015, when Node 8 is out for example. |
@oncletom No problem for the delay, thank you so much for reviewing. I will address your comment and add some changes. One thing that remains is what to do regarding the two failing tests, should I change them to adopt a behavior as described by the specification?
Makes sense, I will keep the code compatible (the trie should not break anything). |
Yes, please do proceed as is 🙂 it is totally fine to fix bugs along the way! |
@oncletom Actually, after thinking a bit about this cleanHost preprocessing, and if we're ok making a bold mode but being totally clear about it, it would make some sense to me that we only have functions working on hostnames, and provide a function to extract the hostname of any url ( Another option could be to offer two versions of each function: The only thing I'm not really confident with, is that we don't have a good way to tell the caller that what was passed in is not a valid hostname and they should clean it first. Which might lead to a lot of silent failures (we would just return I need to think a bit more about it. |
I remember at least one part of a production project I was involved with, where References:
On the sanitization topic: As a long-time user of this library I often found the interface a bit confusing with respect to whether it required a semantically valid hostname or not. I would prefer a clear contract here and would tend towards dropping all semantic validation/sanitization and leaving this part to the user (or to a different module) such that |
Yeah, I think the trie implementation and an API design update are two separate decisions. They have different impacts and we do not have to try to solve them at once; and especially in one single pull request. I'd rather ship a good performance improvement and break the API later on, for good developer experience reasons. @ctavan when you write
This is very valuable and I would be happy if we could make emerge the pain points. Maybe it is just a documentation issue. Maybe it is an API design issue. And it is certainly a domain/concept issues to clear out. If you are okay with it, let's discuss API and concepts improvements in a new issue while we are still around here :-)
A bold move okay but in another PR then, or in an another issue so as we can talk about the design before implementing it. |
@oncletom Makes sense. Let's preserve the API for this PR and discuss a possible evolution in another issue. I will clean-up/comment the code a bit and address the concerns between now and tomorrow. Thanks again for the review and discussion! |
aaf0853
to
21a419f
Compare
@oncletom I just updated the PR:
Let me know if there are other things to change before it's ready for merge. |
@remusao great! Will have a look at it more in depth within the next hour or so — and might merge and release in the continuity of it. Great work. Thank you very much for taking the time to do it, to be open about the various thoughts and remarks 👍 |
@oncletom Thanks a lot! Happy to help and improve this library. Shall I create an issue to start the discussion around host cleaning? |
@remusao and I am thankful for your help 🙂
Yes, please go ahead 👍 |
@remusao also, could you create a separate PR with |
@oncletom Thank you for merging! :D I will create the issue soon + detail the different options we discussed already, as a starting point for the brainstorm!
Will do! Talking about benchmark, I see you added the numbers to the README, but I think we need to correct them. On this benchmark, each
I will improve the benchmarks to be more representative of real cases when adding the |
Hey guys, Regarding this new implementation of the |
Hey, @Nekimola thanks for sharing your concern. I actually thought of doing a major bump. The reasoning for not doing so was the behaviour change is a bug fix, as the past implementation was wrong. You were probably relying on a bug rather than on the expected behaviour. What were you trying to do when expecting a |
Thanks for reply, I was checking if a public suffix of a domain doesn't exist. |
@Nekimola Well, if you are checking for a public suffix, according to the spec and the following tests (to be found here), you can see an unlisted TLD actually returns a public suffix.
If you want to check if a TLD exists, I'd suggest you use Let us know if more precisions are needed. |
Yeah, I'm using |
Hi there,
I spent some time this week-end implementing a different representation for the rules: replacing the regex by a more efficient Trie data structure. This changes a lot of code, so I would understand that it might require more work to make "production-ready". Also, comments are pretty sparse at the moment, but I would be glad to provide more where you think it makes sense!
I implemented some benchmark to have an idea of the performance gain, and it looks promising (although I did not register the memory consumption, I don't expect it to be unreasonable). Each
run
is calling the benchmarked function on 100 random domains and I'm using thebenchmark
npm package (I can attach the code somewhere as well if you want to have a look):Current version of the code on master (v2.0.0)
New implementation based on the Trie data-structure
So the speed-up is around
x60
.It now seems that
cleanHostValue
is the bottle-neck, since if we make it a no-op, the results become:I added a flag (
dirtyHost
) on each of these methods to allow disabling the cleaning (if you know that your host value is already clean and you want maximum performances).While implementing this new representation, I stumbled upon some cases for which I'm not really sure of the good behavior. I would love to have your feedback on them.
Result if the public suffix does not exist
The following test-case:
expect(tld.getPublicSuffix('www.freedom.nsa')).to.be(null);
seems to expect the wrong result. According to the specification (https://publicsuffix.org/list/). When no matching rule is found, then "the prevailing rule is "*".", which from what I understand means that we consider the latest label of the hostname to be the public suffix. The result for this test would then bensa
instead ofnull
. What do you think?Result of exceptions
The following test-case:
expect(tld.getPublicSuffix('www.www.ck')).to.be('www.ck');
should probably expectck
instead ofwww.ck
as a public suffix. According to the aforementioned specification, the following should be done when an exception is matched (here:!www.ck
):I guess we should then remove
www
(the leftmost label), resulting in a public suffix ofck
.Result when not public suffix is found
What should be the result of
getDomain
when no valid public suffix is found? Should it benull
as well, or the value passed ashost
to the function?Ignoring trailing and leading dots
The specification says that trailing and leading dots found on rules should be ignored. Although there is no such case at the moment, maybe we should handle them and add a test-case? I left a
TODO
in the parser.Thanks for reading, and I hope this work can be useful to the project,
Have a nice evening.