Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API addition proposal + remove redundancy #103

Merged
merged 7 commits into from
Sep 10, 2017

Conversation

remusao
Copy link
Collaborator

@remusao remusao commented Sep 6, 2017

I played a bit with the idea of changing slightly the API #99 (while maintaining full backward compatibility) and I think I found some compromise.

  • Add a parse method in the public API:
> tldjs.parse('https://example.co.uk')
{ valid: true,
  hostname: 'foo.example.co.uk',
  suffix: 'co.uk',
  domain: 'example.co.uk',
  subdomain: 'foo' }
  • All the other methods are still available and keep their semantic
  • Add an additional option in factory to provide a custom method to extract a hostname from a url (by default it is cleanHostValue)
  • Add a very fast hostname validity check in cleanHostValue to only pay the price of parsing if the argument is not already a valid hostname
  • Change internal APIs to only accept a valid hostname + change arguments slightly to avoid re-computing public suffix, domain, etc. several times (index.js has some extra logic in parse to only extract the hostname once)

The result is a substantial increase in speed + no extra cost if the value given to any method of the public API is already a valid hostname.

Some timings (if processing only clean hostnames):

While interpreting the results, keep in mind that each "op" reported by the benchmark is processing 24 domains
tldjs#isValid x 355,821 ops/sec ±0.70% (94 runs sampled)
tldjs#extractHostname x 333,040 ops/sec ±0.42% (96 runs sampled)
tldjs#tldExists x 128,460 ops/sec ±0.87% (96 runs sampled)
tldjs#getDomain x 41,242 ops/sec ±1.80% (87 runs sampled)
tldjs#getSubdomain x 41,626 ops/sec ±1.06% (95 runs sampled)
tldjs#getPublicSuffix x 40,460 ops/sec ±1.57% (92 runs sampled)

That is roughly 1M hostnames processed per second (for getDomain, getSubdomain, getPublicSuffix).

Other timings (normal benchmark):

While interpreting the results, keep in mind that each "op" reported by the benchmark is processing 24 domains
tldjs#isValid x 497,786 ops/sec ±0.55% (89 runs sampled)
tldjs#extractHostname x 52,050 ops/sec ±1.06% (85 runs sampled)
tldjs#tldExists x 36,349 ops/sec ±1.02% (92 runs sampled)
tldjs#getDomain x 18,385 ops/sec ±0.68% (91 runs sampled)
tldjs#getSubdomain x 18,335 ops/sec ±1.32% (96 runs sampled)
tldjs#getPublicSuffix x 18,289 ops/sec ±0.77% (94 runs sampled)

Which is about 430k hostnames processed per second (but extractHostname is now the bottleneck).

If you think this is a reasonable change, we should close #100 (as it already implements the more strict isValid hostname validation), and I think I should stop optimizing the code :P It's probably more than fast enough now and if not, people can still plug their own fancy extractHostname function.

@ctavan
Copy link

ctavan commented Sep 7, 2017

@remusao I think this would be a very good improvement to the library!

Just one slight suggestion for more consistency: I think it should be publicSuffix instead of suffix in the result object of tldjs.parse().

Only then will the result of parse strictly map to the current api:

// tldjs.parse('https://example.co.uk')
{
  valid: true,  // tldjs.isValid()
  hostname: 'foo.example.co.uk',
  publicSuffix: 'co.uk', // tldjs.getPublicSuffix()
  domain: 'example.co.uk', // tldjs.getDomain()
  subdomain: 'foo',  // tldjs.getSubdomain()
}

@thom4parisot
Copy link
Owner

I support the publicSuffix thingy too.

Although I was wondering if it was sensible to have lazy properties instead, so as the computation would be done only on the used properties. Can be part of another pull request if it is relevant.

@remusao
Copy link
Collaborator Author

remusao commented Sep 8, 2017

I just updated to rename suffix into publicSuffix. Regarding the lazy property, I actually tried it first, and the overhead of this was actually higher than just computing everything at once. So I'm not sure it's worth it. Actually I think it makes sense, it's not so much code so I guess V8 is able to optimize this like hell and inline a lot of it.

@thom4parisot
Copy link
Owner

the overhead of this was actually higher than just computing everything at once

Well, that closes the case then 👍

test/tld.js Outdated
@@ -56,7 +79,7 @@ describe('tld.js', function () {
expect(tld.isValid('.com')).to.be(false);
});

it('should be falsy on dotless hostname', function () {
it.skip('should be falsy on dotless hostname', function () {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests would not pass otherwise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a hostname without dots seems to be valid (according to the firefox parser). So localhost is a valid hostname, strictly speaking, but we won't find a valid public suffix for it, so it will be rejected. Unless we specify it in validHosts. So I disabled these tests, should I remove them altogether?

We can double check if we have enough test cases to cover the behavior of all the functions for such corner cases (getPublicSuffix, getDomain, etc.) and make sure we offer a consistent behavior.

index.js Outdated

return {
valid,
hostname,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, the only property I would add would be tldExists, derived from the same method. Especially as now, thanks to you, we can retrieve a public suffix while having an unknown TLD.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can, this way we actually have an equivalent for each method of the API when using the parse function. I will add it.

@remusao
Copy link
Collaborator Author

remusao commented Sep 9, 2017

@oncletom I updated the PR with the following changes:

  • Add tldExists as part of the output of parse
  • Implement some kind of laziness in the implementation of parse to avoid paying the cost of all functions invocation. This works by having an optional step argument to the function, indicating if we want to stop early after a specific result has been computed (e.g.: tldExists, publicSuffix, etc.).

Current timings:

tldjs#isValid x 524,277 ops/sec ±0.56% (92 runs sampled)
tldjs#extractHostname x 53,515 ops/sec ±0.25% (94 runs sampled)
tldjs#tldExists x 33,019 ops/sec ±0.65% (95 runs sampled)
tldjs#getPublicSuffix x 19,411 ops/sec ±1.32% (91 runs sampled)
tldjs#getDomain x 19,041 ops/sec ±0.95% (93 runs sampled)
tldjs#getSubdomain x 18,960 ops/sec ±0.61% (94 runs sampled)
tldjs#parse x 16,834 ops/sec ±1.21% (96 runs sampled)

@remusao
Copy link
Collaborator Author

remusao commented Sep 9, 2017

Not sure what's going on with the installation on node in travis. Locally testling seems to be installed properly. Could this be a network error or something?

@thom4parisot
Copy link
Owner

Great work 👍

Not sure what's going on with the installation on node in travis. Locally testling seems to be installed properly. Could this be a network error or something?

I don't know, it seems to happen only with node@8 😞

@thom4parisot
Copy link
Owner

Issue seems to be related to Travis CI cache. Testling reports an issue which does not happen on the mocha side of things:

not ok 18 tld.js isValid method should detect valid hostname
  http://localhost:60900/__testling?show=true:11633:49
  callFn@http://localhost:60900/__testling/node_modules/mocha/mocha.js:4480:25
  run@http://localhost:60900/__testling/node_modules/mocha/mocha.js:4472:13
  runTest@http://localhost:60900/__testling/node_modules/mocha/mocha.js:4968:13
  http://localhost:60900/__testling/node_modules/mocha/mocha.js:5074:19
  next@http://localhost:60900/__testling/node_modules/mocha/mocha.js:4886:16
  http://localhost:60900/__testling/node_modules/mocha/mocha.js:4896:11
  next@http://localhost:60900/__testling/node_modules/mocha/mocha.js:4820:16
  http://localhost:60900/__testling/node_modules/mocha/mocha.js:4864:9
  timeslice@http://localhost:60900/__testling/node_modules/mocha/mocha.js:82:27

@thom4parisot
Copy link
Owner

And it seems to hang randomly on node@8. Always after the same test.

@remusao
Copy link
Collaborator Author

remusao commented Sep 9, 2017

@oncletom Have you been able to reproduce the hanging? Is it while running npm run test? Or testling? Is it using phantomjs on Travis?

If you can reproduce, can you try to isolate the test that is causing the issue? If that can help, we can isolate each test case in its own it block, and add the input in the title of the test directly.

@thom4parisot
Copy link
Owner

I have not been able to reproduce the random hanging. The failing case with node@4.

Trying to identify the issue at the moment.

@thom4parisot
Copy link
Owner

It is related to the use of 'a'.repeat(). I suspect tests are ran with PhantomJS, which does not have this ES method. Will check if Firefox can be used instead.

@remusao
Copy link
Collaborator Author

remusao commented Sep 10, 2017

Ok, if this is the case, we can just replace the repeat with something else:

function repeat(str, n) {
  var res = '';
  for (var i = 0; i < n; i += 1) {
    res += str;
  }
  return res;
}

@remusao
Copy link
Collaborator Author

remusao commented Sep 10, 2017

I pushed it to give it a try, we can revert if it does not work.

@remusao
Copy link
Collaborator Author

remusao commented Sep 10, 2017

Well, looks like it was that issue?

@thom4parisot
Copy link
Owner

Yeah and thinking about it it's weird, unless a different version of Phantom is in use depending on which version of Node is active…

@thom4parisot thom4parisot merged commit a679f55 into thom4parisot:master Sep 10, 2017
@remusao
Copy link
Collaborator Author

remusao commented Sep 10, 2017

Let's keep an eye on this! That's weird indeed... Thanks for merging.

thom4parisot pushed a commit that referenced this pull request Sep 10, 2017
@remusao remusao deleted the api-change branch January 10, 2019 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants