Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add URI Validation #457

Closed
wants to merge 20 commits into from
Closed

Add URI Validation #457

wants to merge 20 commits into from

Conversation

DavidTPate
Copy link
Contributor

Hi there! Just started using this library and really enjoying it. Came up with a use case where I need to validate some URIs and noticed there were a few Pulls that came up but they were missing functionality. I went ahead and created a new library isUri which implements the full RFC 3986 in a build-up to a regular expression which it then uses to validate the URIs.

I added accompanying tests (similar to how email is tested currently) and also the error message as needed. I included all of the URLs that I've gathered thus far for testing my library as part of the tests, if there are any additional I'd love to add them to both implementations.

… validating some URIs. Add the error text for URI validation.
@DavidTPate
Copy link
Contributor Author

If this is accepted this should take care of #376, #162, and most likely #405. If it's not accepted I'm sure each of the people who needed this functionality could pull in my new library to get similar functionality.

@hueniverse hueniverse added the feature New functionality or improvement label Nov 3, 2014
@hueniverse
Copy link
Contributor

My preference would be for a single, well documented regex we can just copy into joi, over adding another dependency. isemail is an exception given the complexity of that workflow.

@DavidTPate
Copy link
Contributor Author

Could you point me in the direction of what you are looking for in terms of documentation for a regular expression?

The reason why I made it into its own module is because I felt it was more easily manageable to break it up in the module based on how the RFC defines a URI since the end result is a > 2100 character Regular Expression.

@hueniverse
Copy link
Contributor

If you can condense it more so that it looks like the ABNF block, it would help readability. I generally just use references to the RFC instead of big blocks of test. You just need to play with it and see if you can come up with something more compact. It's just very long right now.

@DavidTPate
Copy link
Contributor Author

Gotcha. I can do that pretty easily, as I already have it in ABNF just spread out.

Yeah, I understand it is long the specification itself is quite specific (as it should be) and accounting for that specificity can get pretty hairy. Even though it is long it is still quite fast, here are some benchmark numbers that I have on it.

> Benchmarking complex-ipv4.js

  Testing URI "http://asdf:qw%20er@127.0.0.1:8000?asdf=12345&asda=fc%2F#bacon"

  isUri#test(uri) x 1,916,525 ops/sec ±0.22% (196 runs sampled)

> Benchmarking complex-ipv6.js

  Testing URI "http://asdf:qw%20er@[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]:8000?asdf=12345&asda=fc%2F#bacon"

  isUri#test(uri) x 1,367,745 ops/sec ±0.18% (196 runs sampled)

> Benchmarking complex-ipvFuture.js

  Testing URI "http://asdf:qw%20er@[v1.09azAZ-._~!$&'()*+,;=:]:8000?asdf=12345&asda=fc%2F#bacon"

  isUri#test(uri) x 1,485,227 ops/sec ±0.19% (195 runs sampled)

> Benchmarking complex-uri.js

  Testing URI "http://asdf:qw%20er@localhost:8000?asdf=12345&asda=fc%2F#bacon"

  isUri#test(uri) x 2,100,513 ops/sec ±0.18% (197 runs sampled)

> Benchmarking simple-uri.js

  Testing URI "mailto:John.Doe@example.com"

  isUri#test(uri) x 2,314,868 ops/sec ±0.19% (195 runs sampled)

…oved the library. Added documentation for the regular expression and a few additional tests.
@DavidTPate
Copy link
Contributor Author

@Marsup and @hueniverse It took me a little bit to get to this, but I pulled in the regular expression and added some documentation to describe each of the blocks and what they do. The most complex piece of the Regex is the part dealing with IP addresses as the formats that are valid are more complex.

@Marsup
Copy link
Collaborator

Marsup commented Feb 15, 2015

Didn't @hueniverse have in mind a variable decomposition of that ?

@DavidTPate
Copy link
Contributor Author

It seems like I didn't understand correctly. If you take a look at the library I already put together I think it documents the pieces as you are describing it. Take the host for example, it is described in the same style as the ABNF:

/**
 * Host (http://tools.ietf.org/html/rfc3986#page-18)
 *
 * host        = IP-literal / IPv4address / reg-name
 *
 * @type {string}
 */
var host = '(' + IPLiteral + or + IPv4address + or + regName + ')';

Is that more what you are looking for just where all of that is pulled into Joi as opposed to using a separate library to manage it?

@Marsup
Copy link
Collaborator

Marsup commented Feb 22, 2015

@hueniverse can you check if that's what you had in mind ?

@Marsup Marsup self-assigned this Feb 22, 2015
@Marsup Marsup added this to the 6.0.0 milestone Feb 22, 2015
@hueniverse
Copy link
Contributor

Almost. The regex construction needs to move out into an internals value (no need to construct that every time). I also think that 99% of use cases care about a specific scheme or schemes like http and https. You usually don't want a mailto or ftp links in a form asking for a home page. So I think there should be an optional argument taking a single or list of schemes.

@Marsup
Copy link
Collaborator

Marsup commented Feb 22, 2015

V8 usually does regex caching but yeah that makes sense. I agree on the scheme part as well.
@DavidTPate care to make that last change and update docs ?

@DavidTPate
Copy link
Contributor Author

Thanks @hueniverse and @Marsup I'll pull in the checks with the accompanying ABNF descriptions. Providing the ability to specify schemes makes a lot of sense, I'll go ahead and add that as well. It will just mean that when someone specifies a scheme a new Regex will be compiled since it's being created dynamically. If you guys know something I don't about that just let me know :)

@hueniverse
Copy link
Contributor

Don't know about performance but splitting the uri string on : first might make things easier.

@DavidTPate
Copy link
Contributor Author

Thanks @hueniverse I'll do a benchmark test on it. During initial development Regex proved to be faster, but that was tested with a less complex Regex. I definitely want to go through and re-evaluate a little to see if I can simplify/gain processing velocity.

@devinus
Copy link

devinus commented Feb 26, 2015

This would be really useful. I'm currently trying to validate URL's sent to me in query parameters.

@@ -295,5 +316,181 @@ internals.String.prototype.trim = function () {
return obj;
};

function createUriRegex(optionalScheme) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please attach this to internals as well. Also we put a space after function names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return null;
}

return Errors.create('string.uri', null, state, options);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth having 2 messages depending on whether we have schemes so that the error is more explicit on the expectations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I added one which includes the scheme that was passed. So providing http and https? would yield something similar to: must be a valid uri with a scheme matching the http|https? pattern

…dd the create Uri Regex method to internals (and call it from there). Fix spacing with Function and parameters (add space)
…String (which will be automatically escaped), or an Array of String/RegExp which work the same way as schemes.
@DavidTPate
Copy link
Contributor Author

Went ahead and caught this up to master. I think I just have the documentation left on it. Other than that I really want to see Eran's comment on benchmarks. I did them during the original development and saw more performance from RegExp than from String parsing/splitting but I'm interested in seeing where they land now with the additional complexity in the RegExp.

@DavidTPate
Copy link
Contributor Author

I got the documentation setup and now I'm moving on to the benchmarking. I put together a simple project which runs the benchmarks on Travis for me with Node 0.10 (latest) and Node 0.11 (latest) through a few URI scenarios.

I'm going to go through and add some different ways that I can think of parsing this and see how they benchmark.

@DavidTPate
Copy link
Contributor Author

I went through a few iterations and was able to get a little more than double the performance by switching over the regular expression groups to non-capturing groups.

Here's an example of the difference (the first one uses capturing the groups, the second does not):

Testing URI "http://mail.google.com/"
createUriRegex()#test(uri)         x 1,534,151 ops/sec ±0.73% (96 runs sampled)
createUriRegexImproved()#test(uri) x 3,754,082 ops/sec ±0.99% (92 runs sampled)
createUriRegex('https?')#test(uri)         x 1,710,824 ops/sec ±0.41% (93 runs sampled)
createUriRegexImproved('https?')#test(uri) x 3,424,541 ops/sec ±1.20% (95 runs sampled)

Testing URI "http://asdf:qw%20er@mail.google.com"
createUriRegex()#test(uri)         x 1,672,742 ops/sec ±0.48% (96 runs sampled)
createUriRegexImproved()#test(uri) x 4,623,601 ops/sec ±1.15% (94 runs sampled)
createUriRegex('https?')#test(uri)         x 1,928,301 ops/sec ±0.64% (96 runs sampled)
createUriRegexImproved('https?')#test(uri) x 4,949,729 ops/sec ±0.62% (90 runs sampled)

Testing URI "http://asdf:qw%20er@mail.google.com:8000"
createUriRegex()#test(uri)         x 1,634,704 ops/sec ±0.64% (98 runs sampled)
createUriRegexImproved()#test(uri) x 4,438,129 ops/sec ±0.69% (93 runs sampled)
createUriRegex('https?')#test(uri)         x 1,870,456 ops/sec ±0.51% (96 runs sampled)
createUriRegexImproved('https?')#test(uri) x 4,504,974 ops/sec ±1.40% (92 runs sampled)

if (uriOptions.scheme) {
Hoek.assert(uriOptions.scheme instanceof RegExp || typeof uriOptions.scheme === 'string' || Array.isArray(uriOptions.scheme), 'scheme must be a RegExp, String, or Array');
// If someone wants to match HTTP or HTTPS for example then we need to support both RegExp and String so we don't escape their pattern unknowingly.
if (uriOptions.scheme instanceof RegExp) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put both of these cases into an array and have a single parameter pre-processing path ? Trying to keep it DRY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. this changed the error messages a little bit, so now they will mention "at position 0" when only a single String is passed, but I think that's okay.

@Marsup
Copy link
Collaborator

Marsup commented Mar 1, 2015

Didn't think of non-capturing groups, indeed makes sense, good job !
People are getting nervous about a release, I'm not going to wait very long for Eran to answer, as long as it works any performance improvement can make it into another 6.x.

@DavidTPate
Copy link
Contributor Author

All your feedback should be implemented now. I also added some Positive and Negative tests that I pulled from Node's URL parser tests to try to find anything not working like it should. But things seem good.

Marsup added a commit that referenced this pull request Mar 2, 2015
@Marsup Marsup closed this in e25435e Mar 2, 2015
@Marsup
Copy link
Collaborator

Marsup commented Mar 2, 2015

I've fixed a few things and squashed it, I'll release soon, thanks a lot !

@Starefossen
Copy link
Contributor

Good work guys! Really looking forward to joi v6 😄

@lock
Copy link

lock bot commented Jan 9, 2020

This thread has been automatically locked due to inactivity. Please open a new issue for related bugs or questions following the new issue template instructions.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature New functionality or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants