Conversation
There was a problem hiding this comment.
Looks good to me.
There are a number of spots where we're doing regex-ish logic with loops, which makes me wonder if there's a point where it starts to make sense to just use regexes. Based on this tiny benchmark I wrote for valid_identifier it doesn't seem like raw performance is a big distinguishing factor, though that doesn't say anything about crate bloat or other code organization questions, or for that matter, more complex regexes (which we should, of course, avoid).
https://gist.github.com/david-crespo/1ed0857caa91d37c0a8b9b47f0f124ed
loop time: [15.361 ns]
loop invalid time: [11.025 ns]
regex time: [23.126 ns]
regex invalid time: [15.915 ns]
| * configured. Each segment may be either a literal string or a variable (the | ||
| * latter indicated by being wrapped in braces. | ||
| * latter indicated by being wrapped in braces). Variables may consume a single | ||
| * /-delimited segment or several as defined by a regex (currently only `.*` is |
There was a problem hiding this comment.
There's a couple of things that make me uncomfortable here but these fears might be specious.
One is that we might be conflating "consumes more than one path segment" with "is matched by a regular expression". I could imagine someone wanting to use regular expressions to match just a segment, as in /instances/{instance_id:[0-9]+}/status. (I'm not really saying we should try to support that, so much as: if we're trying to leave the door open to regular expressions in path segments, it's not obvious to me now that this shouldn't be possible.)
Another is that I think there's some non-trivial work that goes into correctly separating the path segments in HTTP. This is the kind of stuff I'm thinking about
dropshot/dropshot/src/router.rs
Lines 474 to 502 in e0201cf
Okay, I mostly punted on that, but the goal was to ensure that there's one place where the correct code would go. These feel like the sort of things that are fine to get wrong for a while and then really hose you when you run into the case where they're wrong. I'm a little worried that if we allow variables to match multiple segments, then the consumer will try to interpret this path themselves, and it will become really easy to do something that appears to work in some cases but breaks when the input is percent-encoded or something like that. (If percent-encoding is a real issue, I could imagine this behavior resulting in a vulnerability where somebody thinks they're rejecting requests that contain
/.., but they're missing requests that contain '/%2e%2e', and this lets people read /etc/passwd or whatever. Again, this is potentially already broken today because all I have is a TODO, not a correct implementation, but it would suck for this problem to sprawl.) Does this go against the principle of "make it hard to do the wrong thing"?
We could potentially fix this by saying that instead of a String, you get an array of path segments or something like that.
Am I being too paranoid?
| }) | ||
| .unwrap_or_default(); | ||
|
|
||
| let visible = if let Some(true) = metadata.unpublished { |
There was a problem hiding this comment.
Is there a test case around visible/unpublished? (Sorry if I missed it!)
There was a problem hiding this comment.
There is. The openapi tests include an unpublished endpoint. We can tell they're successful because no change was required for the output json.
I wasn't sure about the visible vs. unpublished disparity, but I thought that unpublished was the right setting for endpoints, but it felt weird to carry the implicit negative into the struct; let me know if you have suggestions for clarity here.
This will match paths such as `/foo/bar` and `/foo/bar/baz.css`
We can use this to serve static assets such as css, html, and image
files.
Express uses this syntax: `/foo/bar/:rest(.*)`
Dropwizard / Jersey / JAX-RS does: `/foo/bar/{rest:.*}`
Actix does: `/foo/bar/{rest:.*}`
This also adds support for a tag (unpublished = true) in the #[endpoint]
macro to omit certain endpoints from the OpenAPI output. This is both
useful for non-API endpoints and necessary in that OpenAPI doesn't
support any type of multi-segment matching of routes.
modified map Deserializer to allow for compound values
|
Based on a discussion with @davepacheco, I changed this around quite a bit. Rather than receiving the path in a I observed that consumers could use types for their APIs that were impossible to serialize at runtime when the API endpoint is invoked. To address this, I added more stringent type checking. It's still at runtime, but it happens during This exposed areas where we were using types that could never work (e.g. There's still more work to be done here:
But this felt like a significant step forward and I wanted to unblock work that will consume the wildcard path matching. |
| /** a path segment for a variable */ | ||
| Varname(String), | ||
| VarnameSegment(String), | ||
| /** a path segment for a variable with a regex */ |
There was a problem hiding this comment.
IMO would be helpful to describe why there are two Strings here (what does each represent)?
| assert!( | ||
| pat == ".*", | ||
| "Only the pattern '.*' is currently supported" | ||
| ); | ||
| PathSegment::VarnameRegEx(var.to_string(), pat.to_string()) |
There was a problem hiding this comment.
Have you had a chance to look at BurntSushi's regex crate?
Although it probably wouldn't be necessary for a single pattern match, it might be a worthwhile investment before going much further down the "stringly-typed representation of regex patterns".
Also notable, a little mention in the documentation for RegexSet:
If one has hundreds or thousands of regexes to match repeatedly (like a URL router for a complex web application or a user agent matcher), then a regex set can realize huge performance gains.
Seems like a neat use-case :)
There was a problem hiding this comment.
Reading through more of this PR, it seems like we're doing a lot of special casing for "exactly the wildcard" - not just that this is the only permitted regex, but also things like checking that no segments exist after the wildcard.
Seems kinda like those special cases might be hard to undo later - as an example, is it possible to validate that the intersection of N regexes is empty for all possible input paths, to perform a similar collision detection?
There was a problem hiding this comment.
Yes, I'm familiar with the regex crate. I don't follow your second comment.
There was a problem hiding this comment.
ah, let me try to re-phrase.
Basically, when I see structures like this:
dropshot/dropshot/src/router.rs
Lines 94 to 101 in ee7fc26
(specifically the VariableRest part)
and checks like this:
dropshot/dropshot/src/router.rs
Lines 297 to 305 in ee7fc26
I interpret that check as: "we'd like to stop someone from registering both /foo/{bar:.*} and /foo/baz" (as covered by tests below) - but in a more general sense, it seems like we could say "we'd like to prevent callers from registering multiple routes that overlap with one another".
When considering regexes beyond just wildcards, this is where my original question came up - we can maintain checks like this, where we handle only "full wildcard", but making the jump to any other type of regex seems like it'll significantly alter this implementation.
I guess I'm asking what direction the ship is steering towards:
- Do we think pure wildcards will be the only supported regex for the foreseeable future?
- Or do we expect other regexes to be supported? (This led to my question about "finding the intersection between multiple regexes", which seemed like the generalized solution to the "do the routes overlap" question).
There was a problem hiding this comment.
I think I see. I'm going to change VarnameRegEx to VarnameWildcard to remove the false generalization.
There was a problem hiding this comment.
I posted my second comment before I saw your response. I don't anticipate other regexes being supported any time soon.
smklein
left a comment
There was a problem hiding this comment.
Looks good to me - to keep up with all other major changes, maybe worth a note in the CHANGELOG?
(I don't believe this is breaking, but it is a visible change)
|
FYI, apparently this PR is causing test failures in Omicron, and preventing rolling forward: Trivially repro'd in the Omicron repo by running I'll dig around a bit more, and see if I can root-cause |
|
... weird, toggling the |
|
Fixed by #122 |
This will match paths such as
/foo/barand/foo/bar/baz.cssWe can use this to serve static assets such as css, html, and image
files.
Express uses this syntax:
/foo/bar/:rest(.*)Dropwizard / Jersey / JAX-RS does:
/foo/bar/{rest:.*}Actix does:
/foo/bar/{rest:.*}This also adds support for a tag (unpublished = true) in the #[endpoint]
macro to omit certain endpoints from the OpenAPI output. This is both
useful for non-API endpoints and necessary in that OpenAPI doesn't
support any type of multi-segment matching of routes.