Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaping Rules #54

Closed
4 tasks done
tommyknows opened this issue Jul 17, 2023 · 3 comments
Closed
4 tasks done

Escaping Rules #54

tommyknows opened this issue Jul 17, 2023 · 3 comments

Comments

@tommyknows
Copy link
Contributor

tommyknows commented Jul 17, 2023

Hi again!

I've started wondering about the correct escaping-rules for all the different components of the purl. As far as I can tell, the current implementation doesn't actually match the spec in a couple of edge-cases, as it uses PathEncode instead of QueryEncode.

(I'd argue that using PathEncode is the right thing for namespace, name and version, but it creates a couple of difficulties that I've written down on an issue on the spec).

Here's the escapes that are required:

  • namespace: separate on / to get the individual segments, escape every segment with Path/QueryEscape. Currently this is implemented by the usage of JoinPath and then EscapedPath, I believe this is correct.
  • name: simply escape with Path/QueryEscape: fix: escape and unescape name #55
  • version: escape with Path/QueryEscape
  • qualifier: use QueryEscape

The current implementation calls EscapedPathon namespace, name and version, so technically implements the above points, except for one difference: the name isn't escaped by itself, meaning that if it contains a /, it will not be escaped.

Additionally, it won't "fully" escape versions, which could create ambiguities if the version contains an @, e.g. pkg:deb/debian/mypkg@v1.2.3@alpha-1. However, this question I think needs to be answered by the spec, see my linked comment. As a workaround, I guess we could query-escape it...

The name issue can be solved by PathEscapeing (and unescaping) it, I think I'll raise a PR to fix this.

It would be good to know however what to do with the version as well :)

@tommyknows
Copy link
Contributor Author

Coming back to this, the issue seems to be when other packageurl implementations come into play.

For example, the Go implementation produces a purl like this:

pkg:hex/git@github.com:PagerDuty%2Fid-utils-elixir.git@7e01eaec33e59cd2662c9461890ef75582346b5d

Note that according to URL standards, this is fine. The Go implementation also round-trips this correctly, too.

However, the encoding is technically not correct according to the purl spec, because that defines that

the '@' version separator must be encoded as %40 elsewhere

And so, different implementation like the Javascript one parse the above PURL "incorrectly".

I'm not sure if there's any other way than to simply QueryEscape everything, even if it's wrong from a URL-perspective...

At least for now, this would ensure compatibility with other implementations and make the Go library follow the spec as well.

Generally however, I think this is an issue with the PURL spec....but fixing that would likely be a breaking change...

@tommyknows
Copy link
Contributor Author

tommyknows commented Jul 18, 2023

Okay I think I'm going crazy.
The JS implementation uses encodeURIComponent and decodeURIComponent1.
I've tried to compare & round-trip these through Go's PathEscape vs QueryEscape...

fmt.Println(url.PathEscape("a+b c@d"))
// a+b%20c@d
fmt.Println(url.QueryEscape("a+b c@d"))
// a%2Bb+c%40d
// outputs from Go
let pathEscaped = "a+b%20c@d"
let queryEscaped = "a%2Bb+c%40d"
console.log(decodeURIComponent(pathEscaped))
// a+b c@d
console.log(decodeURI(pathEscaped))
// a+b c@d
console.log(decodeURIComponent(queryEscaped))
// a+b+c@d
console.log(decodeURI(queryEscaped))
// a%2Bb+c%40d

(Apparently, Go is RFC3986 compliant (except where it deviates for compatibility reasons), while JS is RFC 2396 compliant (which is obsolete)...)

The Go implementation seems to be more robust:

console.log(encodeURIComponent("a+b c@d"));
// a%2Bb%20c%40d
console.log(encodeURI("a+b c@d"))
// a+b%20c@d
fmt.Println(url.QueryUnescape("a%2Bb%20c%40d"))
// a+b c@d <nil>
fmt.Println(url.PathUnescape("a%2Bb%20c%40d"))
// a+b c@d <nil>

fmt.Println(url.QueryUnescape("a+b%20c@d"))
// a b c@d <nil>
fmt.Println(url.PathUnescape("a+b%20c@d"))
// a+b c@d <nil>

So:

  • To decode a purl is "easy", we simply need to match the encoding-function. E.g. PathUnescape if the component was encoded with PathEscape. That seems to make sense, as the + sign is encoded and decoded differently.
  • To encode a purl in Go (e.g. toString), I don't really know. Looking at the JS decoder, I'd argue we'd need to PathEncode the components, not QueryEncode them...

But that doesn't solve the parsing ambiguities.

Reading into this a bit more, the primary issue seems to be the " " (space). In a query, it can either be encoded as %20 (that's what JS does) or as + (what Go does). This is nicely explained in this SO question.

However, given the fact that JS doesn't actually decode + to spaces, I'd argue that to be compatible in Go, we'd need to replace the escaped spaces, which have been encoded as a +, with a %20 instead. This should be fairly trivial, and as far as I can see, should allow us to use QueryEscape as well.

Footnotes

  1. While it uses these functions, it then afterwards replaces some escaped characters again...

@tommyknows
Copy link
Contributor Author

I think with #58 now being merged, this should be all done, so I'll close this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant