Escaping Rules #54

tommyknows · 2023-07-17T14:51:26Z

Hi again!

I've started wondering about the correct escaping-rules for all the different components of the purl. As far as I can tell, the current implementation doesn't actually match the spec in a couple of edge-cases, as it uses PathEncode instead of QueryEncode.

(I'd argue that using PathEncode is the right thing for namespace, name and version, but it creates a couple of difficulties that I've written down on an issue on the spec).

Here's the escapes that are required:

namespace: separate on / to get the individual segments, escape every segment with Path/QueryEscape. Currently this is implemented by the usage of JoinPath and then EscapedPath, I believe this is correct.
name: simply escape with Path/QueryEscape: fix: escape and unescape name #55
version: escape with Path/QueryEscape
qualifier: use QueryEscape

The current implementation calls EscapedPathon namespace, name and version, so technically implements the above points, except for one difference: the name isn't escaped by itself, meaning that if it contains a /, it will not be escaped.

Additionally, it won't "fully" escape versions, which could create ambiguities if the version contains an @, e.g. pkg:deb/debian/mypkg@v1.2.3@alpha-1. However, this question I think needs to be answered by the spec, see my linked comment. As a workaround, I guess we could query-escape it...

The name issue can be solved by PathEscapeing (and unescaping) it, I think I'll raise a PR to fix this.

It would be good to know however what to do with the version as well :)

The text was updated successfully, but these errors were encountered:

tommyknows · 2023-07-18T10:09:08Z

Coming back to this, the issue seems to be when other packageurl implementations come into play.

For example, the Go implementation produces a purl like this:

pkg:hex/git@github.com:PagerDuty%2Fid-utils-elixir.git@7e01eaec33e59cd2662c9461890ef75582346b5d

Note that according to URL standards, this is fine. The Go implementation also round-trips this correctly, too.

However, the encoding is technically not correct according to the purl spec, because that defines that

the '@' version separator must be encoded as %40 elsewhere

And so, different implementation like the Javascript one parse the above PURL "incorrectly".

I'm not sure if there's any other way than to simply QueryEscape everything, even if it's wrong from a URL-perspective...

At least for now, this would ensure compatibility with other implementations and make the Go library follow the spec as well.

Generally however, I think this is an issue with the PURL spec....but fixing that would likely be a breaking change...

tommyknows · 2023-07-18T12:48:38Z

Okay I think I'm going crazy.
The JS implementation uses encodeURIComponent and decodeURIComponent¹.
I've tried to compare & round-trip these through Go's PathEscape vs QueryEscape...

fmt.Println(url.PathEscape("a+b c@d"))
// a+b%20c@d
fmt.Println(url.QueryEscape("a+b c@d"))
// a%2Bb+c%40d

// outputs from Go
let pathEscaped = "a+b%20c@d"
let queryEscaped = "a%2Bb+c%40d"
console.log(decodeURIComponent(pathEscaped))
// a+b c@d
console.log(decodeURI(pathEscaped))
// a+b c@d
console.log(decodeURIComponent(queryEscaped))
// a+b+c@d
console.log(decodeURI(queryEscaped))
// a%2Bb+c%40d

(Apparently, Go is RFC3986 compliant (except where it deviates for compatibility reasons), while JS is RFC 2396 compliant (which is obsolete)...)

The Go implementation seems to be more robust:

console.log(encodeURIComponent("a+b c@d"));
// a%2Bb%20c%40d
console.log(encodeURI("a+b c@d"))
// a+b%20c@d

fmt.Println(url.QueryUnescape("a%2Bb%20c%40d"))
// a+b c@d <nil>
fmt.Println(url.PathUnescape("a%2Bb%20c%40d"))
// a+b c@d <nil>

fmt.Println(url.QueryUnescape("a+b%20c@d"))
// a b c@d <nil>
fmt.Println(url.PathUnescape("a+b%20c@d"))
// a+b c@d <nil>

So:

To decode a purl is "easy", we simply need to match the encoding-function. E.g. PathUnescape if the component was encoded with PathEscape. That seems to make sense, as the + sign is encoded and decoded differently.
To encode a purl in Go (e.g. toString), I don't really know. Looking at the JS decoder, I'd argue we'd need to PathEncode the components, not QueryEncode them...

But that doesn't solve the parsing ambiguities.

Reading into this a bit more, the primary issue seems to be the " " (space). In a query, it can either be encoded as %20 (that's what JS does) or as + (what Go does). This is nicely explained in this SO question.

However, given the fact that JS doesn't actually decode + to spaces, I'd argue that to be compatible in Go, we'd need to replace the escaped spaces, which have been encoded as a +, with a %20 instead. This should be fairly trivial, and as far as I can see, should allow us to use QueryEscape as well.

While it uses these functions, it then afterwards replaces some escaped characters again... ↩

tommyknows · 2023-08-13T08:51:55Z

I think with #58 now being merged, this should be all done, so I'll close this issue!

tommyknows mentioned this issue Jul 17, 2023

fix: escape and unescape name #55

Merged

tommyknows mentioned this issue Jul 18, 2023

fix: escape everything with modified QueryEscape #58

Merged

tommyknows closed this as completed Aug 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escaping Rules #54

Escaping Rules #54

tommyknows commented Jul 17, 2023 •

edited

Loading

tommyknows commented Jul 18, 2023

tommyknows commented Jul 18, 2023 •

edited

Loading

tommyknows commented Aug 13, 2023

Escaping Rules #54

Escaping Rules #54

Comments

tommyknows commented Jul 17, 2023 • edited Loading

tommyknows commented Jul 18, 2023

tommyknows commented Jul 18, 2023 • edited Loading

Footnotes

tommyknows commented Aug 13, 2023

tommyknows commented Jul 17, 2023 •

edited

Loading

tommyknows commented Jul 18, 2023 •

edited

Loading