Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters are not percent-encoded in URIs #101

Closed
lostenderman opened this issue Dec 13, 2022 · 9 comments
Closed

Characters are not percent-encoded in URIs #101

lostenderman opened this issue Dec 13, 2022 · 9 comments
Labels

Comments

@lostenderman
Copy link
Owner

See https://spec.commonmark.org/0.30/#example-346

@Witiko
Copy link
Collaborator

Witiko commented Dec 19, 2022

The corresponding unit test is testfiles/CommonMark_0.30/code_spans/019.test:

%   ---RESULT--- "example": 346,
%   
%   <p><a href="http://foo.bar.%60baz">http://foo.bar.`baz</a>`</p>
%   
%   ---\RESULT---

<<<
<http://foo.bar.`baz>`
>>>
documentBegin
BEGIN link
- label: http://foo.bar.`baz
- URI: http://foo.bar.%60baz
- title: 
END link
documentEnd

Here is the result of running git checkout commonmark; cd tests; ./test.sh "testfiles/CommonMark_0.30/code_spans/019.test":

Testfile testfiles/CommonMark_0.30/code_spans/019.test
  Format templates/plain/
    Template templates/plain/input.tex.m4
      Command luatex                                   --interaction=nonstopmode  test.tex
*** test-expected.log	2022-12-19 11:33:44.731768482 +0100
--- test-actual.log	2022-12-19 11:33:50.761747898 +0100
***************
*** 1,7 ****
  documentBegin
  BEGIN link
  - label: http://foo.bar.`baz
! - URI: http://foo.bar.%60baz
  - title: 
  END link
  documentEnd
--- 1,7 ----
  documentBegin
  BEGIN link
  - label: http://foo.bar.`baz
! - URI: http://foo.bar.`baz
  - title: 
  END link
  documentEnd

The issue seems related to the writer->uri() metod, which is used by the link writer and the image writer to format URIs. At the moment, the method only replaces various special characters of TeX before typesetting. We should also replace characters not allowed in an URI using the percent encoding specified in RFC 3986. There are ready-made implementations of percent encoding available online. We should add a utility function util.uri_encode() and use it in the definition of the writer->uri() metod like this:

if options.hybrid then
  self.string = self.escape_minimal
    self.uri = function(s)
      s = util.uri_encode(s)
      s = self.escape_minimal(s)
      return s
    end
else
  self.string = self.escape
    self.uri = function(s)
      s = util.uri_encode(s)
      s = self.escape_uri(s)
      return s
    end
end

Note also that we (ab)use writer->uri() to encode keys in the conversion of YAML to TeX. We should add a corresponding method writer->key() for YAML keys that wouldn't apply percent-encoding. We should also perhaps rename the escaping functions to have clearer purpose and forbid the use of the low-level self.escape_*() methods in writer. For example:

< self.escape = util.escaper(self.escaped_chars, self.escaped_minimal_strings)
< self.escape_uri = util.escaper(self.escaped_uri_chars, self.escaped_minimal_strings)
< self.escape_minimal = util.escaper({}, self.escaped_minimal_strings)
<
< if options.hybrid then
<   self.string = self.escape_minimal
<   self.uri = function(s)
<     s = util.uri_encode(s)
<     s = self.escape_minimal(s)
<     return s
<   end
< else
<   self.string = self.escape
<   self.uri = function(s)
<     s = util.uri_encode(s)
<     s = self.escape_uri(s)
<     return s
<   end
< end
---
> local escape_typographic_text = util.escaper(self.escaped_chars, self.escaped_minimal_strings)
> local escape_programmatic_text = util.escaper(self.escaped_uri_chars, self.escaped_minimal_strings)
> local escape_hybrid_text = util.escaper({}, self.escaped_minimal_strings)
>
> if options.hybrid then
>   self.string = self.escape_hybrid_text
>   self.uri = function(s)
>     s = util.uri_encode(s)
>     s = self.escape_hybrid_text(s)
>     return s
>   end
>   self.key = self.escape_hybrid_text
> else
>   self.string = self.escape_typographic_material
>   self.uri = function(s)
>     s = util.uri_encode(s)
>     s = self.escape_programmatic_text(s)
>     return s
>   end
>   self.key = self.escape_programmatic_text
> end

We will need to replace any direct calls to writer->escape, writer->escape_uri and writer->escape_minimal with the corresponding semantic aliases.

@Witiko
Copy link
Collaborator

Witiko commented Dec 19, 2022

As for the issue title: It is percent-encoded rather than mapped to HTML entities.

@lostenderman
Copy link
Owner Author

CommonMark is not percent-encoding some characters in URIs.

Some of them are:

/ + : ? # ( ) *

Consider a direct link:

[](/+:?#()*)

What CommonMark expects:

documentBegin
BEGIN link
- label: 
- URI: /+:?#()*
- title: 
END link
documentEnd

How the URI is actually percent-encoded:

documentBegin
BEGIN link
- label: 
- URI: %2F%2B%3A%3F%23%28%29%2A
- title: 
END link
documentEnd

Do we update the tests to reflect to actual percent-encoding or skip the mentioned characters when encoding?

@Witiko
Copy link
Collaborator

Witiko commented Jan 18, 2023

@lostenderman Can you ask whether this is a bug or a feature at https://github.com/commonmark/commonmark-spec/issues?

@lostenderman
Copy link
Owner Author

I don't suppose it is a bug, seeing multiple discussions about it:

commonmark/commonmark-spec#334

Officially we want to remain neutral on escaping -- it's a renderer issue.
So I think the best approach to this would be for test/normalize.py to normalize href and src attributes somehow.

and

commonmark/commonmark-spec#270.

While the tests remain as HTML, we could deal with the issue by tweaking tests/normalize.py.
This seems an attractive option to me, especially in light of the following consideration. Suppose you're targeting LaTeX in a renderer. An image might then link to a local file. And in that case, you wouldn't want percent-encoding. So that's an argument that all URI encoding issues should be left up to the renderer, not handled in the parser (which is what the spec covers).

@lostenderman lostenderman changed the title Characters are not properly mapped to HTML entities in URIs Characters are not percent-encoded in URIs Jan 18, 2023
@Witiko
Copy link
Collaborator

Witiko commented Jan 18, 2023

If percent-encoding is considered optional and "up to the renderer", then I would insert raw Unicode codepoints in Lua and let the TeX renderer do the percent-encoding if it needs to.

@lostenderman
Copy link
Owner Author

lostenderman commented Feb 7, 2023

I am not sure I understand the proposed solution correctly.

What output should be produced in this example?

[a]

[a]: %#

How do

raw Unicode codepoints

fit into all of this?

@Witiko
Copy link
Collaborator

Witiko commented Feb 7, 2023

As per the quotes you listed in #101 (comment), the percent-encoding should be left up to the renderer. Therefore, it seems to me that we don't need to convert http://foo.bar.`baz to http://foo.bar.%60baz (here, ` is the raw Unicode codepoint) in the Lua parser of the Markdown package. The users can do that by redefining the corresponding TeX renderers to percent-encode the URLs if they need to.

@Witiko
Copy link
Collaborator

Witiko commented Feb 27, 2023

@lostenderman In Witiko#271 and Witiko#272, implemented changes suggested in #101 (comment). I suppose we can close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants