Characters are not percent-encoded in URIs #101

lostenderman · 2022-12-13T19:04:30Z

See https://spec.commonmark.org/0.30/#example-346

Witiko · 2022-12-19T12:29:34Z

The corresponding unit test is testfiles/CommonMark_0.30/code_spans/019.test:

%   ---RESULT--- "example": 346,
%   
%   <p><a href="http://foo.bar.%60baz">http://foo.bar.`baz</a>`</p>
%   
%   ---\RESULT---

<<<
<http://foo.bar.`baz>`
>>>
documentBegin
BEGIN link
- label: http://foo.bar.`baz
- URI: http://foo.bar.%60baz
- title: 
END link
documentEnd

Here is the result of running git checkout commonmark; cd tests; ./test.sh "testfiles/CommonMark_0.30/code_spans/019.test":

Testfile testfiles/CommonMark_0.30/code_spans/019.test
  Format templates/plain/
    Template templates/plain/input.tex.m4
      Command luatex                                   --interaction=nonstopmode  test.tex
*** test-expected.log	2022-12-19 11:33:44.731768482 +0100
--- test-actual.log	2022-12-19 11:33:50.761747898 +0100
***************
*** 1,7 ****
  documentBegin
  BEGIN link
  - label: http://foo.bar.`baz
! - URI: http://foo.bar.%60baz
  - title: 
  END link
  documentEnd
--- 1,7 ----
  documentBegin
  BEGIN link
  - label: http://foo.bar.`baz
! - URI: http://foo.bar.`baz
  - title: 
  END link
  documentEnd

The issue seems related to the writer->uri() metod, which is used by the link writer and the image writer to format URIs. At the moment, the method only replaces various special characters of TeX before typesetting. We should also replace characters not allowed in an URI using the percent encoding specified in RFC 3986. There are ready-made implementations of percent encoding available online. We should add a utility function util.uri_encode() and use it in the definition of the writer->uri() metod like this:

if options.hybrid then
  self.string = self.escape_minimal
    self.uri = function(s)
      s = util.uri_encode(s)
      s = self.escape_minimal(s)
      return s
    end
else
  self.string = self.escape
    self.uri = function(s)
      s = util.uri_encode(s)
      s = self.escape_uri(s)
      return s
    end
end

Note also that we (ab)use writer->uri() to encode keys in the conversion of YAML to TeX. We should add a corresponding method writer->key() for YAML keys that wouldn't apply percent-encoding. We should also perhaps rename the escaping functions to have clearer purpose and forbid the use of the low-level self.escape_*() methods in writer. For example:

< self.escape = util.escaper(self.escaped_chars, self.escaped_minimal_strings)
< self.escape_uri = util.escaper(self.escaped_uri_chars, self.escaped_minimal_strings)
< self.escape_minimal = util.escaper({}, self.escaped_minimal_strings)
<
< if options.hybrid then
<   self.string = self.escape_minimal
<   self.uri = function(s)
<     s = util.uri_encode(s)
<     s = self.escape_minimal(s)
<     return s
<   end
< else
<   self.string = self.escape
<   self.uri = function(s)
<     s = util.uri_encode(s)
<     s = self.escape_uri(s)
<     return s
<   end
< end
---
> local escape_typographic_text = util.escaper(self.escaped_chars, self.escaped_minimal_strings)
> local escape_programmatic_text = util.escaper(self.escaped_uri_chars, self.escaped_minimal_strings)
> local escape_hybrid_text = util.escaper({}, self.escaped_minimal_strings)
>
> if options.hybrid then
>   self.string = self.escape_hybrid_text
>   self.uri = function(s)
>     s = util.uri_encode(s)
>     s = self.escape_hybrid_text(s)
>     return s
>   end
>   self.key = self.escape_hybrid_text
> else
>   self.string = self.escape_typographic_material
>   self.uri = function(s)
>     s = util.uri_encode(s)
>     s = self.escape_programmatic_text(s)
>     return s
>   end
>   self.key = self.escape_programmatic_text
> end

We will need to replace any direct calls to writer->escape, writer->escape_uri and writer->escape_minimal with the corresponding semantic aliases.

Witiko · 2022-12-19T12:32:30Z

As for the issue title: It is percent-encoded rather than mapped to HTML entities.

lostenderman · 2023-01-18T15:04:47Z

CommonMark is not percent-encoding some characters in URIs.

Some of them are:

/ + : ? # ( ) *

Consider a direct link:

[](/+:?#()*)

What CommonMark expects:

documentBegin
BEGIN link
- label: 
- URI: /+:?#()*
- title: 
END link
documentEnd

How the URI is actually percent-encoded:

documentBegin
BEGIN link
- label: 
- URI: %2F%2B%3A%3F%23%28%29%2A
- title: 
END link
documentEnd

Do we update the tests to reflect to actual percent-encoding or skip the mentioned characters when encoding?

Witiko · 2023-01-18T15:27:00Z

@lostenderman Can you ask whether this is a bug or a feature at https://github.com/commonmark/commonmark-spec/issues?

lostenderman · 2023-01-18T19:47:45Z

I don't suppose it is a bug, seeing multiple discussions about it:

commonmark/commonmark-spec#334

Officially we want to remain neutral on escaping -- it's a renderer issue.
So I think the best approach to this would be for test/normalize.py to normalize href and src attributes somehow.

and

commonmark/commonmark-spec#270.

While the tests remain as HTML, we could deal with the issue by tweaking tests/normalize.py.
This seems an attractive option to me, especially in light of the following consideration. Suppose you're targeting LaTeX in a renderer. An image might then link to a local file. And in that case, you wouldn't want percent-encoding. So that's an argument that all URI encoding issues should be left up to the renderer, not handled in the parser (which is what the spec covers).

Witiko · 2023-01-18T20:05:19Z

If percent-encoding is considered optional and "up to the renderer", then I would insert raw Unicode codepoints in Lua and let the TeX renderer do the percent-encoding if it needs to.

lostenderman · 2023-02-07T20:37:20Z

I am not sure I understand the proposed solution correctly.

What output should be produced in this example?

[a]

[a]: %#

How do

raw Unicode codepoints

fit into all of this?

Witiko · 2023-02-07T21:16:59Z

As per the quotes you listed in #101 (comment), the percent-encoding should be left up to the renderer. Therefore, it seems to me that we don't need to convert http://foo.bar.`baz to http://foo.bar.%60baz (here, ` is the raw Unicode codepoint) in the Lua parser of the Markdown package. The users can do that by redefining the corresponding TeX renderers to percent-encode the URLs if they need to.

Witiko · 2023-02-27T23:18:28Z

@lostenderman In Witiko#271 and Witiko#272, implemented changes suggested in #101 (comment). I suppose we can close this issue now.

lostenderman added the links label Dec 13, 2022

Witiko mentioned this issue Dec 31, 2022

Info strings should show the non-escaped variant #36

Closed

lostenderman changed the title ~~Characters are not properly mapped to HTML entities in URIs~~ Characters are not percent-encoded in URIs Jan 18, 2023

Witiko mentioned this issue Feb 27, 2023

Rename writer->encode_*() methods to clarify their purpose Witiko/markdown#271

Closed

lostenderman mentioned this issue Mar 13, 2023

Character escaping #137

Merged

lostenderman closed this as completed May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Characters are not percent-encoded in URIs #101

Characters are not percent-encoded in URIs #101

lostenderman commented Dec 13, 2022

Witiko commented Dec 19, 2022 •

edited

Loading

Witiko commented Dec 19, 2022

lostenderman commented Jan 18, 2023

Witiko commented Jan 18, 2023

lostenderman commented Jan 18, 2023

Witiko commented Jan 18, 2023 •

edited

Loading

lostenderman commented Feb 7, 2023 •

edited

Loading

Witiko commented Feb 7, 2023 •

edited

Loading

Witiko commented Feb 27, 2023

Characters are not percent-encoded in URIs #101

Characters are not percent-encoded in URIs #101

Comments

lostenderman commented Dec 13, 2022

Witiko commented Dec 19, 2022 • edited Loading

Witiko commented Dec 19, 2022

lostenderman commented Jan 18, 2023

Witiko commented Jan 18, 2023

lostenderman commented Jan 18, 2023

Witiko commented Jan 18, 2023 • edited Loading

lostenderman commented Feb 7, 2023 • edited Loading

Witiko commented Feb 7, 2023 • edited Loading

Witiko commented Feb 27, 2023

Witiko commented Dec 19, 2022 •

edited

Loading

Witiko commented Jan 18, 2023 •

edited

Loading

lostenderman commented Feb 7, 2023 •

edited

Loading

Witiko commented Feb 7, 2023 •

edited

Loading