Inline markup incompatibilities #44

link2xt · 2018-08-11T22:53:30Z

I want to make pandoc Muse writer generate lightweight inline markup instead of tags when possible. For that I need to understand when it is possible to use it.

I have tested various inline markup examples with Emacs Muse, Amusewiki, Pandoc Muse, Emacs Org and Pandoc Org. Org mode markup is very close to Muse markup, so I hope it can help with resolving some ambiguities. Here are the results:

Source:      foo*bar*
Emacs Muse:  foo*bar*
Amusewiki:   foo*bar*
Pandoc Muse: foo*bar*
Emacs Org:   foo*bar*
Pandoc Org:  foo*bar*

Source:      *foo*bar
Emacs Muse:  *foo*bar
Amusewiki:   <em>foo</em>bar
Pandoc Muse: *foo*bar
Emacs Org:   *foo*bar
Pandoc Org:  *foo*bar

Source:      *foo*,bar
Emacs Muse:  <em>foo</em>,bar
Amusewiki:   <em>foo</em>,bar
Pandoc Muse: <em>foo</em>,bar
Emacs Org:   <b>foo</b>,bar
Pandoc Org:  <strong>foo</strong>,bar

Source:      foo,*bar*
Emacs Muse:  foo,*bar*
Amusewiki:   foo,<em>bar</em>
Pandoc Muse: foo,<em>bar</em>
Emacs Org:   foo,*bar*
Pandoc Org:  foo,*bar*

I think the second case should not be parsed as emphasis. There is no reason for asymmetry with the first case.

The third case is obviously correct, because we may want to emphasize a word before comma. Normally there is a space after the comma, but there is no reason to check for it in the parser and make it more complicated. All parsers are consistent here by the way.

As for the fourth case I tend to believe it is a bug in both Text::Amuse and Pandoc Muse reader. In Org mode it obviously works as intended, because Emacs Org mode has a configuration variable org-emphasis-regexp-components which specifies different allowed pre and post characters. It makes sense to only check for punctuation in the end, because even in right-to-left text the commas are after the word and are followed by space. From Emacs Muse code it is not so clear whether it is intended or just happened to work this way, but there is no evidence that there is a bug, so I think making it more compatible with the Org mode and Emacs Muse at the same time is the right thing to do.

The text was updated successfully, but these errors were encountered:

link2xt · 2018-08-11T22:55:53Z

Not-so-related bug, but also the result of this series of tests: jgm/pandoc#4820

link2xt · 2018-08-12T17:43:15Z

From my understanding of Emacs Muse code, it allows line start, -, [, <, (, ', " and ` before the first "*" (or "=", or "_" which is underline and not supported by Amusewiki):

https://github.com/alexott/muse/blob/0bb5d3fa57bfc876bacec732a0c1d8f796942403/lisp/muse-publish.el#L122

Then it checks that the character after the corresponding "*" marker is not from the "w" class (letters and digits, basically):
https://github.com/alexott/muse/blob/0bb5d3fa57bfc876bacec732a0c1d8f796942403/lisp/muse-publish.el#L1082

link2xt · 2018-08-12T18:22:00Z

Here is another testcase:

Source:      *foo*0bar
Emacs Muse:  *foo*0bar
Amusewiki:   <em>foo</em>0bar
Pandoc Muse: *foo*0bar
Emacs Org:   *foo*0bar
Pandoc Org:  *foo*0bar

(just fixed it in pandoc: jgm/pandoc@81131ef)

Relevant tests are failing. Reference: #44

melmothx · 2018-08-16T08:55:48Z

I'm willing to push this forward, but it's kind of low priority task as they are mostly corner cases.

link2xt · 2018-08-16T09:50:10Z

@melmothx I see you added the test for *foo *bar* from Org mode bug. Emacs Muse parses this as *foo *bar*, not <em>foo *bar</em>.

Amusewiki produces this completely wrong result:

<em>foo <em>bar</em>
</em>

In LaTeX:

\emph{foo \emph{bar}
}

melmothx · 2018-08-16T09:58:45Z

That's simply the result of the autocorrection when there are open tags left. That's incorrect/random input anyway. Determining what's the correct output is tricky and/or arbitrary.

Bottom line is: garbage in, garbage out. Just saying.

link2xt · 2018-08-16T10:19:29Z

@melmothx

I'm willing to push this forward, but it's kind of low priority task as they are mostly corner cases.

I am also not sure whether we need bug-for-bug compatibility. *foo *bar* is pretty much a corner case indeed, that is why I reduced this issue to some simple cases. In other words, I do not plan to make pandoc writer generate *foo *bar*, so this case can be ignored.

link2xt · 2018-08-16T10:36:16Z

Emacs Muse "parser" is simply a number of regexp rewrite rules with priorities, so trying to copy its behaviour is hard with a proper parser. Cases like *foo *bar* can still be defined by specifying whether the parser is greedy/backtracking.

Let's limit this issue to "what characters are allowed before and after *".

Reference #44

melmothx · 2018-08-16T13:03:04Z

@labdsf I've added a commit in a branch which fixes the case 2, which is clearly a Text::Amuse bug.

Regarding the 4th case, the more I look at it, the less I think it's a bug in your and my code.

For symmetry with *foo*,bar, bar,*foo* should work the same. The same way in *foo*,bar the user is just forgetting a space. Also, if org-mode is letting the user to configure that, it's not a strong indicator.

link2xt · 2018-08-16T21:52:14Z

For symmetry with *foo*,bar, bar,*foo* should work the same. The same way in *foo*,bar the user is just forgetting a space.

Both in left-to-right and right-to-left texts the comma is followed by the space, so there is no need for symmetry. Forgetting a space after comma is a mistake.

I also thought about symmetry first, see my description of case 2. But looking at the code of both Org-mode and Emacs Muse I can tell there is intentional asymmetry in both of them.

Also, if org-mode is letting the user to configure that, it's not a strong indicator.

It does not only have a configuration variable, it also has asymmetric defaults that allow punctuation only in the end, and only opening parentheses in the beginning. It is not directly related to Muse, though. I looked into Org-mode only because I didn't find relevant code in Emacs Muse quickly.

Now that I have looked into Emacs Muse, I am also sure the set of characters allowed before opening * instead of simple "any non-word character" is not a result of some random typo. (*foo* converts to (<em>foo</em>, but )*foo* is left as-is.

It still makes sense to intentionally break compatibility with Emacs Muse here and document it as incompatibility. I do not like arbitrary hardcoded set of characters allowed before *, which does not even include "{". Something based on character classes is definitely better.

link2xt · 2018-08-31T17:22:56Z

Another incompatibility is that Emacs Muse allows whitespace before closing *. The difference is easy to see in the following example:

*foo * bar*

Emacs Muse and pandoc (mostly accidentally) interpret it as

<em>foo </em> bar*

Amusewiki interprets it as

<em>foo * bar</em>

It is another case where I agree with Amusewiki. The reason for the bug is probably that Text::Amuse replaces each * with a tag individually and had to look for context, while Emacs Muse and pandoc just look for the matching *.

Address: #44

melmothx · 2018-09-01T08:43:05Z

@link2xt Could you check this commit with the draft of the formalization, and see if we can agree on this?

link2xt · 2018-09-01T10:17:25Z

It is mostly ok, not whitespace on the inner side and not "word" on the outer side. But I would replace "word" with "alphanumeric". "Word", as defined by "\w" regexp, includes underscore.

Example:

foo_*bar*_baz

foo-*bar*-baz

foo_*bar*-baz

foo-*bar*_baz

Emacs Muse:

<p>foo_*bar*_baz</p>

<p>foo-<em>bar</em>-baz</p>

<p>foo_*bar*-baz</p>

<p>foo-<em>bar</em>_baz</p>

Amusewiki:

<p>
foo_*bar*_baz
</p>

<p>
foo-<em>bar</em>-baz
</p>

<p>
foo_*bar*-baz
</p>

<p>
foo-<em>bar</em>_baz
</p>

Pandoc:

<p>foo_<em>bar</em>_baz</p>
<p>foo-<em>bar</em>-baz</p>
<p>foo_<em>bar</em>-baz</p>
<p>foo-<em>bar</em>_baz</p>

Why allow underscore after, especially if it is not allowed before?

melmothx · 2018-09-01T12:09:14Z

@link2xt Actually, in the branch we're talking about the outcome is

             [ 'foo_*bar*_baz' => 'foo_*bar*_baz' ],
             [ 'foo-*bar*-baz' => 'foo-<em>bar</em>-baz' ],
             [ 'foo_*bar*-baz' => 'foo_*bar*-baz' ],
             [ 'foo-*bar*_baz' => 'foo-*bar*_baz' ],

melmothx · 2018-09-01T12:16:21Z

@link2xt So, the specification would be:

Asterisk and equal symbols (<verbatim>*, **, *** =</verbatim>) are
interpreted as markup elements if they are paired (an opening one and
a closing one).

The opening one must be preceded by something which is not an
alphanumerical character (or at the beginning of the line) and
followed by something which is not a space.

The closing one must be preceded by something which is not a space,
and followed by something which is not an alphanumerical character (or
at the end of the line).

link2xt · 2018-09-01T12:20:11Z

@melmothx
Ok, looks good. So you are going to replace "\w" with an equivalent of [[:alnum:]]?

melmothx · 2018-09-01T12:28:52Z

@link2xt yes, of course, according to that specification, which is going to be added to the manual.

Address: #44

melmothx · 2018-09-01T13:40:14Z

@link2xt Changes (and some additional tests) are in. Are we done here? Can I release and update the doc?

link2xt · 2018-09-01T14:06:56Z

Looks good, thanks.

link2xt · 2018-09-02T00:36:26Z

Next version of pandoc will output lightweight markup when it can: jgm/pandoc@6ea6011

melmothx · 2018-09-02T05:58:41Z

Excellent!

…p elements See melmothx/text-amuse#44 for discussion on these rules

melmothx added a commit that referenced this issue Aug 16, 2018

Add test file for inline markup

c66f8a5

Relevant tests are failing. Reference: #44

melmothx added a commit that referenced this issue Aug 16, 2018

Do not consider asterisk inside a word an inline tag

8a81e5f

Reference #44

link2xt mentioned this issue Aug 31, 2018

Document the differences in allowed characters before lightweight markup melmothx/amusewiki-site#15

Merged

melmothx added a commit that referenced this issue Sep 1, 2018

Add more tests and add a comment with the proposed formalization

9b64281

Address: #44

melmothx added a commit that referenced this issue Sep 1, 2018

Use alphanumerical characters instead of \w and \W for inline markup

ffd3c12

Address: #44

melmothx closed this as completed in df13363 Sep 1, 2018

link2xt pushed a commit to jgm/pandoc that referenced this issue Oct 27, 2018

Muse reader: forbid whitespace after opening and before closing marku…

d28dca5

…p elements See melmothx/text-amuse#44 for discussion on these rules

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline markup incompatibilities #44

Inline markup incompatibilities #44

link2xt commented Aug 11, 2018

link2xt commented Aug 11, 2018

link2xt commented Aug 12, 2018 •

edited

Loading

link2xt commented Aug 12, 2018

melmothx commented Aug 16, 2018

link2xt commented Aug 16, 2018

melmothx commented Aug 16, 2018

link2xt commented Aug 16, 2018

link2xt commented Aug 16, 2018 •

edited

Loading

melmothx commented Aug 16, 2018 •

edited

Loading

link2xt commented Aug 16, 2018

link2xt commented Aug 31, 2018

melmothx commented Sep 1, 2018

link2xt commented Sep 1, 2018 •

edited

Loading

melmothx commented Sep 1, 2018

melmothx commented Sep 1, 2018

link2xt commented Sep 1, 2018

melmothx commented Sep 1, 2018

melmothx commented Sep 1, 2018

link2xt commented Sep 1, 2018

link2xt commented Sep 2, 2018

melmothx commented Sep 2, 2018

Inline markup incompatibilities #44

Inline markup incompatibilities #44

Comments

link2xt commented Aug 11, 2018

link2xt commented Aug 11, 2018

link2xt commented Aug 12, 2018 • edited Loading

link2xt commented Aug 12, 2018

melmothx commented Aug 16, 2018

link2xt commented Aug 16, 2018

melmothx commented Aug 16, 2018

link2xt commented Aug 16, 2018

link2xt commented Aug 16, 2018 • edited Loading

melmothx commented Aug 16, 2018 • edited Loading

link2xt commented Aug 16, 2018

link2xt commented Aug 31, 2018

melmothx commented Sep 1, 2018

link2xt commented Sep 1, 2018 • edited Loading

melmothx commented Sep 1, 2018

melmothx commented Sep 1, 2018

link2xt commented Sep 1, 2018

melmothx commented Sep 1, 2018

melmothx commented Sep 1, 2018

link2xt commented Sep 1, 2018

link2xt commented Sep 2, 2018

melmothx commented Sep 2, 2018

link2xt commented Aug 12, 2018 •

edited

Loading

link2xt commented Aug 16, 2018 •

edited

Loading

melmothx commented Aug 16, 2018 •

edited

Loading

link2xt commented Sep 1, 2018 •

edited

Loading