Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline markup incompatibilities #44

Closed
link2xt opened this issue Aug 11, 2018 · 21 comments
Closed

Inline markup incompatibilities #44

link2xt opened this issue Aug 11, 2018 · 21 comments

Comments

@link2xt
Copy link
Contributor

link2xt commented Aug 11, 2018

I want to make pandoc Muse writer generate lightweight inline markup instead of tags when possible. For that I need to understand when it is possible to use it.

I have tested various inline markup examples with Emacs Muse, Amusewiki, Pandoc Muse, Emacs Org and Pandoc Org. Org mode markup is very close to Muse markup, so I hope it can help with resolving some ambiguities. Here are the results:

Source:      foo*bar*
Emacs Muse:  foo*bar*
Amusewiki:   foo*bar*
Pandoc Muse: foo*bar*
Emacs Org:   foo*bar*
Pandoc Org:  foo*bar*

Source:      *foo*bar
Emacs Muse:  *foo*bar
Amusewiki:   <em>foo</em>bar
Pandoc Muse: *foo*bar
Emacs Org:   *foo*bar
Pandoc Org:  *foo*bar

Source:      *foo*,bar
Emacs Muse:  <em>foo</em>,bar
Amusewiki:   <em>foo</em>,bar
Pandoc Muse: <em>foo</em>,bar
Emacs Org:   <b>foo</b>,bar
Pandoc Org:  <strong>foo</strong>,bar

Source:      foo,*bar*
Emacs Muse:  foo,*bar*
Amusewiki:   foo,<em>bar</em>
Pandoc Muse: foo,<em>bar</em>
Emacs Org:   foo,*bar*
Pandoc Org:  foo,*bar*

I think the second case should not be parsed as emphasis. There is no reason for asymmetry with the first case.

The third case is obviously correct, because we may want to emphasize a word before comma. Normally there is a space after the comma, but there is no reason to check for it in the parser and make it more complicated. All parsers are consistent here by the way.

As for the fourth case I tend to believe it is a bug in both Text::Amuse and Pandoc Muse reader. In Org mode it obviously works as intended, because Emacs Org mode has a configuration variable org-emphasis-regexp-components which specifies different allowed pre and post characters. It makes sense to only check for punctuation in the end, because even in right-to-left text the commas are after the word and are followed by space. From Emacs Muse code it is not so clear whether it is intended or just happened to work this way, but there is no evidence that there is a bug, so I think making it more compatible with the Org mode and Emacs Muse at the same time is the right thing to do.

@link2xt
Copy link
Contributor Author

link2xt commented Aug 11, 2018

Not-so-related bug, but also the result of this series of tests: jgm/pandoc#4820

@link2xt
Copy link
Contributor Author

link2xt commented Aug 12, 2018

From my understanding of Emacs Muse code, it allows line start, -, [, <, (, ', " and ` before the first "*" (or "=", or "_" which is underline and not supported by Amusewiki):

https://github.com/alexott/muse/blob/0bb5d3fa57bfc876bacec732a0c1d8f796942403/lisp/muse-publish.el#L122

Then it checks that the character after the corresponding "*" marker is not from the "w" class (letters and digits, basically):
https://github.com/alexott/muse/blob/0bb5d3fa57bfc876bacec732a0c1d8f796942403/lisp/muse-publish.el#L1082

@link2xt
Copy link
Contributor Author

link2xt commented Aug 12, 2018

Here is another testcase:

Source:      *foo*0bar
Emacs Muse:  *foo*0bar
Amusewiki:   <em>foo</em>0bar
Pandoc Muse: *foo*0bar
Emacs Org:   *foo*0bar
Pandoc Org:  *foo*0bar

(just fixed it in pandoc: jgm/pandoc@81131ef)

melmothx added a commit that referenced this issue Aug 16, 2018
Relevant tests are failing.

Reference: #44
@melmothx
Copy link
Owner

I'm willing to push this forward, but it's kind of low priority task as they are mostly corner cases.

@link2xt
Copy link
Contributor Author

link2xt commented Aug 16, 2018

@melmothx I see you added the test for *foo *bar* from Org mode bug. Emacs Muse parses this as *foo *bar*, not <em>foo *bar</em>.

Amusewiki produces this completely wrong result:

<em>foo <em>bar</em>
</em>

In LaTeX:

\emph{foo \emph{bar}
}

@melmothx
Copy link
Owner

That's simply the result of the autocorrection when there are open tags left. That's incorrect/random input anyway. Determining what's the correct output is tricky and/or arbitrary.

Bottom line is: garbage in, garbage out. Just saying.

@link2xt
Copy link
Contributor Author

link2xt commented Aug 16, 2018

@melmothx

I'm willing to push this forward, but it's kind of low priority task as they are mostly corner cases.

I am also not sure whether we need bug-for-bug compatibility. *foo *bar* is pretty much a corner case indeed, that is why I reduced this issue to some simple cases. In other words, I do not plan to make pandoc writer generate *foo *bar*, so this case can be ignored.

@link2xt
Copy link
Contributor Author

link2xt commented Aug 16, 2018

Emacs Muse "parser" is simply a number of regexp rewrite rules with priorities, so trying to copy its behaviour is hard with a proper parser. Cases like *foo *bar* can still be defined by specifying whether the parser is greedy/backtracking.

Let's limit this issue to "what characters are allowed before and after *".

melmothx added a commit that referenced this issue Aug 16, 2018
@melmothx
Copy link
Owner

melmothx commented Aug 16, 2018

@labdsf I've added a commit in a branch which fixes the case 2, which is clearly a Text::Amuse bug.

Regarding the 4th case, the more I look at it, the less I think it's a bug in your and my code.

For symmetry with *foo*,bar, bar,*foo* should work the same. The same way in *foo*,bar the user is just forgetting a space. Also, if org-mode is letting the user to configure that, it's not a strong indicator.

@link2xt
Copy link
Contributor Author

link2xt commented Aug 16, 2018

For symmetry with *foo*,bar, bar,*foo* should work the same. The same way in *foo*,bar the user is just forgetting a space.

Both in left-to-right and right-to-left texts the comma is followed by the space, so there is no need for symmetry. Forgetting a space after comma is a mistake.

I also thought about symmetry first, see my description of case 2. But looking at the code of both Org-mode and Emacs Muse I can tell there is intentional asymmetry in both of them.

Also, if org-mode is letting the user to configure that, it's not a strong indicator.

It does not only have a configuration variable, it also has asymmetric defaults that allow punctuation only in the end, and only opening parentheses in the beginning. It is not directly related to Muse, though. I looked into Org-mode only because I didn't find relevant code in Emacs Muse quickly.

Now that I have looked into Emacs Muse, I am also sure the set of characters allowed before opening * instead of simple "any non-word character" is not a result of some random typo. (*foo* converts to (<em>foo</em>, but )*foo* is left as-is.

It still makes sense to intentionally break compatibility with Emacs Muse here and document it as incompatibility. I do not like arbitrary hardcoded set of characters allowed before *, which does not even include "{". Something based on character classes is definitely better.

@link2xt
Copy link
Contributor Author

link2xt commented Aug 31, 2018

Another incompatibility is that Emacs Muse allows whitespace before closing *. The difference is easy to see in the following example:

*foo * bar*

Emacs Muse and pandoc (mostly accidentally) interpret it as

<em>foo </em> bar*

Amusewiki interprets it as

<em>foo * bar</em>

It is another case where I agree with Amusewiki. The reason for the bug is probably that Text::Amuse replaces each * with a tag individually and had to look for context, while Emacs Muse and pandoc just look for the matching *.

@melmothx
Copy link
Owner

melmothx commented Sep 1, 2018

@link2xt Could you check this commit with the draft of the formalization, and see if we can agree on this?

@link2xt
Copy link
Contributor Author

link2xt commented Sep 1, 2018

It is mostly ok, not whitespace on the inner side and not "word" on the outer side. But I would replace "word" with "alphanumeric". "Word", as defined by "\w" regexp, includes underscore.

Example:

foo_*bar*_baz

foo-*bar*-baz

foo_*bar*-baz

foo-*bar*_baz

Emacs Muse:

<p>foo_*bar*_baz</p>

<p>foo-<em>bar</em>-baz</p>

<p>foo_*bar*-baz</p>

<p>foo-<em>bar</em>_baz</p>

Amusewiki:

<p>
foo_*bar*_baz
</p>

<p>
foo-<em>bar</em>-baz
</p>

<p>
foo_*bar*-baz
</p>

<p>
foo-<em>bar</em>_baz
</p>

Pandoc:

<p>foo_<em>bar</em>_baz</p>
<p>foo-<em>bar</em>-baz</p>
<p>foo_<em>bar</em>-baz</p>
<p>foo-<em>bar</em>_baz</p>

Why allow underscore after, especially if it is not allowed before?

@melmothx
Copy link
Owner

melmothx commented Sep 1, 2018

@link2xt Actually, in the branch we're talking about the outcome is

             [ 'foo_*bar*_baz' => 'foo_*bar*_baz' ],
             [ 'foo-*bar*-baz' => 'foo-<em>bar</em>-baz' ],
             [ 'foo_*bar*-baz' => 'foo_*bar*-baz' ],
             [ 'foo-*bar*_baz' => 'foo-*bar*_baz' ],

@melmothx
Copy link
Owner

melmothx commented Sep 1, 2018

@link2xt So, the specification would be:

Asterisk and equal symbols (<verbatim>*, **, *** =</verbatim>) are
interpreted as markup elements if they are paired (an opening one and
a closing one).

The opening one must be preceded by something which is not an
alphanumerical character (or at the beginning of the line) and
followed by something which is not a space.

The closing one must be preceded by something which is not a space,
and followed by something which is not an alphanumerical character (or
at the end of the line).

@link2xt
Copy link
Contributor Author

link2xt commented Sep 1, 2018

@melmothx
Ok, looks good. So you are going to replace "\w" with an equivalent of [[:alnum:]]?

@melmothx
Copy link
Owner

melmothx commented Sep 1, 2018

@link2xt yes, of course, according to that specification, which is going to be added to the manual.

@melmothx
Copy link
Owner

melmothx commented Sep 1, 2018

@link2xt Changes (and some additional tests) are in. Are we done here? Can I release and update the doc?

@link2xt
Copy link
Contributor Author

link2xt commented Sep 1, 2018

Looks good, thanks.

@link2xt
Copy link
Contributor Author

link2xt commented Sep 2, 2018

Next version of pandoc will output lightweight markup when it can: jgm/pandoc@6ea6011

@melmothx
Copy link
Owner

melmothx commented Sep 2, 2018

Excellent!

link2xt pushed a commit to jgm/pandoc that referenced this issue Oct 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants