Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mozilla style HTML nested list gets extra bullet in Markdown #9187

Closed
rphair opened this issue Nov 14, 2023 · 9 comments
Closed

Mozilla style HTML nested list gets extra bullet in Markdown #9187

rphair opened this issue Nov 14, 2023 · 9 comments
Labels

Comments

@rphair
Copy link

rphair commented Nov 14, 2023

At least two other test cases in the issue queue show that HTML generated by Mozilla applications, not strictly standards compliant but recently supported by pandoc, is still generating markdown with a different structure than the original HTML: specifically an extra bullet when the indentation level deepens:

1 - #9161 (comment)

Most recently reported, and in the latest release. Under stdout (markdown): note that the pandoc output for the 2nd level list item has two bullets in front of it in the markdown:

- - a

2 - #8150

An earlier test case that currently demonstrates the problem, and also acknowledges (@jgm @tarleb beginning at #8150 (comment)) that the posted Mozilla syntax should be supported due to "widespread" use.

But although the code has been changed so that pandoc now recognises this markup as a nested list, it still places a double bullet before the first more deeply nested item. Here's the current pandoc output from this same test case rendered as markdown: https://gist.github.com/rphair/0fc0e6a35389b039906d2490c872a2d6

Once the fix to #9161 is released, we should get this output also quoted in #8150 (comment), with tight spacing and without a double bullet in front of the item L3.1:

-   L1
-   L2
    -   L3.1
    -   L3.2

This problem can be verified by running the same test case as in #8150 (comment) - though the output looks different today after 0d7f80c fixed the bulk of the problem.

(First found this issue on Linux in pandoc version 3.1.1 and it still persists in currently latest version 3.1.9 Debian package.)

@rphair rphair added the bug label Nov 14, 2023
@jgm
Copy link
Owner

jgm commented Nov 14, 2023

I'm a bit lost in your description of the issue. Can you post (inline) the HTML, the markdown pandoc currently produces, and the markdown you would expect?

@rphair
Copy link
Author

rphair commented Nov 14, 2023

input (same as #8150 (comment)):

<!-- file: nested-list1.html -->
<ul>
  <li>L1</li>
  <li>L2</li>
  <ul>
    <li>L3.1</li>
    <li>L3.2</li>
  </ul>
</ul>

current output (pandoc 3.1.9):

$ pandoc -f html -t markdown < nested-list1.html
-   L1

-   L2

-   -   L3.1
    -   L3.2

correct output:

-   L1
-   L2
    -   L3.1
    -   L3.2

Please note again the issue is only the extra - (bullet) in front of L3.1 (rendered here) since I believe the extraneous line spacing given to the tight HTML list is already corrected by 0f3211c.

@jgm
Copy link
Owner

jgm commented Nov 15, 2023

I find this suggested output surprising and would like to understand by what criterion it is judged to be correct.

The <ul> occurs after the </li> tag that closes the previous list item. So, it should not go in that list item, should it?

@rphair
Copy link
Author

rphair commented Nov 15, 2023

@jgm I included the context in my original post to explain how the Mozilla HTML-generating applications have always marked up nested lists. To my knowledge (around 12 years of observation at this point), all HTML editors like Thunderbird (and Seamonkey, less popular) place nested lists as list items without enclosing them in <li> tags.

So the criteria judging it "correct" is that the more deeply nested <ul> is a list item just like the <li>items above it, and is therefore at the same level... the items in the nested list are just indented more. The deeper nested list is not going "in" the preceding list item: it's going below it.

I don't know what to say about the 2nd bullet pandoc generates on the same line because I don't know why it's there. I would propose, since the W3C standard says that <ul>'s can only contain <li>'s, that when pandoc finds a <ul> appearing inside another <ul> it could be a cue not to produce that extra bullet in front of its first list item.

This problem can be easily & currently reproduced in the latest (and any) version of Thunderbird.... just now I've created a bulleted list that looks like this:


header

  • item 1
  • item 2
    • item 3a
    • item 3b
  • item 4

... which creates HTML exactly like this:

<p>header</p>
<ul>
  <li>item 1</li>
  <li>item 2</li>
  <ul>
    <li>item 3a</li>
    <li>item 3b</li>
  </ul>
  <li>item 4</li>
</ul>

If it's not compelling that the ubiquitous Thunderbird generates this markup, it may be more satisfying to think of the Mozilla parallel standard as: "For brevity's sake, lists can also be items of other lists, and therefore we don't put them inside list item tags."

Regardless of how one interprets the markup, or what one thinks of the Mozilla standard, I am simply saying that the extra bullet shouldn't be there. Running the example above back through pandoc will put two bullets in front of item 3a and I can't actually understand the criterion for producing the first one... since it wouldn't appear even if the deeper list above had been enclosed in the W3C mandated <li> tags.

I hope that is enough to decide whether supporting Mozilla flavoured HTML is of appeal to this project. I appreciate your consideration so far in those 2 earlier issues, and we'd all be able to use pandoc on our Mozilla generated HTML after we can go the final step to eliminate that extra bullet... assuming you can do it without affecting pandoc's treatment of strictly standards compliant lists.

@jgm
Copy link
Owner

jgm commented Nov 15, 2023

I don't know what to say about the 2nd bullet pandoc generates on the same line because I don't know why it's there.

We treat it like

<li><ul>...</ul></li>

(opening a new list item implicitly to hold the content).
So, you have a list item (the first -) whose content is a list.

I'm not too inclined to spend much more time trying to support invalid HTML. Why don't you ask Mozilla to fix their broken HTML instead?

@rphair
Copy link
Author

rphair commented Nov 15, 2023

Thanks @jgm for all the time you have spent so far. I saw last year when you asked the same question in #8150 (comment) and called it a "bug", but Mozilla originally & continuously supporting the abbreviated list style as acceptable HTML is a design decision: an error maybe, but not a casual one & not an oversight.

It's hard to find documentation on this issue because 1) neither Mozilla nor the Thunderbird developers who have continued the practice have advertised either the difference or their position about it, and 2) the language to search for it on the Internet is too ambiguous to target the issue.

There are just lots of little reports of it like this one, hoping for various kinds of support for the unconventional list style: generally meeting with refusals for application support and comments like "your HTML is wrong" (which doesn't fix the operational problem for us or anyone else).

Mozilla won't correct for the problem because all browsers originally supported the legacy abbreviated list style, and have continuously ever since. (This fact in itself suggests a "feature incompleteness" for pandoc regardless of the W3C standard.)

I'm not implying you're obligated to support it but the benefit to the overall community would be huge. HTML and other open source document formats should be readable for a hundred years to come, and we need something to recondition commercially broken document standards so they can be preserved. It seems an ideal choice for pandoc to be that thing.

Note (also please for other readers) I've tried to find / configure a pre-parser for pandoc that will clean up the Mozilla lists so the nested list tags are no longer mis-formatted. HTML Tidy doesn't correct it or even report any errors (which also suggests that the Mozilla list markup is considered acceptable, even if not "standard"). There may be HTML parsers or tools already designed for this and I would happily take any reader's suggestion on how to use them in series with pandoc for this issue.

In any case I wanted to make this one last request to see if you can fix this long-standing, somewhat intractable problem at the destination: considering it will never be fixed at the source. This would be consistent with pandoc's apparent mission to harmonise different document types, often under non-ideal conditions, supporting a more open ecosystem of documentation.

@rphair
Copy link
Author

rphair commented Nov 15, 2023

We treat it like <li><ul>...</ul></li>

This is why we're getting the extra bullet. If pandoc instead would read the non-compliant nested list by dropping the preceding closing </li> and then closing it afterward:

<ul>
  <li>L1</li>
  <li>L2
  <ul>
    <li>L3.1</li>
    <li>L3.2</li>
  </ul>
  </li>
</ul>

... then we can see pandoc would produce the correct markdown:

-   L1
-   L2
    -   L3.1
    -   L3.2

Put another way: this would be fixed if bare (unwrapped with <li>) nested lists are read to be the final part of the preceding list item, rather than wrapping it in the extra <li> element which produces the extra bullet.

@jgm
Copy link
Owner

jgm commented Nov 15, 2023

Yes, I understand that.

@jgm jgm closed this as completed in 13e1b49 Nov 15, 2023
@jgm
Copy link
Owner

jgm commented Nov 15, 2023

OK, it should now be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants