Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: preserve leading, trailing, and multiple interword spaces #10077

Open
me-kell opened this issue Aug 11, 2024 · 8 comments
Open

Proposal: preserve leading, trailing, and multiple interword spaces #10077

me-kell opened this issue Aug 11, 2024 · 8 comments

Comments

@me-kell
Copy link

me-kell commented Aug 11, 2024

Pandoc trims leading and trailing spaces, removes double interword spaces, and treats them as 'Space'.

Although this approach works well for most formats like markdown or html, it might not be the best one for formats like docx or otd.

There are use cases where preserving leading, trailing, and interword spaces is necessary: e.g. interlinear glosses, representation of overlapping speech etc.

Using "Source Code" styles (aka of CodeBlock) works well for non formatted text. But formatted parts (italics and/or bold) are ignored in CodeBlocks. Without "Source Code" style the spaces are not preserved:

grafik

For such use cases preserving preserving leading, trailing, and interword spaces would be very useful.

One possible way to achieve this could be to parse spaces like in " one two " to [Space, Str "one", Space, Space, Str "two", Space]

For this not to break existing code and filters it could be done passing an option --preserve-spaces to pandoc or having an extension only for specific formats (like docx+preserve_spaces).

This proposal is related to Proposal: remove Space element in AST but different since it simply preserves leading, trailing, and interword spaces as Spaces.

And AFAICS the AST has no problems with it:

$ pandoc -f native -t json <(echo '[ Para [ Space, Str "one", Space, Space, Str "two", Space ] ]')
{
    "pandoc-api-version":[1,23,1],
    "meta":{},
    "blocks":[
        {"t":"Para","c":[
                {"t":"Space"},
                {"t":"Str","c":"one"},
                {"t":"Space"},
                {"t":"Space"},
                {"t":"Str","c":"two"},
                {"t":"Space"}
                ]
        }
    ]
}
@jgm
Copy link
Owner

jgm commented Aug 11, 2024

In cases like this you should probably use nonbreaking spaces (which will not be collapsed). \ in pandoc markdown, or just insert a unicode nonbreaking space.

@me-kell
Copy link
Author

me-kell commented Aug 11, 2024

Inserting nonbreaking spaces is a workaround when the input format is markdown. We also replace spaces with nonbreaking spaces with sed. But this is not possible on Word documents. We do it with a Word-Macro. But this implies to do it in a Windows OS.

In our case we convert journal articles to latex. Those articles are written mostly in Word and (even if we keep telling them) authors mostly do not understand the difference between spaces and nonbreaking spaces (or even tabs).

I have very basic knowledge of Haskell. But AFAICS implementing this would simply need to put one conditional in the code where the Space elements are collapsed. This could be done with an option --preserve-spaces.

@jgm
Copy link
Owner

jgm commented Aug 11, 2024

But AFAICS implementing this would simply need to put one conditional in the code where the Space elements are collapsed.

Well no, it's not that simple, as explained in the other issues you've found. For one thing, most of the writers will simply render the doubled Space as one:

% pandoc -t markdown -f native
[ Str "hi", Space, Space, Space, Space, Str "hi" ]
hi hi

Semantically, Space in the AST means "collapsible whitespace." And it is interpreted that way everywhere in the code base.

@me-kell
Copy link
Author

me-kell commented Aug 11, 2024

Semantically, Space in the AST means "collapsible whitespace." And it is interpreted that way everywhere in the code base

That's a good point.

most of the writers will simply render the doubled Space as one

That's right: most writers collapse the double space. But not all (see textile, docx, rtf, xwiki, and zimwiki).

But my point here is not how writers handle the double space. But to have the possibility to handle them in a filter. Either converting them to Str " " or concatenating them to the surrounding string. Letting this to the filter's programmer.

See also my comment in Proposal: preserve leading, trailing, and multiple interword space.

I think that the current handling of spaces in pandoc follows a markdown-ish document metamodel. IMHO this handling of spaces is not optimal for other document models (e.g. docx, odt etc.). And that's probably the reason why from time to time people keep asking to review this.

On the other side, from a conceptual point of view, one would expect exact the same document when doing the following:

pandoc input.docx -f docx -t docx -o output.docx

(The command above is simplistic and ignores the styles and the fact that pandoc uses a default word document template. Let's assume that both use the same template and have the same styles.)

@me-kell
Copy link
Author

me-kell commented Aug 12, 2024

@jgm

Semantically, Space in the AST means "collapsible whitespace." And it is interpreted that way everywhere in the code base.

In different document formats (docx, markdown etc.) spaces (and tabs) have different semantic meanings.

If leading, trailing, and adjacent spaces are meaningfull in a given document format (e.g. docx) why not let the specific reader build an AST with leading, trailing, and adjacent spaces and let the writer decide whether to collapse adjacent spaces or not? And have a pandoc AST builder that doesn't prevent specific readers from getting transparent leading, trailing, and adjacent spaces.

For readers like docx, odt, rtf I would prefer to have " a b " represented as [ Str " a b "].

@jgm
Copy link
Owner

jgm commented Aug 15, 2024

In general, pandoc uses a simplified document model, which means that features that are semantically significant in some document formats are sometimes lost on conversion (as explained at the very beginning of the documentation).

That simplified document model is part of the reason pandoc can support so many formats without exploding complexity.

That said: if I were starting afresh, I'd probably not have a Space element. (See #7579.) I don't have one in the djot AST. I'd let the writers handle space collapsing as you suggest. It's just that making this change now would be quite a lot of work, and would potentially break a good deal of the infrastructure now built on pandoc. So, it may seem to you that you're making a small request, but it would actually be quite a large project to do this properly.

@me-kell
Copy link
Author

me-kell commented Aug 15, 2024

@jgm Thank you very much for your explanations.

I've been looking at the relevant parts of the code. And even if I'm not proficient with Haskell I realize how much work it is and how much the different aspects are intertwinned in the code.

I very much agree with you: pandoc success is a simplified document model.

There are other parts (Readers, Writers, and filters) which allow to handle (most) parts that are not covered in this simplified model. And I also agree that changing that would probably break existing code.

One last question (that probably does not belong her): is it possible to write a reader and/or writer in other languages than Haskell?

@jgm
Copy link
Owner

jgm commented Aug 15, 2024

is it possible to write a reader and/or writer in other languages than Haskell?

You can write a custom reader or writer in Lua:
https://pandoc.org/custom-readers.html
https://pandoc.org/custom-writers.html

These won't ever become part of pandoc, but they may be good enough for your own use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants