-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: preserve leading, trailing, and multiple interword spaces #10077
Comments
In cases like this you should probably use nonbreaking spaces (which will not be collapsed). |
Inserting nonbreaking spaces is a workaround when the input format is markdown. We also replace spaces with nonbreaking spaces with In our case we convert journal articles to latex. Those articles are written mostly in Word and (even if we keep telling them) authors mostly do not understand the difference between spaces and nonbreaking spaces (or even tabs). I have very basic knowledge of Haskell. But AFAICS implementing this would simply need to put one conditional in the code where the Space elements are collapsed. This could be done with an option |
Well no, it's not that simple, as explained in the other issues you've found. For one thing, most of the writers will simply render the doubled Space as one:
Semantically, |
That's a good point.
That's right: most writers collapse the double space. But not all (see textile, docx, rtf, xwiki, and zimwiki). But my point here is not how writers handle the double space. But to have the possibility to handle them in a filter. Either converting them to See also my comment in Proposal: preserve leading, trailing, and multiple interword space. I think that the current handling of spaces in pandoc follows a markdown-ish document metamodel. IMHO this handling of spaces is not optimal for other document models (e.g. docx, odt etc.). And that's probably the reason why from time to time people keep asking to review this. On the other side, from a conceptual point of view, one would expect exact the same document when doing the following:
(The command above is simplistic and ignores the styles and the fact that pandoc uses a default word document template. Let's assume that both use the same template and have the same styles.) |
In different document formats (docx, markdown etc.) spaces (and tabs) have different semantic meanings. If leading, trailing, and adjacent spaces are meaningfull in a given document format (e.g. docx) why not let the specific reader build an AST with leading, trailing, and adjacent spaces and let the writer decide whether to collapse adjacent spaces or not? And have a pandoc AST builder that doesn't prevent specific readers from getting transparent leading, trailing, and adjacent spaces. For readers like docx, odt, rtf I would prefer to have |
In general, pandoc uses a simplified document model, which means that features that are semantically significant in some document formats are sometimes lost on conversion (as explained at the very beginning of the documentation). That simplified document model is part of the reason pandoc can support so many formats without exploding complexity. That said: if I were starting afresh, I'd probably not have a Space element. (See #7579.) I don't have one in the djot AST. I'd let the writers handle space collapsing as you suggest. It's just that making this change now would be quite a lot of work, and would potentially break a good deal of the infrastructure now built on pandoc. So, it may seem to you that you're making a small request, but it would actually be quite a large project to do this properly. |
@jgm Thank you very much for your explanations. I've been looking at the relevant parts of the code. And even if I'm not proficient with Haskell I realize how much work it is and how much the different aspects are intertwinned in the code. I very much agree with you: pandoc success is a simplified document model. There are other parts (Readers, Writers, and filters) which allow to handle (most) parts that are not covered in this simplified model. And I also agree that changing that would probably break existing code. One last question (that probably does not belong her): is it possible to write a reader and/or writer in other languages than Haskell? |
You can write a custom reader or writer in Lua: These won't ever become part of pandoc, but they may be good enough for your own use. |
Pandoc trims leading and trailing spaces, removes double interword spaces, and treats them as 'Space'.
Although this approach works well for most formats like
markdown
orhtml
, it might not be the best one for formats likedocx
orotd
.There are use cases where preserving leading, trailing, and interword spaces is necessary: e.g. interlinear glosses, representation of overlapping speech etc.
Using "Source Code" styles (aka of CodeBlock) works well for non formatted text. But formatted parts (italics and/or bold) are ignored in CodeBlocks. Without "Source Code" style the spaces are not preserved:
For such use cases preserving preserving leading, trailing, and interword spaces would be very useful.
One possible way to achieve this could be to parse spaces like in
" one two "
to[Space, Str "one", Space, Space, Str "two", Space]
For this not to break existing code and filters it could be done passing an option
--preserve-spaces
to pandoc or having an extension only for specific formats (likedocx+preserve_spaces
).This proposal is related to Proposal: remove Space element in AST but different since it simply preserves leading, trailing, and interword spaces as
Space
s.And AFAICS the AST has no problems with it:
The text was updated successfully, but these errors were encountered: