Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: remove Space element in AST #7579

Open
jgm opened this issue Sep 19, 2021 · 27 comments
Open

Proposal: remove Space element in AST #7579

jgm opened this issue Sep 19, 2021 · 27 comments

Comments

@jgm
Copy link
Owner

jgm commented Sep 19, 2021

This is a fairly radical change, so I'm posting this here for comment. The AST has a Space constructor for inlines, so we get:

% pandoc -t native
Hi there
[Para [Str "Hi",Space,Str "there"]]

The suggestion is to remove Space so that we get instead:

[Para [Str "Hi there"]]

I'm still unclear what this might break in the pandoc code base -- and it might also break things in third-party filters. It would likely make the code much faster, though, and remove some unnecessary complexity. It would also increase the readability of the native AST.

@jgm
Copy link
Owner Author

jgm commented Sep 19, 2021

The issue for filters could be partially mitigated if we provided a function that adds breaks [Str "hi there"] into [Str "hi",Str " ",Str "there"], and a pattern synonym for Space that matches Str " ". That would allow existing filters that depend on Space to be easily modified.

@jgm
Copy link
Owner Author

jgm commented Sep 19, 2021

Some background in #7124

@denismaier
Copy link
Contributor

denismaier commented Sep 19, 2021

First thought: this could make some filters much easier, e.g. search-replace operations for strings containing spaces would be possible.

@gwern
Copy link
Contributor

gwern commented Sep 19, 2021

First thought: this could make some filters much easier, e.g. search-replace operations for strings containing spaces would be possible.

Yes, it definitely would. I discussed this for my own Pandoc API search-and-replace utility; the existence of Space nodes (possibly arbitrarily present or absent) makes string search and replace extremely unreliable & difficult because it lifts strings to the AST list level. The only decent solution was to manually erase all Space nodes to put the AST into a canonical order where Str nodes contain the runs of text you expect them to contain. If there were no such nodes, I wouldn't have to do that and string munging becomes much more straightforward. It is still difficult to match any kind of mixture of formatting and text*, but no surprise there.

(It has always been opaque to me what the motivation behind having Space nodes and parsing strings into Space/Str lists was, since I haven't noticed any need for it in my own API use, nor it used elsewhere, and erasing them entirely hasn't led to any particular regression in my final HTML outputs. I noticed that the Str-erasing pass seemed to noticeably slow things down, so I would not be surprised if removing Str in general delivered performance benefits from much less AST churn & size.)

* eg with my smallcaps code. Imagine you want to rewrite "ABC-a" to Link _ ["ABC-a"] ("https://en.wikipedia.org/wiki/ABC-a" ""). With a straight search-and-replace of Str "...ABC-a...", you're fine; but what if a formatting pass has run before and so it's now actually [Str "...", Span "smallcaps-auto" _ [Str "ABC"], Str "-a..."]. Trouble.

@jgm
Copy link
Owner Author

jgm commented Sep 20, 2021

I think that I originally added Space to have something easy to split on for line wrapping (though the memory is hazy). In any case, we don't need it for that any more, since we use doclayout for that. I don't think there's any compelling reason to have it now, other than backwards compatibility. I remember that when I wrote the cheapskate parser, I originally had Space, then removed it, and things got much faster as a result, so I'd expect that here too.

There are certain kinds of filters that are made easier if you can assume that strings don't contain spaces, e.g. a filter that converts all instances of "AAA" into a particular link. These filters will still be possible without Space, but they'd involve more steps: in this case, finding "AAA" as a word inside a longer string, splitting the longer string, and inserting the link.

@Delanii
Copy link

Delanii commented Sep 21, 2021

I have a small lua filter for inserting non-breakable spaces after specific prefixes, before dashes, etc ... This could simplify that filter workings to few regexes, opposed to checking for presence of Space elements. Also, would make it practically the same to filter that I am using for luaLaTeX compilations (if I am not running pandoc), which is IMHO great. It could even make pandoc AST representation more readable (or the opposite?)
With that being said, if this is implemented, it would be great to reference corresponding pandoc release number here in case anybody would be searching for that ...

@fiapps
Copy link

fiapps commented Sep 21, 2021

This will break some of my filters, but the change is a good thing in the long run, because it will make filters that look for a series of words easier to write. If the mitigation @jgm suggested was made available, I would probably use it.

@gwern
Copy link
Contributor

gwern commented Sep 21, 2021

These filters will still be possible without Space, but they'd involve more steps: in this case, finding "AAA" as a word inside a longer string, splitting the longer string, and inserting the link.

If you mean that this filter could be written by literal pattern-matching on replace (Str "AAA") = Str "XYZ", I think that is unlikely to be a useful filter: you would still need to search within Str strings in order to insert the replacement, or else your users will be unpleasantly surprised that "foo AAA bar" replaces but, say, "foo AAA." or "(AAA foo" do not (because they no longer parse into Str "AAA" but the different literals Str "AAA." or Str "(AAA"). The case of exactly space/newline-separated "AAA" strings must be a rare one indeed, and as soon as you go beyond it, you're finding words inside longer strings & re-inventing regexps poorly. All the Space buys you in this case is an illusion that you implemented what you wanted.

@jgm
Copy link
Owner Author

jgm commented Sep 22, 2021

Started work on nospace branch.

@tarleb
Copy link
Collaborator

tarleb commented Sep 27, 2021 via email

@jgm
Copy link
Owner Author

jgm commented Sep 27, 2021

I assume multiple spaces would (usually) be kept verbatim in the new AST

Yes, that's the current plan.

@jgm
Copy link
Owner Author

jgm commented Sep 27, 2021

I've made some progress in converting the pandoc code base, but it's messy.

Note that in the writers, we currently assume that only Space is breakable, so space characters inside Str are not breakable. This assumption is used in various places, e.g. in the Markdown and Org writers, in code designed to avoid new block-level syntax being created inadvertently by line wrapping, e.g. when a hyphen or asterisk wraps and creates a list item. I assume it's going to be possible to change this code so that the bad-wrap detection is done in a different way, but I haven't studied it in detail.

Another thing that definitely gets more complicated is pattern matching to detect inline lists that start with space, and to remove this space. Again, this can be done without a Space element, but it's more complicated.

@jgm
Copy link
Owner Author

jgm commented Sep 28, 2021

I got things working enough to do some benchmarks with -f markdown-smart -t html (-smart because the abbreviation code is still not converted). This is after modifying the str parser in the markdown reader to parse larger spans of text, including spaces.

master:

   1,549,435,104 bytes allocated in the heap
     102,699,664 bytes copied during GC
      13,945,072 bytes maximum residency (10 sample(s))
          79,632 bytes maximum slop
              42 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       182 colls,     0 par    0.047s   0.048s     0.0003s    0.0017s
  Gen  1        10 colls,     0 par    0.054s   0.058s     0.0058s    0.0122s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.156s  (  0.171s elapsed)
  GC      time    0.102s  (  0.106s elapsed)
  EXIT    time    0.000s  (  0.007s elapsed)
  Total   time    0.258s  (  0.286s elapsed)

  Alloc rate    9,902,821,760 bytes per MUT second

  Productivity  60.6% of total user, 59.8% of total elapsed

nospace branch:

   1,264,393,400 bytes allocated in the heap
      75,768,304 bytes copied during GC
      10,495,536 bytes maximum residency (10 sample(s))
          77,232 bytes maximum slop
              35 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       144 colls,     0 par    0.036s   0.036s     0.0003s    0.0012s
  Gen  1        10 colls,     0 par    0.044s   0.048s     0.0048s    0.0117s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.133s  (  0.147s elapsed)
  GC      time    0.080s  (  0.084s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time    0.213s  (  0.236s elapsed)

  Alloc rate    9,532,232,140 bytes per MUT second

  Productivity  62.2% of total user, 62.3% of total elapsed

The nospace branch uses significantly less memory (35 vs 42), does fewer GCs (144 vs 182), and finishes faster (0.213s vs 0.258s). However, the differences aren't huge. I'm still not completely convinced that the advantages in performance, or the advantages in facilitating filter writing, outweigh the inconvenience of a big AST change like this. As for filter writing, it's always possible to wrap your filter in a utility function that, first, removes spaces, then adds them again.

@jgm
Copy link
Owner Author

jgm commented Sep 28, 2021

Proof of concept:

{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.Definition
import qualified Text.Pandoc.Builder as B
import qualified Data.Text as T
import Text.Pandoc.Walk (walk)
import Data.List (intersperse)

removeSpace :: [Inline] -> [Inline]
removeSpace = B.toList . mconcat . map go
 where
  go Space = B.str " "
  go x = B.singleton x

addSpace :: [Inline] -> [Inline]
addSpace = B.toList . mconcat . foldr go mempty
 where
  go (Str t) = (B.text t :)
  go x = (B.singleton x :)

test :: IO ()
test = do
  let doc = [Str "hello", Space, Str "there", Space, Emph [Str "friend"]]
  print $ walk (addSpace . map specialFilter . removeSpace) $ doc

specialFilter :: Inline -> Inline
specialFilter  (Str t) = Str $ T.replace "hello there" "hi ya" t
specialFilter x = x

@jgm
Copy link
Owner Author

jgm commented Oct 1, 2021

I assume multiple spaces would (usually) be kept verbatim in the new AST

@tarleb is there any reason they should be?

@tarleb
Copy link
Collaborator

tarleb commented Oct 1, 2021 via email

@jgm
Copy link
Owner Author

jgm commented Oct 1, 2021

Yes, that's something to keep in mind. We can certainly represent a doubled space in a Str element. However, the current behavior of Text.Pandoc.Builder.text is to collapse adjacent spaces, so if a reader relies on that, doubled spaces won't be preserved. Of course, the reader itself could be made sensitive to (sentence-ending punctuation + doubled space) and preserve it in this case (using B.str rather than B.text).
[EDIT: That's not quite right; the Monoid instance will also collapse spaces on append.]

@lewer
Copy link
Contributor

lewer commented May 13, 2022

Hello,
any progress on this? I currently replace Space with Str(" ") and then merge consecutive Str with a filter, but it's not really convenient and sometimes slow.

@jgm
Copy link
Owner Author

jgm commented May 13, 2022

See above. I'm not persuaded that the advantages outweigh the rather large drawbacks of an API change.

@gwern
Copy link
Contributor

gwern commented May 13, 2022

The API doesn't necessarily have to change. Perfect is the enemy of better: Space could be left in the AST type just as it is now, but deprecated and not generated by Pandoc, so filter writers could make the assumption of its absence (and querying for the existence of any Space node is presumably cheap, as well as Pandoc warning about use of deprecated Space nodes).

Such a required invariant is no worse than many parts of the status quo like using partial functions such as head/tail, or having fall-through AST pattern-matching, or not necessarily handling stuff wrapped in RawInline/Block elements, or the lack of Attr on many elements which ideally would have it etc.

And if an author does write a filter or relies on a dependency which re-introduces Space nodes, that is then their personal problem to deal with, and the burden of collapsing Space/Str nodes to pass on a clean AST to subsequent filters also their problem.

Thus far from the comments, far more filters are hindered than helped by the inconsistent confusion of Space/Str-with-whitespace-in-them nodes, and I think it's fair that the programming & performance burden be put on the few users who demand to use Space for whatever reason, and not on the many users who don't use it.

@jgm
Copy link
Owner Author

jgm commented May 14, 2022

By API change, I don't just mean removing the Space constructor -- I mean removing the expectation that Space will be used for spaces as it is now. There may be many filters and third-party tools that rely on this expectation (as well as many parts of the pandoc code, but at least that we control ourselves). So I'd predict (based on many past experiences) that the change would cause a lot of things to break. One can debate about where the "burden" should lie, but I think there's a presumption against making changes that break existing things.

@me-kell
Copy link

me-kell commented Aug 11, 2024

As far as I understand this proposal, it address (at least) four different and independent things:

  • how to parse spaces ' ': as Space or as Str " ".
  • whether to collapse adjacent spaces (both many Space to one Space and many Str " " to one Str " ")
  • whether to remove leading and trailing spaces or not (both Space and Str " ").
  • how to build strings with spaces: as [Str "one", Space, Str "two"] or as [Str "one two"] (or as [Str "one", Str " ", Str "two"] or as [Str "one two"]).

One approach could be to pass options to pandoc to configure the behavious above and let pandoc use the current behaviour as default if no option is given. That way existing code would not fail, and filter programers could stay with the existing default, allowing to programm new filters with the new behaviour when using the new options.

Those options could be named like the following:

  • --parse-space-as-string
  • --collapse-adjacent-spaces
  • --trim-spaces
  • --concatenate-spaces

@bpj
Copy link

bpj commented Aug 11, 2024

There are several things which I don’t understand here:

  • What would be the gain of replacing Space elements with Str " " if spaces are going to be a separate element anyway? In Lua elem.tag == 'Space' seems a lot better than elem.text == " ". I’m worried each time I have to type the string " " in any language, because that single space is hard to eye-parse, easily lost: omitted, replaced with multiple space characters, replaced with a newline if your editor isn’t all that smart.
  • The point of the Space element seems to be that writers may replace it with a line feed, which is legitimate. A space is semantically different from a(nother) string. I’m as lazy as anybody in Lua, inserting strings containing spaces into the AST, but those won’t line-break properly!
  • What is to gain by preserving multiple space characters? Most rebderers — notably HTML and all flavors of TeX — collapse them anyway. “Tabbing” with multiple space characters is very bad practice. There are dedicated characters like ‹ › U+2004 THREE-PER-EM SPACE aka   or   which you should use in those very rare cases where you legitimately need a wider space: ‹ › U+0020 SPACE is meant to be variable-width in proportional typesetting.
  • As far as I know Pandoc never produces Str " " so there is nothing to confuse. Possibly a string containing only ‹ › U+00A0 NO-BREAK SPACE might look similar in native or JSON output but it is not actually the same.

@me-kell
Copy link

me-kell commented Aug 12, 2024

@bpj

What would be the gain of replacing Space elements with Str " " if spaces are going to be a separate element anyway?

There is no gain at all. It was a proposal to not change too much code and retain spaces. Obviously is my proposal not that easy.

@me-kell
Copy link

me-kell commented Aug 12, 2024

@bpj

The point of the Space element seems to be that writers may replace it with a line feed, which is legitimate. A space is semantically different from a(nother) string. I’m as lazy as anybody in Lua, inserting strings containing spaces into the AST, but those won’t line-break properly!

What is to gain by preserving multiple space characters? Most rebderers — notably HTML and all flavors of TeX — collapse them anyway. “Tabbing” with multiple space characters is very bad practice. There are dedicated characters like ‹ › U+2004 THREE-PER-EM SPACE aka   or   which you should use in those very rare cases where you legitimately need a wider space: ‹ › U+0020 SPACE is meant to be variable-width in proportional typesetting.

"A space is semantically different from a(nother) string" is only true in some contexts. In a CodeBlock a space IS a string. Actually every char is a string in a CodeBlock.

"“Tabbing” with multiple space characters is very bad practice." is only true in some context. In CodeBlock and in some programming languages and file formats (e.g. Python, Occam, YAML among others) tabbing with multiple space characters is not only best practice it is mandatory.

In some disciplines (Linguistics, Pedagogy, Philosophy) there are writing conventions that recommend to use an monospaced font and use the space as string. In the same way programming languages use spaces in CodeBlocks.
There is though a feature that cannot be done with CodeBlocks: formating text with e.g. italics or bold. For such cases CodeBlocks are not useful.

For such use cases there is a need to preserve spaces.

@bpj
Copy link

bpj commented Aug 12, 2024

Of course Code and CodeBlock are different: they are, by definition, special fixed-width strings where renderers basically replace ‹ › U+0020 SPACE with ‹ › U+00A0 NO-BREAK SPACE. It is rather daft to suggest that that, or semantically significant indentation, invalidates the general point that space in regular inline text has special semantics in regular variable-width typography and typesetting, namely being potentially variable-width, potentially line-breaking divisions.1 Everything else, including “tabbing-by-spaces” — a feature of typewriter and plaintext “typography” which is fundamentally fixed-width — is a bad idea, follows from that. Note that justification fundamentally requires this “rubbery” nature of space, so it is most definitely a feature, something which Knuth explored at some length in The TeX book.

Note that if you insist on getting a particular number of ‹ › U+0020 SPACE characters in your output (outside of Code(Block)s) you can use a Raw inline of the appropriate type, like ` `{=html} although it is rather pointless: if you want a wider space in the rendered text you are, as I said, better off using no-break spaces of various kinds — in Pandoc Markdown usually through  ,   and the like. Anything else flies in the face of all rules of typography and typesetting.

Footnotes

  1. It is also rather daft to expect that everyone list all regular exceptions which follow from definitions everytime they mention a general rule. I realize why you might have the tendency to want to do that, but if one must always state every such thing explicitly one will never get to the point. Verbatim text goes under the general heading of “Exceptional structures” which are to be understood if you have done your homework. It took myself nearly thirty years to realize the rule of regular exceptions because it took that time for me to realize that most rule systems are ordered. It would be good if that order could always be stated. It is often not done because working out explicitly what all the rules are and then their proper ordering is a lot of work — for example most of phonologists’ discourse is about that — and the full working out is usually not needed for practical applications.

@me-kell
Copy link

me-kell commented Aug 12, 2024

@bpj Thanks for sharing your personal opinions.

BTW: It is not necessary to name other's opinions "daft".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants