-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: remove Space element in AST #7579
Comments
The issue for filters could be partially mitigated if we provided a function that adds breaks |
Some background in #7124 |
First thought: this could make some filters much easier, e.g. search-replace operations for strings containing spaces would be possible. |
Yes, it definitely would. I discussed this for my own Pandoc API search-and-replace utility; the existence of (It has always been opaque to me what the motivation behind having * eg with my smallcaps code. Imagine you want to rewrite "ABC-a" to |
I think that I originally added Space to have something easy to split on for line wrapping (though the memory is hazy). In any case, we don't need it for that any more, since we use doclayout for that. I don't think there's any compelling reason to have it now, other than backwards compatibility. I remember that when I wrote the cheapskate parser, I originally had Space, then removed it, and things got much faster as a result, so I'd expect that here too. There are certain kinds of filters that are made easier if you can assume that strings don't contain spaces, e.g. a filter that converts all instances of "AAA" into a particular link. These filters will still be possible without Space, but they'd involve more steps: in this case, finding "AAA" as a word inside a longer string, splitting the longer string, and inserting the link. |
I have a small lua filter for inserting non-breakable spaces after specific prefixes, before dashes, etc ... This could simplify that filter workings to few regexes, opposed to checking for presence of Space elements. Also, would make it practically the same to filter that I am using for luaLaTeX compilations (if I am not running pandoc), which is IMHO great. It could even make pandoc AST representation more readable (or the opposite?) |
This will break some of my filters, but the change is a good thing in the long run, because it will make filters that look for a series of words easier to write. If the mitigation @jgm suggested was made available, I would probably use it. |
If you mean that this filter could be written by literal pattern-matching on |
Started work on nospace branch. |
I'd like that change. I assume multiple spaces would (usually) be kept
verbatim in the new AST, i.e., `one two` would be parsed as
`Str "one two"` in most formats?
|
Yes, that's the current plan. |
I've made some progress in converting the pandoc code base, but it's messy. Note that in the writers, we currently assume that only Space is breakable, so space characters inside Str are not breakable. This assumption is used in various places, e.g. in the Markdown and Org writers, in code designed to avoid new block-level syntax being created inadvertently by line wrapping, e.g. when a hyphen or asterisk wraps and creates a list item. I assume it's going to be possible to change this code so that the bad-wrap detection is done in a different way, but I haven't studied it in detail. Another thing that definitely gets more complicated is pattern matching to detect inline lists that start with space, and to remove this space. Again, this can be done without a Space element, but it's more complicated. |
I got things working enough to do some benchmarks with master:
nospace branch:
The |
Proof of concept: {-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.Definition
import qualified Text.Pandoc.Builder as B
import qualified Data.Text as T
import Text.Pandoc.Walk (walk)
import Data.List (intersperse)
removeSpace :: [Inline] -> [Inline]
removeSpace = B.toList . mconcat . map go
where
go Space = B.str " "
go x = B.singleton x
addSpace :: [Inline] -> [Inline]
addSpace = B.toList . mconcat . foldr go mempty
where
go (Str t) = (B.text t :)
go x = (B.singleton x :)
test :: IO ()
test = do
let doc = [Str "hello", Space, Str "there", Space, Emph [Str "friend"]]
print $ walk (addSpace . map specialFilter . removeSpace) $ doc
specialFilter :: Inline -> Inline
specialFilter (Str t) = Str $ T.replace "hello there" "hi ya" t
specialFilter x = x |
@tarleb is there any reason they should be? |
@tarleb is there any reason they should be?
The one example that I have is my head is sentences. People had been
asking for ways to markup sentences, but it's difficult to distinguish
between sentence endings and abbreviations. Keeping any double spaces
a period could allow users to write filters which split on sentences.
|
Yes, that's something to keep in mind. We can certainly represent a doubled space in a Str element. However, the current behavior of Text.Pandoc.Builder.text is to collapse adjacent spaces, so if a reader relies on that, doubled spaces won't be preserved. Of course, the reader itself could be made sensitive to (sentence-ending punctuation + doubled space) and preserve it in this case (using B.str rather than B.text). |
Hello, |
See above. I'm not persuaded that the advantages outweigh the rather large drawbacks of an API change. |
The API doesn't necessarily have to change. Perfect is the enemy of better: Space could be left in the AST type just as it is now, but deprecated and not generated by Pandoc, so filter writers could make the assumption of its absence (and querying for the existence of any Space node is presumably cheap, as well as Pandoc warning about use of deprecated Space nodes). Such a required invariant is no worse than many parts of the status quo like using partial functions such as head/tail, or having fall-through AST pattern-matching, or not necessarily handling stuff wrapped in RawInline/Block elements, or the lack of Attr on many elements which ideally would have it etc. And if an author does write a filter or relies on a dependency which re-introduces Space nodes, that is then their personal problem to deal with, and the burden of collapsing Space/Str nodes to pass on a clean AST to subsequent filters also their problem. Thus far from the comments, far more filters are hindered than helped by the inconsistent confusion of Space/Str-with-whitespace-in-them nodes, and I think it's fair that the programming & performance burden be put on the few users who demand to use Space for whatever reason, and not on the many users who don't use it. |
By API change, I don't just mean removing the Space constructor -- I mean removing the expectation that Space will be used for spaces as it is now. There may be many filters and third-party tools that rely on this expectation (as well as many parts of the pandoc code, but at least that we control ourselves). So I'd predict (based on many past experiences) that the change would cause a lot of things to break. One can debate about where the "burden" should lie, but I think there's a presumption against making changes that break existing things. |
As far as I understand this proposal, it address (at least) four different and independent things:
One approach could be to pass options to pandoc to configure the behavious above and let pandoc use the current behaviour as default if no option is given. That way existing code would not fail, and filter programers could stay with the existing default, allowing to programm new filters with the new behaviour when using the new options. Those options could be named like the following:
|
There are several things which I don’t understand here:
|
There is no gain at all. It was a proposal to not change too much code and retain spaces. Obviously is my proposal not that easy. |
"A space is semantically different from a(nother) string" is only true in some contexts. In a CodeBlock a space IS a string. Actually every char is a string in a CodeBlock. "“Tabbing” with multiple space characters is very bad practice." is only true in some context. In CodeBlock and in some programming languages and file formats (e.g. Python, Occam, YAML among others) tabbing with multiple space characters is not only best practice it is mandatory. In some disciplines (Linguistics, Pedagogy, Philosophy) there are writing conventions that recommend to use an monospaced font and use the space as string. In the same way programming languages use spaces in CodeBlocks. For such use cases there is a need to preserve spaces. |
Of course Code and CodeBlock are different: they are, by definition, special fixed-width strings where renderers basically replace ‹ › U+0020 SPACE with ‹ › U+00A0 NO-BREAK SPACE. It is rather daft to suggest that that, or semantically significant indentation, invalidates the general point that space in regular inline text has special semantics in regular variable-width typography and typesetting, namely being potentially variable-width, potentially line-breaking divisions.1 Everything else, including “tabbing-by-spaces” — a feature of typewriter and plaintext “typography” which is fundamentally fixed-width — is a bad idea, follows from that. Note that justification fundamentally requires this “rubbery” nature of space, so it is most definitely a feature, something which Knuth explored at some length in The TeX book. Note that if you insist on getting a particular number of ‹ › U+0020 SPACE characters in your output (outside of Code(Block)s) you can use a Raw inline of the appropriate type, like Footnotes
|
@bpj Thanks for sharing your personal opinions. BTW: It is not necessary to name other's opinions "daft". |
This is a fairly radical change, so I'm posting this here for comment. The AST has a Space constructor for inlines, so we get:
The suggestion is to remove Space so that we get instead:
I'm still unclear what this might break in the pandoc code base -- and it might also break things in third-party filters. It would likely make the code much faster, though, and remove some unnecessary complexity. It would also increase the readability of the native AST.
The text was updated successfully, but these errors were encountered: