Unify defaults metadata and markdown metadata parsers #6328

lierdakil · 2020-05-01T00:29:12Z

This is prompted by lierdakil/pandoc-crossref#259.

Apparently, pandoc doesn't parse metatdata fields in the defaults file the way it does it for Markdown files, particularly boolean fields are parsed as strings. I took a gander, and to my surprise, found that pandoc uses completely different code path for parsing defaults file metadata field compared to yaml metadata block in Markdown.

Long story short, this is a first attempt at merging those code paths. This should hopefully behave a little more consistently (although I'll admit I quickly gave up on figuring out how to parse fields as Markdown while parsing defaults file -- I believe it's feasible, but likely more than a bit tricky)

P.S. This is more a request for comments than a bona fide pull request. The code perhaps needs a little polish. And I keep forgetting about the draft PR feature.

P.P.S. Sorry if this text is a little rambly or a bit incoherent, it's 4 AM and I'm slightly sleep-deprived.

jgm · 2020-05-01T03:37:23Z

This is intentional. The defaults file is supposed to be more or less equivalent to specifying things on the command line. When you specify --metadata, it gets treated as a string, not parsed as Markdown. It wouldn't in general make sense to parse it as Markdown -- maybe you're not converting from Markdown! Same reasoning for the different treatment in the defaults file.

lierdakil · 2020-05-01T03:42:02Z

Okay, ignore Markdown parsing part then, this patch doesn't do it anyway. The main issue here is booleans don't get parsed as booleans. And it's entirely possible to specify booleans in metadata on the CLI.

lierdakil · 2020-05-01T03:47:01Z

I guess the question is: do you want to have the same semantics in the Markdown reader metadata parser and defaults metadata parser? (modulo parsing strings as Markdown)

If not, I would argue you need a good reason for that, because different semantics will lead to much confusion.

jgm · 2020-05-01T05:17:54Z

If booleans can be specified using --metadata on the command line, then yes, it should be the same in the defaults file.

alerque · 2020-05-01T05:36:46Z

@lierdakil You can now convert existing PRs to Draft mode. The link to do so us under the "Reviewers" section in the sidebar.

jgm · 2020-05-16T22:53:11Z

@lierdakil - can you clarify the status of this draft? What work is still needed before it's ready to review?

lierdakil · 2020-05-17T12:29:41Z

So, okay, a bit of an explanation of how I'm going about this.

There's a function yamlMap in Text.Pandoc.Readers.Metadata, which takes a block-parser and a YAML object and parses it into a MetaMap. This is specifically the parser that is used in parsing Markdown metadata blocks (albeit in the "normal" code path this happens through the proxy of yamlBsToMeta). The slight issue with this function is that it expects block parser to return Blocks, which is less flexible than we would want here, so I have to amend it to instead expect a parser that returns a MetaValue straight away (whatever that MetaValue happens to be). This is relatively trivial, because yamlMap parser wraps the result of block-parser in MetaBlocks anyway.

Given this modification, I'm re-using yamlMap to parse metadata YAML object in the defaults file. However, instead of the "blocks parser" that the function originally expects, I'm passing a dumb "parser" that just returns the input string wrapped as a MetaString.

The original question I was implicitly asking with this PR is whether or not this all seems like a sound approach. On the positive side, this guarantees consistent handling of metadata in the defaults file and in-line as metadata blocks, with the notable exception of parsing string values as Markdown (in the current implementation, defaults file strings are treated as unformatted strings). On the negative side, this doesn't at all guarantee that anything would be consistent with the CLI (which uses a different code path altogether). Another possible negative is that the code ends up being somewhat convoluted, although I'm not sure if there's a better way to handle deep-ish monad stacks anyway.

As for what still needs to be done, the current implementation silently ignores any parsing errors reported by runPure. Frankly, with the current implementation, there isn't much that could go wrong in the first place, but to future-proof, I would argue it'd be good to handle possible errors anyway. The modification required is somewhat trivial as far as I can tell, but I didn't find the time to make it yet. Furthermore, the code needs a little bit of a stylistic clean-up, there are some useless hanging wheres in places, and variable names could perhaps use some work (for instance pBlocks isn't really a blocks-parser at this point, but a metadata-parser)

I'd like to encourage a quick look over the code as it is now (while the changeset is relatively small and contains little noise), to figure out if I should pursue this further, or if this approach should be abandoned in favour of another perhaps.

jgm · 2020-05-17T13:13:35Z

src/Text/Pandoc/App/Opt.hs

+  where
+    pMetaString = pure . MetaString <$> P.manyChar P.anyChar
+    runEverything :: P.ParserT Text P.ParserState PandocPure (P.F (M.Map Text MetaValue)) -> Meta
+    runEverything p =
+      either (const mempty) (Meta . flip P.runF def) . join . runPure $ P.readWithM p def ""
+yamlToMeta _ = mempty


I'm confused about this code. What's the point of runEverything here if we're just passing in pMetaString?

Basically: to make the type checker happy. yamlMap is several monad stacks deep before we get to the actual MetaMap, runEverything has to unwrap those.

FWIW, some of those layers need to be joined with the Parser in doOpt for the purposes of error handling, but I was being a little lazy with the proof-of-concept.

jgm · 2020-05-17T13:15:44Z

Quick review:

I am okay with the change in T.P.Readers.Metadata.

I don't quite understand some of the added complexity in T.P.App.Opt, as noted in the comment on the code.

Also, I thought that one of the aims here was to ensure that CLI metadata and (leafs nodes for) metadata in defaults files are parsed the same (particularly as regards boolean values), and I don't see that in this PR.

lierdakil · 2020-05-17T13:44:35Z

Also, I thought that one of the aims here was to ensure that CLI metadata and (leafs nodes for) metadata in defaults files are parsed the same (particularly as regards boolean values), and I don't see that in this PR.

As I said above, this PR does not touch the CLI code path at all. The idea here is to handle metadata specified from the defaults file the same as metadata specified via the YAML metadata block (because both are YAML metadata, and discrepancies are confusing). The way I achieve the intended goal is to pass the whole YAML object containing the metadata in the defaults file to the parser in Text.Pandoc.Readers.Metadata. Booleans are handled by the yamlMap parser directly.

lierdakil · 2020-05-17T14:02:56Z

Basically, this is what I'm trying to point out: we don't have to reinvent the wheel with parsing metadata in the defaults file, we already have a parser for YAML metadata, which works in the context of the Markdown parser. We can use that one with minimal modification (at the cost of some arguably ugly monad stack unwrapping). That reduces effort duplication and gives us some UX consistency for free. If this sounds reasonable then I can tidy this PR up, and we can move on to review/merge and whatnot. Otherwise, we can abandon this angle and I can quickly slap together some ad-hoc measure to fix boolean parsing in defaults metadata and be done with that (this will lead to some code duplication, which irks me because DRY, but oh well)

Sorry if I'm failing to express my thoughts clearly.

P.S. I would argue using the same code to do basically the same job is highly preferable, but I acknowledge that it's a personal opinion, and that's why I'm asking if it sounds reasonable to you.

jgm · 2020-05-20T16:12:04Z

@brainchild0 - just a note that comments like the last one are motivation-killers in open-source projects. I haven't forgotten about this PR, but believe it or not, I have other work to do besides pandoc. I've indicated general agreement with the proposed change -- my last message asks for clarification only, which I happily received. The walls of text you have provided make it harder to review the PR; please keep your comments as succinct as possible.

brainchild0 · 2020-05-20T16:39:54Z

@jgm: Although the overall intention was different from what may have been construed, I understand the view you express. I removed the preceding comment, which I am interpreting as the target of your remark, being the only one you indicated explicitly. (Please be aware that my reading of the conversation, though perhaps inaccurate, was not one of agreement. It may be beside the point.)

lierdakil · 2020-05-25T14:28:06Z

@jgm, so, okay, sorry it took a while, I got busy with day job and didn't find much time to work on this.

Anyway, I've updated the PR. A few notes:

Error handling is somewhat anaemic, since PandocError ideally should be handled in IO, but we're not in IO -- and the only other option is to basically fail . show. Note that at present there are no errors to handle -- there is no code path in yamlToMeta that would throw a pure exception as far as I can tell, so this is more future-proofing than anything.
I took the liberty to clean up some code in T.P.R.Metadata. That includes renaming some bindings to hopefully avoid confusion in the future and getting rid of some do-notation, one menacingly looking case tower, and unused language extensions.
While at it, I also removed references to RelaxedPolyRec language extension everywhere I found it. Since I believe the current pandoc source can't be built with GHC 7 and below anyway, and RelaxedPolyRec is literally impossible to turn off since at least GHC 7.6, and almost all mention of this extension is removed from GHC documentation, I'd say it's a good idea to remove this extension as obsolete.

lierdakil · 2020-06-28T14:27:18Z

This is a friendly reminder that this PR is ready for review. I'm afraid this will go out of sync with master if it marinates for much longer, and it does fix a real issue (report linked in the OP). No pressure, just in case it slipped through the cracks.

jgm · 2020-06-28T17:05:50Z

Sorry, I just forgot about this. I'll take a look now.

jgm · 2020-06-28T17:10:40Z

src/Text/Pandoc/Readers/MediaWiki.hs

@@ -1,6 +1,4 @@
 {-# LANGUAGE OverloadedStrings #-}
-{-# LANGUAGE RelaxedPolyRec    #-}


Can you split these deletions of RelaxedPolyRec into a separate PR, since they aren't related?

jgm · 2020-06-28T17:13:12Z

src/Text/Pandoc/Readers/Markdown.hs

        return $ runF meta defaultParserState
  parsed <- readWithM parser def{ stateOptions = opts } ""
  case parsed of
    Right result -> return result
    Left e       -> throwError e

-
+asBlocks :: Functor f => f (B.Many Block) -> f MetaValue
+asBlocks p = MetaBlocks . B.toList <$> p


asMetaBlocks might be a better name?

Sidenote: this is the same as fmap toMetaValue, where the latter is from Text.Pandoc.Builder.

@tarleb, good point.

jgm · 2020-06-28T17:21:58Z

I think I'm fine with this change, though I have a nagging worry that there's something I'm overlooking. What more would be involved in making boolean values work in metadata in defaults files?

I'd like to make a pandoc release very soon, so we can support ghc 8.10, and it would be convenient to include this (since we'd be going to 2.10 anyway and this is an API change).

jgm · 2020-06-28T17:22:34Z

Btw it's helpful if all API changes are clearly indicated as such in the commit messages, so I don't forget to include them in the changelog.

lierdakil · 2020-06-28T20:24:17Z

Okay, I've removed RelaxedPolyRec stuff and asBlocks (in favour of fmap toMetaValue as per @tarleb's suggestion).

What more would be involved in making boolean values work in metadata in defaults files?

As long as we're using the same code path in Markdown and defaults file metadata parsing, nothing. That's mostly the point: avoid duplicating

pandoc/src/Text/Pandoc/Readers/Metadata.hs

Lines 84 to 88 in f2b3377

    
           checkBoolean :: Text -> Maybe Bool 
        
           checkBoolean t 
        
             | t == T.pack "true" || t == T.pack "True" || t == T.pack "TRUE" = Just True 
        
             | t == T.pack "false" || t == T.pack "False" || t == T.pack "FALSE" = Just False 
        
             | otherwise = Nothing

and associated code.

Note to self, checkBoolean itself is already duplicated in

pandoc/src/Text/Pandoc/App/Opt.hs

Lines 500 to 508 in 34775b4

    
           readMetaValue :: String -> MetaValue 
        
           readMetaValue s 
        
             | s == "true"  = MetaBool True 
        
             | s == "True"  = MetaBool True 
        
             | s == "TRUE"  = MetaBool True 
        
             | s == "false" = MetaBool False 
        
             | s == "False" = MetaBool False 
        
             | s == "FALSE" = MetaBool False 
        
             | otherwise    = MetaString $ T.pack s

. Might want to take a closer look.

Btw it's helpful if all API changes are clearly indicated as such in the commit messages, so I don't forget to include them in the changelog.

Uh... about that. Are there any public-facing API changes? Now that I look at it, Text.Pandoc.Readers.Metadata isn't exposed, and yamlBsToMeta isn't reexported directly. Other changes are local to modules, and I didn't touch any instances. So I guess there aren't technically any?

jgm · 2020-06-29T05:37:39Z

Sorry, I had thought the module was exposed. It isn't, you're right.

jgm · 2020-06-29T05:39:41Z

And sorry about my misunderstanding; I now see that Booleans are handled.

The branch has some conflicts that need to be resolved before merging.

lierdakil · 2020-06-29T14:25:21Z

Rebased.

jgm · 2020-06-29T19:39:06Z

Thanks!

lierdakil marked this pull request as draft May 6, 2020 06:26

jgm mentioned this pull request May 7, 2020

Boolean in metadata at defaults read as string #6349

Closed

This comment has been minimized.

Sign in to view

jgm reviewed May 17, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

lierdakil force-pushed the defaults-meta-parse branch 4 times, most recently from 6585b96 to f916163 Compare May 25, 2020 13:15

lierdakil marked this pull request as ready for review May 25, 2020 13:37

lierdakil marked this pull request as ready for review May 25, 2020 14:29

lierdakil force-pushed the defaults-meta-parse branch from f916163 to dd9840a Compare May 25, 2020 16:55

jgm reviewed Jun 28, 2020

View reviewed changes

lierdakil force-pushed the defaults-meta-parse branch from dd9840a to c6e019a Compare June 28, 2020 20:07

lierdakil force-pushed the defaults-meta-parse branch from c6e019a to 3f78c85 Compare June 28, 2020 21:31

lierdakil mentioned this pull request Jun 28, 2020

Remove obsolete RelaxedPolyRec extension #6487

Merged

lierdakil added 3 commits June 29, 2020 17:06

Unify defaults and markdown metadata parsers

f26923b

Handle errors in yamlToMeta

34e54d3

Clean up T.P.R.Metadata

42e7f1e

lierdakil force-pushed the defaults-meta-parse branch from 3f78c85 to 42e7f1e Compare June 29, 2020 14:15

jgm merged commit f1a5295 into jgm:master Jun 29, 2020

lierdakil deleted the defaults-meta-parse branch June 29, 2020 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify defaults metadata and markdown metadata parsers #6328

Unify defaults metadata and markdown metadata parsers #6328

lierdakil commented May 1, 2020 •

edited

Loading

jgm commented May 1, 2020

lierdakil commented May 1, 2020

lierdakil commented May 1, 2020

jgm commented May 1, 2020

alerque commented May 1, 2020

This comment has been minimized.

jgm commented May 16, 2020

This comment has been minimized.

lierdakil commented May 17, 2020 •

edited

Loading

jgm May 17, 2020

lierdakil May 17, 2020 •

edited

Loading

jgm commented May 17, 2020

lierdakil commented May 17, 2020

lierdakil commented May 17, 2020 •

edited

Loading

This comment has been minimized.

jgm commented May 20, 2020

brainchild0 commented May 20, 2020

lierdakil commented May 25, 2020

lierdakil commented Jun 28, 2020

jgm commented Jun 28, 2020

jgm Jun 28, 2020

jgm Jun 28, 2020

tarleb Jun 28, 2020

lierdakil Jun 28, 2020

jgm commented Jun 28, 2020

jgm commented Jun 28, 2020

lierdakil commented Jun 28, 2020

jgm commented Jun 29, 2020

jgm commented Jun 29, 2020

lierdakil commented Jun 29, 2020

jgm commented Jun 29, 2020

		@@ -1,6 +1,4 @@
		{-# LANGUAGE OverloadedStrings #-}
		{-# LANGUAGE RelaxedPolyRec #-}

Unify defaults metadata and markdown metadata parsers #6328

Unify defaults metadata and markdown metadata parsers #6328

Conversation

lierdakil commented May 1, 2020 • edited Loading

jgm commented May 1, 2020

lierdakil commented May 1, 2020

lierdakil commented May 1, 2020

jgm commented May 1, 2020

alerque commented May 1, 2020

This comment has been minimized.

jgm commented May 16, 2020

This comment has been minimized.

lierdakil commented May 17, 2020 • edited Loading

jgm May 17, 2020

Choose a reason for hiding this comment

lierdakil May 17, 2020 • edited Loading

Choose a reason for hiding this comment

jgm commented May 17, 2020

lierdakil commented May 17, 2020

lierdakil commented May 17, 2020 • edited Loading

This comment has been minimized.

jgm commented May 20, 2020

brainchild0 commented May 20, 2020

lierdakil commented May 25, 2020

lierdakil commented Jun 28, 2020

jgm commented Jun 28, 2020

jgm Jun 28, 2020

Choose a reason for hiding this comment

jgm Jun 28, 2020

Choose a reason for hiding this comment

tarleb Jun 28, 2020

Choose a reason for hiding this comment

lierdakil Jun 28, 2020

Choose a reason for hiding this comment

jgm commented Jun 28, 2020

jgm commented Jun 28, 2020

lierdakil commented Jun 28, 2020

jgm commented Jun 29, 2020

jgm commented Jun 29, 2020

lierdakil commented Jun 29, 2020

jgm commented Jun 29, 2020

lierdakil commented May 1, 2020 •

edited

Loading

lierdakil commented May 17, 2020 •

edited

Loading

lierdakil May 17, 2020 •

edited

Loading

lierdakil commented May 17, 2020 •

edited

Loading