Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify defaults metadata and markdown metadata parsers #6328

Merged
merged 3 commits into from
Jun 29, 2020

Conversation

lierdakil
Copy link
Contributor

@lierdakil lierdakil commented May 1, 2020

This is prompted by lierdakil/pandoc-crossref#259.

Apparently, pandoc doesn't parse metatdata fields in the defaults file the way it does it for Markdown files, particularly boolean fields are parsed as strings. I took a gander, and to my surprise, found that pandoc uses completely different code path for parsing defaults file metadata field compared to yaml metadata block in Markdown.

Long story short, this is a first attempt at merging those code paths. This should hopefully behave a little more consistently (although I'll admit I quickly gave up on figuring out how to parse fields as Markdown while parsing defaults file -- I believe it's feasible, but likely more than a bit tricky)

P.S. This is more a request for comments than a bona fide pull request. The code perhaps needs a little polish. And I keep forgetting about the draft PR feature.

P.P.S. Sorry if this text is a little rambly or a bit incoherent, it's 4 AM and I'm slightly sleep-deprived.

@jgm
Copy link
Owner

jgm commented May 1, 2020

This is intentional. The defaults file is supposed to be more or less equivalent to specifying things on the command line. When you specify --metadata, it gets treated as a string, not parsed as Markdown. It wouldn't in general make sense to parse it as Markdown -- maybe you're not converting from Markdown! Same reasoning for the different treatment in the defaults file.

@lierdakil
Copy link
Contributor Author

Okay, ignore Markdown parsing part then, this patch doesn't do it anyway. The main issue here is booleans don't get parsed as booleans. And it's entirely possible to specify booleans in metadata on the CLI.

@lierdakil
Copy link
Contributor Author

I guess the question is: do you want to have the same semantics in the Markdown reader metadata parser and defaults metadata parser? (modulo parsing strings as Markdown)

If not, I would argue you need a good reason for that, because different semantics will lead to much confusion.

@jgm
Copy link
Owner

jgm commented May 1, 2020

If booleans can be specified using --metadata on the command line, then yes, it should be the same in the defaults file.

@alerque
Copy link
Contributor

alerque commented May 1, 2020

@lierdakil You can now convert existing PRs to Draft mode. The link to do so us under the "Reviewers" section in the sidebar.

@lierdakil lierdakil marked this pull request as draft May 6, 2020 06:26
@brainchild0

This comment has been minimized.

@jgm
Copy link
Owner

jgm commented May 16, 2020

@lierdakil - can you clarify the status of this draft? What work is still needed before it's ready to review?

@brainchild0

This comment has been minimized.

@lierdakil
Copy link
Contributor Author

lierdakil commented May 17, 2020

So, okay, a bit of an explanation of how I'm going about this.

There's a function yamlMap in Text.Pandoc.Readers.Metadata, which takes a block-parser and a YAML object and parses it into a MetaMap. This is specifically the parser that is used in parsing Markdown metadata blocks (albeit in the "normal" code path this happens through the proxy of yamlBsToMeta). The slight issue with this function is that it expects block parser to return Blocks, which is less flexible than we would want here, so I have to amend it to instead expect a parser that returns a MetaValue straight away (whatever that MetaValue happens to be). This is relatively trivial, because yamlMap parser wraps the result of block-parser in MetaBlocks anyway.

Given this modification, I'm re-using yamlMap to parse metadata YAML object in the defaults file. However, instead of the "blocks parser" that the function originally expects, I'm passing a dumb "parser" that just returns the input string wrapped as a MetaString.

The original question I was implicitly asking with this PR is whether or not this all seems like a sound approach. On the positive side, this guarantees consistent handling of metadata in the defaults file and in-line as metadata blocks, with the notable exception of parsing string values as Markdown (in the current implementation, defaults file strings are treated as unformatted strings). On the negative side, this doesn't at all guarantee that anything would be consistent with the CLI (which uses a different code path altogether). Another possible negative is that the code ends up being somewhat convoluted, although I'm not sure if there's a better way to handle deep-ish monad stacks anyway.

As for what still needs to be done, the current implementation silently ignores any parsing errors reported by runPure. Frankly, with the current implementation, there isn't much that could go wrong in the first place, but to future-proof, I would argue it'd be good to handle possible errors anyway. The modification required is somewhat trivial as far as I can tell, but I didn't find the time to make it yet. Furthermore, the code needs a little bit of a stylistic clean-up, there are some useless hanging wheres in places, and variable names could perhaps use some work (for instance pBlocks isn't really a blocks-parser at this point, but a metadata-parser)

I'd like to encourage a quick look over the code as it is now (while the changeset is relatively small and contains little noise), to figure out if I should pursue this further, or if this approach should be abandoned in favour of another perhaps.

Comment on lines 483 to 488
where
pMetaString = pure . MetaString <$> P.manyChar P.anyChar
runEverything :: P.ParserT Text P.ParserState PandocPure (P.F (M.Map Text MetaValue)) -> Meta
runEverything p =
either (const mempty) (Meta . flip P.runF def) . join . runPure $ P.readWithM p def ""
yamlToMeta _ = mempty
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this code. What's the point of runEverything here if we're just passing in pMetaString?

Copy link
Contributor Author

@lierdakil lierdakil May 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically: to make the type checker happy. yamlMap is several monad stacks deep before we get to the actual MetaMap, runEverything has to unwrap those.

FWIW, some of those layers need to be joined with the Parser in doOpt for the purposes of error handling, but I was being a little lazy with the proof-of-concept.

@jgm
Copy link
Owner

jgm commented May 17, 2020

Quick review:

I am okay with the change in T.P.Readers.Metadata.

I don't quite understand some of the added complexity in T.P.App.Opt, as noted in the comment on the code.

Also, I thought that one of the aims here was to ensure that CLI metadata and (leafs nodes for) metadata in defaults files are parsed the same (particularly as regards boolean values), and I don't see that in this PR.

@lierdakil
Copy link
Contributor Author

Also, I thought that one of the aims here was to ensure that CLI metadata and (leafs nodes for) metadata in defaults files are parsed the same (particularly as regards boolean values), and I don't see that in this PR.

As I said above, this PR does not touch the CLI code path at all. The idea here is to handle metadata specified from the defaults file the same as metadata specified via the YAML metadata block (because both are YAML metadata, and discrepancies are confusing). The way I achieve the intended goal is to pass the whole YAML object containing the metadata in the defaults file to the parser in Text.Pandoc.Readers.Metadata. Booleans are handled by the yamlMap parser directly.

@lierdakil
Copy link
Contributor Author

lierdakil commented May 17, 2020

Basically, this is what I'm trying to point out: we don't have to reinvent the wheel with parsing metadata in the defaults file, we already have a parser for YAML metadata, which works in the context of the Markdown parser. We can use that one with minimal modification (at the cost of some arguably ugly monad stack unwrapping). That reduces effort duplication and gives us some UX consistency for free. If this sounds reasonable then I can tidy this PR up, and we can move on to review/merge and whatnot. Otherwise, we can abandon this angle and I can quickly slap together some ad-hoc measure to fix boolean parsing in defaults metadata and be done with that (this will lead to some code duplication, which irks me because DRY, but oh well)

Sorry if I'm failing to express my thoughts clearly.

P.S. I would argue using the same code to do basically the same job is highly preferable, but I acknowledge that it's a personal opinion, and that's why I'm asking if it sounds reasonable to you.

@brainchild0

This comment has been minimized.

@jgm
Copy link
Owner

jgm commented May 20, 2020

@brainchild0 - just a note that comments like the last one are motivation-killers in open-source projects. I haven't forgotten about this PR, but believe it or not, I have other work to do besides pandoc. I've indicated general agreement with the proposed change -- my last message asks for clarification only, which I happily received. The walls of text you have provided make it harder to review the PR; please keep your comments as succinct as possible.

@brainchild0
Copy link

@jgm: Although the overall intention was different from what may have been construed, I understand the view you express. I removed the preceding comment, which I am interpreting as the target of your remark, being the only one you indicated explicitly. (Please be aware that my reading of the conversation, though perhaps inaccurate, was not one of agreement. It may be beside the point.)

@lierdakil lierdakil force-pushed the defaults-meta-parse branch 4 times, most recently from 6585b96 to f916163 Compare May 25, 2020 13:15
@lierdakil lierdakil marked this pull request as ready for review May 25, 2020 13:37
@lierdakil
Copy link
Contributor Author

@jgm, so, okay, sorry it took a while, I got busy with day job and didn't find much time to work on this.

Anyway, I've updated the PR. A few notes:

  • Error handling is somewhat anaemic, since PandocError ideally should be handled in IO, but we're not in IO -- and the only other option is to basically fail . show. Note that at present there are no errors to handle -- there is no code path in yamlToMeta that would throw a pure exception as far as I can tell, so this is more future-proofing than anything.
  • I took the liberty to clean up some code in T.P.R.Metadata. That includes renaming some bindings to hopefully avoid confusion in the future and getting rid of some do-notation, one menacingly looking case tower, and unused language extensions.
  • While at it, I also removed references to RelaxedPolyRec language extension everywhere I found it. Since I believe the current pandoc source can't be built with GHC 7 and below anyway, and RelaxedPolyRec is literally impossible to turn off since at least GHC 7.6, and almost all mention of this extension is removed from GHC documentation, I'd say it's a good idea to remove this extension as obsolete.

@lierdakil lierdakil marked this pull request as ready for review May 25, 2020 14:29
@lierdakil
Copy link
Contributor Author

This is a friendly reminder that this PR is ready for review. I'm afraid this will go out of sync with master if it marinates for much longer, and it does fix a real issue (report linked in the OP). No pressure, just in case it slipped through the cracks.

@jgm
Copy link
Owner

jgm commented Jun 28, 2020

Sorry, I just forgot about this. I'll take a look now.

@@ -1,6 +1,4 @@
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RelaxedPolyRec #-}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split these deletions of RelaxedPolyRec into a separate PR, since they aren't related?

return $ runF meta defaultParserState
parsed <- readWithM parser def{ stateOptions = opts } ""
case parsed of
Right result -> return result
Left e -> throwError e


asBlocks :: Functor f => f (B.Many Block) -> f MetaValue
asBlocks p = MetaBlocks . B.toList <$> p
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asMetaBlocks might be a better name?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: this is the same as fmap toMetaValue, where the latter is from Text.Pandoc.Builder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarleb, good point.

@jgm
Copy link
Owner

jgm commented Jun 28, 2020

I think I'm fine with this change, though I have a nagging worry that there's something I'm overlooking. What more would be involved in making boolean values work in metadata in defaults files?

I'd like to make a pandoc release very soon, so we can support ghc 8.10, and it would be convenient to include this (since we'd be going to 2.10 anyway and this is an API change).

@jgm
Copy link
Owner

jgm commented Jun 28, 2020

Btw it's helpful if all API changes are clearly indicated as such in the commit messages, so I don't forget to include them in the changelog.

@lierdakil
Copy link
Contributor Author

Okay, I've removed RelaxedPolyRec stuff and asBlocks (in favour of fmap toMetaValue as per @tarleb's suggestion).

What more would be involved in making boolean values work in metadata in defaults files?

As long as we're using the same code path in Markdown and defaults file metadata parsing, nothing. That's mostly the point: avoid duplicating

checkBoolean :: Text -> Maybe Bool
checkBoolean t
| t == T.pack "true" || t == T.pack "True" || t == T.pack "TRUE" = Just True
| t == T.pack "false" || t == T.pack "False" || t == T.pack "FALSE" = Just False
| otherwise = Nothing
and associated code.

Note to self, checkBoolean itself is already duplicated in

readMetaValue :: String -> MetaValue
readMetaValue s
| s == "true" = MetaBool True
| s == "True" = MetaBool True
| s == "TRUE" = MetaBool True
| s == "false" = MetaBool False
| s == "False" = MetaBool False
| s == "FALSE" = MetaBool False
| otherwise = MetaString $ T.pack s
. Might want to take a closer look.

Btw it's helpful if all API changes are clearly indicated as such in the commit messages, so I don't forget to include them in the changelog.

Uh... about that. Are there any public-facing API changes? Now that I look at it, Text.Pandoc.Readers.Metadata isn't exposed, and yamlBsToMeta isn't reexported directly. Other changes are local to modules, and I didn't touch any instances. So I guess there aren't technically any?

@jgm
Copy link
Owner

jgm commented Jun 29, 2020

Sorry, I had thought the module was exposed. It isn't, you're right.

@jgm
Copy link
Owner

jgm commented Jun 29, 2020

And sorry about my misunderstanding; I now see that Booleans are handled.

The branch has some conflicts that need to be resolved before merging.

@lierdakil
Copy link
Contributor Author

Rebased.

@jgm jgm merged commit f1a5295 into jgm:master Jun 29, 2020
@jgm
Copy link
Owner

jgm commented Jun 29, 2020

Thanks!

@lierdakil lierdakil deleted the defaults-meta-parse branch June 29, 2020 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants