-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata #35
Comments
Is the purpose of the metadata block to set variables in the standalone output doc template? (I'm thinking here of my rough understanding of how Pandoc works.) My understanding is that YAML is a rather complex format. What about TOML? |
I don't much like TOML for this purpose; it requires you to quote strings, and it makes it very inconvenient to represent e.g. an array of references. |
YAML has bad handling of anything including newlines. So "simplified subset of YAML" would not solve the issue TOML does solve. But I agree that quoting stuff is annoying. One thing that I am missing is to spread metadata across the document - for longer documents one loses context if you have to put all the metadata at the very beginning of the document. So my requirement would be support multiple metadata blocks instead of just one. Btw. how about just "reusing" existing formatted blocks with a reserved format keywoard (might be a "symbol" instead latin text)? Instead of |
I needed a way to load configuration into my Pandoc filters without needing to "revert" Pandoc metadata trees to "plain" data. I couldn't use an existing YAML parser and soon despaired about writing a parser for full YAML. What I did succeed in writing was a parser for a basic subset of flow-style YAML without unquoted strings and tags, so basically JSON plus YAML-style hex integers and single and double quoted strings, giving the main advantages of YAML over JSON (including getting rid of the odious surrogate pair escapes!) without the significant indentation. The lpeg/re grammar isn't terribly large (copied from my Moonscript file):
|
reStructuredText does something interesting here. They re-use definition list syntax; when a definition list occurs right after the document title, it is interpreted as metadata (IIRC). Nice thing is that we already have a nice readable syntax for that. |
@jgm wrote:
Nice!
Can't say I like that, since it should be legal to place a definition list as the first thing in the document, so some delimiter (three or more of some punctuation character!) would seem in order. Are
Yes. Some questions: - Would multiple "definitions" become a list? - and a nested "definition list" a nested mapping? - Would values be verbatim or be parsed as markup? If the latter there should IMO 1. be a way to mark a value as a raw string, maybe
|
|
I think the |
my current (Makefile‐based) workflow involves
nested metadata is useful in my experience for namespacing, although foo:
bar: etaoin
baz: shrdlu can usually be represented as foo-bar: etaoin
foo-baz: shrdlu supporting lists/arrays is more important, as they are more difficult to represent through alternate means |
In case there is still doubt about the topic, I am highly in favor of document metadata being within the document. Is there some reason that the comment character could not be co-opted to serve as a docstring for metadata? Comment block at the start of the document can contain whatever syntax is chosen to define key:values. In Rust, a Anyway, big fan of the project, and I am waiting on the sidelines for the eventual release. |
Another option -- we already have syntax to associate arbitrary metadata with elements: attribute
|
One beautiful thing about this is that (with the addition of a single comma) it's a valid Lua table. Not that that matters. But I suggested a metadata format like this on markdown-discuss 15 years ago. However, I think it's important to consider what types of data will go into the metadata fields. Our attributes are just strings. But string content isn't adequate for metadata. E.g., titles will often contain formatting like emphasis, and abstracts can even contain paragraphs and lists. |
My gut response here would be to leave these kinds of metadata to the processors. Eg,
and let the specific rendered to interpret abstract as metadata, and pull title there as well. |
I love having a way to serialize data without a new bespoke syntax. One nice thing about Markdown documents that embed YAML/TOML in the preface is that I can easily read/export that format without a new parser. Lua tables (with nil) feels great. |
I like the idea about using attribute syntax a lot, but less so the idea that it be a Lua table. Would that mean that Lua escapes are legal in the string? I assume |
Nobody wants to put an abstract into something like a JSON string, escaping newlines etc. One nice thing about a Lua table is that you actually could do
|
Heya, just my two-cent 😅 I think it might be helpful to compare a representative "in the wild" Markdown front-matter. I feel YAML is certainly the most "readable", but this obviously comes with the unfortunate over-complexities for parsing. YAMLversion: 1
title: My Document
author:
- name: Author One
affiliation: University of Somewhere
- name: Author Two
affiliation: University of Nowhere
abstract: |
This is my very,
very, very, long abstract...
toc: true
format:
html:
# some comment ...
code-fold: true
html-math-method: katex
pdf:
geometry:
- top=30mm
- left=20mm TOMLversion = 1
title = "My Document"
abstract = """This is my very,
very, very, long abstract...
"""
toc = true
[[author]]
name = "Author One"
affiliation = "University of Somewhere"
[[author]]
name = "Author Two"
affiliation = "University of Nowhere"
[format.html]
# some comment ...
code-fold = true
html-math-method = "katex"
[format.pdf]
geometry = [ "top=30mm", "left=20mm" ] Lua Table{
version = 1,
title = "My Document",
author = {
{
name = "Author One",
affiliation = "University of Somewhere"
},
{
name = "Author Two",
affiliation = "University of Nowhere"
}
},
abstract = [[
This is my very,
very, very, long abstract...
]] ,
toc = true,
format = {
html = {
-- some comment...
["code-fold"] = true,
["html-math-method"] = "katex"
},
pdf = {
geometry = { "top=30mm", "left=20mm" }
}
}
} JSON(no comments allowed) {
"version": 1,
"title": "My Document",
"abstract": "This is my very,\nvery, very, long abstract...\n",
"toc": true,
"author": [
{
"name": "Author One",
"affiliation": "University of Somewhere"
},
{
"name": "Author Two",
"affiliation": "University of Nowhere"
}
],
"format": {
"html": {
"code-fold": true,
"html-math-method": "katex"
},
"pdf": {
"geometry": [
"top=30mm",
"left=20mm"
]
}
}
} |
If leaning on an existing format, the chief benefit is being able to read/write document metadata without a bespoke parser. Is StrictYAML codified where this would be an option in other languages? Similar problem for JSON – I think supporting comments should be a goal, but most JSON parsers do not support a comment syntax. Perhaps JSON5 is standardized enough to be considered? Then again, djot is an entirely new format which already requires a custom parser, but it would be nice to get the metadata formatting for free. |
If the metadata is a lua table, would the parser be able to evaluate functions within it? Also, I just stumbled on this project a few days ago and love its potential and vision! |
|
Me neither, at least not by default. It might be somewhat less scary if executed in a custom environment insulated from the file system, but that might be severely limiting when you cannot load modules. An alternative might be a custom variable interpolation or even template system with limited capabilities. I have written such a processor for MoonScript/Lua but it uses Lpeg/re and as such is not appropriate for djot. Before Pandoc included lpeg/re in its Lua API I had written a parser in pure MoonScript/Lua but it was a lot of code: 700+ lines, a whole parser implementation of its own. With lpeg/re I'm down to about 300 lines not counting what is done by the lpeg/re modules, which still is at the upper bound for what I'm comfortable with inlining into a Pandoc filter. That includes a mechanism for pluggable functions and some default functions, which make up around a third of the code. I usually add around 20-60 lines of extra functions and variable data, and that's a MoonScript class, so I'm back at some 700 lines of Lua code, plus dependency on lpeg/re. |
Leaving executable code as an extension makes sense. And if djot's parsers are moving away from lua as @dbready mentioned, embedding lua just to read metadata seems extraneous. I don't think any of the other common metadata formats allow for code execution natively, and they probably prevent this for good reason. If metadata code execution is left to the program, then you can just pass in code through the program's custom metadata field, like pandoc's |
There's also Hjson which looks like this:
It's basically json, but it doesn't require quoting keys and it has comments and nice multi-line strings. |
Another potential choice is NestedText. It's designed to be simple to parse yet still humanly readable (based on YAML). Here's an example:
It only has three types: dictionaries, lists, and strings. There's even a more simplified version. |
Trying to think more holistically, an eventual goal of this markup is that non-programmers could adopt it in various places: blogs, academic papers, forums, etc. In which case, using an existing JSON/YAML/TOML format is a disadvantage: for a layman, it becomes a bespoke “header metadata” format different from the rest of the djot markup. From the angle of minimizing language size, I am in favor of matklad’s suggestion to use the existing djot attribute syntax. Less for a user to learn and easier to implement a parser. |
If existing djot syntax is to be used, which I think is a good idea, it is best to use definition/(un)ordered list syntax so that hierarchical structures are possible, for example multiple authors as a bullet list and the name/affiliation/email of each as a definition list. |
I'm very much in favour of metadata in djot documents. In My two cents (and sort of mentioned elsewhere): I suspect native definition lists will do, possibly wrapped inside a
When using a designated div type (like |
I like this idea, however with one minor exception: with this propisal, h1
level section header becomes the document title.
I'd like to move this definition area to the top of the document with an
explicit title field for the document title.
With this move it is more natural to use h1 level sections to split your
document in major parts.
…On Sun, Jan 15, 2023, 00:46 Alex Kladov ***@***.***> wrote:
I realized that there's a quite syntactically nice way to embed meta in
existing djot:
# My Document
: author = Alex Kladov
: highlight-code
: highlight-theme = GitHub
: abstract
Lorem Ipsum Dolores
Bla Bla Bla
You could say that : key = value isn't actually djot syntax, but it
needn't be! If there's a filter which turns first definition list after
title into meta, it can also split dt's on = into a key and a value
—
Reply to this email directly, view it on GitHub
<#35 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2WNNYOE63CQ6CIBWOXXKLWSM3EPANCNFSM55FI7W7Q>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
There’s #130 which proposes dedicated syntax for document titles. Everyone except me seems to be in agreement that title should just be a metadata filed, but I still just don’t see that personally :) _Obviously_ title is the element you start your doc with, both in the the source code, and in stand-alone HTML (title goes to both |
reStructuredText has a convention that the first heading sets the title, and a "field list" after it is treated as metadata (IIRC): e.g.,
We don't exactly have a "field list" in djot, but perhaps we could/should steal the concept: |
I fail to see the advantage of this
(I hope I got my lorem generator to produce correct djot definition list syntax. You get my idea!) It may be more whitespace than some people like, but it uses existing djot syntax in an extensible way, which is key. Obviously lists-as-meta could (probably should) have some additional restrictions such as definitions/values either containing just a nested list or a string which is treated as a plain string rather than rather than being parsed into a list of blocks/inlines, but it would be good if the structure as such uses the same basic syntax as regular lists. (I moved my thought on definition list syntax from here to separate discussion in #193. I also wrote something on metadata vs. other data which doesn’t concern (meta) data structure as such in #192.) |
Here's how metadata might look with reST style "field lists" (https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#field-lists):
This allows formatted and even block-level content for the fields. It does not yet support structured fields (e.g. metadata in the form of lists or structured objects). In pandoc you can have:
which is quite useful. Of course we could model that as
or perhaps
Still, you lose some flexibility. E.g. in pandoc you can have
which in pandoc metadata is clearly a ListMeta and not a BlockMeta with a list as its contents. For the latter, you'd write
In the field list syntax,
doesn't distinguish the two meanings distinguished above. But perhaps the "repeated key" approach does:
Note 1: with the "repeated key" approach to forming list metadata, you would not be able to override an earlier metadata item with a later one, as you can in pandoc. Note 2: field lists would create an ambiguity with the symb syntax, making it impossible to express a paragraph beginning with a symb. That's probably bad. We could revisit symb syntax or find a different syntax for field lists. |
All in all, I'm still liking a simplified YAML-ish syntax best. |
Let's do a small thought experiment. Please take it with a pinch of salt. Can I have metadata with the existing Djot reader/parser as-is without changes, and just simple tweaks to my Djot renderer to some format? What is a metadata item? Let's go for the simplest form: a (key, value) that can possibly be defined anywhere in my document, does not affect normal output, and can be collected so my renderer may do something useful with it. The value can contain formatted text, possibly spanning multiple paragraphs. Wait a second... I do already have a construct for that! It's called a "footnote reference"... Let's go for it, but distinguishing it from my regular footnote space: I could just "reserve" some keys by a mere naming convention... Let's use colons in those metadata key names, for instance...
Now my renderer just has to look in the footnote references for those keys-with-colons, and use their value (e.g. put that title in a running header, or whatever). Without needing YAML, etc. So... Problem solved! But wait again... I could still actually refer to those weird pseudo-footnotes in the flow of my text. Why not, no problem, and this might actually even be handy...
Ahem! Thinking further, Djot has this small loosely-defined things called "symbols" too... I don't really need emojis or whatever it was supposed to be... So let's assume my renderer could actually resolve these symbols using my metadata footnote references? .... And suddenly, I went beyond just having metadata support...
Nifty.
What could go wrong here? 🤣 One could argue that using footnote syntax for this stuff is bad semantics. Quite right, possibly... but this is a lightweight markup language, so heh, after all... And if one wanted really distinct markup for different things, it's no longer lightweight, and it does already exist... it's called XML 😁 |
...
...
Not at all. The form:
is already overloaded, used by both Reference link definitions and Footnotes, with the later effectively carving out a key namespace with all its keys prefixed with In a meta markup language I'm working on (Plain Text Style Sheets), I've a generalized notion of reference definitions (is there a better name?) which includes key-value definitions just as you described, supporting reference links, footnotes, metadata, and automatic substitutions/macros. References can also be defined for content elements, e.g. named anchors to headings or any block/inline span, table and figure references, important term introductions/definitions, citations, index entries, glossary definitions, hashtags. Recursive resolution is also supported. Author/reader-friendly namespaces, if necessary, are easily defined by a simple character prefix, e.g. |
I'd like to add my 2c after reading this thread, as I'm very interested in this functionality and plan to integrate it into one of my projects. As far as I understand, there are two (largely independent) aspects being discussed: the location of the metadata and its format. Location:
Format: various options are listed in #35 (comment) and #35 (comment). Most of the options are format-independent, so can be integrated with any of the proposed formats, but using footnote references would largely define the format as well. I listed some of the pros/cons for each of the options (although I'm sure the list can be extended). All locations require their content to be hidden (maybe with the exception of footnote references), so may not work well with processors that don't recognize the syntax. I find the option of using footnote references really interesting, but it's likely to suffer from difficulties expressing elements that require arrays or sub-elements (for example, multiple authors with names, email and affiliations). If there is a good way to address this, then I'd favor this option. The attribute syntax has similar advantages (and is likely a bit less verbose), but doesn't allow multi-line values. Using the meta element is probably the most flexible one, but would require a separate processing, depending on which format is selected. I'd prefer Lua tables (and there are easy ways to suppress function execution there if needed), but I can understand why other formats may need to be supported (instead or as well). (updated 7/30 to add attribute syntax) |
Nice summary @pkulchenko
The content of a "metadata" block remains to be specified, with use cases largely depending on the context -- suffice to look how scattered is the use of such blocks in existing Markdown solutions (static web site or blog generators all have their things, etc.; without clear namespacing... in some documents I saw In other terms, to @jgm 's initial question ("Should there be a built-in format for metadata, or should that be considered distinct from the markup syntax?"), I am tempted to answer negatively to the first point (and thus positively to the second). |
(Yet) another option would be the raw blocks, they're already set aside for special treatment by the processor.
It does leave the exact choice of metadata format up to the application consuming the document, which is a bit of a shame, but since it explicitly states the format you could always rewrite it easily if needed. |
YAML has such overcomplicated parsing rules. I’d be happier with something simpler but based on YAML rather than full YAML compliance if going that route—since full compliance would likely involve reliance on an entire YAML parser library as a dependency. Personally I don’t like the ad hoc nature of reusing a code block versus something more first-class as it becomes trickier to understand that it’s special, such as for editors to suggest the block is foldable/concealable, etc., or for consistent metadata fetching. It would be ideal in many build systems to be able to call something like |
It won’t do to just assume that a raw block marked with I agree that a dependency on a full YAML library probably should be avoided, but it would be good to consider what makes YAML attractive:
Unfortunately this human-reader friendliness comes at the price of requiring syntax rules which often are not at all intuitive to human writers in order to accommodate the “computer reader”. So what is needed is a reasonable (assuming YAML is unreasonable) compromise between those human-friendly features and features which are “computer friendly”. However I believe that this dichotomy is a bit of a red herring: any format which is meant to be read and written by both humans and computers has to strike such a balance, including djot, which already leans heavily in the direction of human-friendliness. I have said it before: djot already can parse both key-value lists, namely definition lists, and bullet lists, so it makes sense to reuse djot list syntax for which the parsing facilities are already in place! The problem is that you probably won’t usually want metadata values (or keys) to be parsed into textual elements — emphasis, spans etc. One solution might be to mark “raw text” as raw blocks/spans with a format |
Interesting comments. I've spent some time trying different options and then looking at the generated html, json and AST. To me the attribute approach looks like a winner given how concise it is comparing to some other options. I also like to think about it as a way to associate attributes with the document itself instead of specific elements. I'm interested in being able to support the following:
The approach with attributes checks all these boxes for me as shown in the following example:
I'd recommend using This approach allows adding and overwriting attributes (as shown above with This syntax is quite forgiving in terms of quotes being optional, but it does require brackets to be on the same line as some of the text. I'll try it with few more scenarios and report back if I run into any difficulties with it. |
Just a quick remark: this is not true, thematic breaks can (and should) have real attributes, with nothing "meta" about them. In real books, no one uses a mere (full or not) rule in all circumstances. I am currently using, for instance (non exhaustively):
I.e. styling thematic breaks (here, to possibly obtain; respespectively, a centered |
I should have been more explicit; what I meant was that in this case, the thematic break is only added to separate (document) attributes from the rest of the document, so it won't have any other attributes (as it wouldn't exist in the document otherwise). Associating document attributes with any other element would lump them together with all other attributes that may already exist for that element. |
The whole concept of ’front matter’ exists because Markdown, unlike most other document/media file formats, did not provide a native way add metadata. It’s a hack & should be avoided, not replicated. |
The drawback of attribute syntax is that attribute values are just plain strings. Metadata like title and abstract often contain formatting, so it would be nice if they were regular djot syntax. |
Abstract, backstory, correction, epilogue, prologue These are elements that seem they should be in the contains-formatting category (there’s likely a few more). Much of the other elements would be inside something like |
Titles can contain emphasis (e.g. italicizing a title), superscripts and subscripts, and math, for example. |
Finally not too lazy to look at the |
HTML is not the only format in the world and Djot should not be restricted to its lowest common denominator. Many real-world document titles contain formatting that has to be mangled to fit into an HTML |
Ah, I see the argument you are making now.
|
The title is the most prominent piece of data about the document (if any). It should not come after other metadata, nor should it require a metadata block (inside metadata). In "An INI Critique of TOML" [0], the author differentiates between serialisation formats and configuration formats. Metadata in djot is not for serialising and sending data. This makes JSON, TOML, and similar formats unsuitable. Also, using an external/existing format adds the cost of new syntax, as well as more parsing code, or a library. Figuring out the right subset of YAML will still require additional syntax to remember, as well as a new parser. External formats also have the downside of not (natively) supporting djot markup. Also, trying to support multiple external formats sounds like way too much overhead. Overall, I think a metadata block using a definition list is the way to go. 0: https://github.com/madmurphy/libconfini/wiki/An-INI-critique-of-TOML |
Has there been any progress on this discussion? I feel that @pkulchenko's suggestion is the nicest one because it feels the most native to djot with attributes, simply making them available to the document by having either a blank line after them or we could imagine something like an optional placeholder document element like
I think this could present a start for document-level attributes aka metadata, and then as for @jgm concerns, being able to introduce djot syntax for title document attribute, there needs to be a clarification on wether there are other attributes or metadata that require this as well. |
sorry for the late chime in. reading the plethora of messages and opinions so far got me thinking.
Format No1 — the proposed default:
Format No2 — the extension:
i am using the
|
In case it adds anything useful to this discussion, I added a comment here #293 (comment) that references this issue. |
Should there be a built-in format for metadata, or should that be considered distinct from the markup syntax?
If so, what?
Do we need structured keys such as YAML provides? Would be nice to avoid the complexity of YAML, but otherwise YAML is nice for this. Maybe some simplified subset of YAML.
The text was updated successfully, but these errors were encountered: