Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata #35

Open
jgm opened this issue Jul 31, 2022 · 60 comments
Open

Metadata #35

jgm opened this issue Jul 31, 2022 · 60 comments

Comments

@jgm
Copy link
Owner

jgm commented Jul 31, 2022

Should there be a built-in format for metadata, or should that be considered distinct from the markup syntax?

If so, what?

Do we need structured keys such as YAML provides? Would be nice to avoid the complexity of YAML, but otherwise YAML is nice for this. Maybe some simplified subset of YAML.

@uvtc
Copy link
Contributor

uvtc commented Aug 1, 2022

Is the purpose of the metadata block to set variables in the standalone output doc template? (I'm thinking here of my rough understanding of how Pandoc works.)

My understanding is that YAML is a rather complex format. What about TOML?

@jgm
Copy link
Owner Author

jgm commented Aug 1, 2022

I don't much like TOML for this purpose; it requires you to quote strings, and it makes it very inconvenient to represent e.g. an array of references.

@dumblob
Copy link

dumblob commented Aug 1, 2022

YAML has bad handling of anything including newlines. So "simplified subset of YAML" would not solve the issue TOML does solve. But I agree that quoting stuff is annoying.

One thing that I am missing is to spread metadata across the document - for longer documents one loses context if you have to put all the metadata at the very beginning of the document. So my requirement would be support multiple metadata blocks instead of just one.

Btw. how about just "reusing" existing formatted blocks with a reserved format keywoard (might be a "symbol" instead latin text)? Instead of cpp one would use perhaps > or # or whatever and done.

@bpj
Copy link

bpj commented Aug 1, 2022

I needed a way to load configuration into my Pandoc filters without needing to "revert" Pandoc metadata trees to "plain" data. I couldn't use an existing YAML parser and soon despaired about writing a parser for full YAML. What I did succeed in writing was a parser for a basic subset of flow-style YAML without unquoted strings and tags, so basically JSON plus YAML-style hex integers and single and double quoted strings, giving the main advantages of YAML over JSON (including getting rid of the odious surrogate pair escapes!) without the significant indentation. The lpeg/re grammar isn't terribly large (copied from my Moonscript file):

yson_re = re.compile [===[ -- @start-re
  input <- value / not_a_value
  value <- (
      %s*
      ( string
      / number
      / array
      / object
      / 'true'  -> true
      / 'false' -> false
      / 'null'  -> null
      )
      %s*
    / %s* not_a_value
    )
  string <- ( single / double )
  single <- {| "'" ( { [^']+ } / "'" { "'" } )* "'" |} -> concat
  double <- {|
      '"' (
        { [^"\]+ }
      / { '\' ["\/bfnrt0aveN_LP %t] } -> esc
      / ( '\x' { %x^2 } -> hex_char
        / '\u' { %x^4 } -> hex_char
        / '\U' { %x^8 } -> hex_char
        )
      / bad_esc
      )* '"'
    |} -> concat
  number <- {
      '-'?
      ( '0x' %x+
      / ( '0' / [1-9] %d* )
        ( '.' %d+ )?
        ( [eE] [-+]? %d+ )?
      )
    } -> tonumber
  object <- {| '{' %s* '}' / '{' kv ( ',' kv )* '}' |} -> object
  kv <- {|
      ( %s* !string not_a_key )?
      %s* {:k: string :} %s* ( !':' bad )? ':'
      %s* {:v: value :} %s* ( ![,}] bad )?
      ( &',' !( ',' %s* string ) bad )?
    |}
  array <- {|
      '[' %s* ']'
    / '[' {| value |} ( ![],] bad )?
      ( ',' {| value |} ( ![],] bad )? )* ']'
    |} -> array
  bad_esc <- {|
      {:pos: {} :}
      {:msg: { '\' . } -> 'Unknown or invalid escape "%1"' :}
    |} => fail
  not_a_key <- {|
      %s* {:pos: {} :}
      {:msg: { %S* } -> 'Expected key (string) near "%1"' :}
    |} => fail
  not_a_value <- {|
      %s* {:pos: {} :}
      {:msg: { %S* } -> 'Expected value near "%1"' :}
    |} => fail
  bad <- {|
      %s* {:pos: {} :}
      {:msg: { %S } -> 'Unexpected "%1"' :}
    |} => fail
-- @stop-re ]===],

@jgm
Copy link
Owner Author

jgm commented Aug 1, 2022

reStructuredText does something interesting here. They re-use definition list syntax; when a definition list occurs right after the document title, it is interpreted as metadata (IIRC).

Nice thing is that we already have a nice readable syntax for that.

@bpj
Copy link

bpj commented Aug 2, 2022

@jgm wrote:

reStructuredText does something interesting here. They re-use definition list syntax;

Nice!

when a definition list occurs right after the document title, it is interpreted as metadata (IIRC).

Can't say I like that, since it should be legal to place a definition list as the first thing in the document, so some delimiter (three or more of some punctuation character!) would seem in order. Are ~~~ or +++ taken?

Nice thing is that we already have a nice readable syntax for that.

Yes. Some questions:

-   Would multiple "definitions" become a list?

-   and a nested "definition list" a nested mapping?

-   Would values be verbatim or be parsed as markup? If the latter there should IMO

1.  be a way to mark a value as a raw string, maybe

: raw

`This is a simple raw string`{=}

: more raw

```=
This is a multi-line
raw string
```
  1. Be a way to mark a nested definition list as an actual definition list in the value, maybe by giving it an attribute block, which may contain just a comment.
  • Would/could a bibliography (cf. Citations #32) be included in metadata? I think it should also use definition list syntax but have its own block delimiter (maybe @@@ if @ marks a reference as such).

  • Might it be possible to store values which look like numbers as numbers? In Lua terms val = tonumber(val) or val.

  • Might it be possible to have metadata contain booleans, and if so how would they be represented?

@uvtc
Copy link
Contributor

uvtc commented Aug 2, 2022

Are ~~~ or +++ taken?

~~~ currently works as a delimiter for code blocks.

@uvtc
Copy link
Contributor

uvtc commented Aug 5, 2022

Are ~~~ or +++ taken?

I think the +++ would work well for metadata. It's a good punctuation character to use for a fence. It's not terribly pretty, but that's ok since metadata blocks are not terribly common, and should probably draw attention when they are present. And the + sign makes me think of something that's being added (here, the metadata).

@marrus-sh
Copy link

marrus-sh commented Aug 14, 2022

my current (Makefile‐based) workflow involves cat-ing a number of YAML files onto the front of a Markdown document prior to it being read in by Pandoc. i’m not too attached to YAML as a format, but it would be nice to support append‐only solutions for providing metadata (i.e., ones which don’t require any processing of the file itself). this means:

  • the metadata can be included directly at the beginning or ending of the file (at least one of the two; ideally both)

  • a file which already has metadata can have more metadata appended, with conflicting terms resolved somehow

nested metadata is useful in my experience for namespacing, although

foo:
  bar: etaoin
  baz: shrdlu

can usually be represented as

foo-bar: etaoin
foo-baz: shrdlu

supporting lists/arrays is more important, as they are more difficult to represent through alternate means

@dbready
Copy link

dbready commented Nov 3, 2022

In case there is still doubt about the topic, I am highly in favor of document metadata being within the document.

Is there some reason that the comment character could not be co-opted to serve as a docstring for metadata? Comment block at the start of the document can contain whatever syntax is chosen to define key:values. In Rust, a // is a standard comment, but a /// notes a docstring, giving a cheap way to detect it. Then again, I believe many sins have been committed by utilizing comment blocks for data.

Anyway, big fan of the project, and I am waiting on the sidelines for the eventual release.

@matklad
Copy link
Contributor

matklad commented Nov 3, 2022

Another option -- we already have syntax to associate arbitrary metadata with elements: attribute {.foo #bar baz="quux"} syntax. We just don't have a nice way to attach that to the document as a whole, but I think we can do something like "if the doc starts with attributes and they are followed by a blank line, the attributes belong to the document's node":

{
  author="matklad"
  date="2022-11-03"
}

# Consider using Djot for your next presentation

@jgm
Copy link
Owner Author

jgm commented Nov 3, 2022

{
  author="matklad"
  date="2022-11-03"
}

One beautiful thing about this is that (with the addition of a single comma) it's a valid Lua table. Not that that matters. But I suggested a metadata format like this on markdown-discuss 15 years ago.

However, I think it's important to consider what types of data will go into the metadata fields. Our attributes are just strings. But string content isn't adequate for metadata. E.g., titles will often contain formatting like emphasis, and abstracts can even contain paragraphs and lists.

@matklad
Copy link
Contributor

matklad commented Nov 3, 2022

E.g., titles will often contain formatting like emphasis, and abstracts can even contain paragraphs and lists.

My gut response here would be to leave these kinds of metadata to the processors. Eg,

# Title With _Inlines_

::: abstract

some table or what not

::: 

and let the specific rendered to interpret abstract as metadata, and pull title there as well.

@dbready
Copy link

dbready commented Nov 3, 2022

One beautiful thing about this is that (with the addition of a single comma) it's a valid Lua table

I love having a way to serialize data without a new bespoke syntax. One nice thing about Markdown documents that embed YAML/TOML in the preface is that I can easily read/export that format without a new parser. Lua tables (with nil) feels great.

@bpj
Copy link

bpj commented Nov 3, 2022

I like the idea about using attribute syntax a lot, but less so the idea that it be a Lua table. Would that mean that Lua escapes are legal in the string? I assume \<punct> escapes are already legal in attributes, while Lua only supports \" \' \\, and what about \n and the like? In fact Lua table syntax isn't all that portable: you do need e.g. a JSON library to exchange data with other languages.

@jgm
Copy link
Owner Author

jgm commented Nov 3, 2022

Nobody wants to put an abstract into something like a JSON string, escaping newlines etc. One nice thing about a Lua table is that you actually could do

{
  abstract = [[This is my
abstract.

It has multiple paragraphs.]]

@chrisjsewell
Copy link

Heya, just my two-cent 😅

I think it might be helpful to compare a representative "in the wild" Markdown front-matter.

I feel YAML is certainly the most "readable", but this obviously comes with the unfortunate over-complexities for parsing.
Perhaps a subset of YAML would be nice, removing some of the more problematic features, as in https://hitchdev.com/strictyaml/features-removed/ 🤔

YAML

version: 1
title: My Document
author:
- name: Author One
  affiliation: University of Somewhere
- name: Author Two
  affiliation: University of Nowhere
abstract: |
	This is my very,
    very, very, long abstract...
toc: true
format: 
  html: 
    # some comment ...
    code-fold: true
    html-math-method: katex
  pdf: 
    geometry: 
    - top=30mm
    - left=20mm

TOML

version = 1
title = "My Document"
abstract = """This is my very,
very, very, long abstract...
"""
toc = true

[[author]]
name = "Author One"
affiliation = "University of Somewhere"

[[author]]
name = "Author Two"
affiliation = "University of Nowhere"

[format.html]
# some comment ...
code-fold = true
html-math-method = "katex"

[format.pdf]
geometry = [ "top=30mm", "left=20mm" ]

Lua Table

{
  version = 1,
  title = "My Document",
  author = {
    {
      name = "Author One",
      affiliation = "University of Somewhere"
    },
    {
      name = "Author Two",
      affiliation = "University of Nowhere"
    }
  },
  abstract = [[
This is my very,
very, very, long abstract...
]] ,
  toc = true,
  format = {
    html = {
      -- some comment...
      ["code-fold"] = true,
      ["html-math-method"] = "katex"
    },
    pdf = {
      geometry = { "top=30mm", "left=20mm" }
    }
  }
}

JSON

(no comments allowed)

{
  "version": 1,
  "title": "My Document",
  "abstract": "This is my very,\nvery, very, long abstract...\n",
  "toc": true,
  "author": [
    {
      "name": "Author One",
      "affiliation": "University of Somewhere"
    },
    {
      "name": "Author Two",
      "affiliation": "University of Nowhere"
    }
  ],
  "format": {
    "html": {
      "code-fold": true,
      "html-math-method": "katex"
    },
    "pdf": {
      "geometry": [
        "top=30mm",
        "left=20mm"
      ]
    }
  }
}

@dbready
Copy link

dbready commented Nov 12, 2022

If leaning on an existing format, the chief benefit is being able to read/write document metadata without a bespoke parser. Is StrictYAML codified where this would be an option in other languages? Similar problem for JSON – I think supporting comments should be a goal, but most JSON parsers do not support a comment syntax. Perhaps JSON5 is standardized enough to be considered?

Then again, djot is an entirely new format which already requires a custom parser, but it would be nice to get the metadata formatting for free.

@mcookly
Copy link

mcookly commented Dec 9, 2022

Nobody wants to put an abstract into something like a JSON string, escaping newlines etc. One nice thing about a Lua table is that you actually could do

{
  abstract = [[This is my
abstract.

It has multiple paragraphs.]]

If the metadata is a lua table, would the parser be able to evaluate functions within it?
If so, this might be a great feature for things like datetime or time-based UUIDs.
I use markdown + YAML a lot for zettelkasten notes and academic writing (with pandoc);
a functional metadata can really extend a textfile's usage cases.

Also, I just stumbled on this project a few days ago and love its potential and vision!
Keep up the awesome work!

@dbready
Copy link

dbready commented Dec 9, 2022

  1. I do not like the idea of executable code in the document. Use cases of that nature seem more appropriate to an extension mechanism. If someone wants to embed a block of code in the front-matter and evaluate it, that should be possible, but not the default.
  2. While Lua is the current implementation and being discussed as a serialization format, I do not expect Lua semantics to carry through. That is, would a Python/Javascript/Rust djot parser have to embed Lua so as to properly render a document?

@bpj
Copy link

bpj commented Dec 10, 2022

I do not like the idea of executable code in the document.

Me neither, at least not by default. It might be somewhat less scary if executed in a custom environment insulated from the file system, but that might be severely limiting when you cannot load modules. An alternative might be a custom variable interpolation or even template system with limited capabilities. I have written such a processor for MoonScript/Lua but it uses Lpeg/re and as such is not appropriate for djot. Before Pandoc included lpeg/re in its Lua API I had written a parser in pure MoonScript/Lua but it was a lot of code: 700+ lines, a whole parser implementation of its own. With lpeg/re I'm down to about 300 lines not counting what is done by the lpeg/re modules, which still is at the upper bound for what I'm comfortable with inlining into a Pandoc filter. That includes a mechanism for pluggable functions and some default functions, which make up around a third of the code. I usually add around 20-60 lines of extra functions and variable data, and that's a MoonScript class, so I'm back at some 700 lines of Lua code, plus dependency on lpeg/re.

@mcookly
Copy link

mcookly commented Dec 10, 2022

Leaving executable code as an extension makes sense. And if djot's parsers are moving away from lua as @dbready mentioned, embedding lua just to read metadata seems extraneous. I don't think any of the other common metadata formats allow for code execution natively, and they probably prevent this for good reason.

If metadata code execution is left to the program, then you can just pass in code through the program's custom metadata field, like pandoc's header-includes. And if djot will be adding its own native serialization format, I assume it could allow passing in code blocks / inline code through the metadata. Either way, code is not directly executed when rendering the document.

@tmke8
Copy link

tmke8 commented Jan 7, 2023

There's also Hjson which looks like this:

version: 1
title: My Document
abstract:
  '''
  This is my very,
  very, very, long abstract...

  '''
toc: true
author: [
  {
    name: Author One
    affiliation: University of Somewhere
  },
  {
    name: Author Two
    affiliation: University of Nowhere
  }
],
format: {
  # some comment ...
  html: {
    code-fold: true,
    html-math-method: katex
  },
  pdf: {
    geometry: [
      "top=30mm", "left=20mm"
    ]
  }
}

It's basically json, but it doesn't require quoting keys and it has comments and nice multi-line strings.

@mcookly
Copy link

mcookly commented Jan 9, 2023

Another potential choice is NestedText. It's designed to be simple to parse yet still humanly readable (based on YAML). Here's an example:

version: 1
title: My Document
abstract:
  > This is my very,
  > very, very, long abstract...
toc: true
author:
  -
    name: Author One
    affiliation: University of Somewhere
  -
    name: Author Two
    affiliation: University of Nowhere
format:
  # Some comment ...
  html:
    code-fold: true
    html-math-method: katex
  pdf:
    geometry: [ "top=30mm", "left=20mm" ]

It only has three types: dictionaries, lists, and strings. There's even a more simplified version.

@dbready
Copy link

dbready commented Jan 9, 2023

Trying to think more holistically, an eventual goal of this markup is that non-programmers could adopt it in various places: blogs, academic papers, forums, etc. In which case, using an existing JSON/YAML/TOML format is a disadvantage: for a layman, it becomes a bespoke “header metadata” format different from the rest of the djot markup.

From the angle of minimizing language size, I am in favor of matklad’s suggestion to use the existing djot attribute syntax. Less for a user to learn and easier to implement a parser.

@bpj
Copy link

bpj commented Jan 9, 2023

If existing djot syntax is to be used, which I think is a good idea, it is best to use definition/(un)ordered list syntax so that hierarchical structures are possible, for example multiple authors as a bullet list and the name/affiliation/email of each as a definition list.

@ffel
Copy link

ffel commented Jan 10, 2023

I'm very much in favour of metadata in djot documents. In pandoc I use title, author, date, and lang nearly everywhere. Often I add references local to one document (visited web pages).

My two cents (and sort of mentioned elsewhere): I suspect native definition lists will do, possibly wrapped inside a meta (or perhaps even djot?) div:

::: meta
title
:  Title of document
author
:  Author A
:  Author B
:::

When using a designated div type (like meta above) it will be possible to not only add a metadata block at the top of the document but also add meta data in later parts of the documents (perhaps, again, the citation information of a visited web site).

@ffel
Copy link

ffel commented Jan 15, 2023 via email

@matklad
Copy link
Contributor

matklad commented Jan 15, 2023

There’s #130 which proposes dedicated syntax for document titles. Everyone except me seems to be in agreement that title should just be a metadata filed, but I still just don’t see that personally :) _Obviously_ title is the element you start your doc with, both in the the source code, and in stand-alone HTML (title goes to both <title> and h1) :)

@jgm
Copy link
Owner Author

jgm commented Jan 15, 2023

reStructuredText has a convention that the first heading sets the title, and a "field list" after it is treated as metadata (IIRC): e.g.,

===================
Pandoc User’s Guide
===================

:Author: John MacFarlane
:Date: August 22, 2022

We don't exactly have a "field list" in djot, but perhaps we could/should steal the concept:
https://docutils.sourceforge.io/0.4/docs/ref/rst/restructuredtext.html#field-lists

@bpj
Copy link

bpj commented Jan 15, 2023

I fail to see the advantage of this :key = value over a regular definition list whose children (“definitions”) may be anything, not just one-line strings. You will run into the same extensibility problem as the INI file format and, eventually, similar clunky solutions (I foresee people doing things like foo.bar.bqz = quux) and letting renderers sort it out, which is no good. It’s better to build in the possibility of an hierarchical structure (and thus namespaces) from the start, and regular lists which can be nested is clearly the way to do it. It hopefully also will avoid the possible need to quote values which is a source of irritation in INI because in most INI variants you can’t do that, and multi-line values don’t become a problem. At one point I wrote a parser for INI with section syntax in pure Lua and it wasn’t fun because of the idiotic way hierarchical structures are expressed in that format, because it’s easy to confuse branches and leaves in that syntax; let’s not fall into that trap!

:::meta
: author

  - : name

      Libero Sint

    : email

      maxime@example.org

  - : name

      Officia Ut

    : email

      id@blanditiis.example.com

  - : name

      Neque Ea

    : email

      eum@reiciendis.example.com
:::

(I hope I got my lorem generator to produce correct djot definition list syntax. You get my idea!)

It may be more whitespace than some people like, but it uses existing djot syntax in an extensible way, which is key.

Obviously lists-as-meta could (probably should) have some additional restrictions such as definitions/values either containing just a nested list or a string which is treated as a plain string rather than rather than being parsed into a list of blocks/inlines, but it would be good if the structure as such uses the same basic syntax as regular lists.

(I moved my thought on definition list syntax from here to separate discussion in #193. I also wrote something on metadata vs. other data which doesn’t concern (meta) data structure as such in #192.)

@jgm
Copy link
Owner Author

jgm commented Jan 16, 2023

Here's how metadata might look with reST style "field lists" (https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#field-lists):

# My Document

:author: Alex Kladov
:highlight-code: true
:highlight-theme: GitHub
:abstract:
  Lorem Ipsum Dolores

  Bla Bla Bla

## First section

This allows formatted and even block-level content for the fields. It does not yet support structured fields (e.g. metadata in the form of lists or structured objects). In pandoc you can have:

---
author:
- name: Sam Smith
  institution: Cal Tech
- name: Julie Wang
  institution: UCLA
...

which is quite useful. Of course we could model that as

:author:
  - :name: Sam Smith
    :institution: Cal Tech
  - :name: Julie Wang
    :institution: UCLA

or perhaps

:author:
  :name: Sam Smith
  :institution: Cal Tech
:author:
  :name: Julie Wang
  :institution: UCLA

Still, you lose some flexibility. E.g. in pandoc you can have

---
author:
- John Smith
- Julie Wang
...

which in pandoc metadata is clearly a ListMeta and not a BlockMeta with a list as its contents. For the latter, you'd write

---
author: |
  - John Smith
  - Julie Wang
...

In the field list syntax,

:author:
  - John Smith
  - Julie Wang

doesn't distinguish the two meanings distinguished above. But perhaps the "repeated key" approach does:

:author: John Smith
:author: Julie Wang

Note 1: with the "repeated key" approach to forming list metadata, you would not be able to override an earlier metadata item with a later one, as you can in pandoc.

Note 2: field lists would create an ambiguity with the symb syntax, making it impossible to express a paragraph beginning with a symb. That's probably bad. We could revisit symb syntax or find a different syntax for field lists.

@jgm
Copy link
Owner Author

jgm commented Jan 16, 2023

All in all, I'm still liking a simplified YAML-ish syntax best.

@Omikhleia
Copy link

Omikhleia commented May 4, 2023

Let's do a small thought experiment. Please take it with a pinch of salt.

Can I have metadata with the existing Djot reader/parser as-is without changes, and just simple tweaks to my Djot renderer to some format?

What is a metadata item? Let's go for the simplest form: a (key, value) that can possibly be defined anywhere in my document, does not affect normal output, and can be collected so my renderer may do something useful with it. The value can contain formatted text, possibly spanning multiple paragraphs.

Wait a second... I do already have a construct for that! It's called a "footnote reference"... Let's go for it, but distinguishing it from my regular footnote space: I could just "reserve" some keys by a mere naming convention... Let's use colons in those metadata key names, for instance...

[^:author:]: John Smith{.smallcaps}
[^:title:]: The _Great_ Book

Now my renderer just has to look in the footnote references for those keys-with-colons, and use their value (e.g. put that title in a running header, or whatever).

Without needing YAML, etc.
I can even have (Djot) lists and all whatnot's there! My renderer just has to use the bits of the Djot AST that it needs. It's quite straightforward...

So... Problem solved!

But wait again... I could still actually refer to those weird pseudo-footnotes in the flow of my text. Why not, no problem, and this might actually even be handy...

As the author[^:author:] said....

Ahem! Thinking further, Djot has this small loosely-defined things called "symbols" too... I don't really need emojis or whatever it was supposed to be... So let's assume my renderer could actually resolve these symbols using my metadata footnote references?

.... And suddenly, I went beyond just having metadata support...
I also got templating with recursive variable substitution available... Saving myself the need for a pre-processing step in my workflow:

[^:author-firsname:]: John
[^:author-lastname:]: Smith{.smallcaps}
[^:author:]: :author-firstname: :author-lastname:

By the way, my name is :author:, pleased to meet you!

Nifty.

  • No change to the Djot reader/parser, and just a few reasonable changes to my Djot writer/converter (or a filter in-between)
  • No addition of specific keywords or weird formats...

What could go wrong here? 🤣

One could argue that using footnote syntax for this stuff is bad semantics. Quite right, possibly... but this is a lightweight markup language, so heh, after all... And if one wanted really distinct markup for different things, it's no longer lightweight, and it does already exist... it's called XML 😁

@vassudanagunta
Copy link
Contributor

@Omikhleia,

Wait a second... I do already have a construct for that! It's called a "footnote reference"... Let's go for it, but distinguishing it from my regular footnote space: I could just "reserve" some keys by a mere naming convention...

...

So let's assume my renderer could actually resolve these symbols using my metadata footnote references?

.... And suddenly, I went beyond just having metadata support... I also got templating with recursive variable substitution available... Saving myself the need for a pre-processing step in my workflow:

...

One could argue that using footnote syntax for this stuff is bad semantics.

Not at all. The form:

[key]: value

is already overloaded, used by both Reference link definitions and Footnotes, with the later effectively carving out a key namespace with all its keys prefixed with ^.

In a meta markup language I'm working on (Plain Text Style Sheets), I've a generalized notion of reference definitions (is there a better name?) which includes key-value definitions just as you described, supporting reference links, footnotes, metadata, and automatic substitutions/macros. References can also be defined for content elements, e.g. named anchors to headings or any block/inline span, table and figure references, important term introductions/definitions, citations, index entries, glossary definitions, hashtags. Recursive resolution is also supported. Author/reader-friendly namespaces, if necessary, are easily defined by a simple character prefix, e.g. ^ for footnotes, # for hashtags, though I don't recommend too many namespaces as multiple definitions for the same base key will be confusing. Different ambiguity resolution rules are supported, e.g. first def wins (like CommonMark), scope-based (defined by section and page hierarchies) or strict/fail-fast on any name collision. I'd also like to make numbered list items automatically referenceable, e.g. for a link to "step 2" that also reflects any list item renumbering.

@bdarcus bdarcus mentioned this issue May 19, 2023
@pkulchenko
Copy link

pkulchenko commented Jul 31, 2023

I'd like to add my 2c after reading this thread, as I'm very interested in this functionality and plan to integrate it into one of my projects. As far as I understand, there are two (largely independent) aspects being discussed: the location of the metadata and its format.

Location:

  • at the beginning of the file
    • supports appending
    • easy to write/process
  • after the first header (as a field list) Metadata #35 (comment)
    • consistent with some other approaches
    • -requires a header/title; some documents may not have one
  • "meta" element (or meta-lua, meta-json, etc.) Metadata #35 (comment)
    • not element-specific/dependent, so may appear multiple times in the document
    • largely reuses existing syntax
  • footnote reference Metadata #35 (comment)
    • supports appending
    • not element-specific/dependent, so may appear multiple times in the document
    • largely reuses existing syntax
    • can be used/referenced in the document itself
    • may be used for automated changes (for example, to add "updated" meta-field to the end of the document)
    • -may be difficult to build a hierarchy (likely requires ^:element-subelement: syntax) or an array/table (requires repetitions of the same key) with. Repetitions may run into problems with subelements.
  • attribute syntax; those attributes that are not attached to an element or block (separated by an empty line) become document-level attributes
    • supports appending
    • not element-specific/dependent, so may appear multiple times in the document
    • largely reuses existing syntax
    • -may be difficult to build a hierarchy (similar to footnote references)
    • -doesn't allow multi-line values (for example, abstracts; although I'm not sure a meta-element is a good container for an abstract anyway)
  • additional file (just for the sake of completeness)
    • this can reuse the existing syntax, but requires to have a specific extension (djmt?)

Format: various options are listed in #35 (comment) and #35 (comment). Most of the options are format-independent, so can be integrated with any of the proposed formats, but using footnote references would largely define the format as well.

I listed some of the pros/cons for each of the options (although I'm sure the list can be extended). All locations require their content to be hidden (maybe with the exception of footnote references), so may not work well with processors that don't recognize the syntax.

I find the option of using footnote references really interesting, but it's likely to suffer from difficulties expressing elements that require arrays or sub-elements (for example, multiple authors with names, email and affiliations). If there is a good way to address this, then I'd favor this option. The attribute syntax has similar advantages (and is likely a bit less verbose), but doesn't allow multi-line values.

Using the meta element is probably the most flexible one, but would require a separate processing, depending on which format is selected. I'd prefer Lua tables (and there are easy ways to suppress function execution there if needed), but I can understand why other formats may need to be supported (instead or as well).

(updated 7/30 to add attribute syntax)

@Omikhleia
Copy link

Nice summary @pkulchenko

additional file (just for the sake of completeness)

The content of a "metadata" block remains to be specified, with use cases largely depending on the context -- suffice to look how scattered is the use of such blocks in existing Markdown solutions (static web site or blog generators all have their things, etc.; without clear namespacing... in some documents I saw sansfont, margin-xxx etc. which is a huge conflation between styling paradigms and rendering options for specific tools.)

In other terms, to @jgm 's initial question ("Should there be a built-in format for metadata, or should that be considered distinct from the markup syntax?"), I am tempted to answer negatively to the first point (and thus positively to the second).

@TheDecryptor
Copy link

(Yet) another option would be the raw blocks, they're already set aside for special treatment by the processor.

``` =yaml
author: My Name Here
date: 2023-08-01
tags: [a, b, c]
```

# ...

It does leave the exact choice of metadata format up to the application consuming the document, which is a bit of a shame, but since it explicitly states the format you could always rewrite it easily if needed.

@toastal
Copy link
Contributor

toastal commented Aug 1, 2023

YAML has such overcomplicated parsing rules. I’d be happier with something simpler but based on YAML rather than full YAML compliance if going that route—since full compliance would likely involve reliance on an entire YAML parser library as a dependency.

Personally I don’t like the ad hoc nature of reusing a code block versus something more first-class as it becomes trickier to understand that it’s special, such as for editors to suggest the block is foldable/concealable, etc., or for consistent metadata fetching. It would be ideal in many build systems to be able to call something like djot metadata --format json so outputs can be piped to other tools. If it could be in several inconsistent formats, this task becomes difficult.

@bpj
Copy link

bpj commented Aug 1, 2023

It won’t do to just assume that a raw block marked with =yaml or =json or whatever is a metadata block. What if you are writing documentation for software which takes its configuration from files written in the format you have chosen as metadata format? At the very minimum you will need to use for example =meta-yaml, but overall it is much better to have a built-in metadata format in dedicated metadata blocks uniquely marked as such which the djot parser parses out of the box, and which is expressive enough from the get-go so that people aren’t tempted to come up with bespoke extensions or alternatives.

I agree that a dependency on a full YAML library probably should be avoided, but it would be good to consider what makes YAML attractive:

  • It allows hierarchical data structures.
  • It is pleasant to the human reader because it doesn’t always require punctuation — quotes and brackets — to indicate structure.
  • To a human it pretty much reads like an intuitive mixture of bullet lists and enumerations as you would write them in plain text.

Unfortunately this human-reader friendliness comes at the price of requiring syntax rules which often are not at all intuitive to human writers in order to accommodate the “computer reader”. So what is needed is a reasonable (assuming YAML is unreasonable) compromise between those human-friendly features and features which are “computer friendly”.

However I believe that this dichotomy is a bit of a red herring: any format which is meant to be read and written by both humans and computers has to strike such a balance, including djot, which already leans heavily in the direction of human-friendliness. I have said it before: djot already can parse both key-value lists, namely definition lists, and bullet lists, so it makes sense to reuse djot list syntax for which the parsing facilities are already in place! The problem is that you probably won’t usually want metadata values (or keys) to be parsed into textual elements — emphasis, spans etc. One solution might be to mark “raw text” as raw blocks/spans with a format =text since it is probably unlikely that someone will come up with a code or markup format called “text”. The problem with this is that it means that what probably is the most common case will be specially marked. Perhaps the best solution to this is to simply not at all support markup inside metadata keys and values beyond the basic key-value/bullet item structure, which maybe can be handled by a parameter to the list parsing function(s)? If the metadata values are plain strings with any markup literally preserved the application using djot can pass individual metadata values to the djot parser as and when needed.

@pkulchenko
Copy link

Interesting comments. I've spent some time trying different options and then looking at the generated html, json and AST. To me the attribute approach looks like a winner given how concise it is comparing to some other options. I also like to think about it as a way to associate attributes with the document itself instead of specific elements.

I'm interested in being able to support the following:

  • specify front matter
  • provide custom attributes to be used later in the page
  • provide document level attributes to used as metadata

The approach with attributes checks all these boxes for me as shown in the following example:

{attr1="bar and\
 baz"
 .clssy
 attr2=more}
{updated=20230801 attr2=less}
---

# title

{source="personal-experience"}
> More than three people on one
> bicycle is *not* recommended.

I'd recommend using --- as the first element to associate attributes with (as it looks like the existing front-matter syntax from jekyll), but it's actually optional for my proposal. I'd use the attributes from the very first element in the document with the exception of section (as associating the attributes with a header creates section/heading structure with attributes associated with the heading element). One advantage of using the thematic_break (---) is that it will get only meta attributes, whereas other elements may have their own attributes, but it's a minor consideration.

This approach allows adding and overwriting attributes (as shown above with attr2 getting less assigned instead of more) and possibly providing multi-string values (although it may require using \EOL escaping). All this information is already available in html, JSON and AST, so wouldn't require any additional processing and can accept any custom attributes.

This syntax is quite forgiving in terms of quotes being optional, but it does require brackets to be on the same line as some of the text. I'll try it with few more scenarios and report back if I run into any difficulties with it.

@Omikhleia
Copy link

One advantage of using the thematic_break (---) is that it will get only meta attributes, whereas other elements may have their own attributes, but it's a minor consideration.

Just a quick remark: this is not true, thematic breaks can (and should) have real attributes, with nothing "meta" about them. In real books, no one uses a mere (full or not) rule in all circumstances. I am currently using, for instance (non exhaustively):

{ .dinkus }
---

{ .asterism }
---

{ .pagebreak .pendant type=floral }
---

I.e. styling thematic breaks (here, to possibly obtain; respespectively, a centered * * *, or a floral pendant introducing a page break in print) while still preserving semantics (a thematic break indeed, so a hr-like rule or whatever is still an option for non-compliant renderers, or non-existing or non-supported styles). And though rare, there are cases when it had to occur at the start of (sub)document. That is to say: overloading the existing thematic break with other considerations is likely a wrong approach. It has it's own rights to classes and attributes!

@pkulchenko
Copy link

Just a quick remark: this is not true, thematic breaks can (and should) have real attributes, with nothing "meta" about them. In real books, no one uses a mere (full or not) rule in all circumstances.

I should have been more explicit; what I meant was that in this case, the thematic break is only added to separate (document) attributes from the rest of the document, so it won't have any other attributes (as it wouldn't exist in the document otherwise). Associating document attributes with any other element would lump them together with all other attributes that may already exist for that element.

@toastal
Copy link
Contributor

toastal commented Aug 2, 2023

front matter

The whole concept of ’front matter’ exists because Markdown, unlike most other document/media file formats, did not provide a native way add metadata. It’s a hack & should be avoided, not replicated.

@jgm
Copy link
Owner Author

jgm commented Aug 5, 2023

The drawback of attribute syntax is that attribute values are just plain strings. Metadata like title and abstract often contain formatting, so it would be nice if they were regular djot syntax.

@toastal
Copy link
Contributor

toastal commented Aug 6, 2023

Abstract, backstory, correction, epilogue, prologue

These are elements that seem they should be in the contains-formatting category (there’s likely a few more). Much of the other elements would be inside something like <meta name="x" content="y">. In these cases, formatting doesn’t make sense. I’m not sure <title> allows including elements.

@jgm
Copy link
Owner Author

jgm commented Aug 6, 2023

Titles can contain emphasis (e.g. italicizing a title), superscripts and subscripts, and math, for example.

@toastal
Copy link
Contributor

toastal commented Aug 6, 2023

Finally not too lazy to look at the <title> spec, it’s content model is text, not flow content, so it shouldn’t have any other elements inside it. Are you meaning headlines?

@emilazy
Copy link

emilazy commented Aug 6, 2023

HTML is not the only format in the world and Djot should not be restricted to its lowest common denominator. Many real-world document titles contain formatting that has to be mangled to fit into an HTML <title>.

@toastal
Copy link
Contributor

toastal commented Aug 6, 2023 via email

@pranabekka
Copy link

The title is the most prominent piece of data about the document (if any). It should not come after other metadata, nor should it require a metadata block (inside metadata).

In "An INI Critique of TOML" [0], the author differentiates between serialisation formats and configuration formats. Metadata in djot is not for serialising and sending data. This makes JSON, TOML, and similar formats unsuitable.

Also, using an external/existing format adds the cost of new syntax, as well as more parsing code, or a library. Figuring out the right subset of YAML will still require additional syntax to remember, as well as a new parser. External formats also have the downside of not (natively) supporting djot markup. Also, trying to support multiple external formats sounds like way too much overhead.

Overall, I think a metadata block using a definition list is the way to go.

0: https://github.com/madmurphy/libconfini/wiki/An-INI-critique-of-TOML

@nbonfils
Copy link

Has there been any progress on this discussion?
I am currently considering picking djot for a project of knowledge base creation (like a wiki), and there metadata are a must (like tags, author, date etc..).

I feel that @pkulchenko's suggestion is the nicest one because it feels the most native to djot with attributes, simply making them available to the document by having either a blank line after them or we could imagine something like an optional placeholder document element like +++ was suggested at the begining of the thread. Like so:

{attr1="bar and\
 baz"
 .clssy
 attr2=more}
{updated=20230801 attr2=less}
+++

# title

{source="personal-experience"}
> More than three people on one
> bicycle is *not* recommended.

I think this could present a start for document-level attributes aka metadata, and then as for @jgm concerns, being able to introduce djot syntax for title document attribute, there needs to be a clarification on wether there are other attributes or metadata that require this as well.
If yes, then some time needs to be spent on extending the attribute syntax.
If no, maybe a special syntax for titles would work, maybe expanding on my +++ proposal for a placeholder element, any text after could constitute a title. Like +++ My _emphasized_ document title.

@terefang
Copy link

terefang commented Apr 23, 2024

sorry for the late chime in.

reading the plethora of messages and opinions so far got me thinking.

  • if one allows multiple meta-data formats like yaml, toml, json, etc – how many parser libraries does a "complient" parser need to interface with to be actually compliant ?
  • having a concatenable format solution makes sense in the spirit of unix.
  • having a fenced solution may make sense so filter developers may roll their own to their own liking
  • there needs to be a default behavior that works as a reliable fallback that does not add anything new to the parser-syntax.

Format No1 — the proposed default:

# Pandoc User’s Guide

* :Author: John MacFarlane
* :Author: Johnny MacFarlane
* :Date: August 22, 2022

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus malesuada rhoncus lorem in fringilla. 
  • the first if the document starts with a heading it is used as the documents title
  • if the heading used for the documents title is followed by an item-list it is interpreted as meta-data instead – item lists after a title are highly unlikely as normal document formatting.

Format No2 — the extension:

+++ [type]
* :Title: Pandoc User’s Guide - A Manual
* :Author: John MacFarlane
* :Author: Johnny MacFarlane
* :Date: August 22, 2022
+++

# Pandoc User’s Guide

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus malesuada rhoncus lorem in fringilla. 

i am using the +++ fencing here but @jgm may decide otherwise.

  • the first block of the document is a +++ fence it is interpreted as meta-data instead
  • +++ fence may optionally followed by a type specifier
  • the only +++ fence type defined is: None – the empty type is the default with a sysntax like the above.
  • any other types are filter specific.

@uvtc
Copy link
Contributor

uvtc commented Apr 24, 2024

In case it adds anything useful to this discussion, I added a comment here #293 (comment) that references this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests