|
| 1 | +--- |
| 2 | +title: Modular Metadata |
| 3 | +abstract: | |
| 4 | + Proposes a reserved `metadata` property on every OXA node that carries structured information about the node's content — title, authors, affiliations, funding, licenses, identifiers, and other descriptive data. Metadata is modular, referenceable, and composable: it propagates from parent to child, can be overridden at any level of the tree, and supports cross-references between metadata entries. This RFC establishes the principles and structural conventions for metadata; the specific field definitions are deferred to future RFCs. |
| 5 | +--- |
| 6 | + |
| 7 | +This RFC introduces a `metadata` property available on every OXA content node. The property provides a structured place for descriptive information — authorship, licensing, identifiers, titles, affiliations, funding, and similar concerns — that applies to the node and, by default, to all of its descendants. |
| 8 | + |
| 9 | +The design is motivated by modular scientific publishing, where individual components of a document (a figure panel, an embedded image, a chapter) may have distinct authorship, licensing, or provenance from the containing document. Rather than requiring all metadata to live at the document root, OXA treats metadata as a contextual property that flows down the tree and can be narrowed or replaced at any node. |
| 10 | + |
| 11 | +This RFC lays out the principles of the approach. It does not define the specific metadata fields (e.g. the shape of an author object or the license vocabulary) — those will be specified in subsequent RFCs that can draw on this structural foundation. |
| 12 | + |
| 13 | +## Motivation |
| 14 | + |
| 15 | +Scientific documents are not monolithic. A single article may contain: |
| 16 | + |
| 17 | +- **Figures** contributed by a collaborator who is not a document author |
| 18 | +- **Panels within a figure** created by a subset of the figure's authors |
| 19 | +- **Embedded images** sourced from external works with different licenses |
| 20 | +- **Chapters** in a collection, each written by different author groups |
| 21 | +- **Datasets** with their own DOIs, funders, and data-availability statements |
| 22 | + |
| 23 | +Current document formats handle this poorly. JATS places all metadata in a single `<front>` section at the document level; there is no standard mechanism for per-component authorship or licensing. LaTeX has no native metadata model at all. Pandoc's YAML frontmatter is document-level only. |
| 24 | + |
| 25 | +OXA needs metadata that is **modular** — it must be possible to describe _any_ node in the tree with its own metadata context, independent of the document root. |
| 26 | + |
| 27 | +## Proposal |
| 28 | + |
| 29 | +### The `metadata` Property |
| 30 | + |
| 31 | +Every OXA node MAY include a `metadata` property. This is a reserved key, distinct from `data` (the general extension bucket defined in RFC0002). Where `data` is an unstructured escape hatch for tool-specific or experimental fields, `metadata` is a structured, well-defined space for descriptive information about the node's content. |
| 32 | + |
| 33 | +```typescript |
| 34 | +interface Node { |
| 35 | + type: string; |
| 36 | + children?: Node[]; |
| 37 | + value?: string; |
| 38 | + data?: Record<string, unknown>; |
| 39 | + metadata?: Metadata; |
| 40 | +} |
| 41 | +``` |
| 42 | + |
| 43 | +The `Metadata` type will be defined in detail by subsequent RFCs. For the purposes of this RFC, it is an object that may contain fields such as: |
| 44 | + |
| 45 | +```typescript |
| 46 | +interface Metadata { |
| 47 | + title?: InlineContent[]; |
| 48 | + subtitle?: InlineContent[]; |
| 49 | + authors?: (AuthorData | MetadataReference)[]; |
| 50 | + license?: LicenseData | MetadataReference; |
| 51 | + identifiers?: Record<string, string>; |
| 52 | + affiliations?: (AffiliationData | MetadataReference)[]; |
| 53 | + funding?: (FundingData | MetadataReference)[]; |
| 54 | + // ... additional fields defined by future RFCs |
| 55 | +} |
| 56 | +``` |
| 57 | + |
| 58 | +### Metadata Context and Propagation |
| 59 | + |
| 60 | +The top-level node in a tree establishes the **metadata context** for all of its descendants. Children inherit the parent's metadata unless they provide their own `metadata` property, in which case the child's metadata becomes the new context for that subtree. |
| 61 | + |
| 62 | +This is analogous to how the JATS `<front>` section describes the document — but generalized to any node in the tree. |
| 63 | + |
| 64 | +```yaml |
| 65 | +type: Document |
| 66 | +metadata: |
| 67 | + title: |
| 68 | + - type: Text |
| 69 | + value: 'Seismic Observations of the 2024 Noto Peninsula Earthquake' |
| 70 | + authors: |
| 71 | + - identifier: rowan |
| 72 | + name: Rowan Cockett |
| 73 | + orcid: 0000-0002-7859-8394 |
| 74 | + - identifier: tracy |
| 75 | + name: Tracy K. Teal |
| 76 | + orcid: 0000-0002-9180-9598 |
| 77 | + license: |
| 78 | + id: CC-BY-4.0 |
| 79 | +children: |
| 80 | + - type: Heading |
| 81 | + level: 1 |
| 82 | + children: |
| 83 | + - type: Text |
| 84 | + value: 'Introduction' |
| 85 | + - type: Paragraph |
| 86 | + children: |
| 87 | + - type: Text |
| 88 | + value: 'This document demonstrates modular metadata...' |
| 89 | + - type: Image |
| 90 | + src: 'https://example.com/seismic-map.png' |
| 91 | + metadata: |
| 92 | + authors: |
| 93 | + - xref: '@rowan' |
| 94 | + roles: |
| 95 | + - Visualization |
| 96 | + license: |
| 97 | + id: CC-BY-4.0 |
| 98 | +``` |
| 99 | +
|
| 100 | +In this example: |
| 101 | +
|
| 102 | +- The `Document` node establishes authorship and licensing for the entire tree. |
| 103 | +- The `Image` node overrides the metadata context: it credits a specific author with a specific role, and declares its own license. The `Heading` and `Paragraph` nodes inherit the document-level metadata. |
| 104 | +- The image's author entry uses a cross-reference (`xref: '@rowan'`) to point back to the full author definition in the document metadata, rather than duplicating the data. |
| 105 | +
|
| 106 | +### Principles |
| 107 | +
|
| 108 | +#### 1. Modular |
| 109 | +
|
| 110 | +Metadata can be attached to any node. A figure, a panel within a figure, a chapter, an embedded dataset — any node that needs its own descriptive context can carry `metadata`. This supports modular science, where components are authored, licensed, and identified independently. |
| 111 | + |
| 112 | +**Example:** A figure composed of four panels, where panel (b) was created by a different research group: |
| 113 | + |
| 114 | +```yaml |
| 115 | +type: Figure |
| 116 | +metadata: |
| 117 | + authors: |
| 118 | + - xref: '@rowan' |
| 119 | + - xref: '@tracy' |
| 120 | +children: |
| 121 | + - type: Image |
| 122 | + src: 'panel-a.png' |
| 123 | + - type: Image |
| 124 | + src: 'panel-b.png' |
| 125 | + metadata: |
| 126 | + authors: |
| 127 | + - identifier: external-collab |
| 128 | + name: J. Martinez |
| 129 | + orcid: 0000-0001-2345-6789 |
| 130 | + affiliations: |
| 131 | + - name: Universidad Nacional |
| 132 | + license: |
| 133 | + id: CC-BY-SA-4.0 |
| 134 | + - type: Image |
| 135 | + src: 'panel-c.png' |
| 136 | + - type: Image |
| 137 | + src: 'panel-d.png' |
| 138 | +``` |
| 139 | + |
| 140 | +Panels (a), (c), and (d) inherit the figure-level metadata. Panel (b) has its own authorship and a different license. |
| 141 | + |
| 142 | +#### 2. Referenceable |
| 143 | + |
| 144 | +Metadata entries can be **defined once and referenced elsewhere** in the document. Authors, affiliations, funders, and other entities are given identifiers within the metadata and can be referenced using cross-reference (`xref`) syntax from other metadata sections or from inline content. |
| 145 | + |
| 146 | +**Example:** An author defined in the document metadata and referenced in an acknowledgements section: |
| 147 | + |
| 148 | +```yaml |
| 149 | +type: Document |
| 150 | +metadata: |
| 151 | + authors: |
| 152 | + - identifier: rowan |
| 153 | + name: Rowan Cockett |
| 154 | + orcid: 0000-0002-7859-8394 |
| 155 | +children: |
| 156 | + # ... document content ... |
| 157 | + - type: Paragraph |
| 158 | + children: |
| 159 | + - type: Text |
| 160 | + value: 'In this manuscript, ' |
| 161 | + - type: CrossReference |
| 162 | + xref: '@rowan' |
| 163 | + kind: Person |
| 164 | + children: |
| 165 | + - type: Text |
| 166 | + value: 'R. C.' |
| 167 | + - type: Text |
| 168 | + value: ' conceived the study and wrote the initial draft.' |
| 169 | +``` |
| 170 | + |
| 171 | +The `@` prefix distinguishes metadata references from content references (e.g. `#fig1` for a figure, `@rowan` for a metadata reference, like authors). The exact cross-reference mechanics will be defined in a future RFC on cross-references. |
| 172 | + |
| 173 | +#### 3. Composable |
| 174 | + |
| 175 | +Metadata references can be **composed** — a new node can reference existing metadata entries while adding or overriding specific fields. This avoids duplication and keeps the source of truth in one place. |
| 176 | + |
| 177 | +**Example:** An image that references an existing author but adds a role specific to this context: |
| 178 | + |
| 179 | +```yaml |
| 180 | +type: Image |
| 181 | +src: 'visualization.png' |
| 182 | +metadata: |
| 183 | + authors: |
| 184 | + - xref: '@rowan' |
| 185 | + roles: |
| 186 | + - Visualization |
| 187 | + - Software |
| 188 | +``` |
| 189 | + |
| 190 | +The image does not redefine Rowan's name, ORCID, or affiliations — it references the canonical entry and layers on context-specific roles. This composition pattern means that updating the author's ORCID in the document metadata automatically propagates to all references. |
| 191 | + |
| 192 | +### Metadata Identifiers |
| 193 | + |
| 194 | +All identifiable entries within metadata (authors, affiliations, funders, grants, venues, etc.) carry an `identifier` field. These identifiers: |
| 195 | + |
| 196 | +- MUST be unique across all metadata in the document (i.e. you cannot have an author and an affiliation with the same identifier) |
| 197 | +- Need NOT be unique across the content of the document — a section with identifier `csf` and a metadata affiliation with identifier `csf` occupy different namespaces (content and metadata respectively) |
| 198 | +- Are referenced using the `@` prefix in cross-references (e.g. `@rowan`, `@csf`) |
| 199 | + |
| 200 | +The `@` prefix is a convention proposed by this RFC to distinguish metadata references from content references. A future cross-reference RFC will formalize the full syntax, including how to disambiguate when content and metadata identifiers overlap. |
| 201 | + |
| 202 | +### Title and Subtitle |
| 203 | + |
| 204 | +Titles and subtitles are included in `metadata` as inline content arrays (`InlineContent[]`), allowing rich formatting (e.g. math, emphasis, superscripts in titles). |
| 205 | + |
| 206 | +A node's metadata title and its content are distinct concepts. A figure may have a caption (in its `children`) that differs from the metadata title inherited from the image's original source. Both can coexist: |
| 207 | + |
| 208 | +- The **metadata title** describes the node for indexing, citation, and metadata propagation purposes |
| 209 | +- The **content title** (e.g. a caption, heading) is what appears in the rendered document |
| 210 | + |
| 211 | +This distinction is useful when embedding components from external sources. An image sourced from a different publication carries its original metadata title, but the containing figure may present it with a different caption in context. |
| 212 | + |
| 213 | +Titles in metadata do need to be traversable by tree algorithms for transformations (e.g. resolving cross-references within a title). Because `metadata.title` is an array of inline nodes — the same types that appear in `children` — existing tree walkers can be extended to traverse metadata content with minimal additional complexity. |
| 214 | + |
| 215 | +### Unknown and Experimental Metadata |
| 216 | + |
| 217 | +Metadata that does not fit a defined field SHOULD be placed in the node's `data` property (RFC0002), not in `metadata`. The `metadata` property is reserved for structured, well-defined fields specified by RFCs. This keeps `metadata` predictable for tooling while preserving `data` as the extension point for experimental or tool-specific information. |
| 218 | + |
| 219 | +## Relationship to JATS |
| 220 | + |
| 221 | +The document-level `metadata` is analogous to the JATS `<front>` element, which contains `<article-meta>` with title, authors, affiliations, funding, licenses, and identifiers. Future RFCs that define the specific metadata fields SHOULD aim for mostly lossless mapping to and from JATS `<front>`, with the understanding that some JATS elements may be omitted where open alternatives exist (e.g. preferring ROR over Ringold for organization identifiers, or ORCID over proprietary author IDs)[^jats-lossless]. |
| 222 | + |
| 223 | +[^jats-lossless]: There are elements of JATS that we may choose to not include in this metadata, for example, support for non-open identifiers that have open alternatives (e.g. Ringold). |
| 224 | + |
| 225 | +The key difference from JATS is that OXA metadata is not restricted to the document root. Any node can carry `metadata`, enabling per-component attribution and licensing that JATS does not natively support. |
| 226 | + |
| 227 | +| Concern | JATS | OXA | |
| 228 | +| ----------------- | --------------------------------------------------- | ---------------------------------- | |
| 229 | +| Metadata scope | Document-level only (`<front>`) | Any node in the tree | |
| 230 | +| Author per-figure | Not natively supported | `metadata.authors` on any node | |
| 231 | +| License per-asset | `<license>` in `<permissions>`, document-level only | `metadata.license` on any node | |
| 232 | +| Identifiers | `<article-id>`, fixed vocabulary | `metadata.identifiers`, extensible | |
| 233 | +| Extension | Custom XML elements or processing instructions | `data` property (RFC0002) | |
| 234 | + |
| 235 | +## Alternatives Considered |
| 236 | + |
| 237 | +### Backmatter Node |
| 238 | + |
| 239 | +We considered a `Backmatter` block-level node that would live exactly once as the last child of any tree and contain contributor definitions, affiliations, funding information, and supporting sections (data availability, acknowledgements, etc.). |
| 240 | + |
| 241 | +This approach was rejected because: |
| 242 | + |
| 243 | +- It conflates **metadata** (descriptive information about the content) with **content** (sections like acknowledgements that are part of the narrative). Acknowledgements are content that happen to appear at the end; they belong in the tree as regular nodes, not in a special container. |
| 244 | +- It does not support per-component metadata. A `Backmatter` on the document root cannot express that a specific image has different authorship. |
| 245 | +- It raises awkward questions about where to define new metadata entries that are first introduced mid-document (e.g. an author who only contributed one figure). With the `metadata` property approach, the author can be defined where they are first relevant — either on the document node (if they should be discoverable at the top level) or on the specific component. |
| 246 | + |
| 247 | +### Metadata on CrossReference Nodes |
| 248 | + |
| 249 | +An alternative for mid-document author definitions would be to allow `CrossReference` nodes to carry `metadata` that defines new entries inline: |
| 250 | + |
| 251 | +```yaml |
| 252 | +type: CrossReference |
| 253 | +metadata: |
| 254 | + authors: |
| 255 | + - identifier: someone |
| 256 | + name: 'A. Helpful Person' |
| 257 | +xref: '@someone' |
| 258 | +children: |
| 259 | + - type: Text |
| 260 | + value: 'Person' |
| 261 | +``` |
| 262 | + |
| 263 | +While this works mechanically, it adds complexity to cross-reference semantics — a `CrossReference` would sometimes _define_ metadata rather than just _reference_ it. The simpler approach is to define all metadata entries on the appropriate container node (typically the document root) and reference them from content. This keeps the definition site predictable and avoids special-casing `CrossReference` for metadata propagation. |
| 264 | + |
| 265 | +## Open Questions |
| 266 | + |
| 267 | +- **Identifier scoping:** Should metadata identifiers be required to start with a special prefix (e.g. `person:rowan`, `org:csf`), or is the `@` reference prefix sufficient to prevent conflicts with content identifiers? Starting without type prefixes keeps the syntax lighter, but may need revisiting if collision patterns emerge. |
| 268 | +- **Propagation semantics:** When a child node provides `metadata`, does it _replace_ the parent context entirely, or _merge_ with it? Full replacement is simpler and more predictable; merging risks ambiguity about which fields are inherited vs. overridden. This RFC proposes full replacement as the default — a child with `metadata` establishes a new, complete context for its subtree. |
| 269 | +- **Metadata field definitions:** The specific shapes of author, affiliation, funding, license, and identifier objects are intentionally deferred. Future RFCs should define these, drawing on JATS, schema.org, DataCite, and CRediT for established vocabularies. |
| 270 | +- **Tree traversal of metadata content:** Metadata fields like `title` contain inline node arrays. Should tree-walking algorithms traverse `metadata` by default, or require explicit opt-in? Traversing by default ensures transformations (e.g. resolving cross-references in titles) work transparently, but increases the surface area that algorithms must handle. |
| 271 | +- **Inline metadata references:** Can metadata entries be referenced freely from inline content (e.g. an author callout in the acknowledgements)? This RFC proposes yes — a `CrossReference` node with `xref: '@rowan'` can appear anywhere in the document. The rendering of such references (e.g. expanding to the author's full name, linking to their ORCID) is a renderer concern. |
| 272 | + |
| 273 | +## Implications |
| 274 | + |
| 275 | +If accepted, this RFC: |
| 276 | + |
| 277 | +- Reserves `metadata` as a property on all OXA nodes, alongside `type`, `children`, `value`, and `data` |
| 278 | +- Establishes metadata propagation as a core tree semantic: parent metadata applies to children unless overridden |
| 279 | +- Introduces the `@` prefix convention for metadata cross-references, to be formalized in a future cross-reference RFC |
| 280 | +- Provides the structural foundation for future RFCs to define specific metadata fields (authors, licenses, identifiers, etc.) |
| 281 | +- Enables per-component attribution and licensing, supporting modular scientific publishing |
| 282 | +- Maintains a clear separation between structured metadata (`metadata`) and unstructured extensions (`data`) |
| 283 | + |
| 284 | +## Decision |
| 285 | + |
| 286 | +Acceptance of this RFC establishes the `metadata` property as a reserved, structured extension point on every OXA node, enabling modular, referenceable, and composable metadata throughout the document tree. Subsequent RFCs will define the specific metadata vocabularies (authorship, licensing, funding, identifiers) within this framework. |
0 commit comments