WIP: refining/clarifying data dir functionality #4379

vassudanagunta · 2018-02-05T00:07:39Z

[EDIT: The YAML specific issues previously raised herein were addressed by #4402, allowing me to narrow this issue.]

I started by looking into #4138, #3890, #4366, #4083 and #2441, but of course that lead deeper into the rabbit hole of of how Hugo is supposed to work or how it should work. I believe the analysis below is worth making and important. Even if the answer is to keep everything as it is, the clarifications I make probably should make it into Hugo documentation. But the length of my attempt at clarification below is an indication that current behavior might be too complicated. Worth repeating a little more loudly:

The length of this writeup is an indication that current behavior might be too complicated.

I chose to do a PR so I could include code that demonstrates current behavior. If it is decided to change this behavior, I could amend this commit with such changes.

Without further ado…

current Hugo behavior

Hugo loads data files into a data tree rooted in the .Site.Data variable. It translates the relative filesystem paths of each file into relative tree paths to each file's data within the tree. The last node in the tree path corresponds to the filename. Let's call this the file's "tree insertion point".

One consequence of this is that file paths are indistinguishable from data. For example:

data/a.json
{
  "b": {"c" : "d"}
}

and

data/a/b.json
{
  "c": "d"
}

both produce

{
  "a": {
    "b": {
      "c": "d"
    }
  }
}

Another consequence is that given multiple data files, the data can overlap. When this happens data can be either combined, merged, or discarded according to precedence rules. The current rules are as follows:

Files deeper in the folder hierarchy have precedence over shallower ones.
Files in the user data directory have precedence over those in the theme's data directory.
A data file containing anything other than a string map will be grafted onto the tree at the file's insertion point. But if higher precedence data already occupies that node, even if the node is simply part of the path to that higher precedence data, the lower precedence data is entirely discarded.
A data file containing a string map will be merged into the tree at the file's insertion point. This is done by inserting its individual map entries at the tree insertion point following the above precedence rules when there is collision. But if higher precedence data which is not a string map claims that node, the lower precedence data is entirely discarded.

If you want to see the code behind this, it's all in this method. But it will be a lot easier if you just look at actual Hugo inputs and outputs in the next section.

current behavior, illustrated by actual Hugo results

I constructed a scenario for which the current behavior could make sense, but also included within it an example of how it potentially breaks down or becomes confusing. The demo data files shown below are embedded in the new demo test included in this PR, and the output also shown below is encoded as expected test output (the test passes).

First, the user uses a theme designed for a music oriented website. The theme includes some music data that it uses for genre-specific pages layouts:

File 1: <theme>/data/music/genres.json

{
  "rock": {"icon": "rock.png", "palette": "metal.css"},
  "jazz": {"icon": "jazz.png", "palette": "vibrant.css"},
  "soul": {"icon": "soul.png", "palette": "fervent.css"},
  "classical": {"icon": "classical.png", "palette": "elegant.css"}
}

The user takes advantage of the "user data has precedence" rule, overriding the icon for one of the theme defined genres and also adding three new genres:

File 2: data/music/genres.json

{
  "rock": {"icon": "rock2.png"},
  "funk": {"icon": "funk.png", "palette": "funky.css"},
  "hip-hop": {"icon": "hip-hop.png", "palette": "hip.css"},
  "blue-eyed-soul": {"icon": "soul.png", "palette": "fervent"}
}

The user then takes advantage of the "deeper data file has precedence" rule, adding a new field to one of the genres:

File 3: data/music/genres/blue-eyed-soul.json

{
  "parent genre": "soul"
}

The user then adds a data file for actual music that will be listed on the site. While it references the genre data (essentially via a foreign key), it is supposed to be separate table of data:

File 4: data/music.json

{
  "Mother": {"artist": "Pink Floyd", "genre": "rock"},
  "Freddie's Dead": {"artist": "Curtis Mayfield", "genre": "funk"},
  "Son of a Preacher Man": {"artist": "Dusty Springfield", "genre": "blue-eyed soul"}
}

Here is the resulting data tree that Hugo makes available to templates via Site.Data (shown as JSON):

{
  "music": {
    "Freddie's Dead": {
      "artist": "Curtis Mayfield",
      "genre": "funk"
    },
    "Mother": {
      "artist": "Pink Floyd",
      "genre": "rock"
    },
    "Son of a Preacher Man": {
      "artist": "Dusty Springfield",
      "genre": "blue-eyed soul"
    },
    "genres": {
      "blue-eyed-soul": {
        "parent genre": "soul"
      },
      "classical": {
        "icon": "classical.png",
        "palette": "elegant.css"
      },
      "funk": {
        "icon": "funk.png",
        "palette": "funky.css"
      },
      "hip-hop": {
        "icon": "hip-hop.png",
        "palette": "hip.css"
      },
      "jazz": {
        "icon": "jazz.png",
        "palette": "vibrant.css"
      },
      "rock": {
        "icon": "rock2.png"
      },
      "soul": {
        "icon": "soul.png",
        "palette": "fervent.css"
      }
    }
  }
}

non-obvious consequences

The non-obvious consequences are:

Data that doesn't belong in the same set can get mingled together. Grafting data files at deeper nodes in the tree can result in potentially useful override of data inserted at shallowed nodes (e.g. File 3 and File 2 respectively). But the same behavior can also result in data that should be distinct getting mixed together. File 4's song titles are mixed up with the genre list sourced from the other files. It's not obvious that data from files named data/music/genres.json and data/music.json would be mingled this way. Imagine the confusion when a template ranges over .Site.Data.music.
Merging of mapped data is "shallow", with map entries at the root of the data file being inserted or rejected wholesale. There is no attempt to merge the values of two colliding keys. Thus two maps with 10 entries each with one overlapping key will result in 19 entries, and the data for that one overlapping part aren't merged. You can see this in how the rock genre data in File 2 replaces rather than merges with the info in File 1. Likewise the blue-eye soul genre data in File 3 replaces even the non-colliding leaf data in File 3. In both cases this is the opposite of what my imaginary user expected. Though Hugo emits useful warnings when this happens, I'm not sure that makes up for the complexity and potential for confusion:
```
WARN Data for key 'blue-eyed-soul' in path 'music/genres.json' is overridden in subfolder
WARN Data for key 'rock' in path 'music/genres.json' is overridden in subfolder
```
Hugo performance. It likely complicates any solution to Data Files eat memory #1065.

questions

Is the behavior above as intended? Or an unintended consequence? Is the use case I provide real and common enough to justify support? Does anyone rely on this behavior? Is there some other use case I missed?
Is the flexibility for power users worth the tradeoffs in complexity, of non-power users shooting themselves in the foot? Is it worth the tradeoffs in performance?

decisions

See non-obvious consequences above for definition of intermingling vs merging.

Data intermingling

Keep things just as they are. Users can avoid the complexity if they want to.
Require all data be inserted at leaf nodes in the tree. This prevents the unintentional mingling of different data sets; every data set is in it's own sandbox. Uses can still override theme data, as files can still target the some leaf node. It makes it easier to support other data types (such as non string-keyed maps) in the future.

My recommendation is to do #2. It makes hugo data handling is far easier for users to understand. By removing the unexpected intermingling of data, costly confusion is avoided and data integrity is improved. It will make a solution to #1065 far easier.

Data merging

Remove support for merging. No two files can have the same tree path. Leave merge semantics to the user, in the templates. The user can use arbitrary logic to figure out how one set of data overrides or gets merged with another.
Keep things as they are, Shallow merging.
Support deep merges. This would essentially add inheritance semantics to the data, addressing the issues with the rock and blue-eyed-soul genres in the example.

I lean toward #1. But I am unsure of current usage or its popularity. #2 as I stated in non-obvious consequences can result in confusion, and limits its use. My gut says do no merging or go all the way. But since the user can always use whatever merge logic they want in their templates, #1 makes most sense.

bep · 2018-08-15T07:59:33Z

This should be opened as a regular issue and not a PR.

github-actions · 2022-02-03T01:50:26Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

vassudanagunta mentioned this pull request Feb 9, 2018

parser: Fix YAML maps key type #4138

Merged

bep added this to the v0.37 milestone Feb 9, 2018

This was referenced Feb 9, 2018

Consider YAML vs datadir vs config vs non-string keys #4393

Closed

JSON, YAML and TOML data equivalency, Better array data support #4402

Merged

Add data dir behavior illuminating test case

45bb822

vassudanagunta mentioned this pull request Feb 12, 2018

Add support for multiple contentDirs #3757

Closed

bep modified the milestones: v0.37, v0.38 Feb 21, 2018

bep modified the milestones: v0.38, v0.39 Mar 20, 2018

bep modified the milestones: v0.39, v0.40 Apr 9, 2018

bep modified the milestones: v0.40, v0.41 Apr 20, 2018

bep modified the milestones: v0.41, v0.42 May 4, 2018

bep modified the milestones: v0.42, v0.43 Jun 5, 2018

vassudanagunta mentioned this pull request Jun 19, 2018

GetPage, ref and relref improvements #4796

Closed

bep modified the milestones: v0.43, v0.44 Jun 30, 2018

bep modified the milestones: v0.44, v0.45, v0.46 Jul 10, 2018

bep modified the milestones: v0.46, v0.47, v0.48 Aug 3, 2018

bep closed this Aug 15, 2018

vassudanagunta mentioned this pull request Nov 2, 2018

Make WARN the new default log log level #5389

Merged

bep mentioned this pull request Nov 3, 2018

Deprecate overlapping /data merges (or something) #5397

Closed

github-actions bot locked as resolved and limited conversation to collaborators Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: refining/clarifying data dir functionality #4379

WIP: refining/clarifying data dir functionality #4379

vassudanagunta commented Feb 5, 2018 •

edited

Loading

bep commented Aug 15, 2018

github-actions bot commented Feb 3, 2022

WIP: refining/clarifying data dir functionality #4379

WIP: refining/clarifying data dir functionality #4379

Conversation

vassudanagunta commented Feb 5, 2018 • edited Loading

The length of this writeup is an indication that current behavior might be too complicated.

current Hugo behavior

current behavior, illustrated by actual Hugo results

non-obvious consequences

questions

decisions

Data intermingling

Data merging

bep commented Aug 15, 2018

github-actions bot commented Feb 3, 2022

vassudanagunta commented Feb 5, 2018 •

edited

Loading