support data sources #1003

Merged
merged 1 commit into from Oct 1, 2013

Projects

None yet
@liufengyun
Contributor

Data Source enables you to load data, serialized as YAML, from a directory in Jekyll's source: _data.

Here's an example to illustrate:

In my _data directory, I have members.yml, which contains the following:

- name: Ben Balter
  company: GitHub
  location: Washington D.C.
- name: Parker Moore
  location: "Ithaca, NY"

Jekyll loads this data (in this case, an Array of Hashes) into site.data.members (note that the namespace is based on the filename) which can be used in Liquid thusly:

{% for member in site.data.members %}
  <p>
    {{ member.name }} lives in {{ member.location }}
    {% if member.company %} and works for {{ member.company }}{% endif %}
  </p>
{% endfor  %}

This allows us to input arbitrary data into our templates without the use of _config.yml and it is re-read each time a file changes when running jekyll with --watch.

  • Implement
  • Tests
  • Figure out best namespace: site.data, or data?
@swanson
Contributor
swanson commented May 6, 2013

Would love to not have to shove stuff into _config.yml to achieve this.

Here's an example use case: https://github.com/IndyStartupLab/indystartuplab.org/blob/gh-pages/_config.yml - feed into data to includes blocks to prevent copy-pasting every time we add a member/project.

@RohitRox
RohitRox commented Jun 6, 2013

👍 desperately waiting for this

@bumpux
bumpux commented Jul 15, 2013

+1 This will get me out of current solution and living in Jekyll

@parkr
Member
parkr commented Jul 16, 2013

This is a pretty cool feature. I'm not sure about the security implications for this, however. @benbalter, perhaps you could elaborate?

@benbalter
Contributor

This functionality would be a great plugin, but I don't know that the use-case is widespread enough to warrant inclusion in core, at least not at the onset. Would 80% of users use this?

The ability to get global data from something other than _config.yml would be cool. (e.g., with choosealicense's _config.yml).

I think the more Jekyll way to do that would be more transparent. Perhaps a _data directory, that automatically parses any .yml file, and exposes it as site.[filename], without the need to clutter _config.yml with all sorts of needless settings that we could just as easily detect.

From a security standpoint, would love it to be limited to local files within the repo root, at least if safe mode is on. Also, I'd stick with YAML. That's Jekyll's input language. We use it for front matter, we use it for config. We should stick with it.

Can you give some examples of use cases where external files would be needed? If I needed external data, I'd personally much rather have a build script that pulls in the datafile and vendors it to the repo so that I can version it, have a backup if the datasource goes down, etc.

@swanson
Contributor
swanson commented Jul 17, 2013

+1 on "The ability to get global data from something other than _config.yml would be cool. (e.g., with choosealicense's _config.yml)" - more examples: https://github.com/sep/letsworkhappier.com/blob/gh-pages/_config.yml https://github.com/plusjade/jekyll-bootstrap/blob/master/_config.yml

I agree on the remote data/YAML - I think that use case is beyond the 80%.

@liufengyun
Contributor

@benbalter The idea of a _data directory and auto parsing any .yml sounds great. That's much cleaner without tedious settings in _config.yml.

I believe local yaml data files will be welcomed by most jekyll users, while remote data or non-yaml files may be very special use cases.

Though remote data & non-yaml files may be beyond the 80%, I'd like jekyll to support it at least in unsafe mode, so that jekyll is not closed, but open with many interesting possibilities. For example, use data directly from database to generate the site. And it's only about 15 lines of code to support the feature.

@parkr
Member
parkr commented Jul 17, 2013

I like the idea of a _data dir very much. Instead of directly on site, I might move it to site.data instead or a new variable called data altogether, point being that we should namespace this stuff.

I'd like to start first with just local data. An additional concern is where to put references to remote data. In the YAML files? In _config.yml? If we start with just local data, we can add in that complexity later.

@swanson
Contributor
swanson commented Jul 17, 2013

I haven't looked at the internals - but one annoying bit about putting stuff in _config.yml is that changes are not picked up with the watch flag and you have to manually restart the server. If possible, this should be avoided in a new _data directory setup.

@liufengyun
Contributor

@parkr I agree local yaml files in _data directory would be a good start.

Regarding the namespace issue, a new global variable data seems not good. site.data is acceptable. But I think it's better to leave it to users, it's up to end users to avoid the conflict with reserved config vars. Because in most use cases one site only have one or two yaml files(at most several), it's easy to avoid the naming conflict.

Regarding remote data sources, I think it should be defined in _config.yml. For remote data sources(e.g. database), it's impossible to watch changes without restarting the server.

@parkr parkr referenced this pull request in github/choosealicense.com Jul 26, 2013
Open

I18N #68

@liufengyun
Contributor

@parkr @benbalter I've updated the pull request to only autoload yaml files under _data directory.

The jekyll engine will autoload all yaml files(ends with .yml or .yaml) under _data. If there's a file members.yml under the directory, then user can access contents of the file through site.members.

@parkr parkr commented on an outdated diff Aug 30, 2013
lib/jekyll/site.rb
@@ -187,6 +189,22 @@ def read_drafts(dir)
end
end
+ # Read and parse all yaml files under <source>/<dir>
+ #
+ # Returns nothing
+ def read_data(dir)
+ base = File.join(self.source, dir)
+ return [] unless File.exists?(base)
+ entries = Dir.chdir(base) { Dir['*.{yaml, yml}'] }
@parkr
parkr Aug 30, 2013 Member

Is the space allowed there? Can you add a test for both yaml and yml?

@parkr parkr commented on an outdated diff Aug 30, 2013
lib/jekyll/site.rb
@@ -187,6 +189,22 @@ def read_drafts(dir)
end
end
+ # Read and parse all yaml files under <source>/<dir>
+ #
+ # Returns nothing
+ def read_data(dir)
+ base = File.join(self.source, dir)
+ return [] unless File.exists?(base)
@parkr
parkr Aug 30, 2013 Member

I'd probably check to make sure it's a directory too:

return [] unless File.directory?(base)
@parkr parkr commented on an outdated diff Aug 30, 2013
lib/jekyll/site.rb
@@ -187,6 +189,22 @@ def read_drafts(dir)
end
end
+ # Read and parse all yaml files under <source>/<dir>
+ #
+ # Returns nothing
+ def read_data(dir)
+ base = File.join(self.source, dir)
+ return [] unless File.exists?(base)
+ entries = Dir.chdir(base) { Dir['*.{yaml, yml}'] }
+ entries.delete_if { |e| File.directory?(File.join(base, e)) }
+
+ entries.each do |entry|
+ path = File.join(self.source, dir, entry)
+ key = File.basename(entry, '.*')
+ @data[key] = YAML.safe_load_file(path)
@parkr
parkr Aug 30, 2013 Member

We tend to use self.data[key] for accessing attributes on the instance.

@parkr parkr commented on an outdated diff Aug 30, 2013
site/docs/structure.md
@@ -123,6 +125,21 @@ An overview of what each of these does:
</tr>
<tr>
<td>
+ <p><code>_data</code></p>
+ </td>
+ <td>
+ <p>
+
+ Well-formatted site data should be placed here. The jekyll engine will
+ autoload all yaml files(ends with <code>.yml</code> or <code>.yaml</code>)
@parkr
parkr Aug 30, 2013 Member

Whoops! Looks like there is a space missing between files and (

@parkr parkr commented on an outdated diff Aug 30, 2013
test/test_site.rb
@@ -335,5 +335,17 @@ def generate(site)
end
end
+ context 'data directory' do
+ should 'load yaml files' do
+
+ base = File.expand_path('../fixtures', __FILE__)
+ members = {'name' => 'members', 'type' => 'yaml', 'path' => File.join(base, 'members.yaml')}
+
+ site = Site.new(Jekyll.configuration)
+ site.process
+
+ assert_equal site.data['members'].size, 2
@parkr
parkr Aug 30, 2013 Member

Can you also check to make sure what is read in is proper?

It'd also be good to make sure that site.members from site_payload is right.

@parkr
Member
parkr commented Aug 30, 2013

I desperately want this feature. I've been using it (ostensibly) in theclassnotes.github.io and wrote a short rake task to join them into the _config.yml. It'd be amazing to have it built-in :)

Additionally, we should make sure they're re-read when the contents change.

@liufengyun
Contributor

@parkr I've just refined the code according to review.

I think reload will work without any problem, as if any file changes in the source directory, site.process will be called, which then calls site.read and finally site.read_data will be called.

@parkr parkr and 1 other commented on an outdated diff Aug 31, 2013
features/data.feature
@@ -0,0 +1,13 @@
+Feature: Data
+ In order to use well-formatted data in my blog
+ As a blog's user
+ I want to use _data directory in my site
+
+ Scenario: read YAML files in _data directory
+ Given I have a _data directory
+ And I have a "_data/languages.yaml" file that contains "[java, ruby]"
@parkr
parkr Aug 31, 2013 Member

Would you mind also adding a yml/yaml thing here? Maybe have a second file in this scenario?

@liufengyun
liufengyun Aug 31, 2013 Contributor

I've just updated this feature.

@parkr
Member
parkr commented Aug 31, 2013

Could we support subdirectories? Should we support subdirectories?

@liufengyun
Contributor

I think it can satisfy 80% of the requirements without complications with subdirectories.

Later if there's a concrete scenarios for subdirectories, we can add that support as well.

@parkr
Member
parkr commented Aug 31, 2013

Agreed! Let's skip subdirectories for now and just read in the YAML files in _data.

This PR LGTM. @mattr-?

@parkr parkr and 1 other commented on an outdated diff Aug 31, 2013
lib/jekyll/site.rb
@@ -266,7 +284,7 @@ def post_attr_hash(post_attr)
# "tags" - The Hash of tag values and Posts.
# See Site#post_attr_hash for type info.
def site_payload
- {"site" => self.config.merge({
+ {"site" => self.data.merge(self.config).merge({
@parkr
parkr Aug 31, 2013 Member

We should probably use deep_merge here. Maybe we can setup a new method which collects the data and configs?

@liufengyun
liufengyun Aug 31, 2013 Contributor

What's the point here for deep_merge? In my mind, if there's collision of keys, then it's abnormal usage.

@parkr
parkr Aug 31, 2013 Member

If I have a _config.yml that contains:

members:
- name: Ben
  username: benbalter
- name: Parker
  username: parkr

I'd want to deep-merge it with a _data/members.yml with the following contents:

- name: Ben Balter
- name: Parker Moore

To get the output:

site.data['members']
# => [
#  {"name" => "Ben Balter", "username" => "benbalter"}
#  {"name" => "Parker Moore", "username" => "parkr"}
#]
@liufengyun
liufengyun Sep 1, 2013 Contributor

I doubt if there's real-world usage of the case above. Why define a single piece of data in two different places?

I think it's up to the end user to guarantee that keys in _data will not conflict with keys in _config.yml.

@parkr
parkr Sep 1, 2013 Member

I think we should enforce best-practices to some degree, but I really think helping the user out (maybe he or she is tired or just not well-focused that day) does a world of good.

@liufengyun
liufengyun Sep 2, 2013 Contributor

OK, I've changed it to deep_merge.

@parkr parkr and 1 other commented on an outdated diff Aug 31, 2013
features/data.feature
+ I want to use _data directory in my site
+
+ Scenario: autoload *.yaml files in _data directory
+ Given I have a _data directory
+ And I have a "_data/languages.yaml" file that contains "[java, ruby]"
+ And I have an "index.html" page that contains "{% for language in site.languages %}{{language}}{% endfor %}"
+ When I run jekyll
+ Then the "_site/index.html" file should exist
+ And I should see "java" in "_site/index.html"
+ And I should see "ruby" in "_site/index.html"
+
+ Scenario: autoload *.yml files in _data directory
+ Given I have a _data directory
+ And I have a "_data/members.yml" file with content:
+ """
+ - jack
@parkr
parkr Aug 31, 2013 Member

It'd be cool to make sure hashes work as well instead of just arrays. And arrays of hashes. (Mostly that liquid exposes them properly)

@liufengyun
liufengyun Sep 1, 2013 Contributor

OK, I've refined the feature to cover arrays, hashes and arrays of hashes.

@parkr
parkr Sep 1, 2013 Member

Thank you!

@parkr
parkr Sep 1, 2013 Member

Just want to be thorough :)

@parkr parkr and 1 other commented on an outdated diff Sep 1, 2013
lib/jekyll/site.rb
@@ -187,6 +189,22 @@ def read_drafts(dir)
end
end
+ # Read and parse all yaml files under <source>/<dir>
+ #
+ # Returns nothing
+ def read_data(dir)
+ base = File.join(self.source, dir)
+ return [] unless File.directory?(base)
@parkr
parkr Sep 1, 2013 Member

It'd be great to print out a warning message if someone specified a file instead of a directory:

unless File.directory?(base)
  Jekyll.logger.warn "The data directive specified in the configuration does not exist or is not an accessible directory."
  return Array.new
end
@liufengyun
liufengyun Sep 2, 2013 Contributor

Good point, I've added the warning.

@parkr parkr and 1 other commented on an outdated diff Sep 1, 2013
lib/jekyll/site.rb
@@ -187,6 +189,22 @@ def read_drafts(dir)
end
end
+ # Read and parse all yaml files under <source>/<dir>
+ #
+ # Returns nothing
+ def read_data(dir)
+ base = File.join(self.source, dir)
+ return [] unless File.directory?(base)
+ entries = Dir.chdir(base) { Dir['*.{yaml,yml}'] }
+ entries.delete_if { |e| File.directory?(File.join(base, e)) }
+
+ entries.each do |entry|
+ path = File.join(self.source, dir, entry)
+ key = File.basename(entry, '.*')
@parkr
parkr Sep 1, 2013 Member

We should probably do some more sanitation here. If I have the file hello dolly.yml then it should come in as hello_dolly.

def sanitize_filename(name)
  name.gsub(/[^\w\s_-]+/, '')
      .gsub(/(^|\b\s)\s+($|\s?\b)/, '\\1\\2')
      .gsub(/\s+/, '_')
end

Then use it:

key = sanitize_filename(File.basename(entry, '.*'))
@liufengyun
liufengyun Sep 2, 2013 Contributor

Thanks for the regex code, it saves me time:-)

@mattr- mattr- and 1 other commented on an outdated diff Sep 1, 2013
jekyll.gemspec
@@ -6,7 +6,7 @@ Gem::Specification.new do |s|
s.name = 'jekyll'
s.version = '1.1.2'
s.license = 'MIT'
- s.date = '2013-07-25'
+ s.date = '2013-08-30'
@mattr-
mattr- Sep 1, 2013 Member

No need to revise this part of the gemspec. We don't update the spec (outside of the file lists) until release time.

@liufengyun
liufengyun Sep 2, 2013 Contributor

I've reverted the date.

@mattr- mattr- and 1 other commented on an outdated diff Sep 3, 2013
lib/jekyll/site.rb
@@ -187,6 +189,27 @@ def read_drafts(dir)
end
end
+ # Read and parse all yaml files under <source>/<dir>
+ #
+ # Returns nothing
+ def read_data(dir)
+ base = File.join(self.source, dir)
+ unless File.directory?(base)
@mattr-
mattr- Sep 3, 2013 Member

This gives me a warning even if I don't have a _data directory. I don't want to see a warning if the directory doesn't exist.

@parkr
parkr Sep 3, 2013 Member

Ah, good point. That was my suggestion! You can remove it - my b.

@mattr-
Member
mattr- commented Sep 3, 2013

The way this is implemented currently, it feels like we're forcing this on the user. I don't agree with that approach. If I have files in a _data directory, I want to be able to use that data, but if I don't have any _data directory or if I have a _data directory but no files, then I don't want to see a change in the way my site is built.

Please make any adjustments necessary to ensure an existing site that hasn't been changed to support this feature is still built in the same way that it was before.

@parkr
Member
parkr commented Sep 3, 2013

@mattr- That was my fault - I asked for the warning message without considering this. I think without this warning message, the feature is perfect, though. Thoughts?

@liufengyun
Contributor

OK, now I've disabled the warning message.

@liufengyun liufengyun commented on an outdated diff Sep 3, 2013
lib/jekyll/site.rb
@@ -382,5 +401,11 @@ def limit_posts!
def site_cleaner
@site_cleaner ||= Cleaner.new(self)
end
+
+ def sanitize_filename(name)
+ name.gsub(/[^\w\s_-]+/, '')
+ .gsub(/(^|\b\s)\s+($|\s?\b)/, '\\1\\2')
+ .gsub(/\s+/, '_')
+ end
@liufengyun
liufengyun Sep 3, 2013 Contributor

Oops, 1.8.7 doesn't support this syntax.

@benbalter
Contributor

Related, @rypan's jekyll-db which is another great example of Jekyll for data. Supporting cool efforts like that could be killer if this were part of core.

@liufengyun
Contributor

At chaos lab, we are also using posts for data: http://chaos-lab.com/toolbox/

With this feature implemented, we can safely move the data to yaml.

@parkr
Member
parkr commented Sep 3, 2013

Posts as a database is a really great idea but I think this feature is a bit more "visible" as it regards database-like work with Jekyll. jekyll-db is great!

@parkr parkr commented on an outdated diff Sep 3, 2013
lib/jekyll/site.rb
@@ -187,6 +189,23 @@ def read_drafts(dir)
end
end
+ # Read and parse all yaml files under <source>/<dir>
+ #
+ # Returns nothing
+ def read_data(dir)
+ base = File.join(self.source, dir)
+ return unless File.directory?(base)
@parkr
parkr Sep 3, 2013 Member

Additionally, this directory cannot be a symlink if in safe mode.

@parkr parkr and 1 other commented on an outdated diff Sep 3, 2013
lib/jekyll/site.rb
@@ -187,6 +189,23 @@ def read_drafts(dir)
end
end
+ # Read and parse all yaml files under <source>/<dir>
+ #
+ # Returns nothing
+ def read_data(dir)
+ base = File.join(self.source, dir)
+ return unless File.directory?(base)
+
+ entries = Dir.chdir(base) { Dir['*.{yaml,yml}'] }
+ entries.delete_if { |e| File.directory?(File.join(base, e)) }
@parkr
parkr Sep 3, 2013 Member

The entries cannot be symlinks if in safe mode.

@liufengyun
liufengyun Sep 4, 2013 Contributor

OK, now symlinks are filtered in safe mode and tests passed.

@penibelst
Member

Is data from _data/records.yaml available under site.records or site.data.records?

@parkr
Member
parkr commented Sep 3, 2013

@penibelst Data from _data/records.yaml is available under site.records at the moment.

@penibelst
Member

@parkr Imagine you use _data/records.yaml for a while. One day you find a plugin, that implements a killer feature you want so bad. But the configuration for this plugin must sit in site.records too. Now you have a conflict.

@parkr
Member
parkr commented Sep 3, 2013

@penibelst A good supposition, but I'm not sure it'd be frequent enough to cause conflicts. I'd be happy to see it namespaced under site.data, but that could conflict as well ;)

@parkr
Member
parkr commented Sep 3, 2013

Thinking about it again, we should namespace under site.data the same way we namespace site.posts. @liufengyun @benbalter @mattr- what do you think?

@rebelzach

Its an exciting prospect to have the site.data hash available. This seems like a good way for plugins to make information available in liquid. Currently to only way I've seen is to monkey patch the site_payload method.

@penibelst
Member

@parkr It must not be the word “data”. When it’s already taken, use another one (silo?). Remember the slogan

Simple
No more data-bases, comment moderation, or pesky updates to install—just your content.

@mattr-
Member
mattr- commented Sep 3, 2013

Data is data. Let's call things what they are rather than make up fancy
names for them.

On Tue, Sep 3, 2013 at 3:17 PM, Anatol Broder notifications@github.comwrote:

@parkr https://github.com/parkr It must not be the word “data”. When
it’s already taken, use another one (silo?). Remember the slogan

Simple
No more data-bases, comment moderation, or pesky updates to
install—just your content.


Reply to this email directly or view it on GitHubhttps://github.com/mojombo/jekyll/pull/1003#issuecomment-23743431
.

@troyswanson
Member

+1 for site.data.[file]. If the _posts folder populates the site.posts variable, then the _data folder should populate the site.data variable, right?

@rebelzach

I know you ditched the subdirectory discussion, but I like the idea of giving the user the choice of how the hierarchy is structured. So _data/authors.yml becomes site.authors and _data/events/venues.yml becomes site.events.venues. While redundant the user could still create _data/data/records.yml and use site.data.records

@parkr
Member
parkr commented Sep 3, 2013

@rebelzach We'll have to return to that way later. It's far more complicated.

Seems like namespacing under site.data will cause the fewest conflicts. Thus, if I have _data/venues.yml then I access it through site.data.venues.

@liufengyun
Contributor

It depends on how often we expect conflicts in real world, and whether it's hard to revolve possible conflicts.

In most cases plugins use a name that's application-agnostic, but the data files names are mostly application-specific. It's rare that they can conflict. So in my mind, conflicts are very rare, and even it happens it's easy to avoid by renaming the files.

The overhead of site.data is, imagine we've a _data/projects.yml, we've to use site.data.projects instead of site.projects. The latter seems simpler and more straight-forward.

So I prefer to plug the data to site directly instead of site.data.

@imathis
Contributor
imathis commented Sep 4, 2013

For what it's worth I love the idea of enforcing a namespace based on the file name. This would be a great way for third-party plugins to handle configuration. "Just copy awesome_plugin.yml to the _data directory and change the defaults to be what you want." Then plugin authors could read settings from site.data.awesome_plugin. This would be far better than the current method of polluting globals in _config.yml. Yes I know it's possible for plugin authors to define namespaces in the _config.yml but it's harder to detect collisions in a yaml file, and obvious when you're dealing with the filesystem.

With the use case you've defined, _data is a great name for this, but it's not as great for plugins configuration. Still it's better than what we have.

@penibelst
Member

@liufengyun:

The overhead of site.data is, imagine we've a _data/projects.yml, we've to use site.data.projects instead of site.projects. The latter seems simpler and more straight-forward.

It’s not how short it is, it’s about readability and explicit logic. It is logically more straight-forward if _data/projects.yml becames site.data.projects.

@liufengyun
Contributor

@penibelst I think there're three more reasons to prefer site over site.data:

  1. Consistence. If things in _config.yml is not loaded as site.config, why things in _data/ should be loaded as site.data.
  2. Semantics. Conceptually, posts, configurations, yml data files are all data, everything that site exposes is data. It's a little inelegant to pull out part of them and put under site.data.
  3. Usage. It's possible to have a _data/rdiscount.yml or _data/kramdown.yml file that specifies rdiscount or kramdown setting. If we use site.data, it's impossible to do so.
@maul-esel
Contributor

Why not put it in data directly?

@liufengyun: Is it really possible to put configs in data files? You can retrieve them from liquid in the same way, but I'd think those settings would not be used internally. And if so, that would be an argument against putting them in site IMO, because it clutters configuration over multiple files.

@StevenBlack

@maul-esel is right: site.data fragments site configuration.

Moreover we haven't yet discussed the mutability of data.

For example, conceptually speaking, if it's site.data then page.data and post.data would need to follow the semantics and constraints of the cascade from site to page and post. More than likely we want page.data to mutate the site data store, not clobber it.

With a distinct namespace we're free to treat data separately outside the site -> page override. More than likely we won't want to override; we'll want to extend or mutate existing data, which could be substantial in volume.

@parkr
Member
parkr commented Sep 4, 2013

The argument that hasn't been tackled is more along the lines of reducing conflict. With site.data, we reduce conflict. If we decide that we want to support a feature that requires site.projects (for example), your _data/projects.yml will

I'm very, very strongly in favour of site.data.<filename>. It's:

  1. in the _data directory. it's the site's data. site.data makes way more sense semantically
  2. better for reducing conflicts
  3. easier for iteration
  4. not for internal/site configuration so its relation to _config.yml is irrelevant. keep jekyll configs in _config.yml - we would have specified _config/ if we wanted to allow for this instead

We can't introduce a new global, data, as that will also cause significant conflicts. Think of how many people would think to do this (hint: metric tons):

{% capture data %}{{post.author}} posted on {{post.date}} with categories {% for c in post.categories %}{{ c }} {% endfor %}{% endcapture %}
{{ data }}

CRAP now everything Jekyll read in for you from your _data dir has been overridden. site is reserved by Jekyll and has been for years. site.data isn't a thing unless a third-party plugin has monkey-patched the site_payload method and has added this particular key (not something more specific to the plugin, which would make sense).

@imathis
Contributor
imathis commented Sep 4, 2013

@parkr, right on. Fragmentation is a non issue. In fact having separate yaml files makes diffs more focused and meaningful.

I do wish there was a nice way to do something similar for plugin configuration, but perhaps it's better to encourage a name spacing configs as a best practice, rather than creating a new system for it. I could go either way on that.

@parkr
Member
parkr commented Sep 4, 2013

@imathis I could definitely see expanding this concept out to a _config/ directory, but that's a conversation for another time and place. :)

In terms of plugin configuration, I'd have each plugin provide a file to put in _data and access site.data.<plugin_name>.

@liufengyun
Contributor

@parkr It seems site.data wins most support, and that's a solution that I can accept as well. So I've just updated the pull request to use site.data.

Want to see the feature integrated and released ASAP:-)

@parkr
Member
parkr commented Sep 4, 2013

Thanks! This final implementation I'm good with. @mattr- and I are planning on releasing v1.2 tomorrow or Friday so if this gets merged before then, it'll be in the next release! If not, it'll have to wait until v1.3.

@mattr- if you get time today, please take a look!

@penibelst penibelst commented on the diff Sep 4, 2013
site/docs/structure.md
@@ -123,6 +125,21 @@ An overview of what each of these does:
</tr>
<tr>
<td>
+ <p><code>_data</code></p>
+ </td>
+ <td>
+ <p>
+
+ Well-formatted site data should be placed here. The jekyll engine will
+ autoload all yaml files (ends with <code>.yml</code> or <code>.yaml</code>)
+ in this directory. If there's a file <code>members.yml</code> under the directory,
+ then you can access contents of the file through <code>site.data.members</code>.
@penibelst
penibelst Sep 4, 2013 Member

Plese change to <code>site.data.members</code>

@liufengyun
liufengyun Sep 5, 2013 Contributor

It's already <code>site.data.members</code> here, you mean something else ?

@penibelst
penibelst Sep 5, 2013 Member

I’ve seen <code>site.members</code>. My window wasn’t reloaded. Nevermind.

@penibelst penibelst and 2 others commented on an outdated diff Sep 4, 2013
test/source/_data/members.yaml
@@ -0,0 +1,7 @@
+- name: jack
+ age: 27
+ blog: http://test.com/jack
@penibelst
penibelst Sep 4, 2013 Member

Please change to http://example.com/jack

@parkr
parkr Sep 4, 2013 Member

our tests don't have to follow that spec :)

@liufengyun
liufengyun Sep 5, 2013 Contributor

Thanks @penibelst . I've refined the test fixture.

@penibelst
penibelst Sep 5, 2013 Member

@parkr example.com is made for examples, test.com is owned by Test.com, Inc.

Patent 6,513,042 was issued to Test.com, Inc. This patent protects Test.com’s intellectual property in the area of authoring, distributing and selling tests on the Internet.

Do you really want to hear from their legal department? I would recommend to use example.com (.org, .net) for every example or test domain name.

@penibelst penibelst commented on an outdated diff Sep 4, 2013
test/source/_data/members.yaml
@@ -0,0 +1,7 @@
+- name: jack
+ age: 27
+ blog: http://test.com/jack
+
+- name: john
+ age: 32
+ blog: http://test.com/john
@penibelst
penibelst Sep 4, 2013 Member

Please change to http://example.com/john

@penibelst
Member

@liufengyun Thank you very much.

@parkr site.data is a real big deal. Please write a nice release post about it. And maybe you start a new documentation page docs/data/.

@swanson
Contributor
swanson commented Sep 4, 2013

@liufengyun yes big props for this feature (and handling the mountain of back and forth going on this PR)

I, too, am very excited about this and will be great for moving Jekyll from a "blog generator" to a "site generator"

@swanson
Contributor
swanson commented Sep 4, 2013

I'm happy to take a stab at writing a documentation page for docs/data - will send a separate PR

@maul-esel
Contributor

@parkr: I still think data would be a better namespace. If you're about to use the new feature, you might have to rename one or two of your custom variables, yes. But it's the same with data in _config.yml now.

I just think that site is already enough cluttered up with site config and other site-wide vars on one side and jekyll-provided data on the other. Think if you have to trace back where some data in site comes from, e.g. when maintaining or contributing to someone else's repo. You have to check if it is

  1. Specified in the file itself, by means of overriding site (that one's unlikely, but possible),
  2. or provided by jekyll, like site.time,
  3. or in _config.yml (being either a real config for jekyll or just data for the site),
  4. or from some _data file

Whereas using a new data variable reduces it to cases 1 and 4 for data.[some_var] and 1-3 for site.[some_var].

@maul-esel
Contributor

Also, the loaded data overwrites the newly introduced config setting data in liquid, so this config could not be accessed from a page or post. With the possibility of a data key already existing, it would make sense to rename this config option to something more distinctive either way, like data_source or data_directory.

More importantly, if you really stay with site.data, be sure to make it the least backwards-incompatible as possible. Thus, do not overwrite a data key in _config.yml unless you really load valid files from _data. If the directory does not exist or is empty, site.data should still point to a possible key in _config.yml.

@swanson
Contributor
swanson commented Sep 5, 2013

@maul-esel could you post a link or more info about "newly introduced config setting data in liquid"? I couldn't find anything in the Liquid docs about it.

@parkr
Member
parkr commented Sep 5, 2013

Introducing a new top-level variable, data, with such a generic name cannot be done. site is where Jekyll puts everything but the page data (for the current page) and it's important to me not to pollute the top-level namespace with new features. If Jekyll read in my data, I generally expect it to be available to site.data, just like site.posts. It's about logical grouping to me. And why break a data Liquid tag plugin I may have been using (that may not be used for accessing data the way we have designed it)?

@liufengyun
Contributor

@maul-esel thanks for pointing it out.

  1. I've renamed the _data directory config from data to data_source.
  2. I've make the feature backward compatible if user already uses site.data in _config.yml
@maul-esel
Contributor

@liufengyun: Thanks!

@parkr
Member
parkr commented Sep 26, 2013

I'd love to see this merged soon. @mattr-, what do you think of what we have so far?

@oswaldoacauan

OMG! This feature will be great!

@parkr
Member
parkr commented Sep 26, 2013

@defunkt – I think this feature would be very useful on Pages. Any thoughts?

@parkr parkr and 1 other commented on an outdated diff Sep 26, 2013
lib/jekyll/site.rb
@@ -266,13 +287,17 @@ def post_attr_hash(post_attr)
# "tags" - The Hash of tag values and Posts.
# See Site#post_attr_hash for type info.
def site_payload
+ # backward compatibility for possible usage of site.data
+ data_value = self.config['data'] || self.data
@parkr
parkr Sep 26, 2013 Member

This should be a separate method. :)

@liufengyun
liufengyun Sep 29, 2013 Contributor

Thanks @parkr , I've extracted the code to the method site_data.

@parkr parkr commented on an outdated diff Sep 26, 2013
lib/jekyll/site.rb
{"site" => self.config.merge({
"time" => self.time,
"posts" => self.posts.sort { |a, b| b <=> a },
"pages" => self.pages,
"html_pages" => self.pages.reject { |page| !page.html? },
"categories" => post_attr_hash('categories'),
- "tags" => post_attr_hash('tags')})}
+ "tags" => post_attr_hash('tags'),
+ "data" => data_value})}
@parkr
parkr Sep 26, 2013 Member

@benbalter I am starting to doubt my wish for site.data.<filename>. Do you think having the data namespaces under site makes more sense than just having data as the top-level namespace, i.e. data.<filename>?

@mattr- mattr- commented on the diff Oct 1, 2013
lib/jekyll/site.rb
@@ -252,6 +273,14 @@ def post_attr_hash(post_attr)
hash
end
+ # Prepare site data for site payload. The method maintains backward compatibility
+ # if the key 'data' is already used in _config.yml.
+ #
+ # Returns the Hash to be hooked to site.data.
+ def site_data
+ self.config['data'] || self.data
@mattr-
mattr- Oct 1, 2013 Member

Can we get a test for this bit of logic around the backward compatibility? If we're going to maintain backwards compatibility, I'd like to make sure we don't break it.

Not a blocker, so I think I'll go ahead and merge this, but it's something that would be nice to have for later. 😃

@mattr-
mattr- Oct 1, 2013 Member

Right you are! I totally missed that before. Thank you!

@mattr-
Member
mattr- commented Oct 1, 2013

@liufengyun Could you rebase this one more time so I can merge it cleanly? Thanks! ❤️

@liufengyun
Contributor

@mattr- I've rebased to jekyll/master locally, but the master is failing https://travis-ci.org/mojombo/jekyll. Does it matter?

@liufengyun
Contributor

@mattr- I got it, you mean squash all my commits to a single commit? I'll do it right away.

@mattr-
Member
mattr- commented Oct 1, 2013

The test failures were already there from before so don't worry about those. As far as squashing goes, I don't have a preference one way or the other. Whatever you feel like doing. 😺

@liufengyun liufengyun Autoload yaml files under _data directory
The jekyll engine will autoload all yaml files(ends with .yml or .yaml)
under _data. If there's a file members.yml under the directory, then user
can access contents of the file through site.members.
760cbc7
@liufengyun
Contributor

It's done, @mattr-

@mattr- mattr- merged commit cb4d155 into jekyll:master Oct 1, 2013

1 check failed

default The Travis CI build failed
Details
@mattr- mattr- added a commit that referenced this pull request Oct 1, 2013
@mattr- mattr- Update history to reflect merge of #1003 246825a
@mattr-
Member
mattr- commented Oct 1, 2013

🎉 Awesome! 🎉

Thank you so much for your work on this!

So happy to finally 🚢 this.

@liufengyun
Contributor

Thank you all for help review and improve the pull request.

㊗️ Jekyll has finally entered the data era!

@parkr
Member
parkr commented Oct 1, 2013

I may or may not be 😢 with happiness. Great work @liufengyun!

@benbalter
Contributor

This is a game change. Great stuff. 🍻 🍸 🍷 🌴 🎉 🎆

@cobyism
Member
cobyism commented Oct 8, 2013

💖

@localheinz
Contributor

@liufengyun

Has support for sub-directories been added yet?

@liufengyun
Contributor

@localheinz , sub-directories are not supported yet. Currently I think _data/ without sub-directory can satisfy most of the data requirements in site generation.

@localheinz
Contributor

@liufengyun

Was hoping now was already later.

@liufengyun
Contributor

@localheinz The _data feature is just officially released. Let's wait and see how it's received and used. If there's a strong demand for sub-directories support in real-world usage, I think a pull request will be welcomed.

@benbalter
Contributor

A big 👍 for sub-folder support in the next point release if it's a light lift (and there's community adoption of the feature). I'd call it a core use case for any site with more than one data type.

If I have a site with just one data type, e.g., cars, it's fine. {% for car in site.data %}. I can safely assume anything in the _data folder is a car.

Now Imagine I have cars and trucks, which I place in the _data folder. If sub foldered, I could do {% for truck in site.data.trucks %} to iterate through trucks. Without sub folders, it's likely more like {% for vehicle in site.data %}{% if vehicle.type == "truck" %}... (which also requires storing the type value in each vehicle, where as before it was simply foldered.

Alternatively, could I name my yaml file trucks.pickup.yml now to have it parsed into site.data.trucks.pickup?

@liufengyun
Contributor

@benbalter I think there's a misunderstanding about the feature.

If you've a file trucks.yaml under _data/, then you can access it with {% for truck in site.data.trucks %}. No sub-directory required in this case. So you can have members.yaml, projects.yaml, products.yaml under _data, and access them respectively as site.data.members, site.data.projects and site.data.products.

Currently, if you put a file named trucks.pickup.yml under _data, then it's hooked to site.data.truckspickupyml. Points and white spaces are removed.

@routelastresort

@benbalter: When will gh-pages support this feature? I just made a new site with 1.3, then realized that 1.2 (what the github-pages gem is at) had issues (rbenv shim version points to mine, bundle exec jekyll serve uses the gh-pages version, and production github.io as well). Obviously, my site renders fine with 1.3, but will I wait days/weeks/months for Github's version to catch up? Thanks, btw, hehe 👍

@swanson
Contributor
swanson commented Nov 7, 2013
@parkr
Member
parkr commented Nov 7, 2013

@routelastresort I was told by @benbalter that it's been pushed to the production servers.

@routelastresort

@parkr, @swanson, @benbalter - thanks! 👍 I noticed it when I pushed/visited today. You guys rock!

@localheinz
Contributor

❤️

@semireg
semireg commented Nov 15, 2013

Thank you. This has allowed me to design two-tier navigation without static front-matter. One step closer to dynamic front-matter.

https://gist.github.com/caylanlarson/7493380

semireg industries - labelscope

@Wolfr
Wolfr commented Dec 10, 2013

Just made a repo to illustrate some cases: https://github.com/Wolfr/jekyll-data-test

Feel free to fork or PR I feel more concrete examples will help people new to Jekyll and/or YAML and/or Liquid make sense of it.

@TuckerWhitehouse TuckerWhitehouse referenced this pull request Oct 19, 2014
Closed

Remote Data #3015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment