Autoload csv files from data directory #2761

Merged
merged 4 commits into from Aug 18, 2014

Conversation

Projects
None yet
5 participants
@Floppy
Contributor

Floppy commented Aug 16, 2014

Sometimes it's simplest to store data in CSV format. This PR autoloads these files as well, just like JSON or YAML.

@parkr

This comment has been minimized.

Show comment
Hide comment
@parkr

parkr Aug 17, 2014

Member

Oh goodness, I thought I did this! Thanks for the PR. Looks pretty good to me.

Member

parkr commented Aug 17, 2014

Oh goodness, I thought I did this! Thanks for the PR. Looks pretty good to me.

@parkr

View changes

lib/jekyll/site.rb
- data[key] = SafeYAML.load_file(path)
+ case File.extname(path).downcase
+ when '.csv'
+ data[key] = CSV.read(path, headers: true).map(&:to_hash)

This comment has been minimized.

@parkr

parkr Aug 17, 2014

Member

We follow the GitHub Ruby Style Guide, which dictates we use hash rockets:

data[key] = CSV.read(path, :headers => true).map(&:to_hash)
@parkr

parkr Aug 17, 2014

Member

We follow the GitHub Ruby Style Guide, which dictates we use hash rockets:

data[key] = CSV.read(path, :headers => true).map(&:to_hash)

This comment has been minimized.

@parkr

parkr Aug 17, 2014

Member

Additionally, what happens if no header is specified? /cc @benbalter

@parkr

parkr Aug 17, 2014

Member

Additionally, what happens if no header is specified? /cc @benbalter

This comment has been minimized.

@Floppy

Floppy Aug 17, 2014

Contributor

Hashrocket added.

As for headers, if you didn't have headers in the CSV, there would be no way to do things like site.members.name (as there wouldn't be anything to say it was a name), so I think it's OK for Jekyll to support a very precise definitions of CSV, i.e. comma separated and includes header row. That's what most people will want to use anyway. If there wasn't a header row, you'd get junk data, but there's currently no simple way to be sure if a CSV has a header or not, so we can't really throw an error.

@Floppy

Floppy Aug 17, 2014

Contributor

Hashrocket added.

As for headers, if you didn't have headers in the CSV, there would be no way to do things like site.members.name (as there wouldn't be anything to say it was a name), so I think it's OK for Jekyll to support a very precise definitions of CSV, i.e. comma separated and includes header row. That's what most people will want to use anyway. If there wasn't a header row, you'd get junk data, but there's currently no simple way to be sure if a CSV has a header or not, so we can't really throw an error.

@paulfitz paulfitz referenced this pull request in okfn/dataexplorer Aug 17, 2014

Closed

Edit github CSVs #155

@parkr

This comment has been minimized.

Show comment
Hide comment
@parkr

parkr Aug 17, 2014

Member

As for headers, if you didn't have headers in the CSV, there would be no way to do things like site.members.name (as there wouldn't be anything to say it was a name), so I think it's OK for Jekyll to support a very precise definitions of CSV, i.e. comma separated and includes header row. That's what most people will want to use anyway. If there wasn't a header row, you'd get junk data, but there's currently no simple way to be sure if a CSV has a header or not, so we can't really throw an error.

I agree that we should enforce headers. I would really like a way to show some sort of error if no headers exist. Or add a huuuge warning in the docs and the release notes should say support reading CSV's with headers in _data. How can we be clear about this?

Member

parkr commented Aug 17, 2014

As for headers, if you didn't have headers in the CSV, there would be no way to do things like site.members.name (as there wouldn't be anything to say it was a name), so I think it's OK for Jekyll to support a very precise definitions of CSV, i.e. comma separated and includes header row. That's what most people will want to use anyway. If there wasn't a header row, you'd get junk data, but there's currently no simple way to be sure if a CSV has a header or not, so we can't really throw an error.

I agree that we should enforce headers. I would really like a way to show some sort of error if no headers exist. Or add a huuuge warning in the docs and the release notes should say support reading CSV's with headers in _data. How can we be clear about this?

@parkr parkr added the Feature label Aug 17, 2014

@Floppy

This comment has been minimized.

Show comment
Hide comment
@Floppy

Floppy Aug 17, 2014

Contributor

Paging @ldodds and @pezholio. Do you guys think there's any reasonable way to detect a header row in a CSV? It seems it would always be very brittle, to me.

Contributor

Floppy commented Aug 17, 2014

Paging @ldodds and @pezholio. Do you guys think there's any reasonable way to detect a header row in a CSV? It seems it would always be very brittle, to me.

@parkr

This comment has been minimized.

Show comment
Hide comment
@parkr

parkr Aug 17, 2014

Member

@benbalter may also have an idea. He works with this kind of data quite often.

Member

parkr commented Aug 17, 2014

@benbalter may also have an idea. He works with this kind of data quite often.

@Floppy

This comment has been minimized.

Show comment
Hide comment
@Floppy

Floppy Aug 17, 2014

Contributor

We've been building http://csvlint.io recently for CSV validation, and I'm 99% sure we don't have a reliable way to autodetect headers, so I expect it'll have to be a documentation thing. Anyway, we'll see what the others say first!

Contributor

Floppy commented Aug 17, 2014

We've been building http://csvlint.io recently for CSV validation, and I'm 99% sure we don't have a reliable way to autodetect headers, so I expect it'll have to be a documentation thing. Anyway, we'll see what the others say first!

@paulfitz

This comment has been minimized.

Show comment
Hide comment
@paulfitz

paulfitz Aug 18, 2014

I agree with @Floppy that detecting whether a CSV file has a header is unreliable in the general case. It works great on big juicy files with cells stuffed with numbers, dates, and the like, but it breaks your heart on important edge cases, including tables with few rows, or a table full of short strings.

I think it'd definitely be reasonable to treat the following cases as errors:

  • Blank cells in the alleged header.
  • Repeated cells in the alleged header.
  • Numeric-looking cells (integer, float) in the alleged header (this one is a bit less reasonable than the first two, but would catch a lot more headerless CSV files).

Anything that tries to be much smarter than that, it'd be great to have a configuration switch to turn off for when predictability is important.

Very happy user of the _data directory, thanks for including it, and CSV support of any kind would be total icing on the cake!

I agree with @Floppy that detecting whether a CSV file has a header is unreliable in the general case. It works great on big juicy files with cells stuffed with numbers, dates, and the like, but it breaks your heart on important edge cases, including tables with few rows, or a table full of short strings.

I think it'd definitely be reasonable to treat the following cases as errors:

  • Blank cells in the alleged header.
  • Repeated cells in the alleged header.
  • Numeric-looking cells (integer, float) in the alleged header (this one is a bit less reasonable than the first two, but would catch a lot more headerless CSV files).

Anything that tries to be much smarter than that, it'd be great to have a configuration switch to turn off for when predictability is important.

Very happy user of the _data directory, thanks for including it, and CSV support of any kind would be total icing on the cake!

@parkr

This comment has been minimized.

Show comment
Hide comment
@parkr

parkr Aug 18, 2014

Member

Great set of criteria. Thinking more about it now, this kind of validation would better serve the jekyll doctor command. We can print CSV files that violate any of the above. What do you think?

Member

parkr commented Aug 18, 2014

Great set of criteria. Thinking more about it now, this kind of validation would better serve the jekyll doctor command. We can print CSV files that violate any of the above. What do you think?

parkr added a commit that referenced this pull request Aug 18, 2014

@parkr parkr merged commit c4a2ac2 into jekyll:master Aug 18, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details

parkr added a commit that referenced this pull request Aug 18, 2014

@Floppy

This comment has been minimized.

Show comment
Hide comment
@Floppy

Floppy Aug 18, 2014

Contributor

That could work. The core of csvlint.io is in a gem, https://github.com/theodi/csvlint.rb/. We could add the heuristic @paulfitz suggests to that, and integrate that check into jekyll doctor perhaps. It would then catch a whole bunch of CSV errors, which might be useful.

Contributor

Floppy commented Aug 18, 2014

That could work. The core of csvlint.io is in a gem, https://github.com/theodi/csvlint.rb/. We could add the heuristic @paulfitz suggests to that, and integrate that check into jekyll doctor perhaps. It would then catch a whole bunch of CSV errors, which might be useful.

@Floppy Floppy referenced this pull request in theodi/csvlint.rb Aug 18, 2014

Open

Header error detection #96

@Floppy Floppy deleted the theodi:csv-data branch Aug 18, 2014

parkr added a commit that referenced this pull request Aug 26, 2014

@01010000101001100

This comment has been minimized.

Show comment
Hide comment
@01010000101001100

01010000101001100 Nov 5, 2014

Thank you for shipping it with 2.4.0 !

Love it <3

Thank you for shipping it with 2.4.0 !

Love it <3

@jekyll jekyll locked and limited conversation to collaborators Feb 27, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.