Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoload csv files from data directory #2761

Merged
merged 4 commits into from
Aug 18, 2014
Merged

Autoload csv files from data directory #2761

merged 4 commits into from
Aug 18, 2014

Conversation

Floppy
Copy link
Contributor

@Floppy Floppy commented Aug 16, 2014

Sometimes it's simplest to store data in CSV format. This PR autoloads these files as well, just like JSON or YAML.

@parkr
Copy link
Member

parkr commented Aug 17, 2014

Oh goodness, I thought I did this! Thanks for the PR. Looks pretty good to me.

data[key] = SafeYAML.load_file(path)
case File.extname(path).downcase
when '.csv'
data[key] = CSV.read(path, headers: true).map(&:to_hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We follow the GitHub Ruby Style Guide, which dictates we use hash rockets:

data[key] = CSV.read(path, :headers => true).map(&:to_hash)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, what happens if no header is specified? /cc @benbalter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashrocket added.

As for headers, if you didn't have headers in the CSV, there would be no way to do things like site.members.name (as there wouldn't be anything to say it was a name), so I think it's OK for Jekyll to support a very precise definitions of CSV, i.e. comma separated and includes header row. That's what most people will want to use anyway. If there wasn't a header row, you'd get junk data, but there's currently no simple way to be sure if a CSV has a header or not, so we can't really throw an error.

@parkr
Copy link
Member

parkr commented Aug 17, 2014

As for headers, if you didn't have headers in the CSV, there would be no way to do things like site.members.name (as there wouldn't be anything to say it was a name), so I think it's OK for Jekyll to support a very precise definitions of CSV, i.e. comma separated and includes header row. That's what most people will want to use anyway. If there wasn't a header row, you'd get junk data, but there's currently no simple way to be sure if a CSV has a header or not, so we can't really throw an error.

I agree that we should enforce headers. I would really like a way to show some sort of error if no headers exist. Or add a huuuge warning in the docs and the release notes should say support reading CSV's with headers in _data. How can we be clear about this?

@parkr parkr added the Feature label Aug 17, 2014
@Floppy
Copy link
Contributor Author

Floppy commented Aug 17, 2014

Paging @ldodds and @pezholio. Do you guys think there's any reasonable way to detect a header row in a CSV? It seems it would always be very brittle, to me.

@parkr
Copy link
Member

parkr commented Aug 17, 2014

@benbalter may also have an idea. He works with this kind of data quite often.

@Floppy
Copy link
Contributor Author

Floppy commented Aug 17, 2014

We've been building http://csvlint.io recently for CSV validation, and I'm 99% sure we don't have a reliable way to autodetect headers, so I expect it'll have to be a documentation thing. Anyway, we'll see what the others say first!

@paulfitz
Copy link

I agree with @Floppy that detecting whether a CSV file has a header is unreliable in the general case. It works great on big juicy files with cells stuffed with numbers, dates, and the like, but it breaks your heart on important edge cases, including tables with few rows, or a table full of short strings.

I think it'd definitely be reasonable to treat the following cases as errors:

  • Blank cells in the alleged header.
  • Repeated cells in the alleged header.
  • Numeric-looking cells (integer, float) in the alleged header (this one is a bit less reasonable than the first two, but would catch a lot more headerless CSV files).

Anything that tries to be much smarter than that, it'd be great to have a configuration switch to turn off for when predictability is important.

Very happy user of the _data directory, thanks for including it, and CSV support of any kind would be total icing on the cake!

@parkr
Copy link
Member

parkr commented Aug 18, 2014

Great set of criteria. Thinking more about it now, this kind of validation would better serve the jekyll doctor command. We can print CSV files that violate any of the above. What do you think?

parkr added a commit that referenced this pull request Aug 18, 2014
@parkr parkr merged commit c4a2ac2 into jekyll:master Aug 18, 2014
parkr added a commit that referenced this pull request Aug 18, 2014
@Floppy
Copy link
Contributor Author

Floppy commented Aug 18, 2014

That could work. The core of csvlint.io is in a gem, https://github.com/theodi/csvlint.rb/. We could add the heuristic @paulfitz suggests to that, and integrate that check into jekyll doctor perhaps. It would then catch a whole bunch of CSV errors, which might be useful.

@Floppy Floppy deleted the csv-data branch August 18, 2014 16:06
parkr added a commit that referenced this pull request Aug 26, 2014
@ghost
Copy link

ghost commented Nov 5, 2014

Thank you for shipping it with 2.4.0 !

Love it <3

@jekyll jekyll locked and limited conversation to collaborators Feb 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants