Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple sheets in one file #3

Open
sjackman opened this issue Sep 16, 2016 · 47 comments
Open

Multiple sheets in one file #3

sjackman opened this issue Sep 16, 2016 · 47 comments

Comments

@sjackman
Copy link

sjackman commented Sep 16, 2016

I'd like to have multiple sheets in one file to keep data and metadata together in one file. This suggests that we need a container format to keep multiple TSV files together in one file. We could use an off-the-shelf container format, such as tar or zip. I would like however for the container format to be a plain text format to enable editing the entire document in a text editor and checking it into version control. I'm going to some requirements below, and then a random assortment of formats that came to mind.

  1. A plain text container file format that contains multiple plain text files
  2. Can contain TSV, CSV and Markdown PSV (pipe-separated values)
  3. Can contain any plain text file, particularly Markdown (and RMarkdown)
  4. Each embedded plain text file has a name (like a sheet name or file name)
  5. Editable by a human with a plain text editor
  6. Aesthetically pleasing
@sjackman
Copy link
Author

sjackman commented Sep 16, 2016

Tables separated by blank lines

See http://johnkerl.org/miller/doc/file-formats.html#CSV/TSV/etc.

Example

# name: sheet1.tsv
A   B
1   X
2   Y

# name: sheet2.csv
C,D,E
3,4,5

# name: sheet3.md
| F | G |
|---|---|
| 0 | 1 |

Advantages

  1. Simple!
  2. miller (aka mlr) supports this file format, though without # comments and they all have to be the same type (all TSV for example).

Disadvantages

  1. It can't contain any file that contains a blank line (since a blank line is the file separator).
  2. Not all programs handle # comments in TSV/CSV files.

@sjackman
Copy link
Author

sjackman commented Sep 16, 2016

Tables in Markdown

See https://help.github.com/articles/organizing-information-with-tables/

Example

# Tables in *Markdown*

```tsv sheet1
A   B
1   X
2   Y
```

```csv sheet2
C,D,E
3,4,5
```

Table: A pandoc-crossref table caption {#tbl:sheet3}

| F | G |
|---|---|
| 0 | 1 |

Advantages

  1. The container format (Markdown) is already defined!
  2. Markdown editors already exist! Texts http://www.texts.io can edit Markdown tables.
  3. Need to teach programs like mlr, datamash, readr, RStudio et c to read Markdown files and extract tables

Disadvantages

  1. The container format is Markdown, but it can't really contain a Markdown file.

@jennybc
Copy link
Owner

jennybc commented Sep 16, 2016

Do you think it has to be one file? What if, instead, it were one directory? The way that Git repos or RStudio projects are just regular directories on your computer, with some other, mostly hidden stuff lying around to help Git and RStudio do their jobs. An R package is a special case that has even more specific structure about the files and directories. Maybe that's a model for a sanesheet?

A sanesheet-anticipating tool serves the different files to you via conventional tabs. And tabs have an interact / edit mode appropriate to the file type. As delimited file for the TSVs, as markdown for the notes. So I guess that suggests there has to be a .sanesheets file, like foo.Rproj or the entire .git directory, to facilitate coordinated actions across the whole set of files.

I think the point is that someone who wants to experience a sanesheet like a spreadsheet could. Well, if this mythical inpsector/editor existed! But the code-based analyst could experience it like a subdirectory of delimited files and some notes. And all could enjoy the benefit of version control.

@sjackman
Copy link
Author

sjackman commented Sep 16, 2016

MIME (Multipurpose Internet Mail Extensions)

See https://en.wikipedia.org/wiki/MIME

Example

MIME-Version: 1.0

Content-type: text/tab-separated-values; name=sheet1.tsv; boundary=END

A   B
1   X
2   Y
--END

Content-type: text/csv; name=sheet2.csv; boundary=END

C,D,E
3,4,5
--END

Content-type: text/markdown; name=sheet3.md; boundary=END

| F | G |
|---|---|
| 0 | 1 |
--END

Advantages

  1. The container format (MIME) is well defined, supported and implemented. brew install ripmime && ripmime -i example.sanesheet will extract three files from this one file.
  2. Can contain any arbitrary file, including binary files (like PNG files!) using base64 encoding.
  3. Need to teach programs like mlr, datamash, readr, RStudio et c to read MIME files.

Disadvantages

  1. Not as aesthetically attractive (as say Markdown), but not too shabby either.

@jennybc
Copy link
Owner

jennybc commented Sep 16, 2016

Multi-part MIME is pretty intriguing.

I've always wanted to interact with spreadsanesheets via my mail client 😉.

@sjackman
Copy link
Author

sjackman commented Sep 16, 2016

Directory of files

Example

❯❯❯ ls
sheet1.tsv sheet2.csv sheet3.md
❯❯❯ head *
==> sheet1.tsv <==
A   B
1   X
2   Y

==> sheet2.csv <==
C,D,E
3,4,5

==> sheet3.md <==
| F | G |
|---|---|
| 0 | 1 |

Advantages

  1. Already exists. It's pretty much the current system.
  2. Can contain any arbitrary file type.
  3. Can be tarred or zipped up to create a single file.

Disadvantages

  1. Files (like metadata and data) can easily get separated when someone e-mails the data sheet without including the metadata sheet.
  2. Emailing a directory to someone requires first zipping it up.

@sjackman
Copy link
Author

I like "Tables separated by blank lines" for its simplicity, but I believe it has the least support (of the options given) by programs in the wild.

I like "Tables in Markdown" for its aesthetic qualities and existing adoption. Markdown editors already exist, and I believe some already have support for editing tables. This option is probably the most well implemented by existing tools.

I like MIME because the format is simple, well defined and widely implemented. There's lots of existing code/libraries to read/write these files. It can contain binary files, like graphical plots.

@sjackman
Copy link
Author

sjackman commented Sep 16, 2016

Do you think it has to be one file? What if, instead, it were one directory?

A directory of files certainly has the lowest barrier to entry.

I'd like to bundle up multiple files into a single container file so that they're not easily separated, so that for examples when someone e-mails you the data sheet it doesn't get easily separated from the metadata sheet.

People are accustomed to a single spreadsheet file (XLS for example) that contains all the related sheets of one project. I'd like to give people an option that has this same behaviour, but with a plain text file format.

@sjackman
Copy link
Author

@sjackman
Copy link
Author

sjackman commented Sep 16, 2016

JSON (JavaScript Object Notation)

See http://www.json.org

Example

{
"sheet1": {
"A": [1, 2],
"B": ["X", "Y"]
},
"sheet2": {
"C": [3],
"D": [4],
"E": [5]
},
"sheet3": {
"F": [0],
"G": [1]
}
}

Advantages

  1. Well defined format that is intended for storing data.
  2. Can be easily read by R and Python.
  3. Has command-line tool support with jq and mlr.

Disadvantages

  1. Not terribly pretty.
  2. The table looks transposed, with one column per line.

@mr-c
Copy link

mr-c commented Sep 16, 2016

When you are ready for provenance, attribution, and extensible metadata then http://www.researchobject.org/ will be waiting for you :-)

@sjackman
Copy link
Author

sjackman commented Sep 16, 2016

A Research Object Bundle is a zip file of arbitrary files (like TSV files) that also contains a JSON file describing the provenance/attribution of those files.
https://researchobject.github.io/specifications/bundle/
I'm hoping for a plain-text format that can be committed to git. The unzipped zip file can of course be committed to git, in which case we have a "Directory of files" with an additional .ro/manifest.json file to describe the files.

@sjackman
Copy link
Author

@jennybc @hadley Does RStudio have a table editor to edit data frames and TSV files?

@jennybc
Copy link
Owner

jennybc commented Sep 16, 2016

No ... but maybe it could one day!

@sjackman
Copy link
Author

sjackman commented Sep 18, 2016

CSVY (CSV with YAML frontmatter)

See http://csvy.org

Example

---
name: sheet1.tsv
---
A   B
1   X
2   Y

---
name: sheet2.csv
---
C,D,E
3,4,5

---
name: sheet3.md
---
| F | G |
|---|---|
| 0 | 1 |

Advantages

  1. The format is already defined for a single sheet.
  2. Looks nice and is similar to the familiar Markdown with YAML frontmatter.
  3. Has two R implementations! https://cran.r-project.org/web/packages/csvy/csvy.pdf and https://cran.r-project.org/web/packages/rio/rio.pdf

Disadvantages

  1. YAML frontmatter would likely break existing TSV/CSV parsers. Can be stripped out with
sed '/^---$/,/^---$/d'

@sjackman
Copy link
Author

sjackman commented Sep 20, 2016

@jennybc Any thoughts/preferences on these file formats? My favourites are

  1. JSON for a machine-readable, well-defined and implemented format.
  2. CSVY for the most attractive human-writable/readable format, though I would use TSVY myself.

I really like the look of TSVY and its similarity to Markdown with YAML front matter. I quite like the idea of storing the code in RMarkdown with YAML frontmatter and storing the data in TSV with YAML frontmatter. I could even see concatenating the code and data files for the occasions where you may want to store both in a single file, and then I think you may have a real competitor for an Excel spreadsheet.

@jennybc
Copy link
Owner

jennybc commented Sep 20, 2016

I have JSON filed along with 😱 XML in my head. As in, if I don't have a nested or recursive structure, why would I go there? It's also not that human readable. For normal humans.

So I like these tab and comma delimited formats with some meta-data in a header, yes. For the bits that are data.

I think it is important to think about the motley assortment of things people park in a spreadsheet. I honestly think the neat packaging of disparate objects is part of what users like. I see a fair number of spreadsheets where the data worksheets really could be csv files. But then there's always the "README as worksheet" lurking at the front or the back.

@sjackman
Copy link
Author

sjackman commented Sep 20, 2016

Yes, I agree with needing a format to wrap up prose, code, data and report in a single file. Currently that lives in 3+ files: the prose and code in one RMarkdown file, the data in multiple TSV files, and the the rendered report in an HTML file. I'd like a format to stuff all that in a single text file. I prefer having multiple files when building a data analysis pipeline, but when sending a report to a collaborator, I prefer a single file.

@jennybc
Copy link
Owner

jennybc commented Sep 20, 2016

This is why I think your multipart MIME idea is not crazy. This is also why xlsx is a zip archive. Unless people can get comfortable with a directory, you have to shove it all into something 😕.

@sjackman
Copy link
Author

sjackman commented Sep 20, 2016

I like the MIME idea for storing files of different types all in a single text file: the RMarkdown, the data and the HTML. For the related but simpler problem of how to stuff multiple sheets (TSV tables) in a single file, I prefer the TSV tables separated by YAML frontmatter blocks.

@jennybc
Copy link
Owner

jennybc commented Sep 20, 2016

Do you think it's necessary to stuff multiple sheets in a single file? Why?

@sjackman
Copy link
Author

sjackman commented Sep 20, 2016

  1. So that double-clicking a single file opens up that file in some pretty app (that doesn't yet exist) showing all the sheets in tabs, giving a similar user experience to an Excel spreadsheet.
  2. So that data and metadata don't get separated when someone e-mails a TSV file of one but omits the other.

@sjackman
Copy link
Author

See this Twitter conversation thread:
https://twitter.com/BaCh_mira/status/777511003017867264

@BaCh_mira

Directories are problem when people/scripts forget to copy/ftp parts. Specially when /* contains more than they want.

@mike_schatz

tarball?

@BaCh_mira

Then program has to navigate tarball. Back to square one. Or have tar/untar copies: not cool w/ large files (100GiB)

@jennybc
Copy link
Owner

jennybc commented Sep 20, 2016

Oh yes I definitely think this whole bundle of files needs to be packaged in some way yes. I thought you meant that, within that main receptacle, you wanted to get all the TSV files into one file. I misunderstood.

This is why the MIME idea is interesting, because it already anticipates very disparate things, with a pre-existing vocabulary for declaring what things are, signalling where they begin/end, etc.

@sjackman
Copy link
Author

sjackman commented Sep 21, 2016

In the case when you're bundling all different types of files (TSV, RMarkdown and HTML), I like the MIME format. The TSV in that MIME file can be just simple TSV without YAML blocks.

In the case when you're bundling just TSV files (no RMardkon, no HTML) I prefer one file of TSV tables separated by YAML blocks (no MIME).

If we only cared about bundling tables (sheets) into a single file, I'd prefer the TSV/YAML solution. If we want to tackle the whole enchilada, MIME is looking good.

@sjackman
Copy link
Author

sjackman commented Sep 21, 2016

Success reading a multipart MIME sanesheet! So excite! Are we on to something?

eg.sanesheet

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=END

Title: An example sanesheet
--END
Content-Disposition: file; name="sheet1"; filename="sheet1.tsv"

A   B
1   X
2   Y
--END
Content-Disposition: file; name="sheet2"; filename="sheet2.tsv"

C   D   E
3   4   5
--END--

read_sanesheet.r

library(purrr)
library(readr)
library(webutils)

multipart <- parse_multipart(read_file("sanesheet.tsv.multipart"), boundary = "END")
sheets <- map(multipart, function(x) read_tsv(x$value))
sheets
$sheet1
  A B
1 1 X
2 2 Y

$sheet2
  C D E
1 3 4 5

read_sanesheet.sh

This sanesheet can also be extracted to one-file-per-sheet at the command line using ripmime.

❯❯❯ brew install ripmime
❯❯❯ ripmime -i eg.sanesheet -d eg
❯❯❯ head eg/*
==> eg/sheet1.tsv <==
A   B
1   X
2   Y

==> eg/sheet2.tsv <==
C   D   E
3   4   5

==> eg/textfile0 <==
Title: An example sanesheet

@sjackman
Copy link
Author

@jennybc What do you think of the name and file extension .tidysheet?

@hadley
Copy link

hadley commented Sep 21, 2016

I really think that multiple files is the only sane way to go. Would be better treat like an RStudio project, with one metadata file that people click on to launch the whole business. Then you could also mingle in other files (like R scripts etc).

@sjackman
Copy link
Author

sjackman commented Sep 21, 2016

It's difficult to e-mail a directory files. I don't have any inside scoop, but I think that's why Apple migrated .pages, .numbers et c from their directories of files (bundles) to flat files, and that's even with special OS support to make the directory look to the user like a single file. A directory of files will eventually need to be zipped up to send to someone. A more likely outcome is that someone e-mail the data sheet and just leave out the metadata file, and the two will be separated.

See this Twitter conversation thread:
https://twitter.com/BaCh_mira/status/777511003017867264
#3 (comment)

ODS and XLSX use a zip file of files as the file format. Michael @mr-c mentioned http://www.researchobject.org/, which is a zip of files. zip as a container format is alright, but it's binary and can't easily be edited with a text editor or committed to version control. MIME is a nice container format because it is plain text, and it faithfully represents a directory of files.

To woo/convert spreadsheet users, one key feature currently missing is the ability to store multiple sheets in a single file. I'd like the format of that file to be plain text.

@hadley
Copy link

hadley commented Sep 21, 2016

Yes, that's a downside, but I think the downsides of the other options (i.e. one massive opaque file that requires special software to read) are worse.

@sjackman
Copy link
Author

sjackman commented Sep 21, 2016

Zip files are quite opaque, which is why I'm not a big fan of zip as a container format for plain text files. MIME is quite readable, as far as standard plain-text container formats go. What exactly do you mean by opaque? The contents of the MIME file example in #3 (comment) are pleasingly transparent in my opinion.

@hadley
Copy link

hadley commented Sep 21, 2016

They are pleasingly transparent to you as a programmer, but what existing tools can easily extract data out of a file of that nature?

The structure of MIME also does not lead itself to good performance - if you have 200 meg csv file followed by a markdown readme, you'll have to scan all 200 megs of lines to find the md file.

People don't seem to have issues sharing RStudio projects?

@sjackman
Copy link
Author

sjackman commented Sep 21, 2016

They are pleasingly transparent to you as a programmer, but what existing tools can easily extract data out of a file of that nature?

ripmime for the shell and webutils::parse_multipart for R to name two.
rio::import can read a zip of TSV files. I'd be up for submitting a PR to add support for MIME as a container.

@sjackman
Copy link
Author

sjackman commented Sep 21, 2016

The structure of MIME also does not lead itself to good performance - if you have 200 meg csv file followed by a markdown readme, you'll have to scan all 200 megs of lines to find the md file.

That's also true of a tar.gz of files. I'm not sure whether zip is indexed. You can get random access to individual files by extracting the MIME container, same as with tar.gz and zip of a directory.

@sjackman
Copy link
Author

sjackman commented Sep 21, 2016

People don't seem to have issues sharing RStudio projects?

People with the technical ability to create RStudio projects don't have an issue sharing RStudio projects. I'm hoping to target spreadsheet users with two key features:

  1. You can double-click the file and open it in a graphical spreadsheet editor that would show one tab per sheet.
  2. You can e-mail this single file, without having to create an archive of a directory, so that its sheets don't get separated.

@hadley
Copy link

hadley commented Sep 21, 2016

If you use a single file format, you also need to expose UI for everything you can do with your file browser: delete sheets, copy sheets from another project, ...

Also for anything other than trivial data, you'll need to compress the contents in order to email, so that means you'll need a layer on top of mime.

@sjackman
Copy link
Author

Also for anything other than trivial data, you'll need to compress the contents in order to email, so that means you'll need a layer on top of mime.

Gmail's file size attachment limit is 25 MB. That's good enough I would hazard for many typical Excel spreadsheets without being compressed (though XLSX is compressed). Larger files can be transferred via a link on Dropbox, same for any data file larger than the e-mail attachment limit.

@sjackman
Copy link
Author

sjackman commented Sep 21, 2016

If you use a single file format, you also need to expose UI for everything you can do with your file browser: delete sheets, copy sheets from another project,

I don't feel that's onerous. Only two operations are needed: create a new blank sheet and delete a sheet. Good old copy-and-paste can be used to copy a sheet between two projects.

@sjackman
Copy link
Author

Texts (http://www.texts.io) can edit Markdown tables.

tables md

@sjackman
Copy link
Author

rio::import will be able to read multiple tables from an HTML file once gesistsa/rio#126 is resolved.

@danfowler
Copy link

@hadley: I really think that multiple files is the only sane way to go. Would be better treat like an RStudio project, with one metadata file that people click on to launch the whole business. Then you could also mingle in other files (like R scripts etc).

@sjackman Sorry to jump in so late, but Tabular Data Packages might be an option here. You can store multiple TSVs in a single directory and define a datapackage.json that provides high-level metadata (e.g. license info); a Table Schema that defines column descriptions, types, and constraints (csvy also uses this); and information about the CSV dialect. For serialization into one file, we have an open issue here: frictionlessdata/datapackage#132 . Version 1.0 coming super soon.

@Stephen-Gates
Copy link

You may be interested in this new project Data Curator. We're building on top of Data Packages, Comma Chameleon and other goodness. Development starting this week.

@sjackman
Copy link
Author

That's very exciting! Thanks for the heads up, Stephen. I'll be a willing beta tester, if you're looking for early outside users.

@Stephen-Gates
Copy link

Hope to go Beta at release 0.3.0 should be of some value at that stage for single data files. Everyone is very welcome to test and report issues.

@sjackman
Copy link
Author

Excellent. I don't know of a way to watch releases or completed milestones in GitHub. Could you suggest an issue, or create an issue to which I could subscribe, that would be updated/closed when 0.3.0 is released?

@Stephen-Gates
Copy link

@sjackman
Copy link
Author

sjackman commented Jul 24, 2017

I and I imagine other users don't use an RSS feed reader as part of my workflow. I'd prefer to use the GitHub web interface to watch notifications. Would you consider creating a low-volume locked issue that you comment on once per release, so that users like myself can subscribe to that issue?
See for example Linuxbrew/brew#1 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants