Multiple sheets in one file #3

sjackman · 2016-09-16T17:23:41Z

I'd like to have multiple sheets in one file to keep data and metadata together in one file. This suggests that we need a container format to keep multiple TSV files together in one file. We could use an off-the-shelf container format, such as tar or zip. I would like however for the container format to be a plain text format to enable editing the entire document in a text editor and checking it into version control. I'm going to some requirements below, and then a random assortment of formats that came to mind.

A plain text container file format that contains multiple plain text files
Can contain TSV, CSV and Markdown PSV (pipe-separated values)
Can contain any plain text file, particularly Markdown (and RMarkdown)
Each embedded plain text file has a name (like a sheet name or file name)
Editable by a human with a plain text editor
Aesthetically pleasing

sjackman · 2016-09-16T17:34:36Z

Tables separated by blank lines

See http://johnkerl.org/miller/doc/file-formats.html#CSV/TSV/etc.

Example

# name: sheet1.tsv
A   B
1   X
2   Y

# name: sheet2.csv
C,D,E
3,4,5

# name: sheet3.md
| F | G |
|---|---|
| 0 | 1 |

Advantages

Simple!
miller (aka mlr) supports this file format, though without # comments and they all have to be the same type (all TSV for example).

Disadvantages

It can't contain any file that contains a blank line (since a blank line is the file separator).
Not all programs handle # comments in TSV/CSV files.

sjackman · 2016-09-16T17:48:31Z

Tables in Markdown

See https://help.github.com/articles/organizing-information-with-tables/

Example

# Tables in *Markdown*

```tsv sheet1
A   B
1   X
2   Y
```

```csv sheet2
C,D,E
3,4,5
```

Table: A pandoc-crossref table caption {#tbl:sheet3}

| F | G |
|---|---|
| 0 | 1 |

Advantages

The container format (Markdown) is already defined!
Markdown editors already exist! Texts http://www.texts.io can edit Markdown tables.
Need to teach programs like mlr, datamash, readr, RStudio et c to read Markdown files and extract tables

Disadvantages

The container format is Markdown, but it can't really contain a Markdown file.

jennybc · 2016-09-16T17:57:02Z

Do you think it has to be one file? What if, instead, it were one directory? The way that Git repos or RStudio projects are just regular directories on your computer, with some other, mostly hidden stuff lying around to help Git and RStudio do their jobs. An R package is a special case that has even more specific structure about the files and directories. Maybe that's a model for a sanesheet?

A sanesheet-anticipating tool serves the different files to you via conventional tabs. And tabs have an interact / edit mode appropriate to the file type. As delimited file for the TSVs, as markdown for the notes. So I guess that suggests there has to be a .sanesheets file, like foo.Rproj or the entire .git directory, to facilitate coordinated actions across the whole set of files.

I think the point is that someone who wants to experience a sanesheet like a spreadsheet could. Well, if this mythical inpsector/editor existed! But the code-based analyst could experience it like a subdirectory of delimited files and some notes. And all could enjoy the benefit of version control.

sjackman · 2016-09-16T18:06:24Z

MIME (Multipurpose Internet Mail Extensions)

See https://en.wikipedia.org/wiki/MIME

Example

MIME-Version: 1.0

Content-type: text/tab-separated-values; name=sheet1.tsv; boundary=END

A   B
1   X
2   Y
--END

Content-type: text/csv; name=sheet2.csv; boundary=END

C,D,E
3,4,5
--END

Content-type: text/markdown; name=sheet3.md; boundary=END

| F | G |
|---|---|
| 0 | 1 |
--END

Advantages

The container format (MIME) is well defined, supported and implemented. brew install ripmime && ripmime -i example.sanesheet will extract three files from this one file.
Can contain any arbitrary file, including binary files (like PNG files!) using base64 encoding.
Need to teach programs like mlr, datamash, readr, RStudio et c to read MIME files.

Disadvantages

Not as aesthetically attractive (as say Markdown), but not too shabby either.

jennybc · 2016-09-16T18:10:30Z

Multi-part MIME is pretty intriguing.

I've always wanted to interact with ~~spread~~sanesheets via my mail client 😉.

sjackman · 2016-09-16T18:21:39Z

Directory of files

Example

❯❯❯ ls
sheet1.tsv sheet2.csv sheet3.md
❯❯❯ head *
==> sheet1.tsv <==
A   B
1   X
2   Y

==> sheet2.csv <==
C,D,E
3,4,5

==> sheet3.md <==
| F | G |
|---|---|
| 0 | 1 |

Advantages

Already exists. It's pretty much the current system.
Can contain any arbitrary file type.
Can be tarred or zipped up to create a single file.

Disadvantages

Files (like metadata and data) can easily get separated when someone e-mails the data sheet without including the metadata sheet.
Emailing a directory to someone requires first zipping it up.

sjackman · 2016-09-16T18:28:53Z

I like "Tables separated by blank lines" for its simplicity, but I believe it has the least support (of the options given) by programs in the wild.

I like "Tables in Markdown" for its aesthetic qualities and existing adoption. Markdown editors already exist, and I believe some already have support for editing tables. This option is probably the most well implemented by existing tools.

I like MIME because the format is simple, well defined and widely implemented. There's lots of existing code/libraries to read/write these files. It can contain binary files, like graphical plots.

sjackman · 2016-09-16T18:39:26Z

Do you think it has to be one file? What if, instead, it were one directory?

A directory of files certainly has the lowest barrier to entry.

I'd like to bundle up multiple files into a single container file so that they're not easily separated, so that for examples when someone e-mails you the data sheet it doesn't get easily separated from the metadata sheet.

People are accustomed to a single spreadsheet file (XLS for example) that contains all the related sheets of one project. I'd like to give people an option that has this same behaviour, but with a plain text file format.

sjackman · 2016-09-16T19:10:09Z

Twitter poll! https://twitter.com/sjackman/status/776860608629055488

sjackman · 2016-09-16T19:26:41Z

JSON (JavaScript Object Notation)

See http://www.json.org

Example

{
"sheet1": {
"A": [1, 2],
"B": ["X", "Y"]
},
"sheet2": {
"C": [3],
"D": [4],
"E": [5]
},
"sheet3": {
"F": [0],
"G": [1]
}
}

Advantages

Well defined format that is intended for storing data.
Can be easily read by R and Python.
Has command-line tool support with jq and mlr.

Disadvantages

Not terribly pretty.
The table looks transposed, with one column per line.

mr-c · 2016-09-16T21:22:21Z

When you are ready for provenance, attribution, and extensible metadata then http://www.researchobject.org/ will be waiting for you :-)

sjackman · 2016-09-16T22:10:46Z

A Research Object Bundle is a zip file of arbitrary files (like TSV files) that also contains a JSON file describing the provenance/attribution of those files.
https://researchobject.github.io/specifications/bundle/
I'm hoping for a plain-text format that can be committed to git. The unzipped zip file can of course be committed to git, in which case we have a "Directory of files" with an additional .ro/manifest.json file to describe the files.

sjackman · 2016-09-16T22:53:25Z

@jennybc @hadley Does RStudio have a table editor to edit data frames and TSV files?

jennybc · 2016-09-16T23:27:03Z

No ... but maybe it could one day!

sjackman · 2016-09-18T17:33:52Z

CSVY (CSV with YAML frontmatter)

See http://csvy.org

Example

---
name: sheet1.tsv
---
A   B
1   X
2   Y

---
name: sheet2.csv
---
C,D,E
3,4,5

---
name: sheet3.md
---
| F | G |
|---|---|
| 0 | 1 |

Advantages

The format is already defined for a single sheet.
Looks nice and is similar to the familiar Markdown with YAML frontmatter.
Has two R implementations! https://cran.r-project.org/web/packages/csvy/csvy.pdf and https://cran.r-project.org/web/packages/rio/rio.pdf

Disadvantages

YAML frontmatter would likely break existing TSV/CSV parsers. Can be stripped out with

sed '/^---$/,/^---$/d'

sjackman · 2016-09-20T17:45:03Z

@jennybc Any thoughts/preferences on these file formats? My favourites are

JSON for a machine-readable, well-defined and implemented format.
CSVY for the most attractive human-writable/readable format, though I would use TSVY myself.

I really like the look of TSVY and its similarity to Markdown with YAML front matter. I quite like the idea of storing the code in RMarkdown with YAML frontmatter and storing the data in TSV with YAML frontmatter. I could even see concatenating the code and data files for the occasions where you may want to store both in a single file, and then I think you may have a real competitor for an Excel spreadsheet.

jennybc · 2016-09-20T19:41:54Z

I have JSON filed along with 😱 XML in my head. As in, if I don't have a nested or recursive structure, why would I go there? It's also not that human readable. For normal humans.

So I like these tab and comma delimited formats with some meta-data in a header, yes. For the bits that are data.

I think it is important to think about the motley assortment of things people park in a spreadsheet. I honestly think the neat packaging of disparate objects is part of what users like. I see a fair number of spreadsheets where the data worksheets really could be csv files. But then there's always the "README as worksheet" lurking at the front or the back.

sjackman · 2016-09-20T21:17:11Z

Yes, I agree with needing a format to wrap up prose, code, data and report in a single file. Currently that lives in 3+ files: the prose and code in one RMarkdown file, the data in multiple TSV files, and the the rendered report in an HTML file. I'd like a format to stuff all that in a single text file. I prefer having multiple files when building a data analysis pipeline, but when sending a report to a collaborator, I prefer a single file.

jennybc · 2016-09-20T21:21:49Z

This is why I think your multipart MIME idea is not crazy. This is also why xlsx is a zip archive. Unless people can get comfortable with a directory, you have to shove it all into something 😕.

sjackman · 2016-09-20T22:45:37Z

I like the MIME idea for storing files of different types all in a single text file: the RMarkdown, the data and the HTML. For the related but simpler problem of how to stuff multiple sheets (TSV tables) in a single file, I prefer the TSV tables separated by YAML frontmatter blocks.

jennybc · 2016-09-20T23:12:41Z

Do you think it's necessary to stuff multiple sheets in a single file? Why?

sjackman · 2016-09-20T23:20:01Z

So that double-clicking a single file opens up that file in some pretty app (that doesn't yet exist) showing all the sheets in tabs, giving a similar user experience to an Excel spreadsheet.
So that data and metadata don't get separated when someone e-mails a TSV file of one but omits the other.

sjackman · 2016-09-20T23:22:37Z

See this Twitter conversation thread:
https://twitter.com/BaCh_mira/status/777511003017867264

@BaCh_mira

Directories are problem when people/scripts forget to copy/ftp parts. Specially when /* contains more than they want.

@mike_schatz

tarball?

@BaCh_mira

Then program has to navigate tarball. Back to square one. Or have tar/untar copies: not cool w/ large files (100GiB)

jennybc · 2016-09-20T23:35:54Z

Oh yes I definitely think this whole bundle of files needs to be packaged in some way yes. I thought you meant that, within that main receptacle, you wanted to get all the TSV files into one file. I misunderstood.

This is why the MIME idea is interesting, because it already anticipates very disparate things, with a pre-existing vocabulary for declaring what things are, signalling where they begin/end, etc.

sjackman · 2016-09-21T00:33:09Z

In the case when you're bundling all different types of files (TSV, RMarkdown and HTML), I like the MIME format. The TSV in that MIME file can be just simple TSV without YAML blocks.

In the case when you're bundling just TSV files (no RMardkon, no HTML) I prefer one file of TSV tables separated by YAML blocks (no MIME).

If we only cared about bundling tables (sheets) into a single file, I'd prefer the TSV/YAML solution. If we want to tackle the whole enchilada, MIME is looking good.

sjackman · 2016-09-21T05:48:35Z

Success reading a multipart MIME sanesheet! So excite! Are we on to something?

eg.sanesheet

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=END

Title: An example sanesheet
--END
Content-Disposition: file; name="sheet1"; filename="sheet1.tsv"

A   B
1   X
2   Y
--END
Content-Disposition: file; name="sheet2"; filename="sheet2.tsv"

C   D   E
3   4   5
--END--

read_sanesheet.r

library(purrr)
library(readr)
library(webutils)

multipart <- parse_multipart(read_file("sanesheet.tsv.multipart"), boundary = "END")
sheets <- map(multipart, function(x) read_tsv(x$value))
sheets

$sheet1
  A B
1 1 X
2 2 Y

$sheet2
  C D E
1 3 4 5

read_sanesheet.sh

This sanesheet can also be extracted to one-file-per-sheet at the command line using ripmime.

❯❯❯ brew install ripmime
❯❯❯ ripmime -i eg.sanesheet -d eg
❯❯❯ head eg/*
==> eg/sheet1.tsv <==
A   B
1   X
2   Y

==> eg/sheet2.tsv <==
C   D   E
3   4   5

==> eg/textfile0 <==
Title: An example sanesheet

sjackman · 2016-09-21T07:10:08Z

@jennybc What do you think of the name and file extension .tidysheet?

hadley · 2016-09-21T12:32:28Z

I really think that multiple files is the only sane way to go. Would be better treat like an RStudio project, with one metadata file that people click on to launch the whole business. Then you could also mingle in other files (like R scripts etc).

sjackman · 2016-09-21T16:06:05Z

It's difficult to e-mail a directory files. I don't have any inside scoop, but I think that's why Apple migrated .pages, .numbers et c from their directories of files (bundles) to flat files, and that's even with special OS support to make the directory look to the user like a single file. A directory of files will eventually need to be zipped up to send to someone. A more likely outcome is that someone e-mail the data sheet and just leave out the metadata file, and the two will be separated.

See this Twitter conversation thread:
https://twitter.com/BaCh_mira/status/777511003017867264
#3 (comment)

ODS and XLSX use a zip file of files as the file format. Michael @mr-c mentioned http://www.researchobject.org/, which is a zip of files. zip as a container format is alright, but it's binary and can't easily be edited with a text editor or committed to version control. MIME is a nice container format because it is plain text, and it faithfully represents a directory of files.

To woo/convert spreadsheet users, one key feature currently missing is the ability to store multiple sheets in a single file. I'd like the format of that file to be plain text.

hadley · 2016-09-21T17:28:46Z

Yes, that's a downside, but I think the downsides of the other options (i.e. one massive opaque file that requires special software to read) are worse.

sjackman · 2016-09-21T17:57:34Z

Zip files are quite opaque, which is why I'm not a big fan of zip as a container format for plain text files. MIME is quite readable, as far as standard plain-text container formats go. What exactly do you mean by opaque? The contents of the MIME file example in #3 (comment) are pleasingly transparent in my opinion.

hadley · 2016-09-21T18:07:05Z

They are pleasingly transparent to you as a programmer, but what existing tools can easily extract data out of a file of that nature?

The structure of MIME also does not lead itself to good performance - if you have 200 meg csv file followed by a markdown readme, you'll have to scan all 200 megs of lines to find the md file.

People don't seem to have issues sharing RStudio projects?

sjackman · 2016-09-21T18:33:38Z

They are pleasingly transparent to you as a programmer, but what existing tools can easily extract data out of a file of that nature?

ripmime for the shell and webutils::parse_multipart for R to name two.
rio::import can read a zip of TSV files. I'd be up for submitting a PR to add support for MIME as a container.

sjackman · 2016-09-21T18:34:35Z

The structure of MIME also does not lead itself to good performance - if you have 200 meg csv file followed by a markdown readme, you'll have to scan all 200 megs of lines to find the md file.

That's also true of a tar.gz of files. I'm not sure whether zip is indexed. You can get random access to individual files by extracting the MIME container, same as with tar.gz and zip of a directory.

sjackman · 2016-09-21T18:38:44Z

People don't seem to have issues sharing RStudio projects?

People with the technical ability to create RStudio projects don't have an issue sharing RStudio projects. I'm hoping to target spreadsheet users with two key features:

You can double-click the file and open it in a graphical spreadsheet editor that would show one tab per sheet.
You can e-mail this single file, without having to create an archive of a directory, so that its sheets don't get separated.

hadley · 2016-09-21T18:40:19Z

If you use a single file format, you also need to expose UI for everything you can do with your file browser: delete sheets, copy sheets from another project, ...

Also for anything other than trivial data, you'll need to compress the contents in order to email, so that means you'll need a layer on top of mime.

sjackman · 2016-09-21T18:46:21Z

Also for anything other than trivial data, you'll need to compress the contents in order to email, so that means you'll need a layer on top of mime.

Gmail's file size attachment limit is 25 MB. That's good enough I would hazard for many typical Excel spreadsheets without being compressed (though XLSX is compressed). Larger files can be transferred via a link on Dropbox, same for any data file larger than the e-mail attachment limit.

sjackman · 2016-09-21T18:47:30Z

If you use a single file format, you also need to expose UI for everything you can do with your file browser: delete sheets, copy sheets from another project,

I don't feel that's onerous. Only two operations are needed: create a new blank sheet and delete a sheet. Good old copy-and-paste can be used to copy a sheet between two projects.

sjackman · 2016-09-21T22:31:41Z

Texts (http://www.texts.io) can edit Markdown tables.

sjackman · 2016-09-22T18:21:41Z

rio::import will be able to read multiple tables from an HTML file once gesistsa/rio#126 is resolved.

danfowler · 2017-02-02T21:38:47Z

@hadley: I really think that multiple files is the only sane way to go. Would be better treat like an RStudio project, with one metadata file that people click on to launch the whole business. Then you could also mingle in other files (like R scripts etc).

@sjackman Sorry to jump in so late, but Tabular Data Packages might be an option here. You can store multiple TSVs in a single directory and define a datapackage.json that provides high-level metadata (e.g. license info); a Table Schema that defines column descriptions, types, and constraints (csvy also uses this); and information about the CSV dialect. For serialization into one file, we have an open issue here: frictionlessdata/datapackage#132 . Version 1.0 coming super soon.

R implementation: https://github.com/ropenscilabs/datapkg
Python library: https://github.com/frictionlessdata/datapackage-py

Stephen-Gates · 2017-07-24T07:52:35Z

You may be interested in this new project Data Curator. We're building on top of Data Packages, Comma Chameleon and other goodness. Development starting this week.

sjackman · 2017-07-24T08:30:51Z

That's very exciting! Thanks for the heads up, Stephen. I'll be a willing beta tester, if you're looking for early outside users.

Stephen-Gates · 2017-07-24T09:17:39Z

Hope to go Beta at release 0.3.0 should be of some value at that stage for single data files. Everyone is very welcome to test and report issues.

sjackman · 2017-07-24T09:57:50Z

Excellent. I don't know of a way to watch releases or completed milestones in GitHub. Could you suggest an issue, or create an issue to which I could subscribe, that would be updated/closed when 0.3.0 is released?

Stephen-Gates · 2017-07-24T10:04:21Z

Watch releases with https://github.com/ODIQueensland/data-curator/releases.atom

sjackman · 2017-07-24T10:28:29Z

I and I imagine other users don't use an RSS feed reader as part of my workflow. I'd prefer to use the GitHub web interface to watch notifications. Would you consider creating a low-volume locked issue that you comment on once per release, so that users like myself can subscribe to that issue?
See for example Linuxbrew/brew#1 (comment)

sjackman mentioned this issue Sep 21, 2016

Multiple tables in one file leeper/csvy#5

Closed

Stephen-Gates mentioned this issue Jul 24, 2017

Data Curator News - subscribe to follow our progress qcif/data-curator#15

Open

Multiple sheets in one file #3

Multiple sheets in one file #3

Comments

sjackman commented Sep 16, 2016 • edited Loading

sjackman commented Sep 16, 2016 • edited Loading

Tables separated by blank lines

Example

Advantages

Disadvantages

sjackman commented Sep 16, 2016 • edited Loading

Tables in Markdown

Example

Advantages

Disadvantages

jennybc commented Sep 16, 2016

sjackman commented Sep 16, 2016 • edited Loading

MIME (Multipurpose Internet Mail Extensions)

Example

Advantages

Disadvantages

jennybc commented Sep 16, 2016 • edited Loading

sjackman commented Sep 16, 2016 • edited Loading

Directory of files

Example

Advantages

Disadvantages

sjackman commented Sep 16, 2016

sjackman commented Sep 16, 2016 • edited Loading

sjackman commented Sep 16, 2016

sjackman commented Sep 16, 2016 • edited Loading

JSON (JavaScript Object Notation)

Example

Advantages

Disadvantages

mr-c commented Sep 16, 2016

sjackman commented Sep 16, 2016 • edited Loading

sjackman commented Sep 16, 2016

jennybc commented Sep 16, 2016

sjackman commented Sep 18, 2016 • edited Loading

CSVY (CSV with YAML frontmatter)

Example

Advantages

Disadvantages

sjackman commented Sep 20, 2016 • edited Loading

jennybc commented Sep 20, 2016

sjackman commented Sep 20, 2016 • edited Loading

jennybc commented Sep 20, 2016

sjackman commented Sep 20, 2016 • edited Loading

jennybc commented Sep 20, 2016

sjackman commented Sep 20, 2016 • edited Loading

sjackman commented Sep 20, 2016

jennybc commented Sep 20, 2016 • edited Loading

sjackman commented Sep 21, 2016 • edited Loading

sjackman commented Sep 21, 2016 • edited Loading

eg.sanesheet

read_sanesheet.r

read_sanesheet.sh

sjackman commented Sep 21, 2016

hadley commented Sep 21, 2016

sjackman commented Sep 21, 2016 • edited Loading

hadley commented Sep 21, 2016

sjackman commented Sep 21, 2016 • edited Loading

hadley commented Sep 21, 2016

sjackman commented Sep 21, 2016 • edited Loading

sjackman commented Sep 21, 2016 • edited Loading

sjackman commented Sep 21, 2016 • edited Loading

hadley commented Sep 21, 2016

sjackman commented Sep 21, 2016

sjackman commented Sep 21, 2016 • edited Loading

sjackman commented Sep 21, 2016

sjackman commented Sep 22, 2016

danfowler commented Feb 2, 2017

Stephen-Gates commented Jul 24, 2017

sjackman commented Jul 24, 2017

Stephen-Gates commented Jul 24, 2017

sjackman commented Jul 24, 2017

Stephen-Gates commented Jul 24, 2017

sjackman commented Jul 24, 2017 • edited Loading

sjackman commented Sep 16, 2016 •

edited

Loading

sjackman commented Sep 16, 2016 •

edited

Loading

sjackman commented Sep 16, 2016 •

edited

Loading

sjackman commented Sep 16, 2016 •

edited

Loading

jennybc commented Sep 16, 2016 •

edited

Loading

sjackman commented Sep 16, 2016 •

edited

Loading

sjackman commented Sep 16, 2016 •

edited

Loading

sjackman commented Sep 16, 2016 •

edited

Loading

sjackman commented Sep 16, 2016 •

edited

Loading

sjackman commented Sep 18, 2016 •

edited

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

jennybc commented Sep 20, 2016 •

edited

Loading

sjackman commented Sep 21, 2016 •

edited

Loading

sjackman commented Sep 21, 2016 •

edited

Loading

sjackman commented Sep 21, 2016 •

edited

Loading

sjackman commented Sep 21, 2016 •

edited

Loading

sjackman commented Sep 21, 2016 •

edited

Loading

sjackman commented Sep 21, 2016 •

edited

Loading

sjackman commented Sep 21, 2016 •

edited

Loading

sjackman commented Sep 21, 2016 •

edited

Loading

sjackman commented Jul 24, 2017 •

edited

Loading