Skip to content

TSV2 Proposal

andychu edited this page Jul 23, 2021 · 43 revisions
Clone this wiki locally

Up: Structured Data in Oil

Now named QTT: Quoted, Typed Tables

QTT is an interchange format for tables that's a graceful upgrade of TSV. A TSV file is a valid QTT file.

It uses QSN (Quoted String Notation) as a building block.

Short Description

TSV2 is a textual interchange format for tables, like JSON is for nested data. It's most easily described as a TSV file with some nice properties:

  • The header row is required. It may or may not have type annotations.
  • Every row is exactly one line (unlike CSV and TSV).
  • It can represent any byte or character in a field. This is done with single quoted strings like 'foo\n' and 'nul byte: \x00 !' and '\u1234'. Most fields should not be quoted.

In canonical form, a TSV2 file is a valid UTF-8 text file. There may be reasons to use non-canonical forms (e.g. for compatibility with old TSV readers).

Example

NOTE: → are tab characters (written in HTML as →).

name →             age:int → checksum:bytes → ratio:float → child:boolean
'Homer Simpson' →  10 →      '\xab\xcd' →     0.12 →        false
Marge Simpson →    20 →      '\xef\xff' →     0.34 →        false
Bart Simpson →     30 →      '\xdd\xee' →     0.56789 →     true
'Lisa Simpson' →   40 →      '\xdd\xee' →     0.56789 →     true

The default column type is string.

This example is inconsistent to illustrate the principle that single quotes are optional in some cases but required in others. They're recommended when a field contains spaces, but required when a field contains newlines or tabs.

What consumes it?

  • Oil's xargs "dialect" (each Keyword in Oil)
    • It provides xargs with rows, rather than faking it with -n 4 for a row of 4 items.
      • xargs can similarly use the names. We could provide some alternative to the -I {} string.
  • awk "dialect" (See http://www.oilshell.org/blog/tags.html?tag=awk#awk)
    • It provides awk with named columns instead of $1 $2 $3.
    • It provides awk with types.

What emits it?

  • Scripts that scrape coreutils. The output of ps and ls can be massaged into Tables.
  • Oil's augmentation of the find command (TBD)
  • Oil's history mechanism (TODO: issue #320)
  • ... any other tool that can write text to stdout :)

Why not CSV?

A line is not a record in CSV, e.g.

"record
with
spaces",10

is a single record. This doesn't work well with line-oriented shell tools.

Also, CSV in practice tends to be ill-specified. (e.g. hardware outputs really broken CSV files.)

Why not TSV?

  • You can't have tabs in TSV fields.
  • If you have newlines, you no longer have the property that a record is a line (i.e. the same problem that CSV has).
  • TSV has no printable representation for NUL bytes like '\x00'.

Why Single Quotes instead of Double Quotes?

Because JSON uses double quotes, and JSON represents unicode strings. TSV2 represents byte strings, which can be utf-8 encoded.

Is it compatible with TSV?

Strictly speaking, no. Hence the "major version number bump". However:

  • A TSV2 reader can always read a TSV file correctly. Spaces are allowed in raw TSV2 fields, even though a more canonical representation is:
    • 'spaces and single quotes'<TAB>'another field', rather than
    • no single quotes<TAB>another field.
  • If a TSV reader consumes a TSV2 file, then it will "work", except some fields may have leading ' characters. This only happens for "unprintable" characters and tabs, so in the common case there are NO corrections necessary.

Types

Based on JSON primitive types, but less tied to JavaScript:

  • string (default)
  • bytes (can represent arbitrary bytes like invalid UTF-8)
  • "number" is split into int and float, because most languages have that distinction
  • boolean

What's the difference between string and bytes columns?

Nothing as far as the TSV2 format is concerned. string is an indication to the language's library to try to decode utf-8 and escape sequences into a string data structure, if applicable. (e.g. Python, Java, and JavaScript provide random code point access; Go and Rust use UTF-8 internally.)

Columns Other than String / Bytes Follow JSON Format

See the "regular expressions" at json.org. We use true, false, and null.

Required Validations

  • The file must have valid header.
    • Column names should match [a-zA-Z_][a-zA-Z0-9_]+. (Should we allow spaces? Language can convert to _ ?)
  • The number of columns in the TSV2 stream is determined the header. If any row has an invalid number of cells, a fatal error must be raised.
  • Invalid string syntax, e.g.
    • a field starts with ' but doesn't end with one.
    • invalid escape sequences
  • Invalid syntax in any other column:
    • int
    • float
    • boolean

Canonicalization

  • The TSV2 file is valid UTF-8.
    • NUL bytes are written '\x00' (only allow \xAB in bytes columns?)
  • 'fields with spaces have single quotes' -- this makes TSV readers less likely to work on TSV2 files, but that's OK. You have the option to emit non-canonical TSV2.
  • empty strings should be '' !

Relationship to Programming Languages

  • Python: if you do repr() and the string does not contain a single quote, Python will use single quotes. And then it will be a valid TSV2 string.
  • Oil: TODO
  • JavaScript: if it doesn't contain \xFF codes, it should work? Well there's also the "bell" character and such.

Links

Issues / Notes

  • String and Bytes columns can't contain null
    • int and boolean columns can't either
  • What about float columns? null could be useful for NA. It can be represented by IEEE floats.