A unixfsv2 spike, using schemas
This document is an exploration of what a hypothetical "unixfs v2" might look like, using schemas as a design language for describing the data structures.
This is not a final proposal, but rather a lens to explore both unixfsv2 concepts and schemas themselves at the same time, and how we expect them to contribute to each other. Several concepts in schemas, advanced layouts, their interactions, and their syntaxes will be trialled here; they are not final.
Without further ado...
Schema
type AnyFile union {
| File "f"
| Dir "d"
| Symlink "s"
} representation keyed
type Filename string
type Dir struct {
members {Filename:DirEnt}<HAMT>
}
type DirEnt struct {
content &AnyFile
attribs Attribs
}
type File struct {
content Bytes<Rabin>
}
type Symlink struct {
target String
}
advanced Rabin {
kind bytes
implementation "IPLD/experimental/rabin/v1"
}
advanced HAMT {
kind map
implementation "IPLD/experimental/HAMT/v1"
}
type Attribs struct {
mtime Int # Danger danger will robinson; int sizes will rear their ugly head.
posix Int # Can we just... do the standard 0777 mask packing here? I don't think anything else is *less* work.
sticky Bool (implicit: false)
setuid Bool (implicit: false)
setgid Bool (implicit: false)
uid Int # make another schema which deliberately excludes this. try to match this one (the one with more info) first.
gid Int # make another schema which deliberately excludes this. try to match this one (the one with more info) first.
}
# You should be able to imagine sending a fancy graphsync request with some structure like:
#
# {"schema": {{schemaCIDofUnixfs}}, "selector": {{selectorSpec}},
# "schemaPatch": {"./dir/transparently/through/to/file": {{schemaCIDofRabinLayout}},
# "startAt": {{rootBlockCID+path}} }
#
# This kind of fancy graphsync request can be used to make a query which does "the right thing" with
# all advanced layouts transparently; then when reaching the 'patch' area, perhaps the selector spec
# wants to use the seeded-psuedorandomness feature, so it can issue queries like this across many peers.
#
# Perhaps we won't actually need the 'patch' feature, because in practice if doing queries with lots
# of fancy sharded requests, prooobably you'll do the path *up to* that first, in a separate query,
# But either way, you can see how using schemas for signaling is useful here.
Interesting Highlights
Using various schemas as a feature!
Observe the comments on the Attribs.uid and Attribs.gid fields.
By using several different (but significantly overlapping) schemas to describe variations in which attributes we're storing, we can make unixfsv2 into an extensible yet still very well defined system.
Concretely: If the uid and gid elements are found present in the first Attribs map
processed in a tree of data, then we match the first schema (the one with more info),
and we can now expect all future Attribs maps to consistently include the same data.
If the uid or gid elements are found to be missing, we would flunk out of matching
the first schema, and try a second schema which does not include them... and again,
now expect all future Attribs maps to consistently not include these data.
This kind of deep consistency makes it easier to build applications, especially applications which may want to change their behavior or apply distinct transformations depending on which of the schemas matches.
Advanced layout usage sites are marked with angle brackets
E.g., on the line members {Filename:DirEnt}<HAMT>, it's the < and >
characters which denote that we'll be using an advanced layout here.
We might also consider using different characters for this denotation. (Angle brackets are reminiscent of generics in Java syntax; this may or may not be a helpful reminiscence.)
We might also consider a different approach entirely: ditching the ability to
declare advanced layouts on inline type terms at all, and instead using
something more like the representation keyword.
Such an approach would syntactically require a fully named type for any
structure that wants to use an advanced layout.
Given that the most common applications of advanced layouts are for map, list,
and byte kinds, all of which tend to be used via anonymous inline types,
this could result in a substantial increase in syntax volume
(e.g., in this example, it would've lead to two new named types).
(Note that the previous paragraph has only considered schema DSL syntax; perhaps the issue looks differently if considered via the IPLD-native "AST".)
Advanced layout usage sites are marked
See #130 -- specifically, the "signaling" problem -- for more discussion of this.
We're exploring the "explicit signaling" approach, and doing so using schemas as the vehicle, here.
Advanced layouts promising a kind?
The kind map and kind bytes markings in the advanced blocks are a hypothesis.
In the cases shown in this example, we definitely know that these are the kinds we expect the advanced layout algorithms in question to behave as. This may not always be true (e.g. in the encryption user story, not explored here).
The kind is also indicated at the usage site -- e.g. the { character
at the start of the usage site members {Filename:DirEnt}<HAMT> already
makes it quite clear that we're dealing with a map kind -- so perhaps
kind indicators in the advanced block is redundant. Or, perhaps by the
other side of the same coin, they're a good sanity-check landmark.
Advanced layout implementation strings
See #130 -- specifically, the "referencing" (or "rendezvous") problem -- for more discussion of this.
We're exploring with the "behavioral specification" approach here.
What are executables, anyway?
In this schema, the executablity bit is in the Attribs.posix bitfield...
as three distinct bools, once per user/group/everyone (the 0111 mask).
Another way to regard executablity which would often (often!) be valid is to have a single bool per file.
These are two different opinions application design might take in regarding unixish filesystems. Logically consistent arguments can be made for both. Perhaps this is another situation where two different schemas would be useful; we can support both, and do so clearly, if we think it's useful to do so.
Ugh, int sizes
mtime Int
In designing the IPLD Data Model itself, we still have some deferred decisions about how we want to regard integer precision (as well as what we expect, in practice, when client implementations in various languages hit their host language's natural limitations).
We should probably give particular attention to timestamps in such decisions. They're a very common user story.
Additionally, regardless of what precisely we decide to do about integer sizes, we should aim to have documentation with recommendations of best practices for the timestamp user story.
Links
The & syntax seen in content &AnyFile denotes that the AnyFile is
expected on the far side of a Link in that position.
Denotation of the position of a link is important because It Matters To The Hash, as well as mattering to the overall understanding of topology of the larger DAG; yet it's terse because it doesn't matter to the semantics of the design (much).
The single character sigil would seem to imply that there's no such thing as an &&Foo or &&&Foo.
Which indeed I hope there isn't: this author can't think of any situations
where that expresses a sane thing. (It's rare enough in programming with pointers;
it's an entirely separate thing to propose in a merkle-DAG, which has no mutablity
semantics in the first place, which removes even the 'rare' excuses.)
We haven't officially explored or documented this anywhere yet. Probably we should.
Did you notice we still don't have enough information to determine CIDs?
This schema does not specify which:
- CID version;
- multicodec;
- nor multihash
... to use when serializing any block data and constructing links.
It also doesn't specify any restrictions for those values when reading data.
That's... interesting. What does this imply for work that's still left to do in systems which build using these tools?
Do advanced layouts deserve parameters?
We could imagine having blocks with details such as this in the schema:
advanced Rabin {
kind bytes
implementation "IPLD/experimental/rabin/v1"
bitmatch 14
minchunksize 256
maxchunksize 262144
}
advanced HAMT {
kind map
implementation "IPLD/experimental/HAMT/v1"
bitwidth 14
}
Would this be better?
See the "Other" heading in #130 for another introduction of this question.
A separate, but co-located, question is whether such parameters (if we have them) should be serialized in the internal data of the advanced layout (e.g. typically in the 'root block' of that structure), or if they should be here in the schema. This author tends to think the answer may be "both"; it certainly seems they must be in the schema if schemas are to fulfill their purpose as a design language and provide sufficient information for it to be possible to make diverse implementations in new languages.
The full schema at the top of the document explores the choices of "no", and is thus ambiguous on that second question.
Some Errata
Some unixfs naming notes
AnyFile may be a clunky name. Suggestions welcome.
One possible alternative that we've already chosen to avoid here is "inode", because that term has a lot of associations with mutable filesystems (and currently MFS does already have such a concept).
Things not included for brevity
It's likely the real Unixfsv2 will include more members of AnyFile,
such as unix socket files, device node properties, etc.
These are omitted here for brevity; omission here is not meant to suggest that
such things don't belong in later more complete documents which are
on a path to becoming specs.
Remember, Selectors Are Coming
It should be noted that there have been previous discussions about whether or not, and to what degree, it is important to be able to "inline" small objects.
To the best of this author's understanding, much (all?) of the reason to wish for the ability to "inline" small objects is loading a link is cheap only when the content is local, and may become significantly high latency if the content requires retrieval over a network (especially if that retrieval requires a search for a peer who has the content!).
Selectors are a feature which make it possible to describe fetching whole graphs of linked data at once, rather than in an "N+1 query" fashion, and so they dramatically change the field of play here: possibly reducing the performance penalties of "unnecessary" blocks/links to infinitesimal.
We're expecting this feature to land "soon" (on the same timescale, and almost certainly before, any final unixfsv2 specs themselves), so it seems valid to design link placements with that improvement in mind.
inb4 timezones
mtime Int
Yes, the posix syscalls use raw integers for modification times on files.
No, it does not include timezones.
Yes, really.