Dilated File Transfer #23

meejah · 2022-06-30T22:15:52Z

Specification for a new file-transfer protocol based on Dilation.

piegamesde

Did a first quick initial pass on it. There is a lot of conceptual overlap with my transfer-v2 proposal, and I think a few features of the latter could easily be integrated in here.

dilated-file-transfer.md

piegamesde · 2022-07-02T09:24:25Z

dilated-file-transfer.md

+They are actually an encoded `Map` with `String` keys (to use `msgpack` lingo) and values as per the pseudo-code.
+
+All control-channel messages contain a integer "kind" field describing the sort of message it is.
+(That is, `"kind": 1` for example, not the single-byte tag used for subchannel messages)


I am strongly in favor of using strings instead of integers for all enums and other type discriminants, here and below.

Because integers are just another layer of indirection. You need to specify which integer means what, instead of giving things a declarative name. Many serialization frameworks (e.g. serde, gson) support automatic naming based on the identifier name in the code, so this reduces both programming effort and possibilities for mistakes. Furthermore, the space of strings is a lot larger than just single byte integers, so we don't need to do things like having to reserve IDs for purposes and stuff like that.

The only advantage of using integers I can see is size/efficiency, and I don't think this is a concern for us here.

Using a fixed-length field for this in subchannel messages does have advantages since the vast majority of data down those channels will simply be payload bytes of file data. So, you save an additional pass over that data and a second copy in memory.

No matter what, the spec still needs to define a mapping from "thing" to "meaning" whether "thing" is 0x02 or "payload-data".

Ideally we'd just get rid of this part entirely and base the meaning of records on their position (e.g. "first one is the offer", etc) which would allow writing the file-data blobs directly. The implications of this if a stream got 'out of sync' -- e.g. one side thinks "this is msgpack data" while other thinks "it's file data" could be interesting to consider -- but also, that shouldn't happen (unless e.g. metadata is wrong about the file size etc).

dilated-file-transfer.md

…ples

dilated-file-transfer.md

meejah · 2022-07-08T22:06:03Z

@psanford or @Jacalz maybe of interest?

Jacalz

I just had a quick read through and it looks sensible to me. I'd personally be in favour of including compression (more specifically ZSTD compression) as part of the protocol at this time. However, I understand that it also might make sense to handle compression later on.

meejah · 2022-07-09T16:40:24Z

thanks @Jacalz

I just had a quick read through and it looks sensible to me. I'd personally be in favour of including compression (more specifically ZSTD compression) as part of the protocol at this time. However, I understand that it also might make sense to handle compression later on.

I did look at that, and agree that Zstandard sounds like a good approach.

I left it for a future enhancement (likely as a "feature") for several reasons:

this is already pretty big (as @felinira noted)
some implementations may wish to not do it (or not implement it) -- hence "feature"
there's a bunch of subtle stuff (keep the context for entire directory-offer? spec defaults params, or sender decides, or?)
could be security implications (I'm 95% sure "not" for this use-case, compress-than-encrypt has had some spectacular failures e.g. CRIME)

I would like to have compression though. If others agree that's "not too much more complexity" I could spec that out as well in this draft (as the one and only "feature" for now)...?

Vote with emojis:

🎉 leave until later
🚀 : compression now

meejah · 2022-07-09T16:51:21Z

If compression is done as a "feature" it still allows implementations to "upgrade slowly".

Idea being, you can implement just "mode=send" and "mode=receive" for essentially feature-parity with classic transfer (this should mean minimal changes to UX etc). This still means multiple files, if the sender offers more than one, so it's not precisely the same...

Then, you can implement "mode=connect" (possibly with very different UX) for multiple, bi-directional transfers. Some tools may never implement this mode (e.g. wormhole receive will only ever do mode=receive).

(If we specify compression) Separate from both the above, you can implement compression if desired. It could also be "implemented" but not used in some situations (e.g. smaller computers) as it will be a "feature".

Perhaps the above should be a section in the spec?

piegamesde · 2022-07-09T16:56:50Z

In #1, I proposed a simple "format" option for compression. This is fairly simple, under the assumption that the compressed data can be decompressed without requiring any information about the compression settings (this is usually the case). It uses the usual feature intersection which makes it both opt-in and future proof for format additions.

The only real open question is how to deal with folders. Compressing folders as one stream is very likely to be more efficient (no new dictionary per file, deduplication across files), so we probably want that. The downside is that the individual files cannot easily be separated prior to decompressing. This is not a huge deal IMO, but it makes the "send file header then bytes" approach unfeasible.

I propose to skip the individual file headers in directory transfers and just concatenate the (compressed) bytes. The order of the file is deterministic and dictated by the send offer. The checksum message might be modified so that it is one message with a list of checksums, one for each file.

meejah · 2022-07-09T17:00:29Z

The downside is that the individual files cannot easily be separated prior to decompressing. This is not a huge deal IMO, but it makes the "send file header then bytes" approach unfeasible.

No, you can still separate the "compressed bytes for file 1" from the "compressed bytes for file 2" as zstandard allows a "flush" operation. I have tested this. Many compressions also offer their own container format (as does zstd) but you don't have to use these.'

...and yes, we'll want to use the same compression context for as much as feasible as that gives better compression (usually).

(This is one for the "there are subtle things" point and partly why I wanted to skip this until later .. I don't think we should include more than one offer per compression context, because some implementations may process each offer / subchannel in a thread and insisting on one compression-context wouldn't allow that).

piegamesde · 2022-07-09T17:11:39Z

I don't think we should include more than one offer per compression context

I agree, and actually I never thought about that before. To me, different transfers are independent in every sense of the word, and sharing a compression context would create interdependence between them.

In theory, one could make them share compression without making them dependent by providing some kind of pre-trained dictionary prior to the transfer. I don't know how common this is across compression formats, but at least zstd supports it. However, I'm against this for the same reason I rejected that approach for sending individual files: It adds quite a bit of complexity which I'd like to avoid.

Jacalz · 2022-07-11T20:15:43Z

Just a random side-note here. Should we consider using protobuf instead of json? It seems to have big benefits in terms of size and performance.*

https://blog.mbedded.ninja/programming/serialization-formats/a-comparison-of-serialization-formats/

piegamesde · 2022-07-11T20:25:57Z

The problem with ProtoBuf is that it does not have first-class support for self-describing messages: https://developers.google.com/protocol-buffers/docs/techniques#self-description

msgpack is conceptually a drop-in replacement from JSON, whereas trying to use ProtoBuf would be an entirely different beast. I haven't looked at ProtoBuf (& friends) a lot though, so I'd appreciate some analysis on how it could be integrated into Magic Wormhole.

meejah · 2022-07-11T22:09:16Z

Just a random side-note here. Should we consider using protobuf instead of json? It seems to have big benefits in terms of size and performance.*

As specified, we're only using JSON for the "already built" parts that currently must use it (e.g. version exchange). That has to allow open-ended contents (and anyway would be a new mailbox protocol too). For protocol messages related to just this application protocol the preliminary conclusion is "msgpack". CBOR2 is also similar (and has better performance, IIRC).

I would not consider ProtocolBuffers; it uses pre-compiled specs and generated code, but without the advantages of Cap'n'Proto (essentially its successor) or flatbuffers (more-actually the successor because it's Google too). It supports fewer languages than either.

For "no-parse" options, I'd say "flatbuffers" is probably the front-runner. It supports many more languages, is "no-parse" and doesn't bundle other concerns like cap-n-proto does. It still has the disadvantage of requiring generated code though.

We could use specified messages like this, but consider "features" that have optional fields: then we'd have to have different versions of the flatbuffer specs for each feature. On the one hand, those conceptually exist anyway -- but can be handled by "if" statements easily instead of completely separate flatbuffers specs. (e.g. consider the "thumbnail" example in the "protocol expansion" section -- with msgpack/cbor2 one can do something like "if thumbnails on, include thumbnail in the Offer message" whereas with flatbuffers you'd need OfferThumbnail, OfferNoThumbnail etc (multiplied by all features with optional fields).

dilated-file-transfer.md

vu3rdd · 2022-07-14T04:10:34Z

dilated-file-transfer.md

+```python
+class FileOffer:
+    filename: str   # filename (no path segments)
+    timestamp: int  # Unix timestamp (seconds since the epoch in GMT)


Is the timestamp the mtime from stat()? If so, may be we can add some text in that regard?

Yeah .. added some words near here to clarify that.

vu3rdd · 2022-07-14T05:27:42Z

dilated-file-transfer.md

+Peers MUST remember the stat information (at least size + modification-time) of each file in the offer.
+When sending file data peers MUST compare the current size + modification-time to that recorded when making the offer.


OK, so looks like the above timestamp value is indeed mtime.

vu3rdd · 2022-07-14T09:13:16Z

dilated-file-transfer.md

+    filename: str   # filename (no path segments)
+    timestamp: int  # Unix timestamp (seconds since the epoch in GMT)
+    bytes: int      # total number of bytes in the file
+```


Just thinking about file permissions here. Do we want to also send file permissions (the mode bits in stat()) or whether we should just rely on the user's umask for the file creation permissions and attempt to use a sane default like 0644 or something. The current file transfer protocol does not send file permissions. For directories, I think it is left to the clients on the permissions of the base directory. The python client currently uses the zipfile module to extract the zip and each file's permission stored in the extended bits in the file header for each file in the zip is used to restore the original permissions. Storing permission bits in extended bits is not really specified by zip spec, just that zipfile module uses those bits for this purpose.

Just thinking aloud here, perhaps a file/directory transfer program shouldn't bother about permissions etc and just leave it to the client's discretion..

It doesn't stop with permissions though. There are also access control lists (acl). It should support either nothing or everything. Maybe it could be an optional extension.

Permissions are also different for every platform. Windows doesn't use the POSIX permission model at all for example.

Yeah. This is kind of a complex topic and gets a little bit into the "philosophy" here -- that is, this is "a file-transfer protocol" at heart (and not, for example, a "filesystem synchronization protocol" such as rsync).

I did consider for example something like encoding "execute": True or similar (e.g. "some" of the permissions) but even that seems a bit open for abuse. On the flip side, if you're sending a bunch of shell scripts from one computer to another it might be boring/surprising to have to chmod +x *.sh after -- but ultimately safer.

And, as @felinira notes, different platforms have different permissions (and other attributes) available.

These same philosophic considerations come into play with the "symlink" question too -- where the classic protocol takes a more "file transfer" approach versus "synchronize filesystems" (by sending the contents of links).

Using that as guidance, it probably makes sense to not try and transmit permissions? I'm still a little torn about "execute" just from a convenience point of view, but ...overall I think pushing this to "possible future "feature"" seems best.

Thinking further, even something "simple" like transferring "group read permission": it's quite likely that the user on one computer is a different user (with different group-membership) on the other computer -- and, it could even be an OS / filesystem that doesn't even have the concept of "group".

So, yet more reasons to not try and transfer these sorts of attributes. Client implementations should be encouraged to select good defaults for their OS/etc conditions.

meejah · 2022-07-14T18:55:13Z

I have a very rough proof-of-concept implementation in the Python client and here post updated Automat-produced state-machine diagrams. These are obviously tied to a particular implementation but I think are still generic enough to be useful.

In case it's not obvious, a DilatedFileSender machine is made for each offer and a corresponding DilatedFileReceiver is created on the receiving side (in response to a subchannel open).

This only supports File transfers right now, not directories or any control messages (i.e. "text messages").

meejah · 2022-09-20T22:01:32Z

Checking back in here, although the PoC code doesn't support multiple files yet I think we should decide on magic-wormhole/magic-wormhole-mailbox-server#31 before finalizing this transfer spec (and, perhaps that ticket would really be better in this repository in hindsight).

magic-wormhole/magic-wormhole-mailbox-server#31

meejah added 25 commits June 13, 2022 13:36

prelim v2 sketch

aac7fc9

clarify

1f454ee

overall selection

efc0011

attempt at state diagram

58e32e5

tweak diagram

8240b1a

more tweaks

8e8985d

more version work

0e94de2

random notes

c8aaa0d

new writeup, incorporating extensive discussion from brian warner

bb917fe

clarify

75d1c05

words

7caca5b

basic -> core

545a050

paths are lists; cleanups/edits

efdf7ee

clean

12eacc9

clean

23e1e8a

dot diagram

dd4f11d

tweak diagram

8b3ffa7

editing tweaks

59ade4d

add philosophy

d90275c

grammar is fun

d15f61a

links

6f545bb

old

0ba5c9a

slight re-wording and spelling

0ffed00

clarify

83b8dae

old

b650bf8

piegamesde reviewed Jul 2, 2022

View reviewed changes

meejah added 4 commits July 4, 2022 12:19

q: compression?

63ed3b7

subtly change version declartions; add more version discussion / exam…

aaf49b0

…ples

decide: yes, we want send/receive modes

63b41b1

use strings for kind in control messages

8a57ad0

piegamesde reviewed Jul 5, 2022

View reviewed changes

dilated-file-transfer.md Show resolved Hide resolved

ewanas reviewed Jul 6, 2022

View reviewed changes

dilated-file-transfer.md Show resolved Hide resolved

ewanas reviewed Jul 6, 2022

View reviewed changes

dilated-file-transfer.md Outdated Show resolved Hide resolved

meejah added 2 commits July 8, 2022 01:55

add symlink discussion

6606347

further clarify data fragmentation etc

aa27570

Jacalz reviewed Jul 9, 2022

View reviewed changes

meejah added 5 commits July 13, 2022 16:53

verb tense

da44652

nah

24e5335

how to close the session

6057b14

re-word offer language to clarify modes

d1aae38

clarifications and minor re-wording

74de043

vu3rdd reviewed Jul 14, 2022

View reviewed changes

meejah added 2 commits July 14, 2022 12:41

control-message 'kind' is string

2d881ee

clarify the timestamp

0af29f4

This was referenced Jul 27, 2022

Dilated Transfer prototyping magic-wormhole/magic-wormhole#445

Draft

Feature Request: Support sending multiple times (reuse Wormhole code) magic-wormhole/magic-wormhole#230

Open

meejah mentioned this pull request Sep 12, 2022

Ending a session gracefully magic-wormhole/magic-wormhole-mailbox-server#31

Open

meejah mentioned this pull request Apr 26, 2023

Proposal: info-dict extension to the File-Transfer Protocol #39

Open

meejah mentioned this pull request Jun 4, 2024

Dilated File Transfer (transfer-v2) #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dilated File Transfer #23

Dilated File Transfer #23

meejah commented Jun 30, 2022

piegamesde left a comment

piegamesde Jul 2, 2022

meejah Jul 3, 2022

piegamesde Jul 3, 2022

meejah Jul 4, 2022

meejah commented Jul 8, 2022

Jacalz left a comment

meejah commented Jul 9, 2022 •

edited

Loading

meejah commented Jul 9, 2022

piegamesde commented Jul 9, 2022

meejah commented Jul 9, 2022 •

edited

Loading

piegamesde commented Jul 9, 2022

Jacalz commented Jul 11, 2022

piegamesde commented Jul 11, 2022

meejah commented Jul 11, 2022 •

edited

Loading

vu3rdd Jul 14, 2022

meejah Jul 14, 2022

vu3rdd Jul 14, 2022

vu3rdd Jul 14, 2022

felinira Jul 14, 2022 •

edited

Loading

meejah Jul 14, 2022

meejah Jul 15, 2022

meejah commented Jul 14, 2022

meejah commented Sep 20, 2022

		Peers MUST remember the stat information (at least size + modification-time) of each file in the offer.
		When sending file data peers MUST compare the current size + modification-time to that recorded when making the offer.

Dilated File Transfer #23

Are you sure you want to change the base?

Dilated File Transfer #23

Conversation

meejah commented Jun 30, 2022

piegamesde left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meejah commented Jul 8, 2022

Jacalz left a comment

Choose a reason for hiding this comment

meejah commented Jul 9, 2022 • edited Loading

meejah commented Jul 9, 2022

piegamesde commented Jul 9, 2022

meejah commented Jul 9, 2022 • edited Loading

piegamesde commented Jul 9, 2022

Jacalz commented Jul 11, 2022

piegamesde commented Jul 11, 2022

meejah commented Jul 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felinira Jul 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meejah commented Jul 14, 2022

meejah commented Sep 20, 2022

meejah commented Jul 9, 2022 •

edited

Loading

meejah commented Jul 9, 2022 •

edited

Loading

meejah commented Jul 11, 2022 •

edited

Loading

felinira Jul 14, 2022 •

edited

Loading