Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dilated File Transfer #23

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

meejah
Copy link
Member

@meejah meejah commented Jun 30, 2022

Specification for a new file-transfer protocol based on Dilation.

Copy link
Member

@piegamesde piegamesde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first quick initial pass on it. There is a lot of conceptual overlap with my transfer-v2 proposal, and I think a few features of the latter could easily be integrated in here.

dilated-file-transfer.md Outdated Show resolved Hide resolved
dilated-file-transfer.md Outdated Show resolved Hide resolved
They are actually an encoded `Map` with `String` keys (to use `msgpack` lingo) and values as per the pseudo-code.

All control-channel messages contain a integer "kind" field describing the sort of message it is.
(That is, `"kind": 1` for example, not the single-byte tag used for subchannel messages)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly in favor of using strings instead of integers for all enums and other type discriminants, here and below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because integers are just another layer of indirection. You need to specify which integer means what, instead of giving things a declarative name. Many serialization frameworks (e.g. serde, gson) support automatic naming based on the identifier name in the code, so this reduces both programming effort and possibilities for mistakes. Furthermore, the space of strings is a lot larger than just single byte integers, so we don't need to do things like having to reserve IDs for purposes and stuff like that.

The only advantage of using integers I can see is size/efficiency, and I don't think this is a concern for us here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a fixed-length field for this in subchannel messages does have advantages since the vast majority of data down those channels will simply be payload bytes of file data. So, you save an additional pass over that data and a second copy in memory.

No matter what, the spec still needs to define a mapping from "thing" to "meaning" whether "thing" is 0x02 or "payload-data".

Ideally we'd just get rid of this part entirely and base the meaning of records on their position (e.g. "first one is the offer", etc) which would allow writing the file-data blobs directly. The implications of this if a stream got 'out of sync' -- e.g. one side thinks "this is msgpack data" while other thinks "it's file data" could be interesting to consider -- but also, that shouldn't happen (unless e.g. metadata is wrong about the file size etc).

dilated-file-transfer.md Show resolved Hide resolved
dilated-file-transfer.md Show resolved Hide resolved
dilated-file-transfer.md Outdated Show resolved Hide resolved
@meejah
Copy link
Member Author

meejah commented Jul 8, 2022

@psanford or @Jacalz maybe of interest?

Copy link

@Jacalz Jacalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a quick read through and it looks sensible to me. I'd personally be in favour of including compression (more specifically ZSTD compression) as part of the protocol at this time. However, I understand that it also might make sense to handle compression later on.

@meejah
Copy link
Member Author

meejah commented Jul 9, 2022

thanks @Jacalz

I just had a quick read through and it looks sensible to me. I'd personally be in favour of including compression (more specifically ZSTD compression) as part of the protocol at this time. However, I understand that it also might make sense to handle compression later on.

I did look at that, and agree that Zstandard sounds like a good approach.

I left it for a future enhancement (likely as a "feature") for several reasons:

  • this is already pretty big (as @felinira noted)
  • some implementations may wish to not do it (or not implement it) -- hence "feature"
  • there's a bunch of subtle stuff (keep the context for entire directory-offer? spec defaults params, or sender decides, or?)
  • could be security implications (I'm 95% sure "not" for this use-case, compress-than-encrypt has had some spectacular failures e.g. CRIME)

I would like to have compression though. If others agree that's "not too much more complexity" I could spec that out as well in this draft (as the one and only "feature" for now)...?

Vote with emojis:

  • 🎉 leave until later
  • 🚀 : compression now

@meejah
Copy link
Member Author

meejah commented Jul 9, 2022

If compression is done as a "feature" it still allows implementations to "upgrade slowly".

Idea being, you can implement just "mode=send" and "mode=receive" for essentially feature-parity with classic transfer (this should mean minimal changes to UX etc). This still means multiple files, if the sender offers more than one, so it's not precisely the same...

Then, you can implement "mode=connect" (possibly with very different UX) for multiple, bi-directional transfers. Some tools may never implement this mode (e.g. wormhole receive will only ever do mode=receive).

(If we specify compression) Separate from both the above, you can implement compression if desired. It could also be "implemented" but not used in some situations (e.g. smaller computers) as it will be a "feature".

Perhaps the above should be a section in the spec?

@piegamesde
Copy link
Member

In #1, I proposed a simple "format" option for compression. This is fairly simple, under the assumption that the compressed data can be decompressed without requiring any information about the compression settings (this is usually the case). It uses the usual feature intersection which makes it both opt-in and future proof for format additions.

The only real open question is how to deal with folders. Compressing folders as one stream is very likely to be more efficient (no new dictionary per file, deduplication across files), so we probably want that. The downside is that the individual files cannot easily be separated prior to decompressing. This is not a huge deal IMO, but it makes the "send file header then bytes" approach unfeasible.

I propose to skip the individual file headers in directory transfers and just concatenate the (compressed) bytes. The order of the file is deterministic and dictated by the send offer. The checksum message might be modified so that it is one message with a list of checksums, one for each file.

@meejah
Copy link
Member Author

meejah commented Jul 9, 2022

The downside is that the individual files cannot easily be separated prior to decompressing. This is not a huge deal IMO, but it makes the "send file header then bytes" approach unfeasible.

No, you can still separate the "compressed bytes for file 1" from the "compressed bytes for file 2" as zstandard allows a "flush" operation. I have tested this. Many compressions also offer their own container format (as does zstd) but you don't have to use these.'

...and yes, we'll want to use the same compression context for as much as feasible as that gives better compression (usually).

(This is one for the "there are subtle things" point and partly why I wanted to skip this until later .. I don't think we should include more than one offer per compression context, because some implementations may process each offer / subchannel in a thread and insisting on one compression-context wouldn't allow that).

@piegamesde
Copy link
Member

I don't think we should include more than one offer per compression context

I agree, and actually I never thought about that before. To me, different transfers are independent in every sense of the word, and sharing a compression context would create interdependence between them.

In theory, one could make them share compression without making them dependent by providing some kind of pre-trained dictionary prior to the transfer. I don't know how common this is across compression formats, but at least zstd supports it. However, I'm against this for the same reason I rejected that approach for sending individual files: It adds quite a bit of complexity which I'd like to avoid.

@Jacalz
Copy link

Jacalz commented Jul 11, 2022

Just a random side-note here. Should we consider using protobuf instead of json? It seems to have big benefits in terms of size and performance.*

@piegamesde
Copy link
Member

The problem with ProtoBuf is that it does not have first-class support for self-describing messages: https://developers.google.com/protocol-buffers/docs/techniques#self-description

msgpack is conceptually a drop-in replacement from JSON, whereas trying to use ProtoBuf would be an entirely different beast. I haven't looked at ProtoBuf (& friends) a lot though, so I'd appreciate some analysis on how it could be integrated into Magic Wormhole.

@meejah
Copy link
Member Author

meejah commented Jul 11, 2022

Just a random side-note here. Should we consider using protobuf instead of json? It seems to have big benefits in terms of size and performance.*

As specified, we're only using JSON for the "already built" parts that currently must use it (e.g. version exchange). That has to allow open-ended contents (and anyway would be a new mailbox protocol too). For protocol messages related to just this application protocol the preliminary conclusion is "msgpack". CBOR2 is also similar (and has better performance, IIRC).

I would not consider ProtocolBuffers; it uses pre-compiled specs and generated code, but without the advantages of Cap'n'Proto (essentially its successor) or flatbuffers (more-actually the successor because it's Google too). It supports fewer languages than either.

For "no-parse" options, I'd say "flatbuffers" is probably the front-runner. It supports many more languages, is "no-parse" and doesn't bundle other concerns like cap-n-proto does. It still has the disadvantage of requiring generated code though.

We could use specified messages like this, but consider "features" that have optional fields: then we'd have to have different versions of the flatbuffer specs for each feature. On the one hand, those conceptually exist anyway -- but can be handled by "if" statements easily instead of completely separate flatbuffers specs. (e.g. consider the "thumbnail" example in the "protocol expansion" section -- with msgpack/cbor2 one can do something like "if thumbnails on, include thumbnail in the Offer message" whereas with flatbuffers you'd need OfferThumbnail, OfferNoThumbnail etc (multiplied by all features with optional fields).

dilated-file-transfer.md Outdated Show resolved Hide resolved
```python
class FileOffer:
filename: str # filename (no path segments)
timestamp: int # Unix timestamp (seconds since the epoch in GMT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the timestamp the mtime from stat()? If so, may be we can add some text in that regard?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah .. added some words near here to clarify that.

Comment on lines +217 to +218
Peers MUST remember the stat information (at least size + modification-time) of each file in the offer.
When sending file data peers MUST compare the current size + modification-time to that recorded when making the offer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so looks like the above timestamp value is indeed mtime.

filename: str # filename (no path segments)
timestamp: int # Unix timestamp (seconds since the epoch in GMT)
bytes: int # total number of bytes in the file
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking about file permissions here. Do we want to also send file permissions (the mode bits in stat()) or whether we should just rely on the user's umask for the file creation permissions and attempt to use a sane default like 0644 or something. The current file transfer protocol does not send file permissions. For directories, I think it is left to the clients on the permissions of the base directory. The python client currently uses the zipfile module to extract the zip and each file's permission stored in the extended bits in the file header for each file in the zip is used to restore the original permissions. Storing permission bits in extended bits is not really specified by zip spec, just that zipfile module uses those bits for this purpose.

Just thinking aloud here, perhaps a file/directory transfer program shouldn't bother about permissions etc and just leave it to the client's discretion..

Copy link
Contributor

@felinira felinira Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't stop with permissions though. There are also access control lists (acl). It should support either nothing or everything. Maybe it could be an optional extension.

Permissions are also different for every platform. Windows doesn't use the POSIX permission model at all for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This is kind of a complex topic and gets a little bit into the "philosophy" here -- that is, this is "a file-transfer protocol" at heart (and not, for example, a "filesystem synchronization protocol" such as rsync).

I did consider for example something like encoding "execute": True or similar (e.g. "some" of the permissions) but even that seems a bit open for abuse. On the flip side, if you're sending a bunch of shell scripts from one computer to another it might be boring/surprising to have to chmod +x *.sh after -- but ultimately safer.

And, as @felinira notes, different platforms have different permissions (and other attributes) available.

These same philosophic considerations come into play with the "symlink" question too -- where the classic protocol takes a more "file transfer" approach versus "synchronize filesystems" (by sending the contents of links).

Using that as guidance, it probably makes sense to not try and transmit permissions? I'm still a little torn about "execute" just from a convenience point of view, but ...overall I think pushing this to "possible future "feature"" seems best.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking further, even something "simple" like transferring "group read permission": it's quite likely that the user on one computer is a different user (with different group-membership) on the other computer -- and, it could even be an OS / filesystem that doesn't even have the concept of "group".

So, yet more reasons to not try and transfer these sorts of attributes. Client implementations should be encouraged to select good defaults for their OS/etc conditions.

@meejah
Copy link
Member Author

meejah commented Jul 14, 2022

I have a very rough proof-of-concept implementation in the Python client and here post updated Automat-produced state-machine diagrams. These are obviously tied to a particular implementation but I think are still generic enough to be useful.

In case it's not obvious, a DilatedFileSender machine is made for each offer and a corresponding DilatedFileReceiver is created on the receiving side (in response to a subchannel open).

This only supports File transfers right now, not directories or any control messages (i.e. "text messages").

wormhole dilatedfile DilatedFileTransfer m dot
wormhole dilatedfile DilatedFileSender m dot
wormhole dilatedfile DilatedFileReceiver m dot

@meejah
Copy link
Member Author

meejah commented Sep 20, 2022

Checking back in here, although the PoC code doesn't support multiple files yet I think we should decide on magic-wormhole/magic-wormhole-mailbox-server#31 before finalizing this transfer spec (and, perhaps that ticket would really be better in this repository in hindsight).

magic-wormhole/magic-wormhole-mailbox-server#31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants