Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3468: MXC to Hashes #3468

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ShadowJonathan
Copy link
Contributor

@ShadowJonathan ShadowJonathan commented Nov 3, 2021

Rendered

Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>

Preview: https://pr3468--matrix-org-previews.netlify.app

@ShadowJonathan ShadowJonathan changed the title MXCXXXX: MXC to Hashes MXC3468: MXC to Hashes Nov 3, 2021
@ShadowJonathan ShadowJonathan changed the title MXC3468: MXC to Hashes MSC3468: MXC to Hashes Nov 3, 2021
@turt2live turt2live added client-server Client-Server API kind:maintenance MSC which clarifies/updates existing spec proposal A matrix spec change proposal proposal-in-review labels Nov 3, 2021

## Proposal

I propose for MXCs to be reworked into being a pointer to hashes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashes cause problems when we want to delete media: because media is referenced only by itself without the context of an event, we need a unique identifier to allow users to delete their uploaded copy of the media. This further plays into terms of service stuff where typically the user has the intellectual property rights of their upload, which may not be the case in a shared identifier.

For further context, matrix-media-repo originally used hashes as identifiers but quickly changed away from that to maintain those intellectual property rights as well as ensuring that in the future it will be possible for people to delete their own uploads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashes can be garbage-collected once no locally-known MXC points to them, in that aspect, they only serve as a performance detail, to make sure multiple MXCs dont duplicate the same data, and/or also duplicate the same data in transit.

I don't exactly understand what you mean with property rights in the context of practicality, what's to stop someone from downloading and uploading the media behind a MXC from another server? In that aspect, you have the exact same operation /clone does, only more performant.

Users can still delete media IDs, only if their media has been copied somewhere, the underlying hash may not be garbage collected on other servers. I think that's indeed a problem when thinking about property rights and copyright, however, I think that it practically makes absolutely no difference, as media is copied, cached, and downloaded in the exact same fashion across the federation. If anything, this'd give more tools to police media, as media can then be banned by it's hash on local servers. And shared moderation lists can then be used to propagate bans across multiple servers.

If you want to say that shared identifiers, like hashes, arent an option because of copyright issues, then that's practically not enforceable, and thus moot, as MXCs already act as such a 'shared identifier', with servers being able to query media from a proxy server by a MXC, returning its local cache.


## Proposal

I propose for MXCs to be reworked into being a pointer to hashes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the purpose of the hash, especially on the client-server side. Can you explain why a client would care about the hash of a file?

Also, it seems to me like this MSC is proposing two independent things: exposing the hash of an mxc:// url, and allowing for cloning of media. I don't really see them as being related.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MSC formally transforms MXCs into aliases to content hashes, and the clone operation just copies the hash in a performant manner, I think that one would not make sense without the other. (Why only a MSC to change MXCs to being hash-based? what's the purpose behind that? And if i'd propose a clone endpoint only, then it'd be dubious utility without it being low-cost here as it is.)

Can you explain why a client would care about the hash of a file?

Clients might care about hashes for moderation, de-duplication, or debugging purposes.

A bot like mjolnir can submit to ban a hash on a shared list, as a "known" bad image, so not even cloning can propagate it.

The clone endpoint exists then to allow clients to easily reference media under a new MXC when forwarding messages to new rooms, or for whatever other reason, to preserve the underlying media for longer than the original URI would have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the clone operation can just copy a hash. Surely if you want to clone a file from a remote server, you'd want to copy the whole file locally. And if you're cloning a local file, then I don't think that anyone should care what mechanism you're using internally to deduplicate.

Can you explain why a client would care about the hash of a file?

Clients might care about hashes for moderation, de-duplication, or debugging purposes. ...

So, it might be useful for clients to be able to easily get a hash for a file, but I don't think that they need to care about the hash used internally by the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the clone operation can just copy a hash. Surely if you want to clone a file from a remote server, you'd want to copy the whole file locally. And if you're cloning a local file, then I don't think that anyone should care what mechanism you're using internally to deduplicate.

The file is already cloned, but this is also about easy deduplication and resilliency, as files dont have to be copied and uploaded/downloaded with a whole new "ID" every time it is cloned, it reduces federation traffic, reduces disk space needed, at the benefit of added resiliency (if the server the file was originally uploaded from goes down)

So, it might be useful for clients to be able to easily get a hash for a file, but I don't think that they need to care about the hash used internally by the server.

The "hash" here is also the identifier, so you're essentially asking the server "hey, that namespaced identifier? what is the underlying shared identifier for that?", in this case that hash, and it might be interesting to ban or tag that shared identifier for moderation purposes, and/or publish them onto public warn lists, for stuff like known abusive material (if it is entirely unmodified).


Also, utilising could have the benefit of deduplicating even on download->reupload, if there is no pre-processing on the client or server-side. Say, for example, Alice uploads a file on server A, and then Bob downloads it on server B, Bob then downloads the file, and posts it to a public list. Now, Charlie likes the file, and uploads it to their own server C, in a room that is shared with servers A and B, now A and B only have to ask C what hash the file is, and if it is one they already have locally, then they can serve from disk, instead of downloading the file separately from C.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the clone operation can just copy a hash. Surely if you want to clone a file from a remote server, you'd want to copy the whole file locally. And if you're cloning a local file, then I don't think that anyone should care what mechanism you're using internally to deduplicate.

The file is already cloned, but this is also about easy deduplication and resilliency, as files dont have to be copied and uploaded/downloaded with a whole new "ID" every time it is cloned, it reduces federation traffic, reduces disk space needed, at the benefit of added resiliency (if the server the file was originally uploaded from goes down)

Right, but if you're cloning a file that your server has already copied, then it doesn't have to refetch, but nobody should care what method it's using to de-duplicate. That was what my second sentence was trying to get at.

Also, utilising could have the benefit of deduplicating even on download->reupload, if there is no pre-processing on the client or server-side.

My main issue here is that in this MSC, you're saying that each file will be identified by one single hash, but we don't have to just stick with one single hash per file, and it doesn't have to be tied to how the media repository does deduplication. I'm not saying that hashes are a bad idea; I'm saying that we don't need to have a concept of "what is the hash for this file?" Instead, we could just allow users/servers to ask, for example, "What is the sha256 hash for this file", or "What is the sha512 hash for this file", or whatever other algorithm the server supports, and the server that's being asked could calculate the hash on-demand, or it could store it in advance, or do whatever. The requester shouldn't really care how the server does it internally; all it should really needs to care about is whether the server supports the hash algorithm that we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, we could just allow users/servers to ask, for example, "What is the sha256 hash for this file", or "What is the sha512 hash for this file", or whatever other algorithm the server supports.

I already tackle this in the proposal (see "Which hash?"), and propose multihash to be used (to have a single format with which all hashes can be auto-identified. Your suggestion to fetch algo-specific hashes instead of "the" hash is a good one though.

but nobody should care what method it's using to de-duplicate.

Alright, good point, but i'm just saying that its easiest for the server to simply copy the hash received from the other server onto a new local MXC.

However, in conjunction to the first response, i think that I should relax that wording, as servers might "know about" a hash (lets say MD5), but cannot resolve a hash with it, so it cannot download the remote file, perform a hash, and link a MXC with that hash. In the same vein, it cannot "trust" the remote server's hash to a degree (or it shouldn't, anyways).

So, what about the following; The server internally represents a MXC with 1 or multiple hashes, with which it resolves and verifies the file, (implementations could have a "master" hash with which it can easily key files, but those are details), other servers can request hashes (one "requested" type, and multiple "understood" types), and the server must return the requested hash type (and "understood" ones if it has them cached locally), then the server can fetch the file via this hash+type, and verify locally.

The benefit of all of this is that it allows describing the "required-to-provide-on-demand" hashes (sha256 and sha512 for now), which an implementation must always support, while it allows them to experiment or work with additional hash-types.

This is probably a bit of overengineering, and i probably have to size it down, but the basic idea of a file represented by multiple hashes, and an MXC linking to those multiple hashes, would be the core of the proposal.

This could potentially also (accidentally) solve another problem; collision, where a paranoid server might verify the file via the multiple hash-types before serving it to the user.

(I'm increasingly waning off of multihash by thinking about this though, its much easier and much more expansive to re-invent it as a tuple of (str, str), (type_id, hash_hex))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but i'm just saying that its easiest for the server to simply copy the hash received from the other server onto a new local MXC.

It may be, but I'd consider it an implementation detail.

So, what about the following; The server internally represents a MXC with 1 or multiple hashes, with which it resolves and verifies the file, (implementations could have a "master" hash with which it can easily key files, but those are details), other servers can request hashes (one "requested" type, and multiple "understood" types), and the server must return the requested hash type (and "understood" ones if it has them cached locally), then the server can fetch the file via this hash+type, and verify locally.

I don't think we need an endpoint to get a file by its hash. I think all we need is an endpoint to get the hash for a file, given the hash algorithm. Then, when a client/server sees an mxc: that it doesn't have locally, it can ask the origin server "What's the sha256 hash for this mxc:?" (Or "What's the sha3 hash for this mxc:?" if it's decided that sha256 isn't good enough.) Then it receives the hash, and checks if it already has that, and if so, it can reference the file it already has. If not, then it can just fetch the file using the original mxc:. So that's just one new endpoint instead of two.

This avoids the problem of server A using internally sha256 for deduplication, but server B not trusting sha256 and wanting to use sha3 instead, since the internal hash algorithm isn't exposed at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think we've come to a conclusion then;

  • Remove the fetch-by-hash endpoint
  • keep the current fetch-hash-for-mxc endpoint
  • Note the intended and enabled behaviour (internal deduplication, P2P offloading, vhost efficiency, clone efficiency) without mandating anything specific
  • Note down sha256 as a required hash, and sha3 as an optional hash, for now, with possible expansion in the future (with further MSCs)
    • though I think a few others could be added in this MSC directly following some feedback

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well add sha512 as an optional hash too. Or maybe required? I think any library that gives you sha256 will also give you sha512.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MSC conflicts with the ideas of #3911

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client-server Client-Server API kind:maintenance MSC which clarifies/updates existing spec proposal A matrix spec change proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants