Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3468: MXC to Hashes #3468

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
197 changes: 197 additions & 0 deletions proposals/3468-mxc-hash.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MSC conflicts with the ideas of #3911

@@ -0,0 +1,197 @@
# MSC3468: MXCs to Hashes

Currently, matrix media/content repositories work with a MXC to blob mapping, fetching the media
from the domain embedded in the MXC to present it to the user.

However, this becomes a problem when media retention, redaction, and resiliency come into play,
the singular MXC URI becoming a point of failure once the backing server retracts the URI, either
deliberately (aforementioned redaction), or accidentally (via server reset, or losing the backing media).

This is in opposition to how MXCs are used in matrix today, much like Discord media URLs;
immutable and always online, links are copied and reused across rooms.

## Proposal

I propose for MXCs to be reworked into being a pointer to hashes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashes cause problems when we want to delete media: because media is referenced only by itself without the context of an event, we need a unique identifier to allow users to delete their uploaded copy of the media. This further plays into terms of service stuff where typically the user has the intellectual property rights of their upload, which may not be the case in a shared identifier.

For further context, matrix-media-repo originally used hashes as identifiers but quickly changed away from that to maintain those intellectual property rights as well as ensuring that in the future it will be possible for people to delete their own uploads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashes can be garbage-collected once no locally-known MXC points to them, in that aspect, they only serve as a performance detail, to make sure multiple MXCs dont duplicate the same data, and/or also duplicate the same data in transit.

I don't exactly understand what you mean with property rights in the context of practicality, what's to stop someone from downloading and uploading the media behind a MXC from another server? In that aspect, you have the exact same operation /clone does, only more performant.

Users can still delete media IDs, only if their media has been copied somewhere, the underlying hash may not be garbage collected on other servers. I think that's indeed a problem when thinking about property rights and copyright, however, I think that it practically makes absolutely no difference, as media is copied, cached, and downloaded in the exact same fashion across the federation. If anything, this'd give more tools to police media, as media can then be banned by it's hash on local servers. And shared moderation lists can then be used to propagate bans across multiple servers.

If you want to say that shared identifiers, like hashes, arent an option because of copyright issues, then that's practically not enforceable, and thus moot, as MXCs already act as such a 'shared identifier', with servers being able to query media from a proxy server by a MXC, returning its local cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the purpose of the hash, especially on the client-server side. Can you explain why a client would care about the hash of a file?

Also, it seems to me like this MSC is proposing two independent things: exposing the hash of an mxc:// url, and allowing for cloning of media. I don't really see them as being related.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MSC formally transforms MXCs into aliases to content hashes, and the clone operation just copies the hash in a performant manner, I think that one would not make sense without the other. (Why only a MSC to change MXCs to being hash-based? what's the purpose behind that? And if i'd propose a clone endpoint only, then it'd be dubious utility without it being low-cost here as it is.)

Can you explain why a client would care about the hash of a file?

Clients might care about hashes for moderation, de-duplication, or debugging purposes.

A bot like mjolnir can submit to ban a hash on a shared list, as a "known" bad image, so not even cloning can propagate it.

The clone endpoint exists then to allow clients to easily reference media under a new MXC when forwarding messages to new rooms, or for whatever other reason, to preserve the underlying media for longer than the original URI would have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the clone operation can just copy a hash. Surely if you want to clone a file from a remote server, you'd want to copy the whole file locally. And if you're cloning a local file, then I don't think that anyone should care what mechanism you're using internally to deduplicate.

Can you explain why a client would care about the hash of a file?

Clients might care about hashes for moderation, de-duplication, or debugging purposes. ...

So, it might be useful for clients to be able to easily get a hash for a file, but I don't think that they need to care about the hash used internally by the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the clone operation can just copy a hash. Surely if you want to clone a file from a remote server, you'd want to copy the whole file locally. And if you're cloning a local file, then I don't think that anyone should care what mechanism you're using internally to deduplicate.

The file is already cloned, but this is also about easy deduplication and resilliency, as files dont have to be copied and uploaded/downloaded with a whole new "ID" every time it is cloned, it reduces federation traffic, reduces disk space needed, at the benefit of added resiliency (if the server the file was originally uploaded from goes down)

So, it might be useful for clients to be able to easily get a hash for a file, but I don't think that they need to care about the hash used internally by the server.

The "hash" here is also the identifier, so you're essentially asking the server "hey, that namespaced identifier? what is the underlying shared identifier for that?", in this case that hash, and it might be interesting to ban or tag that shared identifier for moderation purposes, and/or publish them onto public warn lists, for stuff like known abusive material (if it is entirely unmodified).


Also, utilising could have the benefit of deduplicating even on download->reupload, if there is no pre-processing on the client or server-side. Say, for example, Alice uploads a file on server A, and then Bob downloads it on server B, Bob then downloads the file, and posts it to a public list. Now, Charlie likes the file, and uploads it to their own server C, in a room that is shared with servers A and B, now A and B only have to ask C what hash the file is, and if it is one they already have locally, then they can serve from disk, instead of downloading the file separately from C.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the clone operation can just copy a hash. Surely if you want to clone a file from a remote server, you'd want to copy the whole file locally. And if you're cloning a local file, then I don't think that anyone should care what mechanism you're using internally to deduplicate.

The file is already cloned, but this is also about easy deduplication and resilliency, as files dont have to be copied and uploaded/downloaded with a whole new "ID" every time it is cloned, it reduces federation traffic, reduces disk space needed, at the benefit of added resiliency (if the server the file was originally uploaded from goes down)

Right, but if you're cloning a file that your server has already copied, then it doesn't have to refetch, but nobody should care what method it's using to de-duplicate. That was what my second sentence was trying to get at.

Also, utilising could have the benefit of deduplicating even on download->reupload, if there is no pre-processing on the client or server-side.

My main issue here is that in this MSC, you're saying that each file will be identified by one single hash, but we don't have to just stick with one single hash per file, and it doesn't have to be tied to how the media repository does deduplication. I'm not saying that hashes are a bad idea; I'm saying that we don't need to have a concept of "what is the hash for this file?" Instead, we could just allow users/servers to ask, for example, "What is the sha256 hash for this file", or "What is the sha512 hash for this file", or whatever other algorithm the server supports, and the server that's being asked could calculate the hash on-demand, or it could store it in advance, or do whatever. The requester shouldn't really care how the server does it internally; all it should really needs to care about is whether the server supports the hash algorithm that we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, we could just allow users/servers to ask, for example, "What is the sha256 hash for this file", or "What is the sha512 hash for this file", or whatever other algorithm the server supports.

I already tackle this in the proposal (see "Which hash?"), and propose multihash to be used (to have a single format with which all hashes can be auto-identified. Your suggestion to fetch algo-specific hashes instead of "the" hash is a good one though.

but nobody should care what method it's using to de-duplicate.

Alright, good point, but i'm just saying that its easiest for the server to simply copy the hash received from the other server onto a new local MXC.

However, in conjunction to the first response, i think that I should relax that wording, as servers might "know about" a hash (lets say MD5), but cannot resolve a hash with it, so it cannot download the remote file, perform a hash, and link a MXC with that hash. In the same vein, it cannot "trust" the remote server's hash to a degree (or it shouldn't, anyways).

So, what about the following; The server internally represents a MXC with 1 or multiple hashes, with which it resolves and verifies the file, (implementations could have a "master" hash with which it can easily key files, but those are details), other servers can request hashes (one "requested" type, and multiple "understood" types), and the server must return the requested hash type (and "understood" ones if it has them cached locally), then the server can fetch the file via this hash+type, and verify locally.

The benefit of all of this is that it allows describing the "required-to-provide-on-demand" hashes (sha256 and sha512 for now), which an implementation must always support, while it allows them to experiment or work with additional hash-types.

This is probably a bit of overengineering, and i probably have to size it down, but the basic idea of a file represented by multiple hashes, and an MXC linking to those multiple hashes, would be the core of the proposal.

This could potentially also (accidentally) solve another problem; collision, where a paranoid server might verify the file via the multiple hash-types before serving it to the user.

(I'm increasingly waning off of multihash by thinking about this though, its much easier and much more expansive to re-invent it as a tuple of (str, str), (type_id, hash_hex))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but i'm just saying that its easiest for the server to simply copy the hash received from the other server onto a new local MXC.

It may be, but I'd consider it an implementation detail.

So, what about the following; The server internally represents a MXC with 1 or multiple hashes, with which it resolves and verifies the file, (implementations could have a "master" hash with which it can easily key files, but those are details), other servers can request hashes (one "requested" type, and multiple "understood" types), and the server must return the requested hash type (and "understood" ones if it has them cached locally), then the server can fetch the file via this hash+type, and verify locally.

I don't think we need an endpoint to get a file by its hash. I think all we need is an endpoint to get the hash for a file, given the hash algorithm. Then, when a client/server sees an mxc: that it doesn't have locally, it can ask the origin server "What's the sha256 hash for this mxc:?" (Or "What's the sha3 hash for this mxc:?" if it's decided that sha256 isn't good enough.) Then it receives the hash, and checks if it already has that, and if so, it can reference the file it already has. If not, then it can just fetch the file using the original mxc:. So that's just one new endpoint instead of two.

This avoids the problem of server A using internally sha256 for deduplication, but server B not trusting sha256 and wanting to use sha3 instead, since the internal hash algorithm isn't exposed at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think we've come to a conclusion then;

  • Remove the fetch-by-hash endpoint
  • keep the current fetch-hash-for-mxc endpoint
  • Note the intended and enabled behaviour (internal deduplication, P2P offloading, vhost efficiency, clone efficiency) without mandating anything specific
  • Note down sha256 as a required hash, and sha3 as an optional hash, for now, with possible expansion in the future (with further MSCs)
    • though I think a few others could be added in this MSC directly following some feedback

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well add sha512 as an optional hash too. Or maybe required? I think any library that gives you sha256 will also give you sha512.


This gives the extra benefit of decoupling aliasing pointers (such as the MXC is) with the underlying media.

Alongside this change, I also propose for an additional client-side endpoint which can quickly "clone"
a MXC. This being done by having the server look up the MXC's hash,
and then creating a new MXC also referencing that hash.

The client-server content API would expose a method for the client to retrieve the hash of a
particular MXC, alongside aforementioned method to clone it.

The server-server content API would add a dedicated fetch method for fetching the hash to a MXC, and
fetching the media to a hash.

### Specification

#### Client-Server

This proposal would like to add the following two methods to CS;

```
POST _matrix/media/v4/clone/{serverName}/{mediaId}
ShadowJonathan marked this conversation as resolved.
Show resolved Hide resolved

Rate-limited: Yes
Authentication: Yes

Responses:
200: JSON (see below)
429: Ratelimited
503: Could not fetch remote MXC-to-hash mapping
```
200 response:
```json
{
"m.clone.mxc": "mxc://local.server/media_id"
}
```

```
GET _matrix/media/v4/hash/{serverName}/{mediaId}

Rate-limited: Yes
Authentication: Yes

Responses:
200: JSON (see below)
429: Ratelimited
503: Could not fetch remote MXC-to-hash mapping
```

200 response:
```json5
{
"m.mxc.hash": "1234567890abcdef" // hex-encoded hash
}
```

#### Server-Server

This proposal would like to add the following two endpoints to S2S;

```
GET _matrix/federation/v?/media/hash

Rate-limited: No
Authentication: Yes

Query parameters:
media_id: string, the local part of an MXC for which the hash is queried

Responses:
200: Pure-binary encoding of corresponding hash
404: Media ID does not exist
```

```
GET _matrix/media/v?/media/fetch/{hash}

Rate-limited: Yes
Authentication: Yes

Responses:
200: Blob of data corresponding to hash
404: Hash-media not found
429: Ratelimited
```

### "Which hash?"

*Note: this is an area of feedback, this'll be removed in the final draft*

So far, the definition of "hash" has been vague. I think converging on a specific hash function
could be a lock-in for future expansion.

So, i'd like to propose using [`multihash`](https://github.com/multiformats/multihash) for these
purposes, this would allow a common format self-describing the hashes used.

For now, only a set series of hashes would be included (see
[here](https://github.com/multiformats/multicodec/blob/master/table.csv) for a full table), which
can be expanded/deprecated with subsequent matrix spec releases, without changing up the format of
the hash, or documenting checks to differentiate the types of hash used, or to reinvent multihash.

However, this is up for debate.

## Motivation

This MSC wishes to unblock efforts for media retention and redaction;
- https://github.com/matrix-org/synapse/issues/6832
- https://github.com/matrix-org/matrix-doc/issues/701

By addition of the `/clone` endpoint, any client wishing to preserve media, can do so by simply
fetching/storing media locally, reducing the linkrot effect that remote servers redacting media
could have.

This MSC would also wish to make matrix more flexible for diverse media delivery systems.

Mapping MXCs to hashes could allow the hashes themselves to become self-verifying keys in any
(centralized or distributed) KV store.

This, in turn, could prepare matrix better for P2P efforts.

This MSC also wishes to make matrix content delivery more resilient, with the exception of mapping a
MXC alias to a hash, a hash could be retrieved from anywhere, and still be self-verifying,
considerably lessening the bus factor, and allowing for better distributed load (see the first
"future extension" in below section)

## Potential issues

This could have a slight performance hit, as an extra RTT between servers is needed to fetch the
media actual, after fetching the hash corresponding to that bit of media.

I think this is a more acceptable tradeoff, an alternative would be to side-channel the hash in a
header, in an endpoint fetching directly from a MXC.

## Future extensions

*Note: this is free-form speculation, and serves to illustrate how future MSCs can extend the
behavior this MSC is enabling.*

A possible extension would be a server-server endpoint which requests what recommended content
endpoints would be to fetch hashes from.

(I.e. a server would ask `/media/endpoints`, and the server can respond with
`["https://common.caching.server", "https://matrix.org"]`, in decreasing order of priority)

This can be helpful when servers share a common "media server", as is the case today with
[matrix-media-repo](https://github.com/turt2live/matrix-media-repo), which "tricks" federation by
redirecting any request for media to itself. This future extension would formalize this process.

This would also be helpful with dealing with "thundering herds", as servers can be redirected to
multiple servers to fetch media from a hash from.

(However, as-is, this could have security problems with DoS-ing, issues with cache invalidation
after redacting media, and possibly more. This is only to illustrate flexibility.)

Another possible extension could be to allow to tap in natively to decentralized media stores, which
often key their data to hashes. This could make media P2P easier to implement and work with.

One last possible extension is to add `410` to every endpoint pertaining fetching media, this could
help with communicating that media has been deleted to servers and clients.

## Security considerations

A big part of this MSC's motivation is to unblock media redaction/retention efforts. However, that
does not mean this MSC should be blind to the struggle of containing unsavory media across
federation.

This MSC adds a `/clone` endpoint, by which a client, on any server, could easily "copy" media,
seemingly making containment efforts useless.

However, at a room-level, and possibly a server-level, hashes themselves could be banned. This can
be implementation-specific, or be built-into bots like mjolnir.

## Unstable prefix

This MSC uses the unstable prefix `nl.automatia.msc3468`;

- `_matrix/media/nl.automatia.msc3468/clone/{serverName}/{mediaId}`
- `_matrix/media/nl.automatia.msc3468/hash/{serverName}/{mediaId}`
- `_matrix/federation/nl.automatia.msc3468/media/hash`
- `_matrix/media/nl.automatia.msc3468/media/fetch/{hash}`
- `nl.automatia.msc3468.clone.mxc`
- `nl.automatia.msc3468.mxc.hash`