Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store images and toots on IPFS to reduce inter-instance traffic and local caching requirements #21461

Open
gabrielbauman opened this issue Nov 23, 2022 · 39 comments
Labels
suggestion Feature suggestion

Comments

@gabrielbauman
Copy link

gabrielbauman commented Nov 23, 2022

Pitch

In short, the idea is to cache user toots and attachments on IPFS.

  • Each Mastodon node would have a configured public IPFS gateway. This might be run by the Mastodon server operator or by someone else.
  • When a local user created a toot, Mastodon would store the toot and any attachments in IPFS using the configured gateway and record the content identifiers (CIDs) for each in its database.
  • When a remote server indicated support for IPFS, Mastodon would forward the IPFS CID for each resource instead of the content itself as part of the usual ActivityPub push.
  • When rendering the timeline on the remote Mastodon instance's client side (the remote user's browser), toots and attachments would be resolved from that instance's configured IPFS HTTP gateway by the end user's browser.

In this way, any server could render any toot and get any attachment by simply resolving the CID from any public IPFS gateway, which amounts to a simple HTTP call.

Motivation

Currently, ActivityPub moves a lot of data around between instances. On a small instance with 30 users who follow some big accounts, I'm seeing about 1.5GB of daily server-to-server traffic. That's a lot, given that my users are pretty quiet.

A lot of this content is never seen by less-active users, but it's still transferred and cached because ActivityPub is push based. People don't scroll far enough to see all the stuff we cache for them, which means a lot of what is transferred is a waste.

By storing toots and attachments as IPFS resources,

  • ActivityPub S2S traffic would be much reduced in size; we'd be passing references to content instead of content.
  • Caching and propagation of toots and resources would be driven by actual user demand rather than simply who follows who; users' browsers would pull content directly from IPFS cache.
  • Toots and resources would be widely cached and available, even if the originating server vanished.
  • Old toots that weren't pinned on IPFS by the originating server would eventually become unavailable.
  • The originating mastodon server would not be subject to waves of fetch requests for toots/attachments, as it would if we passed references to content on the origin and resolved those on demand.

All of this would benefit server admins dealing with high bandwidth and caching requirements. It would benefit users by making toots widely resolvable regardless of whether the originating instance is up or down. And it could benefit IPFS by making each Mastodon node an IPFS node and gateways more widespread; ideally a public IPFS gateway would be installed adjacent to Mastodon as part of a standard install.

This would be implemented as an optional feature that would not break existing Mastodon instances.

@gabrielbauman gabrielbauman added the suggestion Feature suggestion label Nov 23, 2022
@gabrielbauman
Copy link
Author

gabrielbauman commented Nov 23, 2022

I hope I've been clear here, and please be gentle; I'm not an ActivityPub expert and my understanding of how Mastodon does things may be incorrect.

There are also some interesting possible side-effects to this scheme around toot ownership/signing and also server migration - if you change servers but prove ownership of your CIDs, everything just comes with you. CIDs would effectively become global, domain-agnostic identifiers for toots and attachments.

If there's interest, I'd be happy to contribute a branch implementing this as an optional feature.

@afontenot
Copy link
Contributor

Some potential downsides:

  • this would make reliably deleting anything "cached" impossible. IPFS hashes are immutable and as long as at least one instance kept the file alive, it couldn't be deleted. Not even through defederation. Now of course in theory someone could be capturing a copy of every public Mastodon post, but in practice having the canonical URL for each file / post be the user's own instance means that they have some measure of control over whether it remains on the Internet forever.
  • Suppose I follow you from another instance. You make a post on Mastodon and Mastodon sends the post to my instance. The post contains an image; the image (per the feature suggestion) would be an ipfs:// link. Mastodon would then have to do a lookup within IPFS and fetch the entire file so that it could proxy it on behalf of my web browser. Unfortunately, IPFS is in practice quite a lot slower than a HTTP GET request. This would slow federated posts down quite a bit.

I think the main thing in your suggestion that's providing the bulk of the traffic reduction is the idea of delaying fetching media contained in federated posts until a user actually wants to view the post. But why not just implement that without going to the trouble of doing it through IPFS? I think it sounds pretty reasonable! Of course this would also save local storage space. (Note: in practice trying to reduce traffic on Mastodon instances is undermined by having a single user viewing the federated timeline.)

On the other hand, if you assume that remote media is going to need to be cached locally, you could do deduplication without going to the trouble of IPFS. E.g. just hash the file.

@victorb
Copy link

victorb commented Nov 23, 2022

Was just gonna propose the same idea :)

I'd suggest splitting this into two parts though, one for general data like Toots and related things, and the other being the media cache.

Reason for this would be because the complexity to implement them is very different for the two sections, and storage requirements are very different, at least on my personal instance.

My personal instance currently is using ~1.5GB for the actual DB, while ~32GB for the media cache. So for space saving purposes, getting the media cache into IPFS rather than stored directly on disk, would help a lot (not to mention GC would get simpler, just clear out all local objects and let them be re-cached on the fly next time).

Implementing the media cache first would probably also be simpler than the rest of the data, would be my guess at least.

@gabrielbauman
Copy link
Author

gabrielbauman commented Nov 23, 2022

Hi @afontenot, thanks for chiming in.

Some potential downsides:

  • this would make reliably deleting anything "cached" impossible. IPFS hashes are immutable and as long as at least one instance kept the file alive, it couldn't be deleted. Not even through defederation. Now of course in theory someone could be capturing a copy of every public Mastodon post, but in practice having the canonical URL for each file / post be the user's own instance means that they have some measure of control over whether it remains on the Internet forever.

This is true of any content on the Internet, particularly in the era of third-party CDNs like Cloudflare. Many Mastodon nodes are behind Cloudflare - mine included - it reduces server bandwidth consumption by 80% or more. There's no way of knowing when CF might choose to prune content. Likewise, when people post to Instagram or Facebook or other heavily CDNed publishers, deleting content does not make it inaccessible.

Also, screenshots exist.

There are a few ways to make this a little bit more controllable; one might be using IPFS's IPNS service to "update" the "most recent" version of a cached resource, resulting in a "removed" payload.

  • Suppose I follow you from another instance. You make a post on Mastodon and Mastodon sends the post to my instance. The post contains an image; the image (per the feature suggestion) would be an ipfs:// link. Mastodon would then have to do a lookup within IPFS and fetch the entire file so that it could proxy it on behalf of my web browser. Unfortunately, IPFS is in practice quite a lot slower than a HTTP GET request. This would slow federated posts down quite a bit.

When a server received an ActivityPub push containing IPFS links, it could check IPFS to see if each IPFS image was resolvable on its configured IPFS gateway and force resolution if not. This would make the media available on that gateway.

I don't think this is a great idea though. Yes, the first request to IPFS for an image just added to IPFS would be slow, but future resolution requests would be much faster - the more people request the content, the quicker and more local IPFS content becomes. If we ensure that IPFS gateways used for mastodon connect to other mastodon-related IPFS nodes, we could speed up resolution by not waiting for propagation on the wider IPFS network.

And an IPFS gateway gives us an HTTPS API to work against that handles the IPFS:// operations. Keep in mind that the IPFS gateway could be run by a third party and shared by many mastodon nodes, or mastodon nodes could hit a round-robin of many ipfs gateways.

I think the main thing in your suggestion that's providing the bulk of the traffic reduction is the idea of delaying fetching media contained in federated posts until a user actually wants to view the post. But why not just implement that without going to the trouble of doing it through IPFS?

Yes, you could essentially send references to media and toots on origin servers via ActivityPub and have users grab them on demand, but then popular toots that hadn't been cached yet would result in waves of traffic to the origin, which would make ActivityPub push irrelevant. What we have now is better than that.

The actual idea here is to separate concerns. Let ActivityPub do notification, and let a caching protocol handle availability of resources.

@gabrielbauman
Copy link
Author

Hi @victorb!

I'd suggest splitting this into two parts though, one for general data like Toots and related things, and the other being the media cache.
Implementing the media cache first would probably also be simpler than the rest of the data, would be my guess at least.

Agreed that the media cache would be the easiest route here, and that's what my early experimentation has centered on. Simpler implementation, 90% of the benefit realized.

I do think there's room for making toots resolvable by any server on demand, or without talking to a server entirely, but that can wait.

@gabrielbauman
Copy link
Author

gabrielbauman commented Nov 23, 2022

There's also been some discussion of using WebTorrent instead of IPFS, which would actually allow web browsers to participate in content propagation, along the lines of what PeerTube does. Remains tangential to this issue, but just wanted to acknowledge that there are other approaches on the table.

Also, some browsers (Brave) are including client-side IPFS nodes, which means the Mastodon webapp could resolve resources from the users' local machine...

@gabrielbauman
Copy link
Author

One other thing: Mastodon nodes receiving full-fat ActivityPub pushes could push media into IPFS, even if the origin didn't do it. Might be helpful during rollout of the feature.

@afontenot
Copy link
Contributor

So maybe I've misunderstood the suggestion here.

The way I'm looking at it: for a given instance, hold the number of posts its users want to view constant. For each post, either the content has to be cached (consuming space resources) or fetched (consuming bandwidth resources). IPFS would add global deduplication for files, although this could be implemented in other ways (e.g. the ActivityPub message could contain a hash). There's no way to avoid this fundamental tension between storage and network.

Rereading you, maybe you agree with this. But you think having Mastodon instances acting as a global CDN is a sufficient advantage to be worth implementing in its own right? That is, it's not the cache size or total Fediverse bandwidth you're concerned with, it's the distribution of that bandwidth between instances. Right now large instances see their bandwidth increase with every new instance that comes online, because as soon as any user on the new instance follows a user on the large instance, that's one more instance that posts have to be syndicated to. (This is indeed a problem I'm worried about, because it means that resource utilization for an instance with a constant number of users is not constant. Large instances may already be overpopulated under future growth even if they've closed registration!) So your solution to this is that instead of instances getting resources directly from the source instance, they request them on the IPFS network and get them from an instance that has already cached them. The total network utilization of the Fediverse doesn't change; the distribution of that utilization does.

Sounds reasonable enough to me. Though I wonder if very small instances can really handle their "share" of the bandwidth under this scheme. I think the practical blocker is going to be that maintainers are very unlikely to implement a protocol that has no plausible story for deletion. Cloudflare may cache resources indefinitely in theory, but in practice if I want to wipe my account at $INSTANCE, I can expect everything I've posted to be unresolvable within 24 hours. A few screenshots of my posts or saved content on individual computers doesn't change that, really, e.g. if my concern is "potential employer can look up things I've said".

To be clear I think the idea of a federated content syndication protocol for ActivityPub is a really good idea. Maybe even a necessary one if the network is going to be capable of reaching 100M+ users without major instances going down. But I don't know that IPFS specifically gets us there. The origin instance needs to remain authoritative for the content of the posts (which can be edited!), or the lack thereof (in case of deletion). You could implement something sort-of approximating the status quo by hiding IPFS links directly from Mastodon users, so that they become strictly non-authoritative resource identifiers used only inside ActivityPub messages. But I don't know if that will satisfy everyone. At bare minimum I think it's critical IPFS links be links to mere content, not containing any user identifiers.

For what it's worth, at this point I'm of the opinion that the "right" way to implement this would be to do an ActivityPub extension for providing alternative versions of attached content / media and supporting servers could choose to prefer IPFS. Would have the additional advantage of being able to offer alternative image formats. Text should probably be left alone.

@gabrielbauman
Copy link
Author

gabrielbauman commented Nov 23, 2022

I appreciate you taking the time to parse the idea more closely.

From my perspective, ActivityStreams S2S appears to be a somewhat unfortunate blend of signaling protocol and content distribution protocol. I think those concerns need to be broken apart to reduce the load for smaller nodes. Pushing heavy content to Mastodon nodes that might need it, but in practice often don't, is wasteful and unnecessary.

Letting content distribution optimize based on end-user demand instead of "potential" demand would be a big win. Deduplication of resources being shared in multiple toots would be a big win. Decoupling resource caching from mastodon instances would be huge for smaller operators, who could use public IPFS gateways instead of running their own. And even in cases where a Mastodon operator ran her own IPFS gateway, the cache requirements might well be lower than what would have been preemptively stored by the existing system.

at this point I'm of the opinion that the "right" way to implement this would be to do an ActivityPub extension for providing alternative versions of attached content / media and supporting servers could choose to prefer IPFS.

Sounds about right. The extension should probably be as simple as representing an attachment to a toot as a "resolvable reference" - essentially a URL that returns the content - instead of attaching actual data. The HTTP scheme would be the only one we'd need to implement in this case, since IPFS gateways give CIDs HTTP URLS.

We'd have to have some way of allowing servers to determine whether or not a remote Mastodon node wants resolvable references or prefers pushed content.

That way the foundations laid here wouldn't be IPFS-specific - maybe people would want to use a different host for their content or refer to content hosted elsewhere without hosting it locally.

We'd also need a hook and outbound queue for pushing locally uploaded media to an external store, whether by webdav or ssh or copying to a folder structure or posting it to an IPFS gateway.

@afontenot
Copy link
Contributor

From my perspective, ActivityStreams S2S appears to be a somewhat unfortunate blend of signaling protocol and content distribution protocol.

Agreed. Although I believe images are shared in AP just by a link back to the origin server. There's nothing forcing the receiver to fetch this additional content if it doesn't want to. Media doesn't ever get pushed directly, AFAIK?

Letting content distribution optimize based on end-user demand instead of "potential" demand would be a big win.

Yes, but the potential for this is limited (except on extremely small instances) by the presence of a firehose feed on Mastodon. Every single resource that the instance gets in its inbox needs to be fully rendered (including fetching remote images) if there's even one user viewing the "federated" feed. Maybe this indicates a need for a "low-resource" mode, where the firehose feed is turned off and remote resources aren't fetched ahead-of-time for users who haven't been active in 24 hours.

Decoupling resource caching from mastodon instances would be huge for smaller operators, who could use public IPFS gateways instead of running their own.

This is something that an operator could already choose to do, right? Just limit the cache size and refetch everything from the remote instance. This just reflects the fundamental tradeoff between network bandwidth and cache size I was talking about. If you don't cache a file and someone wants to view it, you've got to pull it over the network for them. Maybe some ops would prefer to tune more toward "smaller cache", but I don't think IPFS affects that either way. (Ignoring the deduplication part here.)

The HTTP scheme would be the only one we'd need to implement in this case, since IPFS gateways give CIDs HTTP URLS.

How could an IPFS supporting instance know what to do with the link in this case? The "correct" gateway to specify would be the source instance's own gateway, but obviously for federated distribution to work the receiving instance needs to fetch it over IFPS directly if it supports that. You could try to parse the url for /ipfs/<hash> but that seems flaky.

We'd also need a hook and outbound queue for pushing locally uploaded media to an external store, whether by webdav or ssh or copying to a folder structure or posting it to an IPFS gateway.

A little confused by this. There are already mechanisms for pushing media to external hosts, e.g. some people host all their media content on S3 or Cloudflare's equivalent. As far as I know you can't "post" to an IPFS gateway (a HTTP gateway), as these gateways are readonly. It sounds like you want some method for automatically syndicating content to a network run by contributing instances, but as far as I know, IPFS has no built in mechanism for this.

@SgtPooki
Copy link

Related anecdote: https://better.boston/@crschmidt/109412294646370820

@afontenot
Copy link
Contributor

afontenot commented Nov 27, 2022

Related anecdote: https://better.boston/@crschmidt/109412294646370820

Yeah that's not great. Fixing the syndication problem described here wouldn't help with that, though. Maybe the preview card should be created on the origin server and sent by it to other instances as part of the AcitivityPub message.

Edit: apparently the issue you described was previously discussed here: #4486 #12738

@volkriss
Copy link

Quick note: IPFS does have a pubsub feature for broadcasting content that may be useful.

I know the feature was well into development, but I don't know what its status is now or even whether it was even feasible. I wanted to mention it because in theory it could apply to some of these use cases.

@fariparedes
Copy link

this would make reliably deleting anything "cached" impossible. IPFS hashes are immutable and as long as at least one instance kept the file alive, it couldn't be deleted. Not even through defederation. Now of course in theory someone could be capturing a copy of every public Mastodon post, but in practice having the canonical URL for each file / post be the user's own instance means that they have some measure of control over whether it remains on the Internet forever.

While this is a valid privacy concern that I'm personally very sympathetic to, as previously mentioned, stuff that you post on the Internet almost immediately becomes out of your control re: how long it is hosted and available. You can generally rely on archived data to rot eventually as it accumulates and older data is dropped out to save on space, or for your data to simply not be important enough to archive (archive.org is probably not cataloguing every post on every Mastodon instance and is notoriously bad at fully preserving websites beyond text, for instance).

However, if I'm understanding IPFS implementation correctly, stuff will eventually sunset out of it as well. Sure, by nature of decentralization there's far less of an ability to remove stuff you don't want people to see anymore, but if you take something down, it shouldn't propagate through the IPFS network any further since nobody is trying to retrieve it anymore, and any nodes which have cached it will eventually be claimed by entropy the same as any centralized archive or cache. But I've only skimmed the documentation so it's entirely possible I'm missing something like, say, IPFS nodes specifically sharing data with each other even when it's not requested, or boosting a post which cached an image from your website for the preview causing that data to be propagated after it has been taken down at the original source... if that makes any sense.

I think the main thing in your suggestion that's providing the bulk of the traffic reduction is the idea of delaying fetching media contained in federated posts until a user actually wants to view the post. But why not just implement that without going to the trouble of doing it through IPFS? I think it sounds pretty reasonable! Of course this would also save local storage space. (Note: in practice trying to reduce traffic on Mastodon instances is undermined by having a single user viewing the federated timeline.)

FWIW I don't think waiting a couple seconds for a link preview to load in once the post is actually loaded on someone's feed is a big deal (which I believe was Gargron's concern in regards to this solution). Pop-in is a normal and accepted part of the web nowadays. Discord's media embeds take ages to load and nobody cares that much. If previews weren't served at all until specifically requested by the user, that would indeed obviate the use of them. But just lazy-loading them seems like it would be workable, even though it wouldn't address the underlying issue which is that Mastodon instances are duplicating all this effort because they don't communicate and, theoretically, can't even be trusted to communicate or do so in an understandable way.

@afontenot
Copy link
Contributor

However, if I'm understanding IPFS implementation correctly, stuff will eventually sunset out of it as well. Sure, by nature of decentralization there's far less of an ability to remove stuff you don't want people to see anymore, but if you take something down, it shouldn't propagate through the IPFS network any further since nobody is trying to retrieve it anymore, and any nodes which have cached it will eventually be claimed by entropy the same as any centralized archive or cache.

The specific concern I had was about addressing. IPFS means that there's a unique (due to the hash size), permanent web address that corresponds to your content for anything you upload, forever. If I upload a photo of myself, that photo can be accessed at the URL. That's expected, all's well and good. But if I want that photo to go away, under IPFS I have no power over that. Any server anywhere in the world could decide to host that photo, and it would remain available at the same URL forever.

Under the current system, the canonical URL for any of my content is on my home server. That's what gets shared by default. If I ask my home server to delete it, it's gone; any one who clicks the link will get a 404. They can go looking for a backup copy, but there might not be one, or it might not be publicly visible, or they might fail to find it through search, or etc. With IPFS, anyone who wants to make sure my photo never goes away just has to pin it and the link will never 404.

@volkriss
Copy link

The specific concern I had was about addressing. IPFS means that there's a unique (due to the hash size), permanent web address that corresponds to your content for anything you upload, forever. If I upload a photo of myself, that photo can be accessed at the URL.

The permanence of IPFS has been greatly exaggerated :)
IPFS does provide a unique address (not web address) for accessing a bit of content, but it absolutely does not guarantee that the content will be there when the address is requested. Many people miss that, and I know a lot of talk about IPFS gives that misleading image.

In this case IPFS is nothing but a caching proxy layer, and what you're describing has always been a feature or concern of such proxy systems. There's nothing new in IPFS in this regard. All caching proxies have always had that same feature or concern, whichever way you look at it, and we don't tear our hair out over it.

IPFS does have other issues that make it either right or wrong for this application, but the one you're worried about isn't really anything unique. I don't know if I'd say it's a solved problem, but it is one we've been comfortable working with.

@twilde
Copy link

twilde commented Nov 28, 2022

In particular, it's worth noting that the situation described ("photo, once uploaded, can be accessed forever if pinned") is already the case with today's Mastodon media federation and caching model - once my server caches your photo, I can easily tell it to ignore any deletion requests and keep serving that photo forever.

@afontenot
Copy link
Contributor

it absolutely does not guarantee that the content will be there when the address is requested. Many people miss that, and I know a lot of talk about IPFS gives that misleading image.

In particular, it's worth noting that the situation described ("photo, once uploaded, can be accessed forever if pinned") is already the case with today's Mastodon media federation and caching model - once my server caches your photo, I can easily tell it to ignore any deletion requests and keep serving that photo forever.

Again, the issue is not "will this content live forever somewhere?" - the issue is "will this content live forever at a single canonical address that is the same address I originally made it available at?". That's an enormous difference, in my view. Imagine a world in which whenever one of my posts at mastodon.social 404ed, a search would automatically be performed in the caches of every other federated instance, and if even one of them had the data, it would be available at the mastodon.social URL. That's effectively what we're talking about here.

As I understand it the whole point of IPFS is there there's a single canonical way to address a piece of content, and anyone can keep that content online for as long as they like just by pinning it. The status quo is that there are many ways of addressing a piece of content, and the user is in control over whether the canonical URL continues to point at that content. Or to put it differently, the issue isn't that I believe IPFS content never 404s, it's that the user who uploads the content has no control over whether it 404s at its canonical URL.

As I said above, it's not necessarily impossible to hide some aspects of this problem by burying IPFS inside ActivityPub as a caching layer, but I don't think it's no concern at all. And I suspect a lot of admins / maintainers / knowledgeable users will be wary of it.

@volkriss
Copy link

volkriss commented Nov 29, 2022

Again, the issue is not "will this content live forever somewhere?" - the issue is "will this content live forever at a single canonical address that is the same address I originally made it available at?".

And again, the answer to the question is no :)
Because that's not how IPFS works. IPFS does not make content live forever.

As I understand it the whole point of IPFS is there there's a single canonical way to address a piece of content

Yes, lots of people misunderstand IPFS, and like I said above, given some of the outfits putting out misleading hype about the project, it's easy to see why people arrive at those misunderstandings.

No, IPFS has so many offerings ranging from cryptographic and semantic handling of data through the active distribution of in-demand data that canonical addresses are merely a means to other goals. Heck, I think IPFS even has multiple, not a single, ways of addressing pieces of content.

But think about it: if you publish content with a URL, then the URL plus publication time is already a way of specifically addressing the content, even if you later pull the content offline. It's already a way of addressing content in caches. IPFS doesn't change anything there.

So again, I'm not even saying IPFS would work well for this application, but not for the reasons you're worried about. You're focusing on problems that exist regardless of IPFS, that IPFS wouldn't in the slightest bit change.

As a caching layer/CDN, IPFS is really nothing different from ones that have long been deployed, with no particularly new concerns. Heck, as I recall distributed web caches have even used hash addressing before to identify and trade information about the contents of their caches, so even the addressing isn't new!

@afontenot
Copy link
Contributor

afontenot commented Nov 29, 2022

Again, the issue is not "will this content live forever somewhere?" - the issue is "will this content live forever at a single canonical address that is the same address I originally made it available at?".

And again, the answer to the question is no :) Because that's not how IPFS works. IPFS does not make content live forever.

I guess I'm not making myself clear enough. IPFS does not make content live forever at a given URL inherently. I understand that. IPFS takes it out of the hands of the user how long the content will live at the URL. That's the problem.

You're focusing on problems that exist regardless of IPFS, that IPFS wouldn't in the slightest bit change.

Before this change, the canonical URL for my selfie is https://myinstance.example/media/somepath.png. After switching to IPFS the canonical URL for my selfie is ipfs://<longhash>.png. Before the change, I can make my selfie unavailable by removing it (or asking my instance to do so). After the change I have zero power to make the file unavailable at the original URL. Before the change, if someone wants to locate my deleted selfie, they would have to dredge through caches on any instance they have access to, or see if someone screenshotted it. Not everyone who tries to find it will succeed. After the change, if someone wants my selfie to live forever, they simply pin it, and now every single person who looks will find it, simply by opening the original link to the file.

@volkriss
Copy link

After switching to IPFS the canonical URL for my selfie is ipfs://<longhash>.png. Before the change, I can make my selfie unavailable by removing it (or asking my instance to do so).

That is absolutely not true.

And that seems to be a large part of your confusion.

When talking about caching not only is there no one canonical URL, but your removing your content from one access option doesn't necessarily remove it from another.

Again, this has nothing to do with IPFS at this point. This is just standard everyday as it has always been issues of caching content to make it more available for people trying to access it.

Again, there's nothing he new here. IPFS doesn't change anything with the things you are worried about. You're just plain wrong both about how IPFS functions and about how control of content has always been on the larger internet.

I'm sorry to tell you, but you don't have any control over anybody else's content cache, and that doesn't matter what system they might be using for caching that content.

You seem to have these pie in the sky notions about control over your own URLs, but that's just not how the internet works. It never has been.

@fariparedes
Copy link

fariparedes commented Nov 29, 2022

I believe that what afontenot is trying to say is that when you access https://mywebsite.com/image.png, and everyone directly links to that, if you decide to take that image down, https://mywebsite.com/image.png will no longer resolve to your image. But if IPFS caches your image at ipfs://hash.png, and everyone automatically links to that because it's the decentralized cache, you can't take that down, and it will always resolve to your image as soon as you put the image up (until you delete the post and it falls out of cache, anyway). This is indeed a difference.

Whether or not that distinction is meaningful to people is another matter, especially given that the actual difference in terms of privacy between people archiving your image.png and IPFS continuing to cache your image that somebody pinned until that person falls off the network seems pretty minuscule. It's not one I can really speak to. But I don't think it's true to say that they don't understand how IPFS functions. They're just making a technical distinction which is not particularly practical.

@volkriss
Copy link

But if IPFS caches your image at ipfs://hash.png, and everyone automatically links to that because it's the decentralized cache, you can't take that down, and it will always resolve to your image as soon as you put the image up (until you delete the post and it falls out of cache, anyway).

Except that this has nothing to do with IPFS since falling out of cache is the same for any caching system, whether IPFS or cached on an instance or anywhere else. His concern has nothing to do with IPFS. His concern has to do with caching in general.

He keeps talking about ownership of URLs and all of this other stuff that has nothing to do with how IPFS actually functions.

If he doesn't like caching, fine! But it's pretty darned frustrating that he's going in circles talking about IPFS If he's just generally opposed to the bog standard notion that, yeah, sometimes it's nice to cache content because that really helps so much of the internet function so much more efficiently.

Caching is pretty standard operating procedure.
Don't put anything on the public internet that you wouldn't want to live in caches.
Once you shout into the public, you no longer have control of that content, no matter how much you might cling to your preferred URL.

@afontenot
Copy link
Contributor

@volkriss What, exactly, are your claims? Please make them as concrete as I have. Don't just call me wrong over and over without making claims that have the potential to be judged right or wrong themselves.

You seem to have these pie in the sky notions about control over your own URLs, but that's just not how the internet works. It never has been.

This is the sort of thing you've said a few times but I really just have no idea what you could mean. Trivially, of course, if a URL is https://domain.i.control.example/link, then I control the URL. If I delete some content hosted on my server that that URL points to, then that data will no longer be available at that URL. That's just trivially true, I think? I don't see how you could dispute that. Again, I've never claimed you can make assumptions about where your content will or won't be cached, or who will have downloaded a copy, or screenshots, or anything like that.

Likewise, it seems trivially true that if you are a member of a well-behaved instance, and you delete your content, then in a relatively short amount of time (let's just round it to 24 hours), no one who tries to access your content at the original URL will be able to do so. That is a measure of control, of user power. Do you dispute that users have control over their URLs in this sense?

Last, I'm claiming that if a user publishes some content at an IPFS URL and shares that with other instances, then absolutely nothing they can do at that point (short of brute force or legal threats) can guarantee that the same content won't be accessible at the same URL - 24 hours, 1 month, 1 year in the future. Because anyone in the world could pin that content and the user can never make that go away.

Once you shout into the public, you no longer have control of that content, no matter how much you might cling to your preferred URL.

This is one of those statements that's true in an extremely abstract sense but is completely false in reality. Sometimes links just 404 and you aren't able to find another copy, if one even exists. On the normal web a link I control (in the above sense) 404s when I want it to. On IPFS a link to content doesn't 404 unless everyone who pins or caches a copy deletes it.

Whether or not that distinction is meaningful to people is another matter, especially given that the actual difference in terms of privacy between people archiving your image.png and IPFS continuing to cache your image that somebody pinned until that person falls off the network seems pretty minuscule.

I think you've understood my point exactly, @fariparedes. I agree that this is not something that's going to matter to everyone, and in fact I've spoken positively in this thread about the idea of a federated CDN. But let me say something about why this probably matters to some people.

Suppose I post an embarrassing meme about my boss, think better of it, and delete it two minutes later. Currently, a few people might see it, but no one is especially likely to screenshot it, and if they do it's not like there's an enormous searchable archive of embarrassing screenshots out there. It might live in a few Mastodon caches for a few days but it's eventually going to disappear forever. Now, after the changeover to IPFS, making a post on Mastodon means distributing a permanent hash that references my content forever. Even after I delete it, even if all instances are well behaved, it will still be accessible at that hash for days until all the instances delete it. If anyone anywhere wants to keep it online at that hash, they can do that.

Again - I understand that someone could currently screenshot my post and put it on embarassingmastodonposts.example.org or what have you. But someone would have to go looking for something like that. Under the suggested IPFS scheme, however, the very link to my content that I share, the "canonical" reference, could forever point to the content I want gone. If someone shares that link, chances are non-nil that it's still going to be live by the time my boss sees it.

@volkriss
Copy link

I really really don't want to belabor this responding to afontenot because I don't think it can go anywhere, but...

Yeah, you say something bad about your boss. CDNs and instances all pick it up and cache it and start serving it to other people, anybody who asks for it.

You delete your original content from your own web page.

But then you have no guarantee that the CDNs and instances will stop distributing what you said about your boss.

Notice the complete lack of reference to IPFS here. This has nothing to do with IPFS. It's just how the internet and caching works, and it has nothing to do with control over the original URL.

Your complaints, again, have nothing to do with IPFS. They have to do with a reality where so often things once said, and maybe broadcast, can't be unsaid.

All systems of caching, especially distributed systems, have this feature.

@afontenot
Copy link
Contributor

I really really don't want to belabor this responding to afontenot because I don't think it can go anywhere, but...

Yes, I agree that we've said about all that can be said here. I think I've made my points clear enough that anyone following this argument can decide on the merits themselves.

But then you have no guarantee that the CDNs and instances will stop distributing what you said about your boss.

The difference, of course, is that under the present system all anyone has is the canonical URL - some link to mastodon.social, for example. Armed with just that URL, there is no mechanism by which I can go around asking CDNs if they have a copy of the content. There are steps I could take to try to track something down, but it's hard work and I might easily fail.

In IPFS, the address is this mechanism. That's in part what IPFS is trying to achieve - persistence - by making sure that anyone can look up a piece of data so long as anyone has a copy. Finding a copy of a deleted post or media, if anyone has it, is as simple as clicking the link.

The only point I'm trying to make is about addresses providing easy access to cached (or pinned) items, not about the behavior of caches per se.

Your complaints, again, have nothing to do with IPFS. They have to do with a reality where so often things once said, and maybe broadcast, can't be unsaid.

As I see it, this misses the point. It's possible to have an expectation of privacy even if in theory someone could be recording me through my living room window with a telescope right now. This kind of abstract reasoning - "it's impossible to guarantee that no copy remains" → "it's okay to provide a mechanism by which anyone can guarantee that a copy remains and anyone can access that copy" - just doesn't work in practice, in my opinion.

@ineffyble
Copy link
Member

Related issues: #15545 #360 #7657 #6898 #477 #3715

@nileshtrivedi
Copy link

nileshtrivedi commented Dec 17, 2022

if a URL is https://domain.i.control.example/link, then I control the URL

Yes, and this enables certain attacks because not only can you remove the content, you can also tamper it. @ineffyble demonstrated it today by customizing the shared image based on the instance that was requesting it.

Data can be location-addressed and content-addressed. The latter gives verifiability and availability - even if the original source disappears or malfunctions or misbehaves. The former gives us fast fetch as you don't have to find out which peer has a copy of it.

IPFS' multihash project has standardized many possible hash algorithms and schemes (including, but not restricted to IPFS itself). What remains now is actually adopting it.

I think the simeplest solution would be to change media URLs from https://instance.com/assets/filename.png to https://instance.com/assets/filename.{multihash}.png. Then the receiving instances can decide whether it wants to fetch the file with a given multihash from instance.com or from somewhere else and be able to validate it in either case. Software without this ability can continue to work as is.

@Phoenix616
Copy link

But then you have no guarantee that the CDNs and instances will stop distributing what you said about your boss.

The difference, of course, is that under the present system all anyone has is the canonical URL - some link to mastodon.social, for example. Armed with just that URL, there is no mechanism by which I can go around asking CDNs if they have a copy of the content. There are steps I could take to try to track something down, but it's hard work and I might easily fail.

In IPFS, the address is this mechanism. That's in part what IPFS is trying to achieve - persistence - by making sure that anyone can look up a piece of data so long as anyone has a copy. Finding a copy of a deleted post or media, if anyone has it, is as simple as clicking the link.

The only point I'm trying to make is about addresses providing easy access to cached (or pinned) items, not about the behavior of caches per se.

Directly addressing a file via its hash is just one functionality of IPFS. As mentioned earlier in the comments by gabrielbauman one possible solution (and in my opinion the most straight forward one) is to not use pure IPFS-CIDs (hashes) for sharing content with other instances but to use IPNS-CIDs which can be invalidated when deleting content. IPNS is a nameserver system built into IPFS which allows changing which hash a specific name points to as long as you have the private key that created the name.

So when you delete content you just update the IPNS to point to a deleted notice/the empty file and all links on other instances will automatically stop working. (Unless they manually cached the file of course)

@gabrielbauman
Copy link
Author

I was laid off by the mother corp a few weeks back and dropped conversations while I got the interview ball rolling. Glad to see some great conversation here.

Regarding control over and deletion of cached resources, I'd just like to point out that when I post an image to my Mastodon instance right now, it's copied to other Mastondon instances. At that point, it's technically out of my control. I can delete the post, but there's no forcing other nodes to delete the content, which could well remain accessible and public for a long time.

IF we did IPNS pointers to IPFS resources as @Phoenix616 and I mentioned, the originating Mastodon node would actually be able to quickly make posted content inaccessible at the gateway URL known by subscribed Mastodon instances. The content would then quickly age out of IPFS. It's as close to instant deletion as we can realistically get in a system that does caching.

So, where do we stand on this? Worth prototyping? I'm not familiar with this project's process or culture, don't know who to ping on this, and am just looking to see if an implementation would be welcomed.

@volkriss
Copy link

volkriss commented Jan 2, 2023

So, where do we stand on this? Worth prototyping? I'm not familiar with this project's process or culture, don't know who to ping on this, and am just looking to see if an implementation would be welcomed.

I'd LOVE to see someone experiment with this, so you'd hear nothing but support from me!

Like I [probably] said above, I have doubts about whether this IPFS integration would be practical in terms of things like performance, but I'd eagerly follow to see how well it works. If you have the time to work on it, I'd say by all means.

As for culture, I'm reminded that some instances have been heavily modified by their admins to add all sorts of functionality not present in the standard Mastodon. Perhaps it would be worthwhile to contact one of those admins to see what they think about best practices for trying new features? Maybe one would even be an official unofficial test instance for this sort of thing.

@amiyatulu
Copy link

amiyatulu commented Jan 10, 2023

Some potential downsides:

  • this would make reliably deleting anything "cached" impossible. IPFS hashes are immutable and as long as at least one instance kept the file alive, it couldn't be deleted. Not even through defederation. Now of course in theory someone could be capturing a copy of every public Mastodon post, but in practice having the canonical URL for each file / post be the user's own instance means that they have some measure of control over whether it remains on the Internet forever.

Regarding defederation, mastodon instance holds the cid, so if cid is removed from the database, it becomes inaccessible as anyone can hardly remember the cid hash. On the internet, we can hardly stop anyone from hosting anything, but we can stop providing links to the files through moderation. In current defederation, mastodon doesn't stop hosting a server, just it breaks the link. The are also other alternatives like storj where files can be deleted.
For now, we can have just provide separate URL pasting links for images and videos so that images and videos are just rendered directly in the app rather than viewing it by visiting the link. This will have journalists and news channels to post large videos, without overloading the server.

@holdenk
Copy link

holdenk commented Jan 16, 2023

For performance Saturn seems to provide an interesting CDN for IPFS files (with http get).

@volkriss
Copy link

volkriss commented Jan 16, 2023 via email

@holdenk
Copy link

holdenk commented Jan 16, 2023

It's a decentralized gateway (uses DNS load balancing to pick gateway close to a user like a traditional CDN).

@volkriss
Copy link

It's a decentralized gateway (uses DNS load balancing to pick gateway close to a user like a traditional CDN).

I really don't see the point of it for this use case where the spike in access of content would drive distribution throughout the system using native IPFS features, without the need for complications like, in particular, digital currency/filecoin.

In fact, by using gateways instead of IPFS nodes, the native IPFS distribution would be hampered.

Maybe as a web platform Saturn might offer some value over pure IPFS, but as the real-time caching layer being discussed here, it adds a layer of complexity and cost without much benefit.

@SethranKada
Copy link

Wouldn't all this drama be solved by just encrypting the file before it's uploaded? Just send the decryption key along with the hash, and delete the key with mastodon's usual methods. That way, even if the file was archived, the owner can still effectively "delete" it. Once the file becomes impossible to open, it will naturally be dropped from ipfs.

@volkriss
Copy link

Wouldn't all this drama be solved by just encrypting the file before it's uploaded? Just send the decryption key along with the hash, and delete the key with mastodon's usual methods. That way, even if the file was archived, the owner can still effectively "delete" it. Once the file becomes impossible to open, it will naturally be dropped from ipfs.

I think this is a pretty good compromise! Send the key along with the metadata for the post, and the normal ActivityPub delete request would "delete" access to it as with any other content.

@j-adel
Copy link

j-adel commented Oct 20, 2023

I think this paper is very relevant to the discussion:

Woldu, Kifle, Nate Foss, and Matthew Pfe. "A Decentralized Secure Social Network." (2019)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
suggestion Feature suggestion
Projects
None yet
Development

No branches or pull requests