Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralized seed services, inspired by WebSeeds #57

Open
mikeal opened this issue Nov 30, 2018 · 21 comments
Open

Centralized seed services, inspired by WebSeeds #57

mikeal opened this issue Nov 30, 2018 · 21 comments

Comments

@mikeal
Copy link
Member

mikeal commented Nov 30, 2018

First, some background.

Bittorrent WebSeeds

Bittorrent has this great feature called "WebSeeds." It's rather simple, instead of having to spin up a service of bittorrent nodes in order to keep content alive you can simply add an HTTP or FTP URL as the fallback location of the content.

Bittorrent clients try to pull content from other peers but if none are available, or nobody has a complete copy of the file, or if they are just too slow, the client has the option of pulling the content out of the centralized service.

This allows people to keep content up much more simply that they would otherwise be able to if they had to manage a cluster of bittorrent nodes, without sacrificing the other benefits of bittorrent when the content is popular and the network around it is healthy.

Centralized Block and Graph services

The data-structures in IPLD offer some big benefits and upgrades compared to bittorrent. Instead of associating a specific URL with a specific file we could actually have fallback services which are known to have large caches of IPLD blocks.

These services could provide not just a fallback but also get us out around certain performance penalties in IPFS/IPLD. For instance, take the simple case of pulling up a website for the first time using IPNS/IPFS:

  • IPNS resolves to a CID.
  • IPFS looks in the DHT for that CID and establishes a network.
  • As IPFS connects to peers it begins to pull through the graph to grab the content.

This is always slower for the first load than centralized solutions because it takes longer to establish the network to retrieve that content than it would to just connect to a central server.

However, if you could configure a block/graph service in IPFS it could look more like this:

  • IPNS resolves to a CID.
  • In parallel:
    • IPFS looks in the DHT for that CID and establishes a network.
    • Existing HTTP2 connections to block/graph services are queried for the CID.

Now you're already pulling in content while the network is established and you can continue to parallize/optimize grabbing the content as the connections are made. This is the best of both worlds and could actually beat the performance of existing web fetches.

Applications that use IPFS could either configure a set of known services in the node or provide a list of seed services when they ask for specific pieces of content.

I'm thinking that there are two distinct sets of services:

  • Block Service (stores content by multihash)
  • Graph Service (stores content by CID, returns the Block data and meta information that includes whether or not the service contains the full graph referenced by the CID)

If something like this was available it would make a lot of our infrastructure challenges a lot simpler. We could define a very simple REST API that is more or less compatible with S3 and people could literally just stick Cloudflare in front of it for global caching.

Yes, this is a centralized service, but hosted IPFS clusters are also centralized services, they are just available in the DHT. As long as the data-structures are content-addressed centralized services are no more than a caching layer and do about as much to "centralize" the data-structure as an offline cache does.

Thoughts? @daviddias @alanshaw @olizilla @vmx @eefahy

@vmx
Copy link
Member

vmx commented Nov 30, 2018

Without thinking too hard about all implications, it sounds good to me.

@eefahy
Copy link

eefahy commented Dec 1, 2018

If I understand the proposal correctly, once the request comes in we are essentially parallelizing the data fetch via IPFS and one of these backend services? That sounds pretty great to me. One concern would be that we deemphasize our need to make bitswap more performant and would complicate our metrics on bitswap performance on the gateway (which we don't actually have implemented yet)

@mikeal
Copy link
Member Author

mikeal commented Dec 3, 2018

One concern would be that we deemphasize our need to make bitswap more performant and would complicate our metrics on bitswap performance on the gateway

Agreed on the concerns about performance, but I think metrics we'd get from the block services could compensate for any lost gateway metrics. We could even estimate decent metrics in the p2p network we have a hard time measuring if we just look at the difference between the "top of graph" pulled from the block service and the rest of the graph which we can assume they are pulling from a p2p network.

@Stebalien
Copy link
Member

Stebalien commented Dec 4, 2018

Let's take a step back and consider where the real performance issues are. Given:

  1. IPNS resolves to a CID.
  2. In parallel:
    a. IPFS looks in the DHT for that CID and establishes a network.
    b. Existing HTTP2 connections to block/graph services are queried for the CID.

We hit the first snag in 1, not 2: IPNS, as it exists today, is really slow.

Really, it's slow for the exact same reason 2.a is slow: Peer-to-peer routing (well, as it exists today) is slow.

There are two things we can do here to improve the situation:

  1. Reduce the need for routing. For example, when we resolve an IPNS record, the record itself can include hints on where to find the content ("optimistic location addressing"). We can still look elsewhere, but at least we have a place to start. The record can even tell us where to look for newer IPNS records allowing us to cut IPNS lookups short (we currently spend quite a bit of time making sure we have the latest IPNS record). We can also aggressively cache location information.

  2. Introduce some, minimal heterogeneous (i.e., some nodes do more than others) routing and use it optimistically (falling back on full p2p routing). Really, we can probably just have a caching DHT resolver (IIRC, js-ipfs really wants this and so does mobile).

Basically, we can use (2) to get the speed of a centralized service and (1) to reduce the load on (2). However, we'll still keep our fully p2p routing methods and continue to use them as-needed (giving us the best of both worlds).

Note, I didn't say centralized routing. We don't actually need a central authority because everything is signed/validated (although we do need to be careful about censorship).


Aside: WebSeeds.

Bittorrent uses webseeds for two reasons:

  1. Files served through bittorrent are usually served from some web server for non-bittorrent users anyways.
  2. Torrent files include enough information to verify blocks from webseeds without additional metadata.

This means that services offering torrent downloads don't need to run any software other than a webserver.

Running a special-purpose HTTP/2 REST server doesn't give us any advantage over bitswap/graphsync.

Now, we could, instead, build a filestore-like tool that:

  1. Chunks a file.
  2. Imports it into unixfs.
  3. Exports a file including (a) the DAG minus the leaf nodes and (b) the URL of the file. This file is equivalent to a torrent file with a webseed.

These could then be uploaded to a centralized service (i.e., a "tracker"). This way, users would resolve the CID to a "ipfs-torrent" file using the (trusted) tracker and then pull content from the web server.

However, I'm not sure if this is really worth it (for now at least). Our motivation here was "faster content retrieval", not "making files available to IPFS without running IPFS".

We could define a very simple REST API that is more or less compatible with S3 and people could literally just stick Cloudflare in front of it for global caching.

IPFS is supposed to be able to operate in "proxy" mode. That is, bitswap should be able to forward wantlist requests. We don't currently support this but that would cover the caching case quite well.

Really, we desperately need this feature for mobile support.

@mikeal
Copy link
Member Author

mikeal commented Dec 4, 2018

In my initial post I conflated two very different threads of reasoning: reliability and performance. My bad. Going forward, I'll try to separate these more clearly.

Reliability

The primary purpose of WebSeeds is reliability. Most (maybe all) clients use them solely as a "peer of last resort." The same could apply to IPLD/IPFS. When WebSeeds were created it was already trivial to host files somewhere accessible via HTTP/HTTPS and it has become almost as trivial to host arbitrary blocks on a service like S3 or one of its many competitors. It's a far more accessible solution to average developers today than running a cluster. That may change in the future, we're certainly working on making it better, but in the meantime we have this incredibly large barrier to entry when using IPLD/IPFS for real world applications.

Today, if you want your data to be reliably available you need to setup custom infrastructure with IPFS/IPLD and don't with virtually any other approach to building applications. To meet our short terms goals of developer adoption we need to start making things easier on a timeline that is widely available within the year and not consistently back-burner them because at some point in the future.

Files served through bittorrent are usually served from some web server for non-bittorrent users anyways.

This may have been true for most uses of WebSeeds before Webtorrent but I don't think this is true today. WebSeeds allowed Webtorrent to be far more useful by application and service builders because it gave them a single embed that was as bandwidth efficient as possible but still reliable if they stored the file somewhere accessible. Many services that were built on Webtorrent wouldn't have been built if the only way to ensure reliability was to run a Bitorrent cluster. This isn't a case where these files would be "served anyway," the service simply would not exist were it not for WebSeeds.

Torrent files include enough information to verify blocks from webseeds without additional metadata.

While the delivery mechanism is different this is also true of a BlockService + CID. Sure, you have to grab the data for that CID in order to make your way through the information needed to get the rest of the blocks and verify them, but if the delivery service for the data also provides the metadata in a verifiable way I don't think it's all that materially different.

However, I'm not sure if this is really worth it (for now at least). Our motivation here was "faster content retrieval", not "making files available to IPFS without running IPFS".

This is precisely the point :)

Making files accessible to IPFS nodes without running a cluster service would greatly reduce the barriers developers currently have in keeping the files available.

Performance

We hit the first snag in 1, not 2: IPNS, as it exists today, is really slow.

I'm 100% supportive of improvements and changes to IPNS that would make it faster. I'd also love to reduce the need to wait for DNS propagation when the CID is changed.

Exports a file including (a) the DAG minus the leaf nodes and (b) the URL of the file. This file is equivalent to a torrent file with a webseed.

If you could produce a manifest for all the CID's in the file, nested properly so that they could be verified as a connected graph as they are grabbed, a block service could be just about as fast as a raw GET of a single file, assuming you had HTTP2 and could make all the block requests in parallel.

Keep in mind that, in the WebSeed case, the file is not always downloaded in a single GET, often only parts of the file aren't available in the network and so sections of the file are grabbed with range requests to satisfy the sections not available in the torrent network. This isn't all that different from atomic block gets and, with HTTP2, could be optimized quite a lot.

@Stebalien
Copy link
Member

Stebalien commented Dec 4, 2018

I'd also love to reduce the need to wait for DNS propagation when the CID is changed.

?


This:

It's a far more accessible solution to average developers today than running a cluster.

Is in conflict with:

Instead of associating a specific URL with a specific file we could actually have fallback services which are known to have large caches of IPLD blocks.

The former is concerned with small developers who just want to store their data with Amazon. The latter sounds more like an "IPLD cache as a service" system (that could totally run an IPFS cluster with a click).

The former would likely be configured on a per-dapp basis (a dapp would configure a "backup" block exchange that just fetches blocks from S3). The latter would be well-known a known centralized service which, IMO, doesn't really need to exist given bitswap.

Note: the former isn't going to help with things like page loading because we'd need to know which s3 bucket to use before we start loading the page.


Let's try to hone in on the problem a bit. What's the UX or DX flow you're trying to enable?

@mikeal
Copy link
Member Author

mikeal commented Dec 5, 2018

The former would likely be configured on a per-dapp basis (a dapp would configure a "backup" block exchange that just fetches blocks from S3). The latter would be well-known a known centralized service which, IMO, doesn't really need to exist given bitswap.

I should have been clearer. While this would open up the possibility of a large of block cache the primary use would be a developer configuring it for a single application. This could still be a very large amount of data in a multi-user application, but as you point out is not a necessary solution for global caching outside of the data in this specific application.

Let's try to hone in on the problem a bit. What's the UX or DX flow you're trying to enable?

I see the DX in an API like IPFS being one of two forms.

  • ipfs.config('blockservice.myapp', URL)
  • ipfs.[namespace].get(key, value, {blockservice: URL})

This would only cover the read side of the equation. The write side of this can be done in the application itself as it will likely require some form of authentication. We should leave this to application developers for now, enabling this in IPFS itself would require re-thinking the guarantees a successful write return gives you (did it finish the write to the remote block store?).

This will also open up another replication case (push to single peer) that we should let the ecosystem work on optimizing before we would include it in IPFS.

@Stebalien
Copy link
Member

Stebalien commented Dec 5, 2018

So, what if we instead we push features that better enable pinning services? A web-backed blockservice will work for individual apps but makes it much harder for apps to interoperate unless said blockservice also runs an IPFS node, making it's blocks available over bitswap.

@mikeal
Copy link
Member Author

mikeal commented Dec 5, 2018

So, what if we instead we push features that better enable pinning services?

We are doing this to some extent, it's why we have ipfs-cluster.

The problem is that we have adoption goals we are trying to reach on a much shorter timeline than we can expect ipfs-cluster to reach the state of adoption and availability we already get existing cloud storage providers (S3, DigitalOcean, etc).

Adding block/graph services won't prohibit us from improving pinning services but waiting for pinning services to be widely available and easy to use will prohibit us from getting broader adoption in the next few years.

@Stebalien
Copy link
Member

Stebalien commented Dec 5, 2018

Yeah, but if data gets soloed into per-app S3 buckets, we're not building dapps. We're going to have a very hard time convincing devs to use IPLD if we're not going to give them something substantially better than firebase. Worse, I'm worried the system will end up depending on these per-app shared blockstores.

@mikeal
Copy link
Member Author

mikeal commented Dec 5, 2018

Yeah, but if data gets soloed into per-app S3 buckets, we're not building dapps.

The data isn't completely silo'd, any data that is in use by connected peers will be in the DHT. Essentially, one "mega-peer" is not in the DHT and not available to Bitswap. This is already a better case than we have with forked DHT's, where none of the connected peers for that app are sharing data in the mainline DHT.

Sure, saying "you have to run a cluster to have reliable data" will make sure all the data used is available but it also means that many applications simply won't be written as a result, so the data from those applications is still never going to be accessible.

But I think there's a way we can still avoid these segmentation concerns with graph/block services. If we do not enable the second API I suggested and only enable the node level configuration ipfs.config('blockservice.myapp', URL) we could find ways to use these nodes as proxies for all the data in the services and not just what is locally available in each nodes.

@daviddias
Copy link
Member

daviddias commented Jan 29, 2019

Note for the readers: js-ipfs has supported S3 backed datastores for over than an year now. Tutorial at:

@mikeal
Copy link
Member Author

mikeal commented Jan 29, 2019

Additional note for readers: I detailed the gap between the current S3 storage backend and what this is proposing here protocol/pl-ipfs-team#24 (comment)

@eocarragain
Copy link

eocarragain commented Jan 29, 2019

@mikeal that link seems to be to a private repo. Can you repeat the information here?

@mikeal
Copy link
Member Author

mikeal commented Jan 29, 2019

@eocarragain apologies, here it is:


The problem with our current S3 backed storage is that it doesn't distinguish between read and write configuration. In the most common use cases you want:

  • A single provider writing data that is stored long term (the application provider).
  • Many people reading from long term (non-DHT discovered) storage.
  • Many people keeping local copies of the data in their own storage.
  • Bonus Points:
    • Content that is currently being stored by the peer network should be preferred over long-term storage.

@parkan
Copy link

parkan commented Feb 1, 2019

also: S3 is not the sole backend type possible (Google Cloud Storage, DigitalOcean Storage, owndatacenter, etc)

using the S3 API has certain advantages, but the least common denominator for access is probably HTTP

@mikeal
Copy link
Member Author

mikeal commented Feb 1, 2019

Yup, and all of these services have a similar enough HTTP interface that we could do the same thing across all of them if it’s read-only since only authentication differs between them.

@aschmahmann
Copy link

aschmahmann commented Apr 25, 2019

While I see the utility of having a centralized service you could fall back on for getting content, it feels strange to be running these two systems in parallel and requiring end-user software to have preconfigured centralized services to use for every application (e.g. ipfs.config('blockservice.myapp', URL)).

An alternative that's close to this, but perhaps a little more IPFS friendly, would be to allow using Multiaddrs to indicate content routing over non-libp2p protocols (libp2p/notes#11). In particular, the flow from CID to data is:

Currently

  1. Search DHT for CID
  2. Receive a list of multiaddrs that have advertised that they have the CID
  3. Use a mutually supported exchange protocol (i.e. bitswap) to retrieve the data from the libp2p nodes at the advertised multiaddrs

With HTTP Support

  1. Search the DHT for CID
  2. Receive a list of multiaddrs that have advertised that they have the CID
  3. Use the multiaddrs to determine which protocol should be used to retrieve the data (i.e. ip4/1.2.3.4/tcp/4001/p2p/PeerID for libp2p protocols like bitswap, /dns4/mybucket.s3.amazonaws.com/tcp/80/http/DataFolder/DataBlock for downloading data over HTTP)

Yes, someone still needs to run an IPFS node that submits Provider records pointing at the HTTP addresses, however, the node no longer needs 24/7 up time. Additionally, we could potentially allow anyone to publish these HTTP-based Provider records allowing us to effectively draft content into IPFS that is currently only accessible via the centralized web.

@rvagg
Copy link
Contributor

rvagg commented Apr 26, 2019

@aschmahmann:

the node no longer needs 24/7 up time. Additionally, we could potentially allow anyone to publish these HTTP-based Provider records allowing us to effectively draft content into IPFS that is currently only accessible via the centralized web

I don't think I know enough yet about discovery but is there an attack vector here? Is it a cheaper operation to poison the DHT with pointers to content that you don't host yourself and don't have to take responsibility for not existing than advertising that you have a certain CID?

@aschmahmann
Copy link

aschmahmann commented Apr 28, 2019

@rvagg I could be mistaken, but I don't think there is any new attack on the DHT node storing the provider record.

The DHT nodes never ask for any sort of "proof" that you have the data (e.g. probabilistically ask advertisers to send the data, ask for the data hashed with some nonce, etc.), so you could easily not have the data. This is on top of the issue that even if the provider node has the data there's no way to know they'll be able to/want to provide it at an acceptable upload speed. You're basically just trusting the advertising nodes off of their reputation (we could add various types of reputation systems going forward, but currently I think we trust all nodes equally).

However, one attack that we do open up if we allow any node to advertise on behalf of another party is a DoS attack on the other party. For example, if I know QmABC is really popular I could setup a DoS on http://InnocentBystander.com/index.html by just advertising /dns4/InnocentBystander.com/tcp/80/http/index.html/ for QmABC. However, if QmABC has other hosts (not an unreasonable assumption for popular content) then the requests will be more distributed and hopefully not severely overwhelm InnocentBystander.com. We could (whether we allow arbitrary nodes to advertise on behalf of others or not) allow users to report unavailable/bad providers so they get evicted from the DHT - although deciding who to trust is essentially a reputation system problem.

It's also worth noting that if we are nonetheless still concerned about random advertisers then we could potentially require that the advertising nodes be "certified" by the non-libp2p service (e.g. signed DNS records with a field for the advertiser). However, I'm not currently convinced this is necessary.

@lidel
Copy link
Contributor

lidel commented Jul 1, 2021

Related: ?format=block and ?format=car could effectively enable every public gateway to be used as block/DAG archive provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants