Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypercore DHT privacy enhancement #263

Closed
Winterhuman opened this issue Jan 25, 2022 · 2 comments
Closed

Hypercore DHT privacy enhancement #263

Winterhuman opened this issue Jan 25, 2022 · 2 comments
Labels
need/triage Needs initial labeling and prioritization

Comments

@Winterhuman
Copy link

Winterhuman commented Jan 25, 2022

This issue was opened after reading: https://discuss.ipfs.io/t/how-a-hypercore-p2p-innovation-could-bring-more-privacy-to-ipfs/13256/1

Right now, the DHT has PeerID -> CID mappings, but what this means is a large group of nodes could scrape the DHT for all CIDs and proceed to download them all to find out who has what content, this is big loss in privacy, however, Hypercore gets around this.

Instead of storing a PeerID -> CID mapping, Hypercore stores the equivalent of a PeerID -> CID of CID mapping where the CID is used to generate another CID called the Discovery Hash which is published instead. A large group of nodes could still scrape the DHT for discovery hashes, however, without the CID they can't request the content from the nodes they learn of.

Before:

  1. NodeA scrapes the DHT for random CIDs and finds out NodeB has QmFoo.
  2. NodeA connects to NodeB and requests QmFoo.
  3. NodeB gives NodeA the content behind QmFoo.
  4. NodeA now knows what content QmFoo corresponds to for all nodes.

After:

  1. NodeA scrapes the DHT for random CIDs and finds out NodeB has QmBar.
  2. NodeA connects to NodeB and requests QmBar.
  3. NodeB will reject the request stating it does not have QmBar, which is true. NodeB has QmFoo whose CID is QmBar, therefore NodeA can't download the content behind QmFoo without knowing the CID by other means.
  4. NodeA can't know what content QmBar and thus QmFoo corresponds to.

Overall, Discovery CIDs mean you must have the CID of the content in order to download it, and, CIDs are no longer public knowledge in the DHT.

@Winterhuman Winterhuman added the need/triage Needs initial labeling and prioritization label Jan 25, 2022
@aschmahmann
Copy link
Contributor

aschmahmann commented Jan 25, 2022

Related:

This proposal has been kicking around for a few years, so IMO the main questions/issues here are, what does this buy us, what will it cost us, what are the tradeoffs of doing this in existing IPFS implementations?

TLDR: I think this is a good idea to do, but it doesn't buy as much you might think it does and comes with a bunch of extra burden not just on implementers, but on the resource consumption of user nodes on the network.

  1. What does this buy us?
    • Users trying to understand what content is generally on the network (and served by whom) won't be able to scrape it by being DHT servers or generally listening to DHT traffic
      • Unfortunately, for many privacy scenarios (e.g. hosting content that's politically sensitive) this turns out to be a big distraction. If someone is trying to figure out who is hosting "bad file" X they can just do a query themselves to find out who has it. IIRC this behavior has been leveraged in the BitTorrent ecosystem by media companies filing DMCA takedown notices to ISPs about users hosting (or downloading) copyrighted content
  2. What will it cost us?
    • False sense of security (see above)
    • While this model works fine for the current state of the IPFS Public DHT, there are some tradeoffs under network expansion
      • For example, say you're trying to download some DAG. Right now the DHT advertisements are just the multihash of some block of content, which you might associated with some CID (i.e. the same block of data get can interpreted with different codecs) and DAG (e.g. did you want all of Wikipedia or just bafywikipedia/wiki/IPFS). You could get everyone who has advertised that block and just sort of figure out who has the data you want by querying them. On the other hand if advertisements started carrying more information about the graph components they were advertising you'd have to download all the advertisements just to sift through and figure out what you wanted
    • People will still end up wanting IPFS search engine tools that can scrape content people want to be crawled and found which means broadcasting both types of records. Which has some cost, but also it means other people doing broadcasting for crawling can expose you (e.g. I advertise that I have content that's safe for me to have, but not for you to have however since some search engine has a list of tons of CIDs an adversary can just compute hash(multihash) and find you again).
  3. What are some implementation tradeoffs/considerations?
    • Depending on how this was implemented it could require either a soft or a hard fork of the DHT
      • Hard fork: small network initially which more attackable and depending on who upgrades a bunch of server load
      • Soft fork: lots of records are replicated twice and all DHT queries are duplicated as they search for both forms waiting for more people to upgrade before dropping support for the old version
    • Bitswap
      • Many Bitswap clients currently ask their peers (for efficiency reasons) if they have the content they are looking for even if it's not advertised in the DHT.
      • For this measure to protect content retrievers (rather than providers) you'd need to either turn off/control this feature or add this kind of privacy in Bitswap as well by asking for hash(multihash) and getting back the block of data encrypted with the multihash as a key.
        • This doesn't have the same soft fork/hard fork problems because of multistream negotiation which is good
        • This would require more computation when sending/receiving blocks to do the encryption or turning off a useful feature (especially useful in offline scenarios)
        • This would require storing an extra index of every block you have keyed by hash(multihash) to respond to queries

@Winterhuman
Copy link
Author

Winterhuman commented Jan 25, 2022

That makes sense to me, Discovery CIDs wouldn't actually solve the problem in the long term and certain applications which require public CIDs would stop functioning, it's also a lot of effort for very little effect.

Having ACL in IPFS would probably yield the same result while being much more flexible. I'll close this issue then, thanks for explaining this in detail to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

2 participants