Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFM Proposal: Number of Client nodes across various networks and implementations #45

Open
yiannisbot opened this issue May 8, 2023 · 4 comments

Comments

@yiannisbot
Copy link
Member

yiannisbot commented May 8, 2023

We are currently capturing the number of clients observed in the IPFS public DHT network and we report this as part of our weekly reports (currently in this repo - see example for Week 17 as well as at probelab.io: https://probelab.io/ipfsdht/#client-vs-server-node-estimate.

As per this discussion thread in Slack, this is great, but only captures part of the story, i.e., it focuses on the public IPFS DHT only, which in turn, means that it is mostly focusing on Kubo. However, IPFS is more than the kubo implementation and more than the public IPFS DHT. A request from @BigLep is to be able to "show the number of peer ids observed across various "networks" and break out by implementation".

In order to go about doing this, we'd need to identify data sources (i.e., how to collect the data) from different: i) IPFS implementations (e.g., Kubo, Helia, Iroh), and ii) networks that run IPFS nodes (e.g., the IPFS DHT, the Lotus DHT, cid.contact/IPNI, etc). We should also ideally deduplicate the PeerIDs to avoid double-counting a peer that participates in more than one network (?).

I'm starting this issue to capture first what we want to target and then come up with data collection ideas (e.g., through measurement tools, logs etc.).

cc: @BigLep @dennis-tra

@dennis-tra
Copy link
Contributor

I think this is a great initiative and would be super insightful!

avoid double-counting a peer that participates in more than one network

I think the more common scenario would be that we might double-count peers across different data sources as opposed to a single peer participating in multiple networks.

@BigLep
Copy link

BigLep commented May 9, 2023

Thanks for creating this @yiannisbot. I'm pasting in some of the relevant info from FIL slack, in case someone can't easily access it:


Concerning Number of Client vs Server Nodes in the DHT

  • Pros
    • Accuracy / comprehensiveness has gotten to a good state
    • Can be generated automatically
  • Cons (or things it's not covering)
    • Insight into implementation prevalance
    • Just focused on the DHT-using nodes

I think addressing the cons is pretty important given themes of the last year that:

  1. IPFS is intended to be more than Kubo. It would be great to show how that is actually going across various networks.
    • (For example, when I looked at the bootstrapper breakdown at the beginning of April, there was far higher prevalence of js-ipfs than I expected - screenshot below).
  2. IPFS usage is beyond the public DHT

Our current network size KPIs aren't helping drive home the message of the diversity of the IPFS project.

@BigLep
Copy link

BigLep commented May 9, 2023

Here is a mock of what I'm thinking: https://docs.google.com/spreadsheets/d/1SHHPBZEsZvZ95skg8MgRNHoSaog6tJ-DpvuZePZZlJ4/edit#gid=0

image

Specifically, I think we need to think about our metric collection from "network probes". If implementations don't identify themselves, they get bucketed as unknown/other.

For example:

  • Banana DHT clients and servers: PL-run bootstrappers
  • Banana DHT servers: nebula crawler
  • cid.contact IPNI: server access logs (and nodes should share some form of peerid ideally - presumably obfuscated for privacy)
  • Filecoin DHT: nebula crawler or Max's Kademlia explorer
  • For DHTs, we identify implementations from Identify protocol (or whatever we're doing now)
  • For HTTP endpoints, we identify implementations by user-agent HTTP header.
  • Lassie, Kubo, etc. should be identifiable by both of these means.

The graph above is for a single month. I could imagine showing that collection of bars grouped together for each month and then displaying multiple months along the x-axis.

@BigLep
Copy link

BigLep commented May 9, 2023

We should also ideally deduplicate the PeerIDs to avoid double-counting a peer that participates in more than one network (?)

I don't think this needs to be a priority currently. We can make it a caveat that nodes (peerids) will participate in multiple "networks" and that as a result, it is not accurate to say "the total number of unique IPFS peerds is the sum of all the bars". For example, I think it's fine for a Kubo peerId to count towards "Banana DHT server", "Banana DHT client", and "cid.contact IPNI".

I do think we should deduplicate peerIds within a given "network". For example, a Kubo node that participates as a "Banana DHT client" every day for a month should only increase the count for that month by 1 (not 30).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants