Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ProjectTracking]: Resolve network concerns for stateless validation #61

Open
Tracked by #46
Longarithm opened this issue Mar 19, 2024 · 3 comments
Open
Tracked by #46

Comments

@Longarithm
Copy link
Member

Longarithm commented Mar 19, 2024

Goals

Make sure that increased network usage and network latencies don't impact mainnet stability.

Initial analysis of the problem space

  • Latency: We have observed some chunk production failures even when the state witness is small or empty, which could be due to network latency being too high. @Tayfun Elmas (tayfunelmas)'s PR #10824 should help understand if that is actually the case. @Jan Ciołek (jancionear)'s proposal of switching to T1 could significantly help with latency in general, not only due to use of direct connections but also because TIER2 routing only minimizes hop count, not latency. I also wonder if we have observed anything to indicate that the state witness is arriving, but arriving too late?
  • Throughput: Alex observed very poor throughput for certain pairs of locations. This is interrelated to latency because TCP (and also QUIC, btw) adjusts send rate based on observed round-trip time.
  • Loss: We also see significant packet loss (?) between certain pairs of locations. I think this is the part which received least discussion, could you share or point me to your findings @Alex Logunov (Longarithm)? Packet loss would also have a significant impact on TCP send rate.
  • Bandwidth Usage: The lower bound on the bandwidth usage to distribute a payload of size S to N nodes is S * N of both egress and ingress bandwidth. Nodes pay for bandwidth and it is not cheap. Due to low diameter of the TIER2 network, I would expect that even today the bandwidth usage is only roughly 2SN. As Robin pointed out we can optimize it to N + sqrt(N) with some kind of fanout. As I understand this is not an important direction at the moment since currently monetary cost is not a blocker for mainnet, and also there should only be roughly 2x room for improvement in this area from network efficiency.

Issues

  • Statelessnet shard 5 stalls for the high load. The only chunk producer' upload network queues are overflowed because of large state witnesses.
  • Localnet with 2 distant validators misses 25% of chunks, apparently because 0.6s is not enough to endorse an empty chunk.

Links to external documentations and discussions

https://near.zulipchat.com/#narrow/stream/407237-core.2Fstateless-validation/topic/mainnet.20traffic/near/421904005

Estimated effort

TBD

Assumptions

Average state witness size is 2 MB and worst case is 16 MB. So,

  • Chunk producer periodically has to send 200 MB worth of data within 1s on average and 1.6 GB within 1s worst-case.
  • Chunk validator has to receive 2 MB/s on average and 16 MB/s worst-case.

Pre-requisites

#46

Out of scope

NA

@walnut-the-cat
Copy link

@saketh-are and @shreyan-gupta , is there a different github issue we are using for network optimization? If not, let's update this as we make progress

@walnut-the-cat
Copy link

Current plans (by @saketh-are )

  • Based on the current experiments, we can adjust our P2P TCP connections to handle traffic bursts of size 2 MB without latency degradation. We can push this limit but start to allocate more and more memory for send/recv buffers, which at some point becomes an issue when multiplied by number of connections. Assume 2 MB for now.
  • We will use erasure coding to break the state witness into 50 parts, each of which is sent on a separate direct connection to a separate chunk validator. Each CV then forwards its part to all other CVs. This 2-hop strategy is what is done in prod for chunk distribution, which we know works reasonably well to deliver chunk information within the block production timeframe. The erasure coding redundancy adds 50% overhead; that is, if we can support a 2 MB message what we really manage to send is 2 * 2/3 = 1.33 MB.
  • So far we have tested handling of a burst of traffic on a single connection. It remains to be seen to what extent performance degrades due to the fact that the chunk producer has to send on 50 different connections at once. I’ll try to obtain some numbers on this today. Pending those results, an upper bound on what we may be able to handle for state witness after implementing the erasure coding looks currently like 2 MB * 50 * 2/3 = ~66 MB.

Benefit of erasure encoding (by @shreyan-gupta )

  • It significantly reduces the instant network requirements for a single chunk producer. Without it, we needed to send 2 MB * 50 validators of data from one node to all other validators. After implementation of erasure encoding, we would need to send 2 MB * 1.5 overhead / 50 parts * 50 validators of data, which is just ~3 MB
  • It's "fairer" to validators from the perspective that random connection drops would not result in validators not receiving the state witness. In the future this opens up the possibility where we may want to reward validators depending on whether they've generated an endorsement which gets included in the block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants