Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-Shard Congestion Control #539

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

jakmeier
Copy link
Contributor

@jakmeier jakmeier commented Mar 22, 2024

@jakmeier
Copy link
Contributor Author

jakmeier commented Mar 22, 2024

And a first draft of "the story behind" is also available: https://github.com/near/nearcore/blob/master/docs/architecture/how/receipt-congestion.md

While the NEP focusses on specifying the proposed changes, the story behind explains our thought process why these changes lead to the desired consequences.

neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
Copy link
Contributor

@akhi3030 akhi3030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Some high level thoughts.

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

If so, then my question is about fairness and relatedly load balancing. The two cases that I am thinking of are:

  1. Shard A is congested and shard B and C both have a ton of receipts for it. Assuming all shards are created equal, how do we make sure that the remaining queue space is shared fairly between B and C? Is it by relying on the linear interpolation?
  2. Shard A is congested and shard B has a ton of receipts for it and shard C has no receipts for it. How do we make sure that we are able to provide all the queue space to B and do not reserve any for C?

neps/nep-0539.md Outdated Show resolved Hide resolved
@jakmeier
Copy link
Contributor Author

jakmeier commented Mar 25, 2024

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

Yes, that sounds exactly right.

1. Shard A is congested and shard B and C both have a ton of receipts for it.  Assuming all shards are created equal, how do we make sure that the remaining queue space is shared fairly between B and C?  Is it by relying on the linear interpolation?

We don't give any guarantees about fairness. We hope that backpressure measures are reducing incoming transactions sharp enough that congestion resolves quickly and everyone can send again. But yes, linear interpolation of how much bandwidth (measured in gas) each shard can send per chunk should help in most practical scenarios, as the newly available space in the incoming queue of the congested shard is shared evenly across all sending shards.

2. Shard A is congested and shard B has a ton of receipts for it and shard C has no receipts for it.  How do we make sure that we are able to provide all the queue space to B and do not reserve any for C?

There is only one big incoming queue, without accounting per shard. So in this example, shard B can fill it up entirely. Shard C will be sad when it wants to send a single receipt and sees the queue full. But I personally think it's a good trade-off to make.

Copy link

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits, typos and such

neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
Co-authored-by: wacban <wacban@users.noreply.github.com>
@akhi3030
Copy link
Contributor

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

Yes, that sounds exactly right.

1. Shard A is congested and shard B and C both have a ton of receipts for it.  Assuming all shards are created equal, how do we make sure that the remaining queue space is shared fairly between B and C?  Is it by relying on the linear interpolation?

We don't give any guarantees about fairness. We hope that backpressure measures are reducing incoming transactions sharp enough that congestion resolves quickly and everyone can send again. But yes, linear interpolation of how much bandwidth (measured in gas) each shard can send per chunk should help in most practical scenarios, as the newly available space in the incoming queue of the congested shard is shared evenly across all sending shards.

2. Shard A is congested and shard B has a ton of receipts for it and shard C has no receipts for it.  How do we make sure that we are able to provide all the queue space to B and do not reserve any for C?

There is only one big incoming queue, without accounting per shard. So in this example, shard B can fill it up entirely. Shard C will be sad when it wants to send a single receipt and sees the queue full. But I personally think it's a good trade-off to make.

Generally happy with your responses here. One other approach I have seen (and implemented in the past) to guarantee fairness is some sort of credit based queuing. This lets a receiving entity decide in fine grain how much of its queue it wants to dedicate to each sender. It is natural to use this mechanics to implement fair sharing or to arbitrary types of prioritisation as well (e.g. one shard is able to send 2x more than another). The drawback of course is more state tracking and complex implementation. So I'm happy with the proposed approach.

neps/nep-0539.md Outdated

We store the outgoing buffered receipts in the trie, similar to delayed receipts
but in their own separate column. Therefore we add a trie column
`BUFFERED_RECEIPT_OR_INDICES: u8 = 13;`. Then we read and write analogue to the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail: I know we use this pattern for DELAYED_RECEIPT_OR_INDICES, but it seems to be that way for historical reasons (see commit message here).

For this new queue it would be clearer to have separate BUFFERED_RECEIPT and BUFFERED_RECEIPT_INDICES columns.

@akhi3030
Copy link
Contributor

Another question popped into my head earlier. AFAIU, creating a promise in NEAR is infallible i.e. contract A on shard 1 can always create a receipt for contract B on shard 2. Further, it is the case that without actually executing the receipt against contract A, we cannot know for sure whether or not it will call contract B. In the worst case, many different contracts on many different shards can all target the same contract (or a set of contracts on a shard).

Does the proposed solution handle such scenarios? Is the filter operation defined going to apply to the receipts created above?

@jakmeier
Copy link
Contributor Author

Another question popped into my head earlier. AFAIU, creating a promise in NEAR is infallible i.e. contract A on shard 1 can always create a receipt for contract B on shard 2. Further, it is the case that without actually executing the receipt against contract A, we cannot know for sure whether or not it will call contract B. In the worst case, many different contracts on many different shards can all target the same contract (or a set of contracts on a shard).

Does the proposed solution handle such scenarios? Is the filter operation defined going to apply to the receipts created above?

The filter operation is only applicable to transactions, not to receipts. Once receipts are created, we commit to execute them.

The described situation is indeed problematic. Of course, that's exactly what backpressure is for.

If shard 3 becomes congested, shard 1 and 2 can still create receipts for shard 3 but they are forced to keep them in their outgoing buffer before forwarding. This way, shard 3 is protected from additional inflow. Eventually, shards 1 and 2 may also become congested and the backpressure spreads further out to all shards trying to send something to them. Eventually all shards are congested and no more new transactions anywhere are accepted.

Unfortunately, it is still not handled perfectly. We only apply backpressure based on incoming congestion, to avoid deadlocks. But if we are able to handle incoming receipts quickly, it is possible shard 1 keeps filling its outgoing buffer for shard 2, growing it faster than it can forward receipts in it. But because the incoming queue is always empty, it does not apply backpressure. (cc @wacban we should probably simulate with the latest changes that decouple incoming and outgoing congested to see how bad this can become.)

@akhi3030
Copy link
Contributor

I think I understand the high level explanation. The drawback is that in the worst case, due to one shard not keeping up, it is possible that the entire network has to stop accepting new transactions. I am still happy with this solution and see this as a very good next step to build. Once built, I can imagine further refinements where we can address such cases as well.

@wacban
Copy link

wacban commented Mar 27, 2024

@akhi3030

This lets a receiving entity decide in fine grain how much of its queue it wants to dedicate to each sender.

If I understand correctly this could be implemented by splitting the delayed receipts queue into one queue per sending shard and then implementing some fair way to pull receipts from this set of queues. This makes sense but I would rather keep this NEP in the current simpler form and work on top of it in follow ups. The good news is that as far as I can tell the current proposal should be easily extendable to what you're suggesting.

A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?

That is correct, just to add a detail to it, each shard will advertise two numbers, one representing the fullness of the "outgoing queues" and one representing the fullness of the "incoming queue". Those two types of congestion are treated differently which allows us to better adapt the measures to the specific workload that the network is under.

@akhi3030
Copy link
Contributor

@wacban: perfect, sounds like a solid plan to me. I am always happy to build incrementally.

github-merge-queue bot pushed a commit to near/nearcore that referenced this pull request Mar 28, 2024
This document describes a few fundamental congestion control problems
and ideas to solve them.

The added page serves as secondary document to
[NEP](near/NEPs#539) to summarise the thought
process behind the most important design decisions. But it is generally
applicable to congestion in Near Protocol's receipt execution system as
it works today. It can even serve as documentation for how congestion
can occur today.

The document includes 8 graphs generated using
[graphviz](https://graphviz.org/).
To regenerate after modifying the `*.dot` files, install the graphviz
toolbox (on systems with apt: `sudo apt install graphviz`) and then run
`dot -Tsvg img_name.dot > img_name.svg`.

---------

Co-authored-by: wacban <wacban@users.noreply.github.com>
- The formulas in the pseudo code were opposite to the description,
  fixing it by swapping incoming and general congestion.
- "General" congestion is a bad name. Changing it to "Memory" congestion.
- Add a sentence of motiviation to the pseudo code snippets for extra explanation
- Add TODO for unbounded queue problem
neps/nep-0539.md Outdated Show resolved Hide resolved
jakmeier and others added 4 commits April 2, 2024 17:31
Co-authored-by: wacban <wacban@users.noreply.github.com>
No link to the actual reference implementation, yet.
Just some clarifying text and in-place code.
I think it's better to keep it simple. While it could be useful in the future
to look at guaranteed to be burnt and attached gas separately for congestion,
our current strategy does not look at it.
@wacban
Copy link

wacban commented Apr 5, 2024

I implemented the model of the strategy proposed in the NEP. I am now analysing different workloads to make sure that the strategy can handle them well. I will be sharing results and suggestions here as I progress.

AllToOne workload.

In this workload all shards send direct transactions to a single shard that becomes congested.

The strategy does a rather bad job at dealing with this workload as the outgoing buffers grow in gas without a reasonable limit. The memory limit is never exceeded because the receipts are small but the number and gas of receipts grows beyond acceptable values.

The reason is that the current proposal does not take the gas accumulated in outgoing buffers into account.

My suggestion would be to replace memory congestion with general_congestion as following:

ShardChunkHeaderInnerV3 {
  // as is
  incoming_congestion: u16,
  // memory -> general
  general_congestion: u16,
}
	// Same as in NEP
    MAX_CONGESTION_MEMORY_CONSUMPTION = 500 MB
    memory_consumption = 0
    memory_consumption += sum([receipt.size() for receipt in delayed_receipts_queue])
    memory_consumption += sum([receipt.size() for receipt in postponed_receipts_queue])
    memory_consumption += sum([receipt.size() for receipt in outgoing_receipts_buffer])

    memory_congestion = memory_consumption / MAX_CONGESTION_MEMORY_CONSUMPTION
    memory_congestion = min(1.0, memory_congestion)
    
    // New
    // Similar to memory but summing up gas instead of size
    MAX_CONGESTION_GAS_BACKLOG = 100 PG
    gas_backlog = 0
    gas_backlog += sum([receipt.gas() for receipt in delayed_receipts_queue])
    gas_backlog += sum([receipt.gas() for receipt in postponed_receipts_queue])
    gas_backlog += sum([receipt.gas() for receipt in outgoing_receipts_buffer])

    gas_congestion = gas_backlog / MAX_CONGESTION_GAS_BACKLOG
    gas_congestion = min(1.0, memory_congestion)
    
    // New
    general_congestion = max(memory_congestion, gas_congestion)

I implemented the suggestion in the model and the results are quite good - both the incoming queue and outgoing buffers display bounded, periodic behaviour.

In the picture below, each period is characterized by four phases:

  • phase 1 - rapid growth
    • incoming gas grows to 150PG
    • outgoing gas grows to 100PG
    • shards send plenty of load to the loaded shard (0)
  • phase 2 - incoming decline
    • incoming gas drops to 100PG
    • outgoing gas stays at to 100PG
  • phase 3 - outgoing decline
    • incoming gas stays at 100PG
    • outgoing gase drops to 0PG
  • phase 4 - incoming decline
    • incoming gas drops to 50PG - the threshold for accepting transactions
    • outgoing gas stays at 0PG

We can probably smooth it out further by replacing the hard incoming congestion threshold with linear interpolation. It's not a priority right now so I'll leave it as is.

Screenshot 2024-04-05 at 13 54 21

@bowenwang1996
Copy link
Collaborator

As a working group member, I nominate @robin-near and @akashin as subject matter experts to review this NEP.

described the two alternatives for efficiently calculating the congestion information and updated relevant sections
Added the description of how to integrate with resharding.
neps/nep-0539.md Outdated Show resolved Hide resolved
It should be based on general congestion, not only memory congestion.
github-merge-queue bot pushed a commit to near/nearcore that referenced this pull request May 6, 2024
This PR connects the changes of previous PRs
(mainly #11128, #11173, and  #11195)
to apply congestion control rules as defined in
[NEP-539](near/NEPs#539).

Builds with a stable protocol version are unaffected by
this PR, since `own_congestion_info` will return None
and the `ReceiptSinkV1` unconditionally forwards 
all receipts exactly as it is done today.

Most components combined in this PR are already tested
in unit tests. The new features added with this PR are
best tested with integration tests, which will follow in future PRs.

The PR has two commits, it makes sense to review each on its own.
@walnut-the-cat
Copy link
Contributor

@akashin and @robin-near , friendly reminder to share your review on this NEP

@victorchimakanu
Copy link

victorchimakanu commented May 6, 2024

NEP Status (Updated by NEP Moderators)

Status: Review

SME reviews:

Protocol Work Group voting indications (❔ | 👍 | 👎 ):

@walnut-the-cat
Copy link
Contributor

@bowenwang1996 , it seems @akashin is OOO this week, should we nominate someone else to expedite?

@akashin
Copy link
Contributor

akashin commented May 13, 2024

I skimmed the proposal and overall approach looks good to me, great work!
I will aim to finish a more thorough review of each section by tomorrow.

neps/nep-0539.md Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated
transactions to the congested shard when it is hit.

`MIN_TX_GAS = 20 Tgas` guarantees that we can still accept a decent amount of
new transactions to shards that are not congestions, even if the local shard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: congestions -> congested

neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Show resolved Hide resolved
neps/nep-0539.md Outdated Show resolved Hide resolved
neps/nep-0539.md Outdated
restrictions. Additionally, the memory consumption of both receipt types can
also cause congestion.

This proposal replaces the local congestion control rules already in place.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it supercedes/extends the existing rules rather than fully replacing them (we still have transaction pool limit and delayed receipt queue limit).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I've added a few words to describe what we keep and what we replace.

neps/nep-0539.md Outdated
```

We replace the formula for the transaction limit to depend on the
`incoming_congestion` computed in the previous shard variable (between 0 and 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean computed in the previous chunk for this shard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The sentence should be fixed now to make more sense.

@akashin
Copy link
Contributor

akashin commented May 14, 2024

Technical summary

Over the last few months, NEAR mainnet has been experiencing a considerable increase in usage, which led to regular congestion on shards 2, 3 and 5.

Usually, transactions on NEAR execute within a few blocks and the latency for the end users is on the order of a few seconds. However, during these periods of congestion, users regularly had to wait more than 15 minutes for their transaction to be processed and sometimes the wait time was over 1 hour. This is a considerable degradation in the user experience and it needs to be addressed to make the users of NEAR happy.

This NEP introduces a number of mechanisms to control the order in which transactions/receipts are admitted, exchanged between shards and processed as well as a concrete policy that is carefully tuned to balance end-user latency with overall system throughput and peak memory overhead.

Recommendation

I recommend approving this NEP.

Benefits

  • Establishes much tighter latency guarantees for admitted transactions (20 blocks compared to 1000s of blocks in the past)
  • Actually deals with cross-shard congestion which lead to unbounded latencies in the past
  • Thorough evaluation of the proposed solution on a range of relevant workloads shows visible improvements in core congestion metrics

Concerns

# Concern Resolution Status
1 There is no hard-limit on the size of the buffers for receipts The soft-limit has been shown to work effectively on simulated workloads. Moreover, this NEP strictly improves the limit that we have today Resolved
2 User transactions will fail due to being rejected Users already have to deal with this today and it needs to be addressed regardless of this NEP Resolved

Copy link
Contributor

@robin-near robin-near left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments on clarifying the content - I'll take another look once that is addressed. Also jakmeier#3 for the typos and grammar nits that would introduce too much clutter if I left them as comments.

below) and modify how outgoing receipts are treated in the transaction
conversion step (3) and in the receipts execution step (4).

The new chunk execution then follows this order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble following this process, because many variables (like gas_backlog, outgoing_gas, outgoing_gas_limit) are not previously well defined. Could we define these variables more carefully before this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, there were too many rounds of renaming things and in the end it was rather inconsistent. I completely missed that, thanks for pointing it out. It should be fixed now.

# new
MIN_TX_GAS = 20 Tgas
MAX_TX_GAS = 500 Tgas
tx_limit = mix(MAX_TX_GAS, MIN_TX_GAS, incoming_congestion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses "incoming_congestion" before introducing the concept. Should we add a section above that summarizes the relevant congestion fields? Otherwise reading this feels a bit like reverse engineering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes a lot of sense. I've done so now, please take another look.

Correct some typos, grammar issues, and clarify some text.
Address various comments by SME reviewers.

- Fix various grammar errors.
- Remove old names and use only the correct names for variables
- Start the specification section by introducing important concepts
@jakmeier
Copy link
Contributor Author

Thanks a lot to @akashin and @robin-near for taking the time to read through our proposal and giving valuable feedback! I really appreciate your expertise to ensure we end up with the best possible solution to move congestion control one step forward.

Sorry about the subpart quality in the grammar, and just in general. I thought we had the NEP cleaned up much better, otherwise I wouldn't have asked for SME reviews. I think we rushed a bit too much then, as we wanted to get the NEP processed started as soon as possible.

I have tried my best to fix it up now and added a new section about important concepts. Please, @robin-near, can you take another look? Let me know if something is still not well defined or not written clearly.

@jakmeier jakmeier requested a review from robin-near May 15, 2024 22:04
@jakmeier
Copy link
Contributor Author

Oh and in the time since the last changes, we added "missed chunks congestion" as an additional indicator. I have added it to the concepts section and to the "Changes to chunk execution" section.

It's a bit of a last minute change, not something we initially wanted to address. But for stateless validation, Near Protocol needs a way to limit incoming receipts even when chunks are missed. This NEP introduces all the required tools to solve that problem, so it seemed worth it to include. But if preferred by the working group, we could also separate it out as its own NEP that builds on top of congestion control.

@wacban, since you spear-headed and implemented this, can you please double-check that I got the details around missed chunk congestion right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-NEP A NEAR Enhancement Proposal (NEP). S-review/needs-sme-review A NEP in the REVIEW stage is waiting for Subject Matter Expert review. WG-protocol Protocol Standards Work Group should be accountable
Projects
Status: REVIEW
Development

Successfully merging this pull request may close these issues.

None yet

10 participants