New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cross-Shard Congestion Control #539
base: master
Are you sure you want to change the base?
Conversation
it's time to get first feedback by engineers outside the focus group
And a first draft of "the story behind" is also available: https://github.com/near/nearcore/blob/master/docs/architecture/how/receipt-congestion.md While the NEP focusses on specifying the proposed changes, the story behind explains our thought process why these changes lead to the desired consequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. Some high level thoughts.
A summary of my understanding is that each shard is going to advertise how much queue space it has available and other shards will take that into account when constructing their chunks and accepting new transactions. Is that a fair summery?
If so, then my question is about fairness and relatedly load balancing. The two cases that I am thinking of are:
- Shard A is congested and shard B and C both have a ton of receipts for it. Assuming all shards are created equal, how do we make sure that the remaining queue space is shared fairly between B and C? Is it by relying on the linear interpolation?
- Shard A is congested and shard B has a ton of receipts for it and shard C has no receipts for it. How do we make sure that we are able to provide all the queue space to B and do not reserve any for C?
Yes, that sounds exactly right.
We don't give any guarantees about fairness. We hope that backpressure measures are reducing incoming transactions sharp enough that congestion resolves quickly and everyone can send again. But yes, linear interpolation of how much bandwidth (measured in gas) each shard can send per chunk should help in most practical scenarios, as the newly available space in the incoming queue of the congested shard is shared evenly across all sending shards.
There is only one big incoming queue, without accounting per shard. So in this example, shard B can fill it up entirely. Shard C will be sad when it wants to send a single receipt and sees the queue full. But I personally think it's a good trade-off to make. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few nits, typos and such
Co-authored-by: wacban <wacban@users.noreply.github.com>
Generally happy with your responses here. One other approach I have seen (and implemented in the past) to guarantee fairness is some sort of credit based queuing. This lets a receiving entity decide in fine grain how much of its queue it wants to dedicate to each sender. It is natural to use this mechanics to implement fair sharing or to arbitrary types of prioritisation as well (e.g. one shard is able to send 2x more than another). The drawback of course is more state tracking and complex implementation. So I'm happy with the proposed approach. |
neps/nep-0539.md
Outdated
|
||
We store the outgoing buffered receipts in the trie, similar to delayed receipts | ||
but in their own separate column. Therefore we add a trie column | ||
`BUFFERED_RECEIPT_OR_INDICES: u8 = 13;`. Then we read and write analogue to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor detail: I know we use this pattern for DELAYED_RECEIPT_OR_INDICES
, but it seems to be that way for historical reasons (see commit message here).
For this new queue it would be clearer to have separate BUFFERED_RECEIPT
and BUFFERED_RECEIPT_INDICES
columns.
Another question popped into my head earlier. AFAIU, creating a promise in NEAR is infallible i.e. contract A on shard 1 can always create a receipt for contract B on shard 2. Further, it is the case that without actually executing the receipt against contract A, we cannot know for sure whether or not it will call contract B. In the worst case, many different contracts on many different shards can all target the same contract (or a set of contracts on a shard). Does the proposed solution handle such scenarios? Is the filter operation defined going to apply to the receipts created above? |
The filter operation is only applicable to transactions, not to receipts. Once receipts are created, we commit to execute them. The described situation is indeed problematic. Of course, that's exactly what backpressure is for. If shard 3 becomes congested, shard 1 and 2 can still create receipts for shard 3 but they are forced to keep them in their outgoing buffer before forwarding. This way, shard 3 is protected from additional inflow. Eventually, shards 1 and 2 may also become congested and the backpressure spreads further out to all shards trying to send something to them. Eventually all shards are congested and no more new transactions anywhere are accepted. Unfortunately, it is still not handled perfectly. We only apply backpressure based on incoming congestion, to avoid deadlocks. But if we are able to handle incoming receipts quickly, it is possible shard 1 keeps filling its outgoing buffer for shard 2, growing it faster than it can forward receipts in it. But because the incoming queue is always empty, it does not apply backpressure. (cc @wacban we should probably simulate with the latest changes that decouple incoming and outgoing congested to see how bad this can become.) |
I think I understand the high level explanation. The drawback is that in the worst case, due to one shard not keeping up, it is possible that the entire network has to stop accepting new transactions. I am still happy with this solution and see this as a very good next step to build. Once built, I can imagine further refinements where we can address such cases as well. |
If I understand correctly this could be implemented by splitting the delayed receipts queue into one queue per sending shard and then implementing some fair way to pull receipts from this set of queues. This makes sense but I would rather keep this NEP in the current simpler form and work on top of it in follow ups. The good news is that as far as I can tell the current proposal should be easily extendable to what you're suggesting.
That is correct, just to add a detail to it, each shard will advertise two numbers, one representing the fullness of the "outgoing queues" and one representing the fullness of the "incoming queue". Those two types of congestion are treated differently which allows us to better adapt the measures to the specific workload that the network is under. |
@wacban: perfect, sounds like a solid plan to me. I am always happy to build incrementally. |
This document describes a few fundamental congestion control problems and ideas to solve them. The added page serves as secondary document to [NEP](near/NEPs#539) to summarise the thought process behind the most important design decisions. But it is generally applicable to congestion in Near Protocol's receipt execution system as it works today. It can even serve as documentation for how congestion can occur today. The document includes 8 graphs generated using [graphviz](https://graphviz.org/). To regenerate after modifying the `*.dot` files, install the graphviz toolbox (on systems with apt: `sudo apt install graphviz`) and then run `dot -Tsvg img_name.dot > img_name.svg`. --------- Co-authored-by: wacban <wacban@users.noreply.github.com>
- The formulas in the pseudo code were opposite to the description, fixing it by swapping incoming and general congestion. - "General" congestion is a bad name. Changing it to "Memory" congestion. - Add a sentence of motiviation to the pseudo code snippets for extra explanation - Add TODO for unbounded queue problem
Co-authored-by: wacban <wacban@users.noreply.github.com>
No link to the actual reference implementation, yet. Just some clarifying text and in-place code.
I think it's better to keep it simple. While it could be useful in the future to look at guaranteed to be burnt and attached gas separately for congestion, our current strategy does not look at it.
I implemented the model of the strategy proposed in the NEP. I am now analysing different workloads to make sure that the strategy can handle them well. I will be sharing results and suggestions here as I progress. AllToOne workload.In this workload all shards send direct transactions to a single shard that becomes congested. The strategy does a rather bad job at dealing with this workload as the outgoing buffers grow in gas without a reasonable limit. The memory limit is never exceeded because the receipts are small but the number and gas of receipts grows beyond acceptable values. The reason is that the current proposal does not take the gas accumulated in outgoing buffers into account. My suggestion would be to replace memory congestion with ShardChunkHeaderInnerV3 {
// as is
incoming_congestion: u16,
// memory -> general
general_congestion: u16,
}
I implemented the suggestion in the model and the results are quite good - both the incoming queue and outgoing buffers display bounded, periodic behaviour. In the picture below, each period is characterized by four phases:
We can probably smooth it out further by replacing the hard incoming congestion threshold with linear interpolation. It's not a priority right now so I'll leave it as is. |
As a working group member, I nominate @robin-near and @akashin as subject matter experts to review this NEP. |
described the two alternatives for efficiently calculating the congestion information and updated relevant sections
Added the description of how to integrate with resharding.
It should be based on general congestion, not only memory congestion.
update the preferred option and added state sync integration
This PR connects the changes of previous PRs (mainly #11128, #11173, and #11195) to apply congestion control rules as defined in [NEP-539](near/NEPs#539). Builds with a stable protocol version are unaffected by this PR, since `own_congestion_info` will return None and the `ReceiptSinkV1` unconditionally forwards all receipts exactly as it is done today. Most components combined in this PR are already tested in unit tests. The new features added with this PR are best tested with integration tests, which will follow in future PRs. The PR has two commits, it makes sense to review each on its own.
@akashin and @robin-near , friendly reminder to share your review on this NEP |
NEP Status (Updated by NEP Moderators)Status: Review SME reviews:
Protocol Work Group voting indications (❔ | 👍 | 👎 ):
|
@bowenwang1996 , it seems @akashin is OOO this week, should we nominate someone else to expedite? |
I skimmed the proposal and overall approach looks good to me, great work! |
neps/nep-0539.md
Outdated
transactions to the congested shard when it is hit. | ||
|
||
`MIN_TX_GAS = 20 Tgas` guarantees that we can still accept a decent amount of | ||
new transactions to shards that are not congestions, even if the local shard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: congestions -> congested
neps/nep-0539.md
Outdated
restrictions. Additionally, the memory consumption of both receipt types can | ||
also cause congestion. | ||
|
||
This proposal replaces the local congestion control rules already in place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it supercedes/extends the existing rules rather than fully replacing them (we still have transaction pool limit and delayed receipt queue limit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I've added a few words to describe what we keep and what we replace.
neps/nep-0539.md
Outdated
``` | ||
|
||
We replace the formula for the transaction limit to depend on the | ||
`incoming_congestion` computed in the previous shard variable (between 0 and 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean computed in the previous chunk for this shard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The sentence should be fixed now to make more sense.
Technical summaryOver the last few months, NEAR mainnet has been experiencing a considerable increase in usage, which led to regular congestion on shards 2, 3 and 5. Usually, transactions on NEAR execute within a few blocks and the latency for the end users is on the order of a few seconds. However, during these periods of congestion, users regularly had to wait more than 15 minutes for their transaction to be processed and sometimes the wait time was over 1 hour. This is a considerable degradation in the user experience and it needs to be addressed to make the users of NEAR happy. This NEP introduces a number of mechanisms to control the order in which transactions/receipts are admitted, exchanged between shards and processed as well as a concrete policy that is carefully tuned to balance end-user latency with overall system throughput and peak memory overhead. RecommendationI recommend approving this NEP. Benefits
Concerns
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments on clarifying the content - I'll take another look once that is addressed. Also jakmeier#3 for the typos and grammar nits that would introduce too much clutter if I left them as comments.
below) and modify how outgoing receipts are treated in the transaction | ||
conversion step (3) and in the receipts execution step (4). | ||
|
||
The new chunk execution then follows this order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have trouble following this process, because many variables (like gas_backlog, outgoing_gas, outgoing_gas_limit) are not previously well defined. Could we define these variables more carefully before this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, there were too many rounds of renaming things and in the end it was rather inconsistent. I completely missed that, thanks for pointing it out. It should be fixed now.
# new | ||
MIN_TX_GAS = 20 Tgas | ||
MAX_TX_GAS = 500 Tgas | ||
tx_limit = mix(MAX_TX_GAS, MIN_TX_GAS, incoming_congestion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses "incoming_congestion" before introducing the concept. Should we add a section above that summarizes the relevant congestion fields? Otherwise reading this feels a bit like reverse engineering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes a lot of sense. I've done so now, please take another look.
Correct some typos, grammar issues, and clarify some text.
Address various comments by SME reviewers. - Fix various grammar errors. - Remove old names and use only the correct names for variables - Start the specification section by introducing important concepts
Thanks a lot to @akashin and @robin-near for taking the time to read through our proposal and giving valuable feedback! I really appreciate your expertise to ensure we end up with the best possible solution to move congestion control one step forward. Sorry about the subpart quality in the grammar, and just in general. I thought we had the NEP cleaned up much better, otherwise I wouldn't have asked for SME reviews. I think we rushed a bit too much then, as we wanted to get the NEP processed started as soon as possible. I have tried my best to fix it up now and added a new section about important concepts. Please, @robin-near, can you take another look? Let me know if something is still not well defined or not written clearly. |
Oh and in the time since the last changes, we added "missed chunks congestion" as an additional indicator. I have added it to the concepts section and to the "Changes to chunk execution" section. It's a bit of a last minute change, not something we initially wanted to address. But for stateless validation, Near Protocol needs a way to limit incoming receipts even when chunks are missed. This NEP introduces all the required tools to solve that problem, so it seemed worth it to include. But if preferred by the working group, we could also separate it out as its own NEP that builds on top of congestion control. @wacban, since you spear-headed and implemented this, can you please double-check that I got the details around missed chunk congestion right? |
Latest rendered view.