Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message loss when network rebalances #2756

Closed
bowenwang1996 opened this issue May 31, 2020 · 12 comments · Fixed by #2841
Closed

Message loss when network rebalances #2756

bowenwang1996 opened this issue May 31, 2020 · 12 comments · Fixed by #2841
Assignees
Labels
A-network Area: Network C-bug Category: This is a bug

Comments

@bowenwang1996
Copy link
Collaborator

In #2490, we introduced network balancing to avoid reaching maximum number of peers. However, it comes at a cost -- when a message is being routed, it can get dropped due to the connection between two nodes on the route getting dropped when they rebalance the network connection. It makes the system slower overall. In some cases, the requester will re-request or request from other peers (for example for partial encoded chunks). In other cases, the message is simply dropped. This can have some fairly undesirable consequences such as block producers not getting enough approvals to produce their blocks. It seems that this also allows an attacker to initiate an eclipse attack on a node that can have some profound impact on the overall network topology.
It is not clear to me, however, how to best solve this issue. Some ideas:

  • When we send a message, allow the sender to specify how many times they want to send it and send the messages through different routes.
  • When we rebalance network connections, we should minimize the decrease in the number of reachable peers.
  • When we send or route a message, instead of using round robin on the shortest paths, we do something like round robin on the peers with most routes to the destination, or the peer with most connection to the destination among the shortest paths, or a combination of the two.
@mfornet
Copy link
Member

mfornet commented May 31, 2020

It is not clear why routed message got lost during rebalance. If nodes are removing other connections it is expected that they have already high number of connections so they should have some route to target, message are only dropped when no route to target is found or when message have been routed through more than 100 hops.

How do you know that messages are being dropped, because of logs?

@bowenwang1996
Copy link
Collaborator Author

nodes are removing other connections it is expected that they have already high number of connections so they should have some route to target

Is it guaranteed in any way? I didn't see it.

@mfornet
Copy link
Member

mfornet commented May 31, 2020

nodes are removing other connections it is expected that they have already high number of connections so they should have some route to target

Is it guaranteed in any way? I didn't see it.

What is guaranteed is that nodes will drop connections if the number of active peers go over some threshold (currently 35). This a high enough number to be confident that the network should remain connected in the whole process.

Again, are you strictly confident that messages are being dropped (do you saw it on logs)? If that is the case, maybe the root cause is another.

@mfornet
Copy link
Member

mfornet commented Jun 1, 2020

@bowenwang1996 Before jumping into this issue we should investigate further how many message are being dropped and the real cause of those messages being dropped.

I mean, for sure they are dropped if there is no route available (it is possible also that message is being routed in a cycle and gets dropped after TTL=0, but it is very unlikely unless a bug or very pathological behavior from peers in the network).

@bowenwang1996
Copy link
Collaborator Author

@mfornet I agree. How about we add some counter to record when a message fails to deliver, as well as its type?

@mfornet
Copy link
Member

mfornet commented Jun 2, 2020

@mfornet I agree. How about we add some counter to record when a message fails to deliver, as well as its type?

Sounds good, let's see if that helps us. I'll add it as a Prometheus counter.

@mfornet
Copy link
Member

mfornet commented Jun 6, 2020

@mfornet I agree. How about we add some counter to record when a message fails to deliver, as well as its type?

@bowenwang1996 did this in #2804 Check the graph Dropped messages of one of the nodes here

@bowenwang1996
Copy link
Collaborator Author

@mfornet that's a lot of dropped messages. Is what shown in the graph cumulative?

@mfornet
Copy link
Member

mfornet commented Jun 6, 2020

@mfornet that's a lot of dropped messages. Is what shown in the graph cumulative?

Yes, it is cumulative.

I see how ~10K messages were dropped in 20minutes, and since then it has been dropping messages slowly, mostly PartialEncodedChunkResponse, ~2K messages in 10 hours.

@ilblackdragon ilblackdragon added the C-bug Category: This is a bug label Jun 7, 2020
@mfornet
Copy link
Member

mfornet commented Jun 11, 2020

The high number of dropped messages is due to the small size of the route-back cache, records are removed from the cache faster than the response is arriving. There is an easy fix which is just increasing the size of the cache to account for this, however, how it is implemented right now it can be easily abused. I'm working on a fix that want to discuss with you:

/// Cache to store route back messages.
///
/// The interface of the cache is similar to a regular HashMap:
/// elements can be inserted, fetched and removed.
///
/// Motivation behind the following (complex) design:
///
/// Since this cache is populated with messages that should be routed
/// back it is important that elements are not removed unless there is
/// some confidence that the response is not going to arrive.
///
/// A naive cache can be easily abused, since a malicious actor can send
/// void messages that should be routed back, letting users replacing useful
/// entries on the cache for fake ones, and producing most of the messages
/// that require route back being dropped.
///
/// Solution:
///
/// The cache will accept new elements until it is full. If it receives a new
/// element but it implies going over the capacity, the message to remove is
/// selected as following:
///
/// 1. For every message store the time it arrives.
/// 2. For every peer store how many message should be routed to it.
///
/// First are removed messages that have been in the cache more time than
/// EVICTED_TIMEOUT. If no message was removed, it is removed the oldest
/// message from the peer with more messages in the cache.
///
/// Rationale:
///
/// - Old entries in the cache will be eventually removed (no memory leak).
/// - If the cache is not at full capacity, all new records will be stored.
/// - If a peer try to abuse the system, it will be able to allocate at most
///     $capacity / number_of_active_connections$ entries.
///
pub struct RouteBackCache {
    /// Number of records currently in the cache.
    size: i64,
    /// Maximum number of records allowed in the cache.
    capacity: i64,
    /// Maximum time allowed before removing a record from the cache.
    evict_timeout: i64,
    /// Main map from message hash to time where it was created + target peer.
    /// Size: O(capacity)
    main: HashMap<CryptoHash, (i64, PeerId)>,
    /// Number of records allocated by each PeerId.
    /// The size is stored with negative sign, to order in PeerId in decreasing order.
    /// Size: O(number of active connections)
    size_per_target: BTreeSet<(i64, PeerId)>,
    /// List of all hashes associated with each PeerId. Hashes within each PeerId
    /// are sorted by the time they arrived from older to newer.
    /// Size: O(capacity)
    record_per_target: BTreeMap<PeerId, BTreeSet<(i64, CryptoHash)>>,
}

cc @bowenwang1996 @SkidanovAlex

@bowenwang1996
Copy link
Collaborator Author

@mfornet is there a per peer limit? Also what is roughly the size of this cache?

@mfornet
Copy link
Member

mfornet commented Jun 12, 2020

is there a per peer limit?

Not explicitly, while the cache is not at max capacity all messages will be stored, but if it is working at capacity, each peer will use at most capacity / active_peers per peer (perfectly balanced) as other peer start allocating new message in the cache. Notice I will remove older messages from peers that have allocated the larger number of records in the cache.

Also what is roughly the size of this cache?

I modified the definition of RouteBackCache, it will use ~200 bytes per entry, if we use as capacity 1e6 (as I suggest) it is roughly 190MB when it is full.

mfornet added a commit that referenced this issue Jun 12, 2020
fix #2756

Increased route back cache to avoid dropping messages which
are supposed to be routed back. The problem was that when some
nodes are syncing several chunk request are asked, and request
arrive faster than responses, effectively invalidating the cache
most of the  responses are dropped, since the route back hash was
dropped from the cache.

Note: This is a vector of attack if we rely on route-back-messages,
since malicious actor can feed the cache with new information,
effectively invalidating our current cache.

Test plan
=========
Reproduced this scenario locally, and tested with and without the cache.
mfornet added a commit that referenced this issue Jun 12, 2020
fix #2756

Increased route back cache to avoid dropping messages which
are supposed to be routed back. The problem was that when some
nodes are syncing several chunk request are asked, and request
arrive faster than responses, effectively invalidating the cache
most of the  responses are dropped, since the route back hash was
dropped from the cache.

Note: This is a vector of attack if we rely on route-back-messages,
since malicious actor can feed the cache with new information,
effectively invalidating our current cache.

Test plan
=========
Reproduced this scenario locally, and tested with and without the cache.
mfornet added a commit that referenced this issue Jun 12, 2020
fix #2756

Increased route back cache to avoid dropping messages which
are supposed to be routed back. The problem was that when some
nodes are syncing several chunk request are asked, and request
arrive faster than responses, effectively invalidating the cache
most of the  responses are dropped, since the route back hash was
dropped from the cache.

Note: This is a vector of attack if we rely on route-back-messages,
since malicious actor can feed the cache with new information,
effectively invalidating our current cache.

Test plan
=========
Reproduced this scenario locally, and tested with and without the cache.
mfornet added a commit that referenced this issue Jun 12, 2020
fix #2756

Increased route back cache to avoid dropping messages which
are supposed to be routed back. The problem was that when some
nodes are syncing several chunk request are asked, and request
arrive faster than responses, effectively invalidating the cache
most of the  responses are dropped, since the route back hash was
dropped from the cache.

Note: This is a vector of attack if we rely on route-back-messages,
since malicious actor can feed the cache with new information,
effectively invalidating our current cache.

Test plan
=========
Reproduced this scenario locally, and tested with and without the cache.
mfornet added a commit that referenced this issue Jun 18, 2020
fix #2756

Increased route back cache to avoid dropping messages which
are supposed to be routed back. The problem was that when some
nodes are syncing several chunk request are asked, and request
arrive faster than responses, effectively invalidating the cache
most of the  responses are dropped, since the route back hash was
dropped from the cache.

Note: This is a vector of attack if we rely on route-back-messages,
since malicious actor can feed the cache with new information,
effectively invalidating our current cache.

Test plan
=========
Reproduced this scenario locally, and tested with and without the cache.
nearprotocol-bulldozer bot pushed a commit that referenced this issue Jun 18, 2020
Increased route back cache to avoid dropping messages which
are supposed to be routed back. The problem was that when some
nodes are syncing several chunk request are asked, and request
arrive faster than responses, effectively invalidating the cache
most of the  responses are dropped, since the route back hash was
dropped from the cache.

This PR also introduce cache strategy proposed [here](#2756 (comment))

Fix #2756 

Test plan
=========
- Reproduced scenario where several routed messages were dropped locally, and tested with new cache and with old cache. The fix was the size increase.
- Unit tests to check the new strategy is working properly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network C-bug Category: This is a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants