Gossip data to a peer without valid channel increases cpu usage #4

jhernandezb · 2023-01-28T18:47:40Z

Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source):
0.34.23

ABCI app (name for built-in, URL for self-written if it's publicly available):
https://github.com/public-awesome/stargaze

Environment:

OS ubuntu 20.04+

What happened:
Currently stargaze mainnet network have multiple reports of increased cpu usage without any meaningful change in our current stack.

After digging a bit we were able to find that gossipDataRoutine and specifically the gossipDataForCatchup method was causing this increase in.

In the following snippet if SendEnvelopeShim fails, it just immediately retries to gossip the same block part until the peer state changes (different round etc), but it generates more work because is loading block meta and block part from disk.

tendermint/consensus/reactor.go

Lines 698 to 710 in e0f68fe

    
           if p2p.SendEnvelopeShim(peer, p2p.Envelope{ //nolint: staticcheck 
        
           	ChannelID: DataChannel, 
        
           	Message: &tmcons.BlockPart{ 
        
           		Height: prs.Height, // Not our height, so it doesn't matter. 
        
           		Round:  prs.Round,  // Not our height, so it doesn't matter. 
        
           		Part:   *pp, 
        
           	}, 
        
           }, logger) { 
        
           	ps.SetHasProposalBlockPart(prs.Height, prs.Round, index) 
        
           } else { 
        
           	logger.Debug("Sending block part for catchup failed") 
        
           } 
        
           return

adding a small sleep like in other error checks fixes the problem, like in our fork public-awesome@da5a32f which seemed to reduce the cpu usage.
time.Sleep(conR.conS.config.PeerGossipSleepDuration)

Currently there is no way to know from this method if the peer is valid for sending the packet, hasChannel is a private method, but ideally we could save loading from disk if we could check first peer.IsValid() then execute the remaining logic.

What you expected to happen:
To add a delay or a check that prevents sending to info to a peer with an invalid state

Have you tried the latest version: yes/no
Yes

How to reproduce it (as minimally and precisely as possible):
Hard to replicate current network conditions as it seems there is some invalid peers in the network causing this issue, but joining the network with a new node will replicate it.

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):

Config (you can paste only the changes you've made):

node command runtime flags:

Please provide the output from the http://<ip>:<port>/dump_consensus_state RPC endpoint for consensus bugs

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

sergio-mena · 2023-02-01T22:09:30Z

Addressed by cometbft/cometbft#241, cometbft/cometbft#244, and cometbft/cometbft#245

See issue informalsystems#4. Compare to their patch informalsystems/tendermint#245.

sergio-mena mentioned this issue Jan 30, 2023

add peer gossip sleep cometbft/cometbft#241

Merged

3 tasks

jhernandezb mentioned this issue Jan 31, 2023

Unnecessary allocations in ZerologWrapper cosmos/cosmos-sdk#14850

Closed

sergio-mena closed this as completed Feb 1, 2023

This was referenced Feb 6, 2023

chore: bump tm with p2p patch cosmos/gaia#2149

Closed

chore: bump tm with p2p patch cosmos/gaia#2150

Merged

chore: Bump tendermint and replace with informal fork cosmos/gaia#2151

Merged

JimLarson mentioned this issue Feb 8, 2023

Cherrypick informal gossip fix into our tendermint fork Agoric/agoric-sdk#6945

Closed

JimLarson added a commit to agoric-labs/tendermint that referenced this issue Feb 8, 2023

fix: cherrypick gossip fix

9f82c42

See issue informalsystems#4. Compare to their patch informalsystems/tendermint#245.

JimLarson mentioned this issue Feb 8, 2023

fix: cherrypick gossip fix agoric-labs/tendermint#35

Merged

ZaradarBH mentioned this issue Mar 15, 2023

V0.45.13 proposal classic-terra/core#176

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gossip data to a peer without valid channel increases cpu usage #4

Gossip data to a peer without valid channel increases cpu usage #4

jhernandezb commented Jan 28, 2023

sergio-mena commented Feb 1, 2023

Gossip data to a peer without valid channel increases cpu usage #4

Gossip data to a peer without valid channel increases cpu usage #4

Comments

jhernandezb commented Jan 28, 2023

sergio-mena commented Feb 1, 2023