Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lighthouse 4.0.0 and "context canceled" Error #7172

Closed
JP-Ellis opened this issue Mar 23, 2023 · 29 comments
Closed

Lighthouse 4.0.0 and "context canceled" Error #7172

JP-Ellis opened this issue Mar 23, 2023 · 29 comments

Comments

@JP-Ellis
Copy link

JP-Ellis commented Mar 23, 2023

System information

  • Erigon Version: erigon version 2.40.1-stable
    • Command
      /usr/bin/erigon \
        --datadir=/eth/erigon \
        --http.port 8545 \
        --http.api eth,erigon,engine,web3 \
        --authrpc.port 8551 \
        --authrpc.jwtsecret /eth/jwt.hex \
        --private.api.addr localhost:8552 \
        --private.api.ratelimit 65536 \
        --metrics \
        --metrics.port 6060 \
        --externalcl \
        --prune htc \
        --prune.r.before=11184524
      
  • OS: Linux (Linux 6.1.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 04 Jan 2023 16:28:15 +0000 x86_64 GNU/Linux )
  • Concensus Layer: Lighthouse
    • Version:
      Lighthouse v4.0.0-0616e01
      BLS library: blst-modern
      SHA256 hardware acceleration: true
      Allocator: jemalloc
      Specs: mainnet (true), minimal (false), gnosis (true)
      
    • Command:
      /usr/bin/lighthouse beacon_node \
        --datadir /var/lib/lighthouse \
        --staking \
        --metrics \
        --validator-monitor-auto \
        --execution-endpoint http://localhost:8551 \
        --execution-jwt /eth/jwt.hex \
        --builder http://localhost:18550
      

Behaviour

  • After a few hours, sync stalls completely with the error "context canceled".
  • Once error happens, Erigon still generates logs, but generally appears unresponsive (sending a kill signal does not cause it to shut down, and I have to send all 10 kill signals for it to forcefully restart)
  • After a restart, Erigon syncs again with no issues until this issue appears randomly
  • This error only appeared after upgrading Lighthouse to v4.0.0. I have tested it with the v4.0.1-rc.0 hotfix as well, and the error remains. I suspect that is the root cause of the issue.

Logs are located in the follow gist JP-Ellis/30aeca74b7760ea603f709309051aefc

@yorickdowne
Copy link

yorickdowne commented Mar 23, 2023

To be clear this is with Lighthouse v4.0.0 or v4.0.1-rc.0, not Erigon v4.0.0.

Lighthouse complains that Erigon no longer responds on the Engine API

We've seen this on post-Shapella Goerli and pre-Shapella mainnet.

It's very reproducible, so it should be fairly easy to debug.

@JP-Ellis JP-Ellis changed the title Erigon 4.0.0 and "context canceled" Error Lighthouse 4.0.0 and "context canceled" Error Mar 23, 2023
@JP-Ellis
Copy link
Author

JP-Ellis commented Mar 23, 2023

@yorickdowne The versions are specified verbatim in the main message, I just messed up the title.

I will be upgrading the v4.0.1-rc.0 of Lighthouse in light of the most recent release. If the same error appears, I will mention that in the original post.

I suspect that even if the error is with Lighthouse, Erigon should probably not become unresponsive and require a forced restart.

@yorickdowne
Copy link

"This error only appeared after upgrading Erigon to v4.0.0" - that's why the clarification.

v4.0.1-rc.0 causes the same behavior with Erigon.

@JP-Ellis
Copy link
Author

Entirely my bad, typo is fixed 👍

@yorickdowne
Copy link

I can reproduce this on Goerli with Erigon source-compiled from devel and Lighthouse v4.0.1-rc.0

@yorickdowne
Copy link

Debug logs. Failure around 11:49 UTC. Logs aren't very instructive alas
erigonstall.debug.log.tar.gz

@yorickdowne
Copy link

Clarification: Erigon doesn't ignore SIGTERM. It just is unable to actually shut down entirely.

execution_1  | [INFO] [03-25|11:58:36.421] Got interrupt, shutting down... 
execution_1  | [DBUG] [03-25|11:58:36.421] Error while executing stage              err="[2/15 Headers] server is stopping"
execution_1  | [DBUG] [03-25|11:58:36.421] Handling incoming message                stream=RecvMessage err="context canceled"
execution_1  | [INFO] [03-25|11:58:36.421] Exiting Engine... 
execution_1  | [DBUG] [03-25|11:58:36.421] rpcdaemon: the subscription to pending blocks channel was closed 
execution_1  | [INFO] [03-25|11:58:36.421] RPC server shutting down 
execution_1  | [INFO] [03-25|11:58:36.421] Exiting... 
execution_1  | [INFO] [03-25|11:58:36.421] RPC server shutting down 
execution_1  | [DBUG] [03-25|11:58:36.421] Handling incoming message                stream=RecvUploadMessage err="context canceled"
execution_1  | [WARN] [03-25|11:58:36.421] Served                                   conn=172.18.0.3:33076 method=engine_forkchoiceUpdatedV2 reqid=1 t=31m1.996573279s err="context canceled"
execution_1  | [DBUG] [03-25|11:58:36.421] Handling incoming message                stream=RecvUploadHeadersMessage err="querying BlockHeaders: context canceled"
execution_1  | [WARN] [03-25|11:58:36.421] NewPayload                               err="context canceled"
execution_1  | [DBUG] [03-25|11:58:36.421] Handling incoming message                stream=RecvUploadHeadersMessage err="querying BlockHeaders: context canceled"
execution_1  | [EROR] [03-25|11:58:36.421] Staged Sync                              err="[2/15 Headers] server is stopping"
execution_1  | [WARN] [03-25|11:58:36.422] Served                                   conn=172.18.0.3:49054 method=engine_newPayloadV2 reqid=1 t=35m23.820669261s err="context canceled"
execution_1  | [WARN] [03-25|11:58:36.421] Served                                   conn=172.18.0.3:57830 method=engine_forkchoiceUpdatedV2 reqid=1 t=31m8.415568952s err="context canceled"
execution_1  | [WARN] [03-25|11:58:36.421] NewPayload                               err="context canceled"
execution_1  | [WARN] [03-25|11:58:36.422] Served                                   conn=172.18.0.3:49376 method=engine_newPayloadV2 reqid=1 t=25m16.930276228s err="context canceled"
execution_1  | [WARN] [03-25|11:58:36.421] Served                                   conn=172.18.0.3:57810 method=engine_forkchoiceUpdatedV2 reqid=1 t=31m11.261330227s err="context canceled"
execution_1  | [WARN] [03-25|11:58:36.422] Failed to serve http endpoint            err="http: Server closed"
execution_1  | [WARN] [03-25|11:58:36.421] Served                                   conn=172.18.0.3:57804 method=engine_forkchoiceUpdatedV2 reqid=1 t=31m16.417928013s err="context canceled"
execution_1  | [INFO] [03-25|11:58:36.421] RPC server shutting down 
execution_1  | [WARN] [03-25|11:58:36.423] Failed to serve http endpoint            err="http: Server closed"
execution_1  | [WARN] [03-25|11:58:36.421] NewPayload                               err="context canceled"
execution_1  | [WARN] [03-25|11:58:36.423] Served                                   conn=172.18.0.3:41594 method=engine_newPayloadV2 reqid=1 t=35m33.861133645s err="context canceled"
execution_1  | [DBUG] [03-25|11:58:36.422] Handling incoming message                stream=RecvUploadHeadersMessage err="querying BlockHeaders: context canceled"
execution_1  | [DBUG] [03-25|11:58:36.423] Handling incoming message                stream=RecvUploadHeadersMessage err="querying BlockHeaders: context canceled"
execution_1  | [DBUG] [03-25|11:58:36.423] Handling incoming message                stream=RecvUploadHeadersMessage err="querying BlockHeaders: context canceled"
execution_1  | [DBUG] [03-25|11:58:36.423] Handling incoming message                stream=RecvUploadHeadersMessage err="querying BlockHeaders: context canceled"
execution_1  | [DBUG] [03-25|11:58:36.421] Handling incoming message                stream=RecvMessage err="context canceled"
execution_1  | [INFO] [03-25|11:58:36.422] HTTP endpoint closed                     url=[::]:8545
execution_1  | [DBUG] [03-25|11:58:36.423] Handling incoming message                stream=RecvUploadHeadersMessage err="querying BlockHeaders: context canceled"
execution_1  | [DBUG] [03-25|11:58:36.424] Handling incoming message                stream=RecvUploadHeadersMessage err="querying BlockHeaders: context canceled"
execution_1  | [INFO] [03-25|11:58:36.424] Engine HTTP endpoint close               url=[::]:8551
execution_1  | [DBUG] [03-25|11:58:36.425] Error in DNS random node sync            tree=all.goerli.ethdisco.net err="lookup all.goerli.ethdisco.net on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled"
execution_1  | [WARN] [03-25|11:58:42.768] Already shutting down, interrupt more to panic. times=9

And this is where it remains until forcibly killed

@yorickdowne
Copy link

Uploaded trace logs to here: https://www.dropbox.com/s/qim77hurooysz6b/erigonstall.trace.log.tar.gz?dl=0

Issue occurs around 16:15

@michaelsproul
Copy link

Given that Erigon was OK with LH v3.5.1 I suspect this is related to the getPayloadBodiesByRange engine API call which we recently added support for in Lighthouse.

@parithosh
Copy link
Contributor

We came across the same issue once we updated to lighthouse v4.0.1-rc0 and erigon v2.40.1. We mitm'ed the engineAPI and found this as the request to the engineAPI:

Mar 27 10:40:21.664 2023 REQUEST
{
  "jsonrpc": "2.0",
  "method": "engine_exchangeCapabilities",
  "params": [
    [
      "engine_newPayloadV1",
      "engine_newPayloadV2",
      "engine_getPayloadV1",
      "engine_getPayloadV2",
      "engine_forkchoiceUpdatedV1",
      "engine_forkchoiceUpdatedV2",
      "engine_getPayloadBodiesByHashV1",
      "engine_getPayloadBodiesByRangeV1",
      "engine_exchangeTransitionConfigurationV1"
    ]
  ],
  "id": 1
}

Mar 27 10:40:21.664 2023 RESPONSE
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": [
    "engine_forkchoiceUpdatedV1",
    "engine_forkchoiceUpdatedV2",
    "engine_newPayloadV1",
    "engine_newPayloadV2",
    "engine_getPayloadV1",
    "engine_getPayloadV2",
    "engine_exchangeTransitionConfigurationV1",
    "engine_getPayloadBodiesByHashV1",
    "engine_getPayloadBodiesByRangeV1"
  ]
}

It basically looped over that continuously.

Updating to lighthouse v4.0.1 and erigon to v2.41.0 (latest of both clients) seems to fix the issue.

@yorickdowne
Copy link

yorickdowne commented Mar 27, 2023

We have a user report that this is not fixed with Erigon v2.41.0. My own test confirms that.

@geomad
Copy link

geomad commented Mar 27, 2023

Maybe related:

Syncing get stuck for me as well. Tried lots of versions, now running 2.38 which seems a bit more stable for me, but still have the problem.
The issue happens when running with the lightclient (as a different service, but maybe thats cause of the lightclient issue?), and when running with a multiplexed consensus client (with https://github.com/TennisBowling/executionbackup-rust). I have not tried running with a dedicated CL, but was running both setups successfully before without an issue.
The node was getting stuck every few hrs with 2.40, and now got stuck after a day+ with 2.38
Here are some logs while running with execution-backup on 2.38.

[INFO] [03-27|20:39:01.273] RPC Daemon notified of new headers       from=16919663 to=16919664 hash=0x90cbba08825364451c2df22da3f46724b85bfb0810632717af6b68e88eb64a1b header sending=13.355µs log sending=211ns
[INFO] [03-27|20:39:01.306] [2/15 Headers] Waiting for Consensus Layer...
[INFO] [03-27|20:39:12.183] [2/15 Headers] Handling new payload      height=16919665 hash=0xe8527a44aeccc475e4b10d6a05bda4ddd784496f45e68787a92dd161908acc61
[INFO] [03-27|20:39:12.472] [2/15 Headers] Waiting for Consensus Layer...
[INFO] [03-27|20:39:12.472] [2/15 Headers] Fork choice update: flushing in-memory state (built by previous newPayload)
[INFO] [03-27|20:39:12.766] Commit cycle                             in=89.925449ms
[INFO] [03-27|20:39:12.766] Timings (slower than 50ms)               Headers=68ms
[INFO] [03-27|20:39:12.766] Tables                                   PlainState=90.5GB AccountChangeSet=185.3GB StorageChangeSet=282.9GB BlockTransaction=10.6GB TransactionLog=550.6GB FreeList=147.1MB ReclaimableSpace=147.1GB
[INFO] [03-27|20:39:12.766] RPC Daemon notified of new headers       from=16919664 to=16919665 hash=0xe8527a44aeccc475e4b10d6a05bda4ddd784496f45e68787a92dd161908acc61 header sending=10.62µs log sending=180ns
[INFO] [03-27|20:39:13.597] [2/15 Headers] Waiting for Consensus Layer...
[INFO] [03-27|20:39:24.881] [2/15 Headers] Handling new payload      height=16919666 hash=0x9d20bd54c82d0d6fc3e2a82cf9f3cd2557fcc08ec0cdb3ff3a2dd9bb867c03e5
[INFO] [03-27|20:39:28.066] [2/15 Headers] Waiting for Consensus Layer...
[INFO] [03-27|20:39:37.011] [2/15 Headers] Handling new payload      height=16919667 hash=0x3ff4d203096ed1bd162dd9b02171a0a25a7673c2ce59b7f4efede60c1717314c
[INFO] [03-27|20:39:37.698] [2/15 Headers] Waiting for Consensus Layer...
[INFO] [03-27|20:39:37.698] [2/15 Headers] Fork choice update: flushing in-memory state (built by previous newPayload)
[INFO] [03-27|20:39:40.130] Commit cycle                             in=523.189442ms
[INFO] [03-27|20:39:40.130] Timings (slower than 50ms)               Headers=218ms CallTraces=256ms AccountHistoryIndex=106ms StorageHistoryIndex=872ms LogIndex=406ms
[INFO] [03-27|20:39:40.130] Tables                                   PlainState=90.5GB AccountChangeSet=185.3GB StorageChangeSet=282.9GB BlockTransaction=10.6GB TransactionLog=550.6GB FreeList=147.1MB ReclaimableSpace=147.1GB
[INFO] [03-27|20:39:40.130] RPC Daemon notified of new headers       from=16919665 to=16919667 hash=0x3ff4d203096ed1bd162dd9b02171a0a25a7673c2ce59b7f4efede60c1717314c header sending=14.608µs log sending=271ns
[INFO] [03-27|20:39:40.805] [2/15 Headers] Waiting for Consensus Layer...
[INFO] [03-27|20:39:50.071] [2/15 Headers] Downloading PoS headers... height=unknown hash=0x29fab9bee18c97d28a71b7fcc512ad6faa213e7d0614750fa9029078d8050657 requestId=26969
[INFO] [03-27|20:39:50.080] [2/15 Headers] Handling new payload      height=16919668 hash=0x29fab9bee18c97d28a71b7fcc512ad6faa213e7d0614750fa9029078d8050657
[INFO] [03-27|20:39:57.891] PoS headers verified and saved           requestId=26969 fork head=0x3ff4d203096ed1bd162dd9b02171a0a25a7673c2ce59b7f4efede60c1717314c
[INFO] [03-27|20:39:57.923] [2/15 Headers] Waiting for Consensus Layer...
[INFO] [03-27|20:39:57.923] [2/15 Headers] Fork choice update: flushing in-memory state (built by previous newPayload)
[INFO] [03-27|20:39:58.952] [txpool] stat                            block=16919667 pending=501 baseFee=10463 queued=11832 alloc=2.6GB sys=6.0GB
[INFO] [03-27|20:40:01.246] Commit cycle                             in=201.76072ms
[INFO] [03-27|20:40:01.246] Timings (slower than 50ms)               Headers=716ms CallTraces=1.108s AccountHistoryIndex=291ms StorageHistoryIndex=282ms LogIndex=646ms TxLookup=74ms
[INFO] [03-27|20:40:01.246] Tables                                   PlainState=90.5GB AccountChangeSet=185.3GB StorageChangeSet=282.9GB BlockTransaction=10.6GB TransactionLog=550.6GB FreeList=147.1MB ReclaimableSpace=147.1GB
[INFO] [03-27|20:40:01.246] RPC Daemon notified of new headers       from=16919667 to=16919668 hash=0x29fab9bee18c97d28a71b7fcc512ad6faa213e7d0614750fa9029078d8050657 header sending=10.4µs log sending=190ns
[INFO] [03-27|20:40:57.276] [p2p] GoodPeers                          eth66=3 eth67=7 eth68=9
[INFO] [03-27|20:40:57.606] [txpool] stat                            block=16919668 pending=1135 baseFee=10450 queued=11866 alloc=2.9GB sys=6.0GB
[INFO] [03-27|20:41:57.606] [txpool] stat                            block=16919668 pending=1803 baseFee=10455 queued=11949 alloc=3.1GB sys=6.0GB
[WARN] [03-27|20:42:08.361] Served                                   conn=127.0.0.1:51176 method=engine_getPayloadBodiesByRangeV1 reqid=1 t=18.265µs err="invalid argument 0: json: cannot unmarshal string into Go value of type uint64"
[WARN] [03-27|20:42:08.520] Served                                   conn=127.0.0.1:51176 method=engine_getPayloadBodiesByRangeV1 reqid=1 t=23.654µs err="invalid argument 0: json: cannot unmarshal string into Go value of type uint64"
[INFO] [03-27|20:42:57.276] [p2p] GoodPeers                          eth66=3 eth67=7 eth68=9
[INFO] [03-27|20:42:57.606] [txpool] stat                            block=16919668 pending=2415 baseFee=10508 queued=12087 alloc=2.9GB sys=6.0GB
[INFO] [03-27|20:43:57.607] [txpool] stat                            block=16919668 pending=3132 baseFee=10515 queued=12255 alloc=3.1GB sys=6.0GB
[INFO] [03-27|20:44:57.276] [p2p] GoodPeers                          eth67=7 eth66=3 eth68=9
[INFO] [03-27|20:45:03.977] [txpool] stat                            block=16919668 pending=3618 baseFee=10532 queued=12396 alloc=2.9GB sys=6.0GB
[INFO] [03-27|20:45:57.605] [txpool] stat                            block=16919668 pending=4268 baseFee=10558 queued=12590 alloc=3.1GB sys=6.0GB
[INFO] [03-27|20:46:57.276] [p2p] GoodPeers                          eth68=9 eth67=6 eth66=3
[INFO] [03-27|20:46:57.607] [txpool] stat                            block=16919668 pending=4777 baseFee=10569 queued=12758 alloc=2.8GB sys=6.0GB
[INFO] [03-27|20:47:57.606] [txpool] stat                            block=16919668 pending=5291 baseFee=10584 queued=12949 alloc=3.1GB sys=6.0GB

Note: When erigon gets stuck systemctl restart doesn't seems able to completely kill erigon, and I have to kill -9
Note2: trying latest devel now

@mwpastore
Copy link

mwpastore commented Mar 27, 2023

We have a user report that this is not fixed with Erigon v2.41.0. My own test confirms that.

Can confirm. Erigon 2.41.0 and Lighthouse 4.0.1.

Now testing: erigon/v2.41.0-dev-540af96e/linux-amd64/go1.20.2

EDIT: That didn't take long. Crashed with "context canceled" after a few minutes. Testing with this patch now:

diff --git a/cmd/rpcdaemon/commands/engine_api.go b/cmd/rpcdaemon/commands/engine_api.go
index e581bcb54..3a3d294dd 100644
--- a/cmd/rpcdaemon/commands/engine_api.go
+++ b/cmd/rpcdaemon/commands/engine_api.go
@@ -409,7 +409,6 @@ var ourCapabilities = []string{
        "engine_getPayloadV2",
        "engine_exchangeTransitionConfigurationV1",
        "engine_getPayloadBodiesByHashV1",
-       "engine_getPayloadBodiesByRangeV1",
 }

 func (e *EngineImpl) ExchangeCapabilities(fromCl []string) []string {

@jt9897253
Copy link
Contributor

Same issue here, also with erigon 2.41.0 and lighthouse 4.0.1. The patch from @mwpastore appears to improve the situation.

@banteg
Copy link
Contributor

banteg commented Mar 27, 2023

observing the same issue, trying out the patch

@emilianobonassi
Copy link

same here

@cryptocifer
Copy link

I encountered the same issue on Prysm as well, also I captured a pprof during the hanging

profile.tar.gz

@yorickdowne
Copy link

Or a similar issue. Prysm doesn't use getPayloadBodiesByRangeV1 according to Nishant. The symptoms seen are similar: context canceled and Erigon no longer responds on RPC/Engine.

Either this is the same issue as Lighthouse and it's got nothing to do with ByRange, or it's a separate issue.

@cleot
Copy link

cleot commented Mar 28, 2023

Served                                   conn=127.0.0.1:53420 method=engine_getPayloadBodiesByRangeV1 reqid=1 t=10.001134724s err="context canceled"

Erigon v2.41.0
Lighthouse v4.0.1

@McSim85
Copy link

McSim85 commented Mar 28, 2023

Seems like related to
#7198

@mwpastore
Copy link

Still up and running with the above patch after about 20 hours.

@yorickdowne
Copy link

We're being asked to test #7199

@mwpastore
Copy link

On it.

@ethDreamer
Copy link

@mwpastore were you able to test this?

@mwpastore
Copy link

@ethDreamer It's running now but it's only been about an hour.

@geomad
Copy link

geomad commented Mar 28, 2023

Someone said he tested #7199 before it got merged and he was still having issues. testing it now myself as well.

@ethDreamer
Copy link

thanks - keep us updated :)

@yorickdowne
Copy link

Time to restart these tests on v2.42

@yorickdowne
Copy link

2.42 stayed stable overnight for 12 hours on 9 servers. As far as I can tell, this is fixed.

calmbeing pushed a commit to calmbeing/bsc-erigon that referenced this issue Apr 24, 2023
gladcow pushed a commit to Ankr-network/erigon that referenced this issue May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests