Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erigon eats 100gb+ of memory when tracing a certain tx #4637

Closed
banteg opened this issue Jul 5, 2022 · 24 comments
Closed

erigon eats 100gb+ of memory when tracing a certain tx #4637

banteg opened this issue Jul 5, 2022 · 24 comments

Comments

@banteg
Copy link
Contributor

banteg commented Jul 5, 2022

System information

Erigon version: erigon version 2022.07.1-alpha-09776394

OS & Version: Linux

Commit hash : 0977639

Expected behaviour

an rpc call returns a trace

Actual behaviour

erigon gobbles up 100gb+ or memory and gets killed by the system

Steps to reproduce the behaviour

run debug_traceTransaction against any of these txs:

  • 0xb9e6b6f275212824215e8f50818f12b37b7ca4c2e0b943785357c35b23743b94
  • 0xd770356649f1e60e7342713d483bd8946f967e544db639bd056dfccc8d534d8e
  • 0x9ef7a35012286fef17da12624aa124ebc785d9e7621e1fd538550d1209eb9f7d

Backtrace

not available, erigon gets killed by the system

@mandrigin
Copy link
Contributor

@banteg is there anything more fresh that has this or similar behaviour? I have a pruned node, so I can't check that far back in history. Or similar transactions that aren't 1/2 of a year old are tracing just fine?

@MysticRyuujin
Copy link
Contributor

This is also affecting the latest stable/beta/depreciated release.

{"jsonrpc":"2.0","id":1,"method":"debug_traceTransaction","params":["0xb9e6b6f275212824215e8f50818f12b37b7ca4c2e0b943785357c35b23743b94"]}

@AlexeyAkhunov
Copy link
Contributor

This is because of this PR: #2779

@mandrigin
Copy link
Contributor

ah, okay, then we probably need to think about adding some kind of pagination/limitation for these traces, or some binary response

@mandrigin
Copy link
Contributor

I wonder also if another json-serialization lib could work there, the marshal/json isn't the most frugal code

@banteg
Copy link
Contributor Author

banteg commented Jul 5, 2022

@banteg is there anything more fresh that has this or similar behaviour? I have a pruned node, so I can't check that far back in history. Or similar transactions that aren't 1/2 of a year old are tracing just fine?

no, my dataset consisted of 11,000 transactions and only these three had this behavior

@darkhorse-spb
Copy link

I'm also having this issue with stable release, tx 0x42b8205ed4c9d9de39340999c05327543f422b4ca881ae5910d56b3ad62d19c6

@mandrigin
Copy link
Contributor

okay what we can try to do is try to change the json serialization library in an experiment branch and then @banteg @darkhorse-spb if you can test it on your machines and see if that helps at all

@AskAlexSharov
Copy link
Collaborator

@mandrigin debug_traceTransaction already using jsoniter.Stream serialization lib, and it must do streaming (in no-batch and no-websocket cases), probably it doesn't because I disabled it in ./rpc/handler.go handleMsg to fix broken JSON format in case of errors error.

It's impossible to stream json, and return error if error happened in the middle of streaming. Because json is not streaming-friendly format.

@mandrigin
Copy link
Contributor

I also have a weird idea of using ETL to first dump everything to the binary files, check for errors and then stream results.

@mandrigin
Copy link
Contributor

but the question is also, what eats all this RAM?

@banteg can I ask you to run Erigon with the built-in rpc daemon and with --pprof and then when it begins eating RAM, maybe at 60 or 80 GB, do curl http://127.0.0.1:6060/debug/pprof/heap > heap.out and attach this file here? then I can look at the profiler too

@AskAlexSharov
Copy link
Collaborator

We decided to enable back streaming feature by default:

#4647 Erigon has enalbed json streamin for some heavy endpoints (like trace_*). It's treadoff: greatly reduce amount of RAM (in some cases from 30GB to 30mb), but it produce invalid json format if error happened in the middle of streaming (because json is not streaming-friendly format)

We decided that value from this streaming is higher than handling "error happen in the middle" rare corner case. But added flag: --rpc.streaming.disable if users wish to pay for correctnesses or compatibility.

@mandrigin
Copy link
Contributor

@banteg @darkhorse-spb can you check in the current devel version and see if it helped?

@tjayrush
Copy link
Contributor

tjayrush commented Jul 6, 2022

but it produce invalid json format if error happened in the middle of streaming (because json is not streaming-friendly format)

Is it Go Code? We ran into the same issue with TrueBlocks. We stream our data too.

We were able to get around it using a defer call that closes and open JSON objects or arrays. It's not perfect -- it doesn't work that well with nested objects, but it works for simple arrays and simple objects for example. If any sub-routines return an error, the defer simply closes the array.

If the program crashes, and a subroutine never returns, it doesn't work, but the program crashed, so something isn't working anyway.

@AskAlexSharov
Copy link
Collaborator

Then user will not see error message at all

@tjayrush
Copy link
Contributor

tjayrush commented Jul 6, 2022

We attach the error as another field in the object in the defer method. Not perfectly compliant JSON, but it works. (Perfectly compliant JSON, if it returns an error, should return empty data -- but that's not possible since you've already streamed the data.)

@AskAlexSharov
Copy link
Collaborator

@tjayrush it even may work in many client libs. do you have some open-source example?

@tjayrush
Copy link
Contributor

tjayrush commented Jul 6, 2022

I'm almost embarrassed to show it. It's super hacky, but here's an example: https://github.com/TrueBlocks/trueblocks-core/blob/feature/new-unchained-index-2.0/src/apps/chifra/internal/chunks/handle_addresses.go#L66. The RenderFooter routine (which closes an array and an object (everything our API delivers has the same shape) get called even if an error happens. We deliver the error on standard error many levels above this code, so it just closes the JSON object and returns the error (or nil if there is no error).

@AskAlexSharov
Copy link
Collaborator

tnx, will try tomorrow

@mandrigin
Copy link
Contributor

@AskAlexSharov do you want to keep this one around?

@AskAlexSharov
Copy link
Collaborator

It’s fixed - streaming enabled. But we need add this approach also: #4637 (comment)

@mandrigin
Copy link
Contributor

okay @nanevardanyan will take a look at the error handling then

@banteg
Copy link
Contributor Author

banteg commented Jul 18, 2022

seems fixed on erigon's side, but clients would need to consider streaming too. one of the traces i reported yields a 66.5GB response. here is a small script which will show both compressed and uncompressed size of the response.

https://gist.github.com/banteg/98dbccbf6e2a3f997199a1b16eb93c5a

@banteg
Copy link
Contributor Author

banteg commented Jul 18, 2022

reran with my dataset. you can clearly see the outliers i found earlier:
trace-size
elapsed-size
gas-size

here are response sizes:

0x9ef7a35012286fef17da12624aa124ebc785d9e7621e1fd538550d1209eb9f7d = 41.4 GB (2.2 GB compressed)
0xd770356649f1e60e7342713d483bd8946f967e544db639bd056dfccc8d534d8e = 43.9 GB (2.4 GB compressed)
0x2428a69601105c365b9fe9d2f30688b91710b6a43bc6d2026344674ae7ffcac3 = 50.4 GB (2.9 GB compressed)
0xb9e6b6f275212824215e8f50818f12b37b7ca4c2e0b943785357c35b23743b94 = 71.5 GB (3.5 GB compressed)

all other traces are under 4 GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants