Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening database takes a very long time #3065

Closed
bowenwang1996 opened this issue Jul 31, 2020 · 10 comments · Fixed by #3123
Closed

Opening database takes a very long time #3065

bowenwang1996 opened this issue Jul 31, 2020 · 10 comments · Fixed by #3123
Assignees
Labels
C-bug Category: This is a bug P-critical Priority: critical

Comments

@bowenwang1996
Copy link
Collaborator

After the hard-fork on testnet yesterday we observe that when restarting a node, it can potentially take more than half an hour to just open the database. To reproduce, initialize and start a testnet node by

neard --home=<home> init --chain-id="testnet" --genesis=genesis.json
neard --home=<home> run

where genesis.json can be found at https://s3-us-west-1.amazonaws.com/build.nearprotocol.com/nearcore-deploy/testnet/genesis.json.

After the node starts, you should be able to see that it is "Waiting for peers". Now shut down the node and restart by neard --home=<home> run and observe that it gets stuck.

@bowenwang1996 bowenwang1996 added C-bug Category: This is a bug P-critical Priority: critical labels Jul 31, 2020
@frol
Copy link
Collaborator

frol commented Aug 3, 2020

My $0.02:

  • strace reports tons of pread64 reads to the same .sst file, that file is over 264MB (276170959 bytes) while the rest of the .sst files are usually quite small (order of kbytes).

  • pread64 tries to read 3566667 bytes from the same offset 272281385 over and over again (it succeeds as strace reports that it successfully reads 3566667 bytes, and the file size (276170959) is bigger than 272281385 + 3566667 (275848052))

  • after 30-35 minutes it manages to reach "INFO stats: Server listening ...":

    Aug 03 12:16:35.054  INFO near: Opening store database at "/home/frol/.near/testnet/data"    
    Aug 03 12:51:37.397  INFO stats: Server listening at ed25519:612SsfzbL8UWShMRpu3j2cmENbAkXLHL4NiM2d4SPL6S@0.0.0.0:24567
    

@frol
Copy link
Collaborator

frol commented Aug 3, 2020

There is a similar report in Ceph: https://tracker.ceph.com/issues/21092

It seems it is related to the cache size of RocksDB.

@frol
Copy link
Collaborator

frol commented Aug 3, 2020

@ailisp Was it you who tweaked RocksDB before? Maybe you have an idea where we miss the cache?

@ailisp
Copy link
Member

ailisp commented Aug 3, 2020

pread64 tries to read 3566667 bytes from the same offset 272281385 over and over again (it succeeds as strace reports that it successfully reads 3566667 bytes, and the file size (276170959) is bigger than 272281385 + 3566667 (275848052))

Great finding! This sounds can't be resolved by smaller write-ahead-log size (.log file), or size of data file (.sst). But rather the node exit before sst file is properly updated and flushed.

after 30-35 minutes it manages to reach "INFO stats: Server listening ...":

Rocksdb might have a 30min timeout on load sst, and would abandon to avoid infinite loop like above case, we might want to abandon earlier to speed up

@frol
Copy link
Collaborator

frol commented Aug 3, 2020

@ailisp Can you take it over or help @mikhailOK with this?

P.S. How critical is having a corrupted .sst file? (I just want to learn, so I can better understand how to provide useful debug information)

@ailisp
Copy link
Member

ailisp commented Aug 3, 2020

P.S. How critical is having a corrupted .sst file? (I just want to learn, so I can better understand how to provide useful debug information)

I think in this case it's already corrupt, it shouldn't retry for 30 minute. In general i'm not sure. From how rocksdb works it suppose to keep the write-ahead-log file until it's fully flushed to sst. So if the last sst file isn't fully written, log file should exist and it should be okay to delete the last sst file and reconstruct from the write-ahead-log file.

@ailisp Can you take it over or help @mikhailOK with this?

Will help @mikhailOK

@frol
Copy link
Collaborator

frol commented Aug 3, 2020

So if the last sst file isn't fully written, log file should exist and it should be okay to delete the last sst file and reconstruct from the write-ahead-log file.

I tried removing it, but then it complained that there is not such .sst file 😄

@ailisp
Copy link
Member

ailisp commented Aug 3, 2020

I tried removing it, but then it complained that there is not such .sst file 😄

That's bad, probably rocksdb thought last log has applied to sst and deleted the log :(

Or i think, more likely (given last sst file is much bigger than the old ones), rocksdb is doing appending into the latest sst. And compact it into several smaller ones once it grows to some limit. If this is the case we have to use rocksdb's functions to fix sst by undo the latest log.

@ailisp
Copy link
Member

ailisp commented Aug 3, 2020

Is it possible sst corrupt because we didn't do this?

image

@frol
Copy link
Collaborator

frol commented Aug 3, 2020

Or i think, more likely (given last sst file is much bigger than the old ones)

In fact, it was not the last sst file, and usually it is the first or close to the first (based on the numbering in the name).

mikhailOK added a commit that referenced this issue Aug 10, 2020
Use rocksdb merge operator for ColState. No longer need to atomically
read + write on update.

Fixes #3065

Test plan
---------
sanity test
manually check performance
mikhailOK added a commit that referenced this issue Aug 12, 2020
Use rocksdb merge operator for ColState. No longer need to atomically
read + write on update.

Fixes #3065

Test plan
---------
sanity test
manually check performance
manually check store version upgrade
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug P-critical Priority: critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants