Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slot rollback take longer than syncing the db #397

Closed
PapaSwer opened this issue Nov 5, 2020 · 14 comments
Closed

Slot rollback take longer than syncing the db #397

PapaSwer opened this issue Nov 5, 2020 · 14 comments

Comments

@PapaSwer
Copy link

PapaSwer commented Nov 5, 2020

After starting cardano-db-sync:

[db-sync-node:Info:31] [2020-11-05 05:02:14.10 UTC] Starting chainSyncClient
[db-sync-node:Info:31] [2020-11-05 05:02:14.15 UTC] Cardano. Db tip is at slot 7066517, block 4618318
[db-sync-node:Info:36] [2020-11-05 05:02:14.15 UTC] Running DB thread
[db-sync-node:Info:36] [2020-11-05 05:02:17.60 UTC] Rolling back to slot 6640000, hash dc3a7980681d38a4bbcbc1bec5f4e6f74d755848ee07889857ef83e78ca6765b
[db-sync-node:Info:36] [2020-11-05 05:02:18.03 UTC] Deleting slots numbered: [7066517..6640025]
[db-sync-node:Info:39] [2020-11-05 05:02:34.74 UTC] getHistoryInterpreter: acquired
-------------------FREEZE-------------------------------

Even after 24h no change is made.

@erikd
Copy link
Contributor

erikd commented Nov 5, 2020

If you look at top or similar you will see that both postgres and db-sync are working really hard.

Whenever the number of slots to roll back is large (like you have there) it will be slow due to #256 .

The solution (while #256 remains unfixed) is to drop and recreate the database. That will take significantly less than 24m hours.

@rcmorano
Copy link

rcmorano commented Nov 5, 2020

Problem for me with this is that it won't ever catch up tip because it falls into one of these rollbacks no matter how many times I've tried a sync from the scratch :(

@erikd
Copy link
Contributor

erikd commented Nov 5, 2020

The only way this rollback could happen while syncing is if db-sync or the node it is connected to is being stopped and restarted (and even then is unlikely).

If there is a problem, I need to be able to recreate it. Unfortunately, I have synced from scratch at least 50 times and I have not seen anything like this even once.

@rcmorano
Copy link

rcmorano commented Nov 5, 2020

The only way this rollback could happen while syncing is if db-sync or the node it is connected to is being stopped and restarted (and even then is unlikely).

If there is a problem, I need to be able to recreate it. Unfortunately, I have synced from scratch at least 50 times and I have not seen anything like this even once.

Maybe also because of some postgres crash/hang up? I've been trying to sync on very low resources instances and also on this "weird" setup where I share the socket through TCP with socat (as a I told you about on some other issue), so yeah in my case postgres/cardano-node probably hung up for too long on some query, restarted... A myriad of bad things could have happened, but somehow 5.0.2 version managed to recover from those issues but on 5.0.3+ I haven't been able to reach tip :(

I've upgraded my instances to catch up tip and will then downgrade. Crossing fingers :D

@erikd
Copy link
Contributor

erikd commented Nov 6, 2020

Maybe also because of some postgres crash/hang up?

Yes, that is possible.

I've been trying to sync on very low resources instances and also on this "weird" setup where I share the socket through TCP with socat

That could easily be the problem. Also, if you do this, you MUST say so in the ticket. .

@rcmorano
Copy link

rcmorano commented Nov 6, 2020

That could easily be the problem. Also, if you do this, you MUST say so in the ticket. .

I did on a different one (#304), but IMHO it's not very relevant to this one since it's the cause for the rollback to happen; not for the rollback to be slow.

@erikd
Copy link
Contributor

erikd commented Nov 9, 2020

I deal with a large number of tickets and it is easy to forget minor details. If there is a special circumstance with a ticket, that circumstance MUST be noted in every ticket where it is relevant.

Secondly, if db-sync is running in a resource constrained environment, problems that only occur in that resource constrained environment are highly likely to be caused by that environment and any further problems are likely due the first one.

Finally, for any problem to be fixed, I must be provided with sufficient information to recreate the problem. In most cases, if I cannot reproduce the problem, I cannot fix the problem.

@xdzurman
Copy link

xdzurman commented Nov 10, 2020

I can confirm that this happens whenever I need to restart db-sync 6.0.0 or start it from a copied postgres db. Notice that it wants to roll back to slot 0.

[db-sync-node:Info:72] [2020-11-10 10:05:59.84 UTC] Starting chainSyncClient
[db-sync-node:Info:72] [2020-11-10 10:05:59.86 UTC] Cardano.Db tip is at slot 13436216, block 4932157
[db-sync-node:Info:76] [2020-11-10 10:05:59.86 UTC] Running DB thread
[db-sync-node:Info:76] [2020-11-10 10:05:59.92 UTC] Rolling back to slot 0, hash f0f7892b5c333cffc4b3c4344de48af4cc63f55e44936196f365a9ef2244134f
[db-sync-node:Info:79] [2020-11-10 10:05:59.94 UTC] getHistoryInterpreter: acquired
[db-sync-node:Info:76] [2020-11-10 10:08:02.65 UTC] Deleting slots numbered: [13436216..1]

Db-sync 5.0.3's output looks the following:
[db-sync-node:Info:79] [2020-11-10 10:02:10.00 UTC] Starting chainSyncClient
[db-sync-node:Info:79] [2020-11-10 10:02:10.49 UTC] Cardano.Db tip is at slot 13432357, block 4931976
[db-sync-node:Info:84] [2020-11-10 10:02:10.49 UTC] Running DB thread
[db-sync-node:Info:84] [2020-11-10 10:02:10.52 UTC] Rolling back to slot 13432357, hash 8ed30798ffe1fb981f7b4876b136d727c7cf71b136894101457232ad82fc8237
[db-sync-node:Info:84] [2020-11-10 10:02:10.52 UTC] Deleting blocks numbered: []
[db-sync-node:Info:87] [2020-11-10 10:02:10.54 UTC] getHistoryInterpreter: acquired
[db-sync-node:Info:84] [2020-11-10 10:02:17.90 UTC] insertShelleyBlock: epoch 228, slot 13436133, block 4932153, hash ee5692ae029858c895d71d254a8b35078c46437eb44d4c77dc3844d2fdc105a1
[db-sync-node:Info:84] [2020-11-10 10:02:21.84 UTC] insertShelleyBlock: epoch 228, slot 13436136, block 4932154, hash 46ad2af4eb675e3a9f6054d0998020c9827234ddb7b7460c7469502ad834473e
...
You can see that it's not rolling to slot 0 and starts right away.

What's the problem @erikd ?

@erikd
Copy link
Contributor

erikd commented Nov 10, 2020

whenever I need to restart db-sync 6.0.0 or start it from a copied postgres db

I totally admit that it will happen in the later case.

For the former case, I have personally restarted it (with the existing database and compatible state dir) hundreds of times without a rollback.

The important distinction is that there is a directory of state information (as specified by --state-dir). If that data is missing or from an incompatible db-sync version, it will try to roll back the database.

@xdzurman
Copy link

xdzurman commented Nov 10, 2020

I'm starting it in docker compose similarly to https://github.com/input-output-hk/cardano-rest/blob/master/docker-compose.yml#L40, how do you specify the state dir?

Edit: found the answer at 6187081

@erikd
Copy link
Contributor

erikd commented Nov 11, 2020

I have a PR which adds better logging. Its not a solution but should be more informative.

@sjlee1125
Copy link

sjlee1125 commented Nov 12, 2020

I had same circumstance after upgrading 6.0.0 (I used docker-compose.yaml in cardano-rest repository)

only difference between cardano-db-sync and cardano-rest is 6187081#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3

after adding that and recreating postgres, problem disappear

@erikd
Copy link
Contributor

erikd commented Nov 12, 2020

@sjlee1125 if something is missing from the cardano-rest docker specification, you should raise a ticket on that repo to make sure it gets fixed.

@Mercurial
Copy link

mine took 10 hours to rollback then I just stopped and resynced because that is way faster

erikd pushed a commit that referenced this issue Jan 5, 2021
@erikd erikd closed this as completed in 1227676 Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants