Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save control_file to disk on compute disconnect #3836

Open
1 of 3 tasks
petuhovskiy opened this issue Mar 16, 2023 · 1 comment
Open
1 of 3 tasks

Save control_file to disk on compute disconnect #3836

petuhovskiy opened this issue Mar 16, 2023 · 1 comment
Labels
c/storage/safekeeper Component: storage: safekeeper t/bug Issue Type: Bug

Comments

@petuhovskiy
Copy link
Member

petuhovskiy commented Mar 16, 2023

There was an issue #3317 about lagging truncate_lsn. It was mostly fixed, but still can be reproduced with safekeeper restarts:

It's probably still can be reproduced by restarting safekeepers before flushing peer_horizon_lsn (truncate_lsn) to the disk, we should try to flush state to disk more often. To do that, we can trigger flushing when timeline becomes inactive (compute disconnects), on graceful shutdown, etc.

Before the fixes there were many timelines with MAX(backup_lsn) - MIN(disk_peer_horizon_lsn) around 16MB, and that triggered S3 download. We can try to get MIN(flush_lsn) - MIN(disk_peer_horizon_lsn) close to zero, should be not hard to do with additional flushes.

Originally posted by @petuhovskiy in #3317 (comment)

We should flush control_file to disk more aggresively:

Metrics that should be improved after fix (values given on the moment of issue creation):

  • MAX(backup_lsn) - MIN(disk_peer_horizon_lsn)max=15223760, avg=3511177 (67 timelines in total where backup_lsn > disk_peer_horizon_lsn)
  • MIN(flush_lsn) - MIN(disk_peer_horizon_lsn)max=16771592, avg=64808
@petuhovskiy petuhovskiy added t/bug Issue Type: Bug c/storage/safekeeper Component: storage: safekeeper labels Mar 16, 2023
@petuhovskiy
Copy link
Member Author

As an alternative, we can just set control file save interval to 10 seconds. It shouldn't affect disk performance and will make safekeepers always have fresh (~10 seconds) control file on disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/safekeeper Component: storage: safekeeper t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant