You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There was an issue #3317 about lagging truncate_lsn. It was mostly fixed, but still can be reproduced with safekeeper restarts:
It's probably still can be reproduced by restarting safekeepers before flushing peer_horizon_lsn (truncate_lsn) to the disk, we should try to flush state to disk more often. To do that, we can trigger flushing when timeline becomes inactive (compute disconnects), on graceful shutdown, etc.
Before the fixes there were many timelines with MAX(backup_lsn) - MIN(disk_peer_horizon_lsn) around 16MB, and that triggered S3 download. We can try to get MIN(flush_lsn) - MIN(disk_peer_horizon_lsn) close to zero, should be not hard to do with additional flushes.
As an alternative, we can just set control file save interval to 10 seconds. It shouldn't affect disk performance and will make safekeepers always have fresh (~10 seconds) control file on disk.
There was an issue #3317 about lagging
truncate_lsn
. It was mostly fixed, but still can be reproduced with safekeeper restarts:Originally posted by @petuhovskiy in #3317 (comment)
We should flush control_file to disk more aggresively:
Metrics that should be improved after fix (values given on the moment of issue creation):
MAX(backup_lsn) - MIN(disk_peer_horizon_lsn)
–max=15223760, avg=3511177
(67 timelines in total where backup_lsn > disk_peer_horizon_lsn)MIN(flush_lsn) - MIN(disk_peer_horizon_lsn)
–max=16771592, avg=64808
The text was updated successfully, but these errors were encountered: