Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bboltcachestorage: mitigate corrupt boltdb cache after panic #4981

Merged
merged 1 commit into from
May 31, 2024

Conversation

jsternberg
Copy link
Collaborator

@jsternberg jsternberg commented May 31, 2024

There are some reports that the nosync configuration of the boltdb can
cause panics on restarts due to corruption of the database. Mitigate by
panic recovery until there is a better solution.

// fallbackOpenDB performs the panic recovery, database backup, and
// opening of the new database file when a panic happens from safeOpenDB.
func fallbackOpenDB(dbPath string, db **bolt.DB, err *error) {
r := recover()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only call recover() inside inline defer. It shouldn't be possible to reach it by calling a regular function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the recover portion to an inline defer. The fallbackOpenDB is now just printing the log message, performing the move, and opening the new database.

}

// initStore initializes the store with the required buckets.
func initStore(db *bolt.DB) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this moved? Doesn't seem to have an effect on recover/defer logic and called only once

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thought it would make things easier to read while I was modifying things. I can revert it if you prefer but I thought it would be easier to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a long function (yet) and only has one caller so I don't think there is a need to slice it. There is also no scope difference as "new" and "init" mean the same thing, just with a different word.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this back. It's not the primary purpose of this PR and the initialization isn't really that complicated anyway.

// safeOpenDB opens a bolt database and recovers from panic that
// can be caused by a corrupted database file.
func safeOpenDB(dbPath string) (db *bolt.DB, err error) {
defer fallbackOpenDB(dbPath, &db, &err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we are already adding a fallback then it should cover all the possible cases where boltdb finds a corrupted library that it can't open. Fallback is cheap and can't happen for regular reasons. Don't want to miss out something and wait for another release cycle.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified this to also apply to any error that occurs when os.Stat otherwise returns without error. So if stat fails for some reason (generally the file does not exist or is not readable), then we'll perform the fallback.

return openDB(dbPath)
}

// fallbackOpenDB performs the panic recovery, database backup, and
Copy link
Member

@tonistiigi tonistiigi May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"performs database backup and opening of the new database file. Called after the first db.Open has failed on reading the file."

There are some reports that the nosync configuration of the boltdb can
cause panics on restarts due to corruption of the database. Mitigate by
panic recovery until there is a better solution.

Co-authored-by: Tonis Tiigi <tonistiigi@gmail.com>
Signed-off-by: Jonathan A. Sternberg <jonathan.sternberg@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants