-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stale cleanup race condition #252
Stale cleanup race condition #252
Conversation
Refactor of `S3::FileRepository` to avoid several closely-related race conditions: - prevent `get_factory()` from yielding a factory that was mid-deletion by the stale watcher, which could cause the plugin crash due to the file no longer existing on disk. This is solved by marking a factory's prefix wrapper as deleted while the stale watcher has exclusive access to it, and checking for deletion status before yielding exclusive access to a prefix wrapper's factory. - eliminates `get_factory()`'s non-atomic `Concurrent::Map#fetch_or_store`, which could cause multiple factories to be created for a single prefix, only one of which would be retained and bytes written to the other(s) would be lost. - introduce `each_factory`, which _avoids_ creating new factories or yielding deleted ones. - refactor `each_files` to use new `each_factory` to avoid yielding files whose factories have been deleted. Additionally, `S3#rotate_if_needed` was migrated to use the now-safer `S3::FileRepository#each_factory` that _avoids_ initializing new factories (and therefore avoids creating empty files on disk after the existing ones had been stale-reaped).
# for stale detection, marking it as deleted before releasing the lock | ||
# and causing it to become deleted from the map. | ||
prefixed_factory.with_lock do |_| | ||
if prefixed_factory.stale? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are trying to lock the prefixed_factory in stale?
as well, isn't deadlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have the (reentrant) lock, so incrementing our holds on it briefly won't ever have lock contention.
prefixed_factory.with_lock do |_| | ||
if prefixed_factory.stale? | ||
prefixed_factory.delete! # mark deleted to prevent reuse | ||
nil # cause deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I know what is the purpose of returning value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a part of the Concurrent::Map#compute_if_present
contract; the result of the block will be stored, or if nil
will cause the value to be deleted.
prefix_val&.with_lock do |factory| | ||
# intentional local-jump to ensure deletion detection | ||
# is done inside the exclusive access. | ||
return yield(factory) unless prefix_val.deleted? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to use global lock for the works case but I think this is better approach and eliminates the scenario you mentioned. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This is largely a forward-port of logstash-plugins/logstash-output-s3#252 with some minor changes to deal with the integrated plugin's usage of the java-native `ConcurrentHashMap` in place of the stand-alone plugin's ruby-native `Concurrent::Map`. Refactor of `S3::FileRepository` to avoid several closely-related race conditions: - prevent `get_factory()` from yielding a factory that was mid-deletion by the stale watcher, which could cause the plugin crash due to the file no longer existing on disk. This is solved by marking a factory's prefix wrapper as deleted while the stale watcher has exclusive access to it, and checking for deletion status before yielding exclusive access to a prefix wrapper's factory. - introduce `each_factory`, which _avoids_ creating new factories or yielding deleted ones. - refactor `each_files` to use new `each_factory` to avoid yielding files whose factories have been deleted. - void-return methods now explicitly emit `nil` to prevent accidental leaks of synchronization-required resources. Additionally, `S3#rotate_if_needed` was migrated to use the now-safer `S3::FileRepository#each_factory` that _avoids_ initializing new factories (and therefore avoids creating empty files on disk after the existing ones had been stale-reaped).
Alternative to #251 that catches an additional race-condition
Concurrent::Map#fetch_or_store
with atomicConcurrent::Map#computeIfAbsent
)FileRepository::PrefixedValue
as deleted while we have exclusive access to it, and avoid using a marked-deletedFileRepository::PrefixedValue
throughout)FileRepository::PrefixedValue#with_lock
use reentrantMonitor
)This will need to be forward-ported to the integrated version of this plugin ->
logstash-integration-aws