Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale cleanup race condition #252

Merged
merged 2 commits into from
Dec 22, 2022

Conversation

yaauie
Copy link
Contributor

@yaauie yaauie commented Dec 21, 2022

Alternative to #251 that catches an additional race-condition

  • Fixes several closely-related race conditions that could cause plugin crashes or data-loss
    • race condition in initializing a prefix could cause one or more local temp files to be abandoned and only recovered after next pipeline start (replace Concurrent::Map#fetch_or_store with atomic Concurrent::Map#computeIfAbsent)
    • race condition in stale watcher could cause the plugin to crash when working with a stale (empty) file that had been deleted (mark FileRepository::PrefixedValue as deleted while we have exclusive access to it, and avoid using a marked-deleted FileRepository::PrefixedValue throughout)
    • race condition in stale watcher could cause a non-empty file to be deleted if bytes were written to it after it was detected as stale (check-and-delete within a single exclusive access, which requires that FileRepository::PrefixedValue#with_lock use reentrant Monitor)

This will need to be forward-ported to the integrated version of this plugin -> logstash-integration-aws

Refactor of `S3::FileRepository` to avoid several closely-related race
conditions:

 - prevent `get_factory()` from yielding a factory that was
   mid-deletion by the stale watcher, which could cause the plugin
   crash due to the file no longer existing on disk. This is solved
   by marking a factory's prefix wrapper as deleted while the stale
   watcher has exclusive access to it, and checking for deletion
   status before yielding exclusive access to a prefix wrapper's factory.
 - eliminates `get_factory()`'s non-atomic `Concurrent::Map#fetch_or_store`,
   which could cause multiple factories to be created for a single
   prefix, only one of which would be retained and bytes written to
   the other(s) would be lost.
 - introduce `each_factory`, which _avoids_ creating new factories
   or yielding deleted ones.
 - refactor `each_files` to use new `each_factory` to avoid yielding
   files whose factories have been deleted.

Additionally, `S3#rotate_if_needed` was migrated to use the now-safer
`S3::FileRepository#each_factory` that _avoids_ initializing new
factories (and therefore avoids creating empty files on disk after
the existing ones had been stale-reaped).
# for stale detection, marking it as deleted before releasing the lock
# and causing it to become deleted from the map.
prefixed_factory.with_lock do |_|
if prefixed_factory.stale?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are trying to lock the prefixed_factory in stale? as well, isn't deadlock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the (reentrant) lock, so incrementing our holds on it briefly won't ever have lock contention.

prefixed_factory.with_lock do |_|
if prefixed_factory.stale?
prefixed_factory.delete! # mark deleted to prevent reuse
nil # cause deletion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I know what is the purpose of returning value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a part of the Concurrent::Map#compute_if_present contract; the result of the block will be stored, or if nil will cause the value to be deleted.

prefix_val&.with_lock do |factory|
# intentional local-jump to ensure deletion detection
# is done inside the exclusive access.
return yield(factory) unless prefix_val.deleted?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to use global lock for the works case but I think this is better approach and eliminates the scenario you mentioned. Thanks!

Copy link
Contributor

@mashhurs mashhurs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@mashhurs mashhurs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yaauie yaauie merged commit a185659 into logstash-plugins:main Dec 22, 2022
yaauie added a commit to yaauie/logstash-integration-aws that referenced this pull request Dec 23, 2022
This is largely a forward-port of logstash-plugins/logstash-output-s3#252 with
some minor changes to deal with the integrated plugin's usage of the
java-native `ConcurrentHashMap` in place of the stand-alone plugin's
ruby-native `Concurrent::Map`.

Refactor of `S3::FileRepository` to avoid several closely-related race
conditions:

 - prevent `get_factory()` from yielding a factory that was
   mid-deletion by the stale watcher, which could cause the plugin
   crash due to the file no longer existing on disk. This is solved
   by marking a factory's prefix wrapper as deleted while the stale
   watcher has exclusive access to it, and checking for deletion
   status before yielding exclusive access to a prefix wrapper's factory.
 - introduce `each_factory`, which _avoids_ creating new factories
   or yielding deleted ones.
 - refactor `each_files` to use new `each_factory` to avoid yielding
   files whose factories have been deleted.
 - void-return methods now explicitly emit `nil` to prevent accidental
   leaks of synchronization-required resources.

Additionally, `S3#rotate_if_needed` was migrated to use the now-safer
`S3::FileRepository#each_factory` that _avoids_ initializing new
factories (and therefore avoids creating empty files on disk after
the existing ones had been stale-reaped).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants