Stale cleanup race condition #252

yaauie · 2022-12-21T23:49:06Z

Alternative to #251 that catches an additional race-condition

Fixes several closely-related race conditions that could cause plugin crashes or data-loss
- race condition in initializing a prefix could cause one or more local temp files to be abandoned and only recovered after next pipeline start (replace Concurrent::Map#fetch_or_store with atomic Concurrent::Map#computeIfAbsent)
- race condition in stale watcher could cause the plugin to crash when working with a stale (empty) file that had been deleted (mark FileRepository::PrefixedValue as deleted while we have exclusive access to it, and avoid using a marked-deleted FileRepository::PrefixedValue throughout)
- race condition in stale watcher could cause a non-empty file to be deleted if bytes were written to it after it was detected as stale (check-and-delete within a single exclusive access, which requires that FileRepository::PrefixedValue#with_lock use reentrant Monitor)

This will need to be forward-ported to the integrated version of this plugin -> logstash-integration-aws

Refactor of `S3::FileRepository` to avoid several closely-related race conditions: - prevent `get_factory()` from yielding a factory that was mid-deletion by the stale watcher, which could cause the plugin crash due to the file no longer existing on disk. This is solved by marking a factory's prefix wrapper as deleted while the stale watcher has exclusive access to it, and checking for deletion status before yielding exclusive access to a prefix wrapper's factory. - eliminates `get_factory()`'s non-atomic `Concurrent::Map#fetch_or_store`, which could cause multiple factories to be created for a single prefix, only one of which would be retained and bytes written to the other(s) would be lost. - introduce `each_factory`, which _avoids_ creating new factories or yielding deleted ones. - refactor `each_files` to use new `each_factory` to avoid yielding files whose factories have been deleted. Additionally, `S3#rotate_if_needed` was migrated to use the now-safer `S3::FileRepository#each_factory` that _avoids_ initializing new factories (and therefore avoids creating empty files on disk after the existing ones had been stale-reaped).

mashhurs · 2022-12-22T05:03:09Z

lib/logstash/outputs/s3/file_repository.rb

+            # for stale detection, marking it as deleted before releasing the lock
+            # and causing it to become deleted from the map.
+            prefixed_factory.with_lock do |_|
+              if prefixed_factory.stale?


we are trying to lock the prefixed_factory in stale? as well, isn't deadlock?

We already have the (reentrant) lock, so incrementing our holds on it briefly won't ever have lock contention.

mashhurs · 2022-12-22T05:07:01Z

lib/logstash/outputs/s3/file_repository.rb

+            prefixed_factory.with_lock do |_|
+              if prefixed_factory.stale?
+                prefixed_factory.delete! # mark deleted to prevent reuse
+                nil # cause deletion


Can I know what is the purpose of returning value?

It is a part of the Concurrent::Map#compute_if_present contract; the result of the block will be stored, or if nil will cause the value to be deleted.

mashhurs · 2022-12-22T05:09:03Z

lib/logstash/outputs/s3/file_repository.rb

+          prefix_val&.with_lock do |factory|
+            # intentional local-jump to ensure deletion detection
+            # is done inside the exclusive access.
+            return yield(factory) unless prefix_val.deleted?


I was thinking to use global lock for the works case but I think this is better approach and eliminates the scenario you mentioned. Thanks!

mashhurs

LGTM!

mashhurs

LGTM!

This is largely a forward-port of logstash-plugins/logstash-output-s3#252 with some minor changes to deal with the integrated plugin's usage of the java-native `ConcurrentHashMap` in place of the stand-alone plugin's ruby-native `Concurrent::Map`. Refactor of `S3::FileRepository` to avoid several closely-related race conditions: - prevent `get_factory()` from yielding a factory that was mid-deletion by the stale watcher, which could cause the plugin crash due to the file no longer existing on disk. This is solved by marking a factory's prefix wrapper as deleted while the stale watcher has exclusive access to it, and checking for deletion status before yielding exclusive access to a prefix wrapper's factory. - introduce `each_factory`, which _avoids_ creating new factories or yielding deleted ones. - refactor `each_files` to use new `each_factory` to avoid yielding files whose factories have been deleted. - void-return methods now explicitly emit `nil` to prevent accidental leaks of synchronization-required resources. Additionally, `S3#rotate_if_needed` was migrated to use the now-safer `S3::FileRepository#each_factory` that _avoids_ initializing new factories (and therefore avoids creating empty files on disk after the existing ones had been stale-reaped).

yaauie added 2 commits December 21, 2022 03:10

race: stop stale sweeper before iterating over factories

3dc04f3

mashhurs reviewed Dec 22, 2022

View reviewed changes

mashhurs approved these changes Dec 22, 2022

View reviewed changes

mashhurs mentioned this pull request Dec 22, 2022

Fixes the no such file or directory issue. #251

Closed

yaauie merged commit a185659 into logstash-plugins:main Dec 22, 2022

yaauie mentioned this pull request Dec 23, 2022

s3 output: resolve stale-detection races logstash-plugins/logstash-integration-aws#19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale cleanup race condition #252

Stale cleanup race condition #252

yaauie commented Dec 21, 2022

mashhurs Dec 22, 2022

yaauie Dec 22, 2022

mashhurs Dec 22, 2022

yaauie Dec 22, 2022

mashhurs Dec 22, 2022

mashhurs left a comment

mashhurs left a comment

Stale cleanup race condition #252

Stale cleanup race condition #252

Conversation

yaauie commented Dec 21, 2022

mashhurs Dec 22, 2022

Choose a reason for hiding this comment

yaauie Dec 22, 2022

Choose a reason for hiding this comment

mashhurs Dec 22, 2022

Choose a reason for hiding this comment

yaauie Dec 22, 2022

Choose a reason for hiding this comment

mashhurs Dec 22, 2022

Choose a reason for hiding this comment

mashhurs left a comment

Choose a reason for hiding this comment

mashhurs left a comment

Choose a reason for hiding this comment