Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix retry indefinitely in termination process #129

Merged
merged 7 commits into from
Mar 4, 2022

Conversation

kaisecheng
Copy link
Contributor

@kaisecheng kaisecheng commented Feb 22, 2022

The plugin is blocking the shutdown process when url point to an invalid port, which is a retryable error that retry indefinitely

This commit adds a checking that checks pipeline shutdown state to quit retry loop
This feature requires logstash-core that exposes shutdown_requested elastic/logstash#13811

Fixed: #123

This commit checks pipeline shutdown request to quit retry loop
This feature requires logstash-core that exposes shutdown_requested

Fixed: logstash-plugins#123
Copy link
Contributor

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 to enabling graceful shutdowns in plugins.

If we are already requiring changes in LS core to make this work, I would rather we implement all of the complex bits of this functionality over in LS core and merely hook into it from this plugin in as lightweight a manner as possible. That way other plugins that need this functionality don't also have to implement the reach into the execution context.

If we were to implement LogStash::Outputs::Base#pipeline_shutdown_requested? (note the leading pipeline_ and trailing ?) in LS core, we could hook into it here with support for older LS's by defining:

  def pipeline_shutdown_requested?
    return super if defined?(super) # since LS 8.1.0
    nil # falsy, unknown
  end

We should also discuss whether 2 attempts during a shutdown is sufficient for this particular plugin, and whether that should be configurable.

lib/logstash/outputs/http.rb Outdated Show resolved Hide resolved
lib/logstash/outputs/http.rb Outdated Show resolved Hide resolved
@kaisecheng
Copy link
Contributor Author

@yaauie thanks for the review. Moving the status check to output base class make a lot of sense, especially involving many nil checking.

We should also discuss whether 2 attempts during a shutdown is sufficient for this particular plugin, and whether that should be configurable.

I try to find a balance between a reasonable shutdown time and the number of retries. The first try is 0, so "2" actually means a total of 3 attempts, which take around 15 seconds in the connection refuse case. I take k8s terminationGracePeriodSeconds as a reference, which kill the pod in 30 seconds (default value) if it can't finish the shutdown process. Changing the attempt to "3" would be pretty close to 30 seconds.

Regarding making it configurable, user who has enabled PQ would have less concern as the events will be retired in the next start, so memory queue user may want to have config. This is a config limit to retryable error in the shutdown scenario. One concern is automatic_retries could confuse with shutdown_retries. We need a better doc to explain the difference and this particular scenario. It is a nice to have feature for memory queue user.

@yaauie
Copy link
Contributor

yaauie commented Feb 28, 2022

@kaisecheng with the current implementation, it is going to be very difficult to ensure that we bail before some arbitrary cutoff, especially when using the default non-batched behaviour and the cleverly randomized exponential backoff. A batch with the default 125 events, for example, can take 125*timeout to burn through its first attempt of each event, and an output that is already in a retry loop before the shutdown is requested can block on Queue#pop for up to 60s before we even get an event with which to run this new logic.

I think that this PR prevents us from blocking indefinitely, which is a good starting point and worth merging. I don't see a trivial way to change the existing implementation into one that is easily interruptible for shutdowns (e.g., finish or fail within X seconds of a shutdown being requested), without re-implementing a threadsafe queue whose elements retain a quarantine timestamp and whose pop method had new behaviour that is less blocking. But alas, that is much larger in scope and would certainly add risk.

Co-authored-by: Ry Biesemeyer <yaauie@users.noreply.github.com>
@kaisecheng
Copy link
Contributor Author

Indeed, the duration can be affected by too many factors, the type of failure, timeout setting, size of events... and this PR is not aiming to define a shutdown process in X seconds. Mostly I am thinking of the connection refuse case when the config point to a wrong URL, how many times it should retry, how many times it has retired before a complete shutdown, and how long does the shutdown take so that user considers it is a "bug" because of too long. I think trying for two times is too short which could finish in 2 seconds, so three attempts seem to be a compromise.

…tp into fix_stall_termination

# Conflicts:
#	CHANGELOG.md
@kaisecheng kaisecheng requested a review from yaauie March 3, 2022 14:53
@kaisecheng kaisecheng merged commit 6dca44e into logstash-plugins:main Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

http output plugin stalls Logstash pipeline termination
3 participants