Fix a possible race where Thread.interrupted was not properly cleared #131

andrewvc · 2017-12-14T03:18:57Z

This should keep perf close enough. It's a little slower, but worth it for the consistency.

Old perf as measured with time:
54.57 real 216.58 user 3.51 sys
New perf as measured with time:
59.58 real 238.70 user 4.21 sys

Test config:

# MULTIPLE grok filters
input {
    generator {
        type => foo
        message => "random message, la la la"
        count => 1000000
    }
}

filter {
    grok {
        match => {
            "message" => "^1 foo 1 bar$"
        }
    }
    grok {
        match => {
            "message" => "^2 foo 2 bar$"
        }
    }
    grok {
        match => {
            "message" => "^3 foo 3 bar$"
        }
    }
    grok {
        match => {
            "message" => "^4 foo 4 bar$"
        }
    }
    grok {
        match => {
            "message" => "^5 foo 5 bar$"
        }
    }
    grok {
        match => {
            "message" => "^6 foo 6 bar$"
        }
    }
    grok {
        match => {
            "message" => "^7 foo 7 bar$"
        }
    }
    grok {
        match => {
            "message" => "^8 foo 8 bar$"
        }
    }
    grok {
        match => {
            "message" => "^9 foo 9 bar$"
        }
    }
    grok {
        match => {
            "message" => "^10 foo 10 bar$"
        }
    }
    grok {
        match => {
            "message" => "^11 foo 11 bar$"
        }
    }
    grok {
        match => {
            "message" => "^12 foo 12 bar$"
        }
    }
    grok {
        match => {
            "message" => "^13 foo 13 bar$"
        }
    }
    grok {
        match => {
            "message" => "^14 foo 14 bar$"
        }
    }
    grok {
        match => {
            "message" => "^15 foo 15 bar$"
        }
    }
    grok {
        match => {
            "message" => "^16 foo 16 bar$"
        }
    }
    grok {
        match => {
            "message" => "^17 foo 17 bar$"
        }
    }
    grok {
        match => {
            "message" => "^18 foo 18 bar$"
        }
    }
    grok {
        match => {
            "message" => "^19 foo 19 bar$"
        }
    }
    grok {
        match => {
            "message" => "^20 foo 20 bar$"
        }
    }

    metrics {
        meter => "events"
        add_tag => "metric"
    }
}

output {
    if "metric" in [tags] {
        stdout {
            codec => line {
                format => "rate_1m: %{[events][rate_1m]}, rate_5m: %{[events][rate_5m]}"
            }
        }
    }
}

colinsurprenant · 2017-12-14T16:49:31Z

lib/logstash/filters/grok/timeout_enforcer.rb

+    rescue java.lang.InterruptedException => e
+      # NOOP, we don't expect these, but maybe some interruptible thing could be
+      # added to grok besides regexps
+      @logger.debug("Unexpected interruptible caught during grok. This isn't a problem most likely")


since the grok has likely been interrupted, shouldn't we either log at a higher, warn or error or bubble up the exception?

Hmmm, good point, yes, it should be a warn.

colinsurprenant · 2017-12-14T17:16:49Z

lib/logstash/filters/grok/timeout_enforcer.rb

@@ -72,31 +68,16 @@ def start_thread_groking(thread)
    @threads_to_start_time.put(thread, java.lang.System.nanoTime)


one thing confusing here and in the grok_till_timeout method is that the current thread is passed as the thread parameter but the current thread is also referenced implicitly using the java.lang.Thread.xxx.
AFAIU both should be the same thread so I wonder for clarity sake if we should not use one thread reference notation?

java.lang.Thread.interrupted @threads_to_start_time.put(java.lang.Thread.currentThread(), java.lang.System.nanoTime)

or

thread.interrupted @threads_to_start_time.put(thread, java.lang.System.nanoTime)

?

+1 on standardizing on the instance

OK, this is now done

colinsurprenant · 2017-12-14T17:23:22Z

Left some minor notes, the thread interruption handling logic seems good.

yaauie · 2017-12-14T17:53:33Z

lib/logstash/filters/grok/timeout_enforcer.rb

-      end
+      # If the regexp finished, but interrupt was called after, we'll want to
+      # clear the interrupted status anyway
+      @threads_to_start_time.remove(thread)


even though this is class-internal, this change introduces something of a mixed abstraction to this method -- we register this thread via start_thread_groking and unregister it by reaching directly into an ivar; if we can keep this method at a single level of abstraction and continue to use the stop_thread_groking that this change also removes to unregister, I believe it will provide greater long-term clarity.

I've opted to remove start_thread_groking and just inline its one call. stop_thread_grokking would only be called once from one spot since there isn't a single way to stop it. In one spot we must compute in another remove.

WDYT of how it looks now?

eh, in an ideal world we'd have better encapsulation (e.g., both grok_till_timeout and cancel_time_out! reach into the @threads_to_start_time ivar directly to muck with its internals), but at least this method isn't mixing abstractions.

EDIT: what better encapsulation? this is literally in a class that only encapsulates the timeout enforcement logic. Clearly I need more coffee.

yaauie · 2017-12-14T18:00:22Z

lib/logstash/filters/grok/timeout_enforcer.rb

+      # If the regexp finished, but interrupt was called after, we'll want to
+      # clear the interrupted status anyway
+      @threads_to_start_time.remove(thread)
+      java.lang.Thread.interrupted


should we be sending java.lang.Thread#isInterrupted() to our thread? It's odd that we both use a reference for the current thread and also rely on static methods that target the current thread.

isInterrupted() does not clear the interrupted state, interrupted() does. In other words, if the thread is in an interrupted state, calling twice interrupted() will return true then false.

I agree we should use thread.interrupted() clarity/consistency ... but practically this has no impact since thread is the current thread too.

I'm +1 on using the local thread object for clarity

yaauie · 2017-12-14T18:16:24Z

lib/logstash/filters/grok/timeout_enforcer.rb

+      @threads_to_start_time.compute(thread) do |thread, start_time|
+        if start_time && start_time < now && now - start_time > @timeout_nanos
+          thread.interrupt
+          nil # Delete the key


a little unsure about the jruby/java boundary here -- nil is not null, and while I can find documentation that the coersion happens from Java to Ruby, I'm not finding anything explicit about the other direction.

$ irb irb(main):001:0> require "java" => false irb(main):002:0> h = java.util.concurrent.ConcurrentHashMap.new => {} irb(main):003:0> h.inspect => "{}" irb(main):004:0> h.put("a", 1) => nil irb(main):005:0> h.inspect => "{\"a\"=>1}" irb(main):006:0> h.compute("a") {|v| nil} => nil irb(main):007:0> h.inspect => "{}"

Thanks for confirming this behavior colin!

colinsurprenant · 2017-12-14T20:13:17Z

nit: I noticed that @cancel_mutex is useless now, we could remove it.

colinsurprenant · 2017-12-14T20:56:33Z

@andrewvc @jordansissel following up on our conversation about ConcurrentHashMap forEach behaviour in this situation:

    @threads_to_start_time.forEach do |thread, start_time|
      @threads_to_start_time.compute(thread) do |thread, start_time|
      ...
      end
    end

From what I can read in https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/package-summary.html#Weakly

Most concurrent Collection implementations (including most Queues) also differ from the usual java.util conventions in that their Iterators and Spliterators provide weakly consistent rather than fast-fail traversal:

they may proceed concurrently with other operations
they will never throw ConcurrentModificationException
they are guaranteed to traverse elements as they existed upon construction exactly once, and may (but are not guaranteed to) reflect any modifications subsequent to construction.

So I am not sure about the behaviour of forEach but I would suspect it is also weakly consistent so I think instead of using @threads_to_start_time.compute we could simply use @threads_to_start_time.computeIfPresent and that would account for the possibility of having the thread removed from the collection while in the forEach loop.

andrewvc · 2017-12-14T20:59:32Z

@colinsurprenant makes sense to move to computeIfPresent, I'll make that improvement.

andrewvc · 2017-12-15T15:12:43Z

I changed this in a few ways:

Inlined start_thread_grokking method for simplicity (it was called in one spot)
Removed the warning for non grok interrupted errors, we'll just report those as regexp interrupted errors, this is a boundary condition that occurs only when regexps take too long
Switched from compute to computeIfPresent
Removed an overly conservative call to thread.interrupted before we start grokking. This is unnecessary

colinsurprenant · 2017-12-15T15:20:46Z

@andrewvc nit: just noticed that @cancel_mutex is still defined but unused

logstash-filter-grok/lib/logstash/filters/grok/timeout_enforcer.rb

Line 12 in eb88860

@cancel_mutex = Mutex.new

andrewvc · 2017-12-15T15:27:40Z

@colinsurprenant just removed that extra line, good catch!

colinsurprenant · 2017-12-15T15:32:54Z

Another observation: this is not part of the change set but might be a good idea to modify: the @running bool is used across threads to control termination, I'd suggest we make it an AtomicBoolean to make it explicitly threadsafe.

logstash-filter-grok/lib/logstash/filters/grok/timeout_enforcer.rb

Line 31 in eb88860

@running = true

colinsurprenant · 2017-12-15T15:36:28Z

@andrewvc ^^ something like https://github.com/elastic/logstash/blob/a0aa92980e74ec9ea00a058fccf648d60c9482fa/logstash-core/lib/logstash/inputs/base.rb#L61

colinsurprenant · 2017-12-15T15:42:06Z

@andrewvc I leave it to you to decide for @running since I believe this will not have practical impact.
LGTM!
Really good job on this!!

This should keep perf even

andrewvc · 2017-12-15T21:58:05Z

@colinsurprenant moved @running to an atomic boolean, good catch. Apparently it didn't cause a problem before, but it definitely wasn't right.

yaauie

LGTM

yaauie · 2017-12-15T22:02:15Z

lib/logstash/filters/grok/timeout_enforcer.rb

-      end
+      # If the regexp finished, but interrupt was called after, we'll want to
+      # clear the interrupted status anyway
+      @threads_to_start_time.remove(thread)


eh, in an ideal world we'd have better encapsulation (e.g., both grok_till_timeout and cancel_time_out! reach into the @threads_to_start_time ivar directly to muck with its internals), but at least this method isn't mixing abstractions.

EDIT: what better encapsulation? this is literally in a class that only encapsulates the timeout enforcement logic. Clearly I need more coffee.

elasticsearch-bot · 2017-12-18T02:33:23Z

Andrew Cholakian merged this into the following branches!

Branch	Commits
master	`82ec779`

PhaedrusTheGreek · 2017-12-18T21:43:48Z

@andrewvc , Is this fix available in LS 5.5.2?

Seems like I have only 3.4.4, and If I understand correctly, this fix is in 4.0.1 ?

$  bin/logstash-plugin update logstash-filter-grok
Updating logstash-filter-grok
Updated logstash-filter-grok 3.4.2 to 3.4.4

PhaedrusTheGreek · 2017-12-18T21:52:41Z

Just confirmed I was able to upgrade in LS 5.6.3

$ bin/logstash-plugin update logstash-filter-grok
Updating logstash-filter-grok
Updated logstash-filter-grok 4.0.0 to 4.0.1

andrewvc added the bug label Dec 14, 2017

andrewvc force-pushed the stronger-thread-safety branch from ee752a6 to 43a01d0 Compare December 14, 2017 03:20

elasticsearch-bot self-assigned this Dec 14, 2017

andrewvc force-pushed the stronger-thread-safety branch from 43a01d0 to 3edfa23 Compare December 14, 2017 13:45

colinsurprenant reviewed Dec 14, 2017

View reviewed changes

yaauie reviewed Dec 14, 2017

View reviewed changes

jordansissel unassigned elasticsearch-bot Dec 14, 2017

andrewvc force-pushed the stronger-thread-safety branch from 2b5cae9 to eb88860 Compare December 15, 2017 15:11

andrewvc force-pushed the stronger-thread-safety branch from eb88860 to d213869 Compare December 15, 2017 15:27

Fix a possible race where Thread.interrupted was not properly cleared

f57ed2a

This should keep perf even

andrewvc force-pushed the stronger-thread-safety branch from d213869 to f57ed2a Compare December 15, 2017 21:54

yaauie approved these changes Dec 15, 2017

View reviewed changes

elasticsearch-bot closed this in 82ec779 Dec 18, 2017

andrewvc mentioned this pull request Dec 18, 2017

Ensure RubyThread interrupts are Cleared #130

Closed

colinsurprenant mentioned this pull request Jul 20, 2019

[meta] kv and grok filters timeout handling elastic/logstash#10976

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a possible race where Thread.interrupted was not properly cleared #131

Fix a possible race where Thread.interrupted was not properly cleared #131

andrewvc commented Dec 14, 2017 •

edited

colinsurprenant Dec 14, 2017

andrewvc Dec 14, 2017

colinsurprenant Dec 14, 2017

andrewvc Dec 14, 2017

andrewvc Dec 15, 2017

colinsurprenant commented Dec 14, 2017

yaauie Dec 14, 2017

andrewvc Dec 15, 2017

yaauie Dec 15, 2017 •

edited

andrewvc Dec 15, 2017

yaauie Dec 14, 2017

colinsurprenant Dec 14, 2017

andrewvc Dec 14, 2017

yaauie Dec 14, 2017

colinsurprenant Dec 14, 2017

andrewvc Dec 14, 2017

colinsurprenant commented Dec 14, 2017

colinsurprenant commented Dec 14, 2017

andrewvc commented Dec 14, 2017

andrewvc commented Dec 15, 2017

colinsurprenant commented Dec 15, 2017

andrewvc commented Dec 15, 2017

colinsurprenant commented Dec 15, 2017

colinsurprenant commented Dec 15, 2017

colinsurprenant commented Dec 15, 2017

andrewvc commented Dec 15, 2017

yaauie left a comment

yaauie Dec 15, 2017 •

edited

elasticsearch-bot commented Dec 18, 2017

PhaedrusTheGreek commented Dec 18, 2017

PhaedrusTheGreek commented Dec 18, 2017

		@@ -72,31 +68,16 @@ def start_thread_groking(thread)
		@threads_to_start_time.put(thread, java.lang.System.nanoTime)

Fix a possible race where Thread.interrupted was not properly cleared #131

Fix a possible race where Thread.interrupted was not properly cleared #131

Conversation

andrewvc commented Dec 14, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colinsurprenant commented Dec 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaauie Dec 15, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colinsurprenant commented Dec 14, 2017

colinsurprenant commented Dec 14, 2017

andrewvc commented Dec 14, 2017

andrewvc commented Dec 15, 2017

colinsurprenant commented Dec 15, 2017

andrewvc commented Dec 15, 2017

colinsurprenant commented Dec 15, 2017

colinsurprenant commented Dec 15, 2017

colinsurprenant commented Dec 15, 2017

andrewvc commented Dec 15, 2017

yaauie left a comment

Choose a reason for hiding this comment

yaauie Dec 15, 2017 • edited

Choose a reason for hiding this comment

elasticsearch-bot commented Dec 18, 2017

PhaedrusTheGreek commented Dec 18, 2017

PhaedrusTheGreek commented Dec 18, 2017

andrewvc commented Dec 14, 2017 •

edited

yaauie Dec 15, 2017 •

edited

yaauie Dec 15, 2017 •

edited