Deadlock detector hack for Kafka driver instability #1087

vprithvi · 2018-09-26T22:24:04Z

Signed-off-by: Prithvi Raj p.r@uber.com

Which problem is this PR solving?

Mitigate effects of Remove dependency on sarama-cluster (Kafka driver) #1052 as a temporary solution

Short description of the changes

Adds a deadlock watcher per partition that monitors message consumption rate per minute. If the rate of consumption is zero, it triggers a rebalance by sending a signal to close the partition. If the partition close is unsuccessful, the instance is killed.

Signed-off-by: Prithvi Raj <p.r@uber.com>

codecov · 2018-09-27T13:46:11Z

Codecov Report

❗ No coverage uploaded for pull request base (master@a429d78). Click here to learn what that means.
The diff coverage is 100%.

@@           Coverage Diff            @@
##             master   #1087   +/-   ##
========================================
  Coverage          ?    100%           
========================================
  Files             ?     141           
  Lines             ?    6723           
  Branches          ?       0           
========================================
  Hits              ?    6723           
  Misses            ?       0           
  Partials          ?       0

Impacted Files	Coverage Δ
cmd/ingester/app/consumer/consumer.go	`100% <100%> (ø)`
cmd/ingester/app/consumer/deadlock_detector.go	`100% <100%> (ø)`
cmd/ingester/app/consumer/consumer_metrics.go	`100% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a429d78...4573d51. Read the comment docs.

cmd/ingester/app/consumer/consumer.go

yurishkuro · 2018-10-01T20:42:09Z

cmd/ingester/app/consumer/sepukku.go

+			buf := make([]byte, 1<<20)
+			logger.Panic("No messages processed in the last check interval",
+				zap.Int32("partition", partition),
+				zap.String("stack", string(buf[:runtime.Stack(buf, true)])))


in hotrod zap logger automatically prints the stack. What's different here?

zap only prints the current go routine - This prints out all go routines.

yurishkuro · 2018-10-01T20:43:56Z

cmd/ingester/app/consumer/sepukku.go

+func (s *seppukuFactory) startMonitoringForPartition(partition int32) *seppukuWorker {
+	var msgConsumed uint64
+	w := &seppukuWorker{
+		msgConsumed:    &msgConsumed,


no reason to allocate a pointer, you can still use address of a struct field with atomic.

True, I had assumed that the pointer allocation had a similar overhead to an field allocation. I didn't want to use & on every access.

yurishkuro · 2018-10-01T20:49:53Z

cmd/ingester/app/consumer/sepukku.go

+				case w.closePartition <- struct{}{}:
+					s.logger.Warn("Signalling partition close due to inactivity", zap.Int32("partition", partition))
+				default:
+					// If closePartition is blocked, attempt seppuku


// If closePartition is blocked, the consumer may have deadlocked -> kill the process

yurishkuro · 2018-10-01T20:52:35Z

cmd/ingester/app/consumer/sepukku.go

+}
+
+func (w *seppukuWorker) close() {
+	w.ticker.Stop()


this could inadvertently kill the process after rebalance, but not sure how to avoid it

Could you elaborate how?

actually, because Ticker.close() does not close the channel, the race condition I was thinking about won't happen on rebalance, but you will leak the goroutine

This is true, I can use a separate channel to close this if you feel strongly about it

yes, let's fix this, a goroutine leak is not good

Signed-off-by: Prithvi Raj <p.r@uber.com>

yurishkuro · 2018-10-05T19:32:33Z

cmd/ingester/app/consumer/consumer.go

@@ -52,12 +56,15 @@ type consumerState struct {

 // New is a constructor for a Consumer
 func New(params Params) (*Consumer, error) {
+	deadlockDetectorFactory := newDeadlockDetectorFactory(params.Factory, params.Logger, time.Minute)


separate issue: s/params.Factory/params.MetricsFactory/

I'll address this separately - it'll only add noise to this PR

yurishkuro · 2018-10-05T19:35:34Z

cmd/ingester/app/consumer/consumer.go


-		msgProcessor.Process(&saramaMessageWrapper{msg})
+		case <-deadlockDetector.getClosePartition():


s/getClosePartition/closePartitionChannel/

yurishkuro · 2018-10-05T19:36:30Z

cmd/ingester/app/consumer/consumer.go

 }

 func (c *Consumer) closePartition(partitionConsumer sc.PartitionConsumer) {
 	c.logger.Info("Closing partition consumer", zap.Int32("partition", partitionConsumer.Partition()))
 	partitionConsumer.Close() // blocks until messages channel is drained
+	c.newPartitionMetrics(partitionConsumer.Partition()).closeCounter.Inc(1)


why do we call new here? If it internally caches the metrics, then s/newPartitionMetrics/newPartitionMetrics

yurishkuro · 2018-10-05T19:39:08Z

cmd/ingester/app/consumer/consumer_metrics.go

+	closeCounter metrics.Counter
+}
+
+func (c *Consumer) getNamespace(partition int32) metrics.Factory {


s/getNamespace/metricsFactoryForPartition/

NB: we don't use "get" in Go.

yurishkuro · 2018-10-05T19:41:12Z

cmd/ingester/app/consumer/consumer.go

+			msgMetrics.offsetGauge.Update(msg.Offset)
+			msgMetrics.lagGauge.Update(pc.HighWaterMarkOffset() - msg.Offset - 1)
+			deadlockDetector.incrementMsgCount()
+			c.allPartitionDeadlockDetector.incrementMsgCount()


I don't follow what the purpose of this allPartitionDeadlockDetector is. It looks like it's only used to increment this counter - why do we even need it? We can always sum the time series to get total counter.

yurishkuro · 2018-10-05T19:48:39Z

cmd/ingester/app/consumer/deadlock_detector.go

+	var msgConsumed uint64
+	w := &deadlockDetector{
+		msgConsumed:    &msgConsumed,
+		ticker:         time.NewTicker(s.interval),


nit: you do not need to leak the ticker. Create it inside goroutine as local var with defer close.

cmd/ingester/app/consumer/deadlock_detector.go

yurishkuro · 2018-10-05T19:56:13Z

cmd/ingester/app/consumer/consumer.go

@@ -42,6 +43,9 @@ type Consumer struct {
 	internalConsumer consumer.Consumer
 	processorFactory ProcessorFactory

+	deadlockDetectorFactory      deadlockDetectorFactory
+	allPartitionDeadlockDetector *deadlockDetector


nit: allPartitions...

yurishkuro · 2018-10-05T19:56:42Z

cmd/ingester/app/consumer/consumer.go

@@ -42,6 +43,9 @@ type Consumer struct {
 	internalConsumer consumer.Consumer
 	processorFactory ProcessorFactory

+	deadlockDetectorFactory      deadlockDetectorFactory


I think it would be cleaner and easier to understand if you had a top-level deadlockDetector, which can create partitionDeadlockDetector as needed (implementation-wide the former may contain the latter for pId=-1). So factory is only used once to create the top-level detector, and the factory does not need to be stored in the Consumer.

Doesn't this mean that the top level deadlockDetector also has the responsibilities of the Factor? (That being said, I think it might be cleaner design)

it does, in a way, but there's nothing wrong with that, especially considering that it happens at runtime and many times, whereas the top-level factory will be used only once on startup and not needed afterwards.

Separating top-level from individual detectors will also allow clean separation in some implementation details, e.g. where some features are not used.

And best of all, you'll be able to move a lot of methods into detectors, away from Consumer.

yurishkuro · 2018-10-05T19:57:57Z

cmd/ingester/app/consumer/deadlock_detector.go

+	msgConsumed    *uint64
+	ticker         *time.Ticker
+	logger         *zap.Logger
+	closePartition chan struct{}


I assume this does not apply to all-partitions detector?

It does not

Signed-off-by: Prithvi Raj <p.r@uber.com>

yurishkuro · 2018-10-05T23:11:04Z

cmd/ingester/app/consumer/consumer_metrics.go

+	f := c.namespace(partition)
+	return partitionMetrics{
+		closeCounter: f.Counter("partition-close", nil),
+		startCounter: f.Counter("partition-start", nil)}


This still bothers me. Some metrics factories are not happy if you try to create a metric with the same name twice. So if we re-acquire the same partition, this could cause a panic, e.g. if someone is using expvar-based metrics (unless we implemented protection in the factory, which I had to do for Prometheus).

I added a test to jaeger-lib/metrics which shows that calling Counter mutiple times with the same tags does not panic for expvar, prometheus and tally.

yurishkuro · 2018-10-05T23:14:21Z

cmd/ingester/app/consumer/consumer_test.go

@@ -210,3 +216,29 @@ func TestSaramaConsumerWrapper_start_Errors(t *testing.T) {

 	t.Fail()
 }
+
+func TestHandleClosePartition(t *testing.T) {
+	localFactory := metrics.NewLocalFactory(0)


s/localFactory/metricsFactory/

yurishkuro · 2018-10-05T23:17:08Z

cmd/ingester/app/consumer/deadlock_detector.go

+	done        chan struct{}
+}
+
+func newDeadlockDetector(factory metrics.Factory, logger *zap.Logger, interval time.Duration) deadlockDetector {


is there some pattern you're following from elsewhere in the code? Referring to metrics factory as merely "factory" is very poor naming.

Ah, this is a result of some laziness and accepting the IDE generated name - I'll update these

cmd/ingester/app/consumer/deadlock_detector.go

yurishkuro · 2018-10-06T19:31:15Z

cmd/ingester/app/consumer/deadlock_detector.go

+
+func newDeadlockDetector(factory metrics.Factory, logger *zap.Logger, interval time.Duration) deadlockDetector {
+	panicFunc := func(partition int32) {
+		factory.Counter("deadlockdetector.panic-issued", map[string]string{"partition": strconv.Itoa(int(partition))}).Inc(1)


assumes that factory is reentrant for the same metric name, which is not guaranteed

cmd/ingester/app/consumer/consumer.go

Signed-off-by: Prithvi Raj <p.r@uber.com>

yurishkuro · 2018-10-08T20:12:02Z

cmd/ingester/app/consumer/consumer_metrics.go

@@ -30,8 +30,17 @@ type errMetrics struct {
 	errCounter metrics.Counter
 }

+type partitionMetrics struct {


nit: we should clean-up these metrics. All 3 structs are per-partition, could we not combine them into one? Then we won't have all these small functions on the Consumer.

I agree - I'll do it as a separate commit

cmd/ingester/app/consumer/deadlock_detector.go

Signed-off-by: Prithvi Raj <p.r@uber.com>

Add seppuku factory

493329c

Signed-off-by: Prithvi Raj <p.r@uber.com>

vprithvi requested review from black-adder, jpkrohling, pavolloffay and yurishkuro as code owners September 26, 2018 22:24

ghost assigned vprithvi Sep 26, 2018

ghost added the review label Sep 26, 2018

Merge branch 'master' into seppuku

ba9e170

Merge branch 'master' into seppuku

7dba654

ghost assigned black-adder Oct 1, 2018

yurishkuro reviewed Oct 1, 2018

View reviewed changes

vprithvi added 2 commits October 4, 2018 11:19

Rename

0f870db

Signed-off-by: Prithvi Raj <p.r@uber.com>

Address feedback

01d65a6

Signed-off-by: Prithvi Raj <p.r@uber.com>

yurishkuro reviewed Oct 5, 2018

View reviewed changes

Address feedback

c818b70

Signed-off-by: Prithvi Raj <p.r@uber.com>

yurishkuro reviewed Oct 6, 2018

View reviewed changes

yurishkuro changed the title ~~Suicide hack for Kafka driver instability~~ Deadlock detector hack for Kafka driver instability Oct 6, 2018

Address feedback

bed08c4

Signed-off-by: Prithvi Raj <p.r@uber.com>

yurishkuro approved these changes Oct 8, 2018

View reviewed changes

vprithvi added 4 commits October 8, 2018 16:42

Address feedback

f243e40

Signed-off-by: Prithvi Raj <p.r@uber.com>

Merge branch 'master' into seppuku

4085d54

Signed-off-by: Prithvi Raj <p.r@uber.com>

Make test not flakey

52f1a16

Signed-off-by: Prithvi Raj <p.r@uber.com>

Refactor panic tests

4573d51

Signed-off-by: Prithvi Raj <p.r@uber.com>

vprithvi merged commit 7105fa9 into master Oct 9, 2018

ghost removed the review label Oct 9, 2018

vprithvi deleted the seppuku branch October 9, 2018 15:55


		msgProcessor.Process(&saramaMessageWrapper{msg})
		case <-deadlockDetector.getClosePartition():

Deadlock detector hack for Kafka driver instability #1087

Deadlock detector hack for Kafka driver instability #1087

Conversation

vprithvi commented Sep 26, 2018 • edited by yurishkuro Loading

Which problem is this PR solving?

Short description of the changes

codecov bot commented Sep 27, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vprithvi commented Sep 26, 2018 •

edited by yurishkuro

Loading

codecov bot commented Sep 27, 2018 •

edited

Loading