Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Worker] Check if worker allocator is terminated in static allocation mode #3105

Merged
merged 2 commits into from
Jan 2, 2024

Conversation

TomerShor
Copy link
Contributor

Followup to #3092 , where we are blocking worker allocation if a the workers are terminated:

  • Add termination check to the static worker allocator mode. It was missing from [Processor] Block worker allocation if worker allocator is terminated #3092 because the static allocator allocates all workers on startup, instead of per event.
  • Rename termination state related methods to "drain state".
  • In Kafka's drainOnRebalance - close the readyForRebalanceChan after sending a value on it. This will help cases where the maxWaitHandlerDuringRebalance times out before the drain handler finishes, and then the channel is closed before the drain invoking goroutine passes a value on it.

@@ -360,6 +359,7 @@ func (k *kafka) drainOnRebalance(session sarama.ConsumerGroupSession,

wg.Wait()
readyForRebalanceChan <- true
close(readyForRebalanceChan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case if we are out of waiting time (in select operator), we will never close the chan because writing to the chan is blocking operation. So, I would leave it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine will still run even when the timeout has passed and the select doesn't wait for this channel.

The issue is that if timeout has passed, the function exists and closes the readyForRebalanceChan. The goroutine continues to run, and tries to write to the closed channel.
My thought was to not close the channel before the goroutine is done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomerShor Yes, but if the goroutine is still running when the timeout has passed, this change will result in the channel never being closed. We won't read from readyForRebalanceChan, leading to a zombie goroutine persisting indefinitely.

Currently, it's possible to attempt writing to a closed channel, causing a panic and the goroutine to exit. To address this issue properly and prevent the panic, we should notify the goroutine from the main function body that the timeout has passed, and there's no need to write anything to the channel. But as for me, panicing in this goroutine is not a big issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rokatyy We already have a recover for that case in this goroutine 🤦
I will revert this change.

Copy link
Contributor

@rokatyy rokatyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, nice catch with SignalTermination!
One question about closing chan and we are done.

pkg/processor/worker/allocator.go Outdated Show resolved Hide resolved
@@ -116,13 +119,18 @@ func (s *singleton) SignalDraining() error {
}

func (s *singleton) SignalTermination() error {
return s.worker.Drain()
s.isTerminated = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@TomerShor TomerShor requested a review from rokatyy January 2, 2024 10:20
@TomerShor TomerShor merged commit 53ea9ed into nuclio:development Jan 2, 2024
11 checks passed
@TomerShor TomerShor deleted the kafka-drain-termination branch January 2, 2024 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants