Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cassandra] Add Configuration.Close() to ensure TLS cert watcher is closed #4515

Merged
merged 2 commits into from Jun 20, 2023

Conversation

kennyaz
Copy link
Contributor

@kennyaz kennyaz commented Jun 7, 2023

Which problem is this PR solving?
Resolves #4514

Short description of the changes
Implement a Close Function in Cassandra to allow us to close the leaky-go routine.

@kennyaz kennyaz requested a review from a team as a code owner June 7, 2023 19:22
@kennyaz kennyaz requested a review from vprithvi June 7, 2023 19:22
yurishkuro
yurishkuro previously approved these changes Jun 7, 2023
Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yurishkuro yurishkuro enabled auto-merge (squash) June 7, 2023 19:51
auto-merge was automatically disabled June 8, 2023 02:23

Head branch was pushed to by a user without write access

@kennyaz
Copy link
Contributor Author

kennyaz commented Jun 8, 2023

The first push failed because of the test case in TestProxyBuilder in build_test.go file is calling fswatcher.go Close function twice. Added protection so that if the channel is already closed, it will not try to attempt to close the channel again.

close(w.stop)
w.wg.Wait()
}
w.isClosed = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unprotected bool can be a source of race condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let me add the mutex lock to prevent the race conditions.

@kennyaz kennyaz requested a review from yurishkuro June 8, 2023 15:35
yurishkuro
yurishkuro previously approved these changes Jun 10, 2023
@yurishkuro
Copy link
Member

please make sure all commits are signed (easiest is to squash into one)

@codecov
Copy link

codecov bot commented Jun 11, 2023

Codecov Report

Patch coverage has no change and project coverage change: +0.01 🎉

Comparison is base (4a6f118) 97.03% compared to head (c4c3fd2) 97.04%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4515      +/-   ##
==========================================
+ Coverage   97.03%   97.04%   +0.01%     
==========================================
  Files         301      301              
  Lines       17817    17817              
==========================================
+ Hits        17289    17291       +2     
+ Misses        424      422       -2     
  Partials      104      104              
Flag Coverage Δ
unittests 97.04% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@kennyaz
Copy link
Contributor Author

kennyaz commented Jun 12, 2023

I squashed and signed all my commits. Can you approve it again?

func (w *FSWatcher) Close() error {
if w.isClosed.CompareAndSwap(false, true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking closer at this and I am not convinced this is all needed. When we call w.watcher.Close(), it closes the Events and Errors channels, so our goroutine will exit naturally. Yes, there is a race condition if you're checking for its exit immediately, but that cannot be completely avoided, even with these changes the goroutine may go to sleep after L143 lowers the semaphore.

If you observed that the goroutine is still running for a long time after close, there may be something else going on.

We can still use the waitgroup introduced here, but without the channel/bool:

func (w *FSWatcher) Close() error {
	// watcher.Close() takes care of multiple calls by returning "already closed" error
	if err := w.watcher.Close(); err != nil {
		return err
	}
	w.wg.Wait()
	return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. let me send out a new commit with these changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate how you determined that the GR was still running? Can it be reproduced with a unit test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are right, the original code should work fine without any issues. I dig deeper into why this issue will happen in the first place. The main reason is that when you create a new Cassandra Cluster on L149, the code calls c.TLS.Config(). The c.TLS.Config call newCertWatcher which then calls watchCertPair(). watchCertPair() call fswatcher.new() which creates a new leaky go routine. After that in Cassandra, there is no method or function to Close the goroutine. In short, the code is working correctly, it is just that when creating the newCluster, the Close function for fswatcher was never called or the method was never implemented to close that go Routine.
https://github.com/jaegertracing/jaeger/blob/main/pkg/cassandra/config/config.go#L149
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/options.go#L99
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/cert_watcher.go#L69
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/cert_watcher.go#L97
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/cert_watcher.go#L106

Copy link
Contributor Author

@kennyaz kennyaz Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose that in https://github.com/jaegertracing/jaeger/blob/main/pkg/cassandra/config/config.go we implement a function called Close that will allow us to close that goroutine.

func (c *Configuration) Close() error { return c.TLS.Close() }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Signed-off-by: kennyaz <115052215+kennyaz@users.noreply.github.com>
@@ -163,6 +163,10 @@ func (c *Configuration) NewCluster(logger *zap.Logger) (*gocql.ClusterConfig, er
return cluster, nil
}

func (c *Configuration) Close() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function did not exist before, nothing is calling it, so just adding it here is not going to address the issue in the ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I looked at the code base, the leaky go routine only happens when new Cassandra Cluster is created directly with no option to close. So, the leaky-go routine is happening on our side, not on your end. I am thinking about changing a ticket from bug to feature request so that users who created the new Cassandra Cluster will have the option to close the fswatcher leaky go routine.

@kennyaz kennyaz requested a review from yurishkuro June 20, 2023 14:37
@yurishkuro yurishkuro changed the title FixGoRoutine [cassandra] Add Configuration.Close() to ensure TLS cert watcher is closed Jun 20, 2023
@yurishkuro yurishkuro enabled auto-merge (squash) June 20, 2023 16:48
@yurishkuro yurishkuro merged commit 050c632 into jaegertracing:main Jun 20, 2023
31 checks passed
kjschnei001 pushed a commit to kjschnei001/jaeger that referenced this pull request Jun 29, 2023
…losed (jaegertracing#4515)

Which problem is this PR solving?
Resolves jaegertracing#4514

Short description of the changes
Implement a Close Function in Cassandra to allow us to close the
leaky-go routine.

Signed-off-by: kennyaz <115052215+kennyaz@users.noreply.github.com>
Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com>
Signed-off-by: KevinSchneider <kevin.schneider@target.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Leaky Go Routine in Jaeger Fswatcher.go
2 participants