Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in keepSafe #369

Merged
merged 1 commit into from
Jun 19, 2019
Merged

Fix deadlock in keepSafe #369

merged 1 commit into from
Jun 19, 2019

Conversation

robert-milan
Copy link
Contributor

No description provided.

@robert-milan robert-milan changed the title Fix deadlock in keepSafe [WIP] Fix deadlock in keepSafe Jun 17, 2019
@robert-milan robert-milan changed the title [WIP] Fix deadlock in keepSafe Fix deadlock in keepSafe Jun 17, 2019
@@ -64,8 +65,6 @@ func (k *keepSafe) GetAll() [][]byte {
}

func (k *keepSafe) Stop() {
k.Lock()
close(k.closed)
k.closed = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to do un-synchronized writes and reads on the same variable? i believe with go's memory model, the answer is no.
i would keep the lock around the write (or if you want to be really fancy, do an atomic operation on an int, but that seems overkill)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not safe heh. I was thinking atomic. We can't keep the write locks around it, that is what is causing the deadlock. The other goroutine will never acquire the lock, because goroutine 1 won't release it, because it is waiting for gourtine 2 to return and decrement the waitgroup.

I guess we could just restrict it to the write though, good point.

Copy link
Contributor

@Dieterbe Dieterbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. but you can get rid of the select and just for range over the ticker channel directly

@woodsaj
Copy link
Member

woodsaj commented Jun 17, 2019

couldnt the race here be more easily solved by just removing the locks from Stop()?
I dont see any reason why the lock was originally being acquired before closing the k.closed chan and calling wg.Wait(). Stop() should only ever be called once.

@robert-milan
Copy link
Contributor Author

couldnt the race here be more easily solved by just removing the locks from Stop()?
I dont see any reason why the lock was originally being acquired before closing the k.closed chan and calling wg.Wait(). Stop() should only ever be called once.

It should not be called twice, but I'm not as familiar with this code. The current change is safer though, otherwise if our assumption of Stop() never being called twice is wrong then we will just get a panic.

Copy link
Member

@woodsaj woodsaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really happy with this approach, i think we need to just use a simple channel that is closed when stop is called.
The main problems i have are:

  1. using a "shutdown" chan is the most common way of dealing with "plugins" like this, and is how we do this in just about every other piece of code we write. It is completely acceptable for a panic to arise if "Stop()" is called multiple times (the panic will be due to calling close() on an already closed channel). There is also an expectation that "plugins" wont be used after stop() has been called.

  2. This approach results in an additional delay of up to k.periodKeep between when "Stop()" is called and when the "keepClean()" goroutine exits. A quick look over the code suggests that this would be a pretty bad thing.
    https://github.com/graphite-ng/carbon-relay-ng/blob/master/destination/destination.go#L295

https://github.com/graphite-ng/carbon-relay-ng/blob/master/destination/destination.go#L304

the call to conn.clearRedo() calls c.keepSafe.Stop().

With the approach used in this PR, a delay of up to 10seconds will be added when connections drop and and reconnect.

@robert-milan
Copy link
Contributor Author

I am not really happy with this approach, i think we need to just use a simple channel that is closed when stop is called.
The main problems i have are:

  1. using a "shutdown" chan is the most common way of dealing with "plugins" like this, and is how we do this in just about every other piece of code we write. It is completely acceptable for a panic to arise if "Stop()" is called multiple times (the panic will be due to calling close() on an already closed channel). There is also an expectation that "plugins" wont be used after stop() has been called.

As long as we are fine with panics I can change it back to using a channel

  1. This approach results in an additional delay of up to k.periodKeep between when "Stop()" is called and when the "keepClean()" goroutine exits. A quick look over the code suggests that this would be a pretty bad thing.

Yes, I did think about that. I wasn't sure if it would be a problem or not.

https://github.com/graphite-ng/carbon-relay-ng/blob/master/destination/destination.go#L295

https://github.com/graphite-ng/carbon-relay-ng/blob/master/destination/destination.go#L304

the call to conn.clearRedo() calls c.keepSafe.Stop().

With the approach used in this PR, a delay of up to 10seconds will be added when connections drop and and reconnect.

Fair enough

@robert-milan robert-milan merged commit 44034df into master Jun 19, 2019
@Dieterbe Dieterbe deleted the fix-deadlock branch June 19, 2019 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants