Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix restore functionality hangs forever #4918

Merged
merged 1 commit into from
Nov 30, 2015
Merged

Conversation

oiooj
Copy link
Contributor

@oiooj oiooj commented Nov 26, 2015

if return errors.New("listener closing") , the serveExecListenerfunction will not return in a loop,and the program will hangs at https://github.com/influxdb/influxdb/blob/master/meta/store.go#L567

Fix #4806

Ref: https://github.com/influxdb/influxdb/blob/master/meta/store.go#L739

  • CHANGELOG.md updated
  • Rebased/mergable
  • Tests pass
  • Sign CLA

@oiooj
Copy link
Contributor Author

oiooj commented Nov 26, 2015

@pauldix @otoolep

@benbjohnson
Copy link
Contributor

lgtm, although we should probably change the Accept() error check to look for IsTemporary() instead of doing string comparisons.

@otoolep
Copy link
Contributor

otoolep commented Nov 26, 2015

I agree with @benbjohnson

@oiooj -- thanks, you have found a real bug in our code. However, the better fix is as shown here:

https://github.com/influxdb/influxdb/blob/master/services/opentsdb/service.go#L211

Would you mind following this pattern and fixing the bug that way? We shouldn't have coded string-compares.

@oiooj
Copy link
Contributor Author

oiooj commented Nov 26, 2015

@otoolep fixed

@@ -735,12 +734,20 @@ func (s *Store) serveExecListener() {
// Accept next TCP connection.
var err error
conn, err := s.ExecListener.Accept()
if err != nil {
if strings.Contains(err.Error(), "connection closed") {
if opErr, ok := err.(*net.OpError); ok && opErr.Temporary() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right to me.

If the error is not temporary, then simply return. Otherwise continue in the loop. Won't that work?

Also, why are you sleeping for a second? Is that debug code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@otoolep If the error is temporary, then continue. Other error return. if error == nil , handle this Conn.
And there is no need to sleep for a second, deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see where you are going.

This still seems a little bit awkward to me. Let me explain.

If the error returned from Accept is temporary, then keep looping. However, if the error is not temporary, then return. The way other code (code that is shutting down the system) signals that the function should return is simply by closing the listener. That will result in a non-temporary error, so polling closing isn't really necessary.

Perhaps this wouldn't work with the ExecListener for some reason? (I have not looked at that code yet).

s.Logger.Printf("exec listener temporary accept error: %s", err)

select {
case <-s.closing:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that if a signal is sent on closing then the ExecListener would be closed from somewhere else and a non-temporary error would result. I might be wrong though.

@oiooj
Copy link
Contributor Author

oiooj commented Nov 29, 2015

@otoolep yes, you're right.

Now if I send a signal to shutdown the system all is normal.

[run] 2015/11/29 11:17:23 Signal received, initializing clean shutdown...
[run] 2015/11/29 11:17:23 Waiting for clean shutdown...
[metastore] 2015/11/29 11:17:23 RPC listener accept error and closed: network connection closed
[metastore] 2015/11/29 11:17:23 exec listener accept error and closed: network connection closed
[copier] 2015/11/29 11:17:23 copier listener closed
[cluster] 2015/11/29 11:17:23 cluster service accept error: network connection closed
[snapshot] 2015/11/29 11:17:23 snapshot listener closed
[shard-precreation] 2015/11/29 11:17:23 Precreation service terminating
[registration] 2015/11/29 11:17:23 registration service terminating
[continuous_querier] 2015/11/29 11:17:23 continuous query service terminating
[retention] 2015/11/29 11:17:23 retention policy enforcement terminating
[monitor] 2015/11/29 11:17:23 shutting down monitor system
[monitor] 2015/11/29 11:17:23 terminating storage of statistics
[handoff] 2015/11/29 11:17:23 shutting down hh service
[subscriber] 2015/11/29 11:17:23 closed service
[run] 2015/11/29 11:17:23 server shutdown completed

just like the cluster service.

@otoolep
Copy link
Contributor

otoolep commented Nov 30, 2015

Thanks @oiooj -- +1 from me.

@benbjohnson -- would you mind re-reviewing?

@benbjohnson
Copy link
Contributor

lgtm

otoolep added a commit that referenced this pull request Nov 30, 2015
Fix restore functionality hangs forever
@otoolep otoolep merged commit abf8fb7 into influxdata:master Nov 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants