Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket loop thread abort_on_exception (was: Thread Safety) #4

Closed
ostinelli opened this issue Apr 30, 2016 · 27 comments
Closed

socket loop thread abort_on_exception (was: Thread Safety) #4

ostinelli opened this issue Apr 30, 2016 · 27 comments

Comments

@ostinelli
Copy link
Owner

ostinelli commented Apr 30, 2016

It appears that NetHttp2 might not be thread-safe.

To try it out, run this concurrent test multiple times, you will probably encounter the error HTTP2::Error::HandshakeError raised inside of the dummy server.

Full failure looks like:

Failures:

  1) Sending sync requests sends multiple GET requests concurrently
     Got 0 failures and 2 other errors:

     1.1) Failure/Error: conn << data

          HTTP2::Error::HandshakeError:
            HTTP2::Error::HandshakeError
          # /Users/roberto/.rvm/gems/ruby-2.3.0@net-http2/gems/http-2-0.8.1/lib/http/2/connection.rb:175:in `receive'
          # ./spec/support/dummy_server.rb:107:in `handle'
          # ./spec/support/dummy_server.rb:48:in `block (3 levels) in listen'

     1.2) Failure/Error: conn << data

          HTTP2::Error::HandshakeError:
            HTTP2::Error::HandshakeError
          # /Users/roberto/.rvm/gems/ruby-2.3.0@net-http2/gems/http-2-0.8.1/lib/http/2/connection.rb:175:in `receive'
          # ./spec/support/dummy_server.rb:107:in `handle'
          # ./spec/support/dummy_server.rb:48:in `block (3 levels) in listen'

Finished in 0.00471 seconds (files took 0.09018 seconds to load)
1 example, 1 failure

Investigation is needed, to see if this is an issue on the dummy server (who cares then) or on the client.

@ostinelli
Copy link
Owner Author

I found why it wasn't thread safe.

The first call to a server will start the underlying socket thread, responsible to listen on a socket. However, due to concurrency, two socket threads were actually started.

Solved by adding a mutex on thread creation.

@phoet
Copy link

phoet commented Dec 22, 2016

i believe there are more thread safety issues. unfortunately it's not so simple to reproduce them...

if you look into the current implementation of ensure_open in the client:

    def ensure_open
      @mutex.synchronize do

        return if @socket_thread

        main_thread = Thread.current
        @socket     = new_socket

        @socket_thread = Thread.new do
          begin
            socket_loop

          rescue EOFError
            # socket closed
            init_vars
            main_thread.raise SocketError.new 'Socket was remotely closed'

          rescue Exception => e
            # error on socket
            init_vars
            main_thread.raise e
          end
        end
      end
    end

this method creates a new watch-dog-thread to check if the connection is alive. when something fails an error is raised on the main thread.

as far as i can see this can result in the following scenario:

main thread
   ||
   || processor A
   || create client -> watch-dog-thread
   || client.call              ||
   ||                          ||
   || processor B              ||
   || do something different   ||
   || do something different   ||
   || do something different   ||
   || do something different   ||
   || <---------raise -------- || CONNECTION-TIMEOUT
   XX

in this scenario the main thread, or say the application will exit with an unhandled timeout error.

i would argue that main_thread.raise should not be used here.

@ostinelli
Copy link
Owner Author

Yes, this is expected behavior. It is up to the main application to isolate errors if needed, but NetHttp2 will raise them when encountered.

@phoet
Copy link

phoet commented Jan 9, 2017

@ostinelli well, isolation is the actual problem here. how would you go about isolation? processor B has no knowledge of any watch-dog-thread and just dies when the background thread raises an error.

@ostinelli
Copy link
Owner Author

ostinelli commented Jan 9, 2017

This is in the nature of bi-directional async sockets though. How would you go about it?

BTW main usage of this gem is https://github.com/ostinelli/apnotic, example usage can be seen there.

@phoet
Copy link

phoet commented Jan 9, 2017

yeah, and my comment is referring to this issue.

it might be more reasonable to use set abort_on_exception via a configuration option per thread or even just use report_on_exception.

@ostinelli
Copy link
Owner Author

ostinelli commented Jan 9, 2017

yeah, and my comment is referring to this issue.

i've replied in there with how Apnotic has been designed and the solution works. If your concern is related to NetHttp2 then we can talk about it.

it might be more reasonable to use set abort_on_exception via a configuration option per thread or even just use report_on_exception.

It is not enough to just make it configurable. If an error happens in the socket thread and it is not raised in the main one, then the library needs to automatically act upon the error. This raises another series of questions discussed elsewhere (i.e. here), for instance what do we do with in-the-midst operations (should they be retried? but what if the server actually got them and they are non-idempotent operations? etc).

@ostinelli ostinelli reopened this Jan 9, 2017
@phoet
Copy link

phoet commented Jan 9, 2017

i understand that the issue is not trivial. all i want is a way to make sure that the background thread is not allowed to crash the application randomly. if you can think of another way...

@ostinelli
Copy link
Owner Author

I understand your concern, that's why I've reopened this issue. If you have suggestions please let me know.

@phoet
Copy link

phoet commented Feb 23, 2017

@ostinelli
Copy link
Owner Author

ostinelli commented Feb 24, 2017

Thank you @phoet, I'm pretty familiar with this. The idea though is how to enable async operations.

I've come to the conclusion that the best option might be to refactor to have an :on_error callback (or similar) which will be called with the encountered error, then it will be up to the programmer to decide what to do (raise error, retry, etc). I have still to find time to do the refactoring though.

@adamcooke
Copy link

I think something like the error callback would be good. If no callback is defined, simply fall back to raising the exception on the main thread (backwards compatibility and all that).

client = NetHttp2::Client.new("http://nghttp2.org")
client.on(:error) do |exception|
  # Do something
end

Is there a huge amount of refactoring involved here? I don't see any other callback architecture on the Client class so something similar to the on/emit on Client. I'll do a quick POC PR.

@adamcooke
Copy link

@phoet
Copy link

phoet commented Feb 24, 2017

i'd do a minor release and not do a backwards compatible change. the code could issue a warning when no callback is given. imo thread raise should be gone

@ostinelli
Copy link
Owner Author

ostinelli commented Feb 24, 2017

@phoet this is a 0.x version. I won't care about backwards compatibility, that's what 0.x stands for. If I can easily make it, I will.

@adamcooke will look at your suggestion asap. Thank you.

@ostinelli
Copy link
Owner Author

ostinelli commented Feb 24, 2017

@adamcooke I saw your suggestion, couple of issues there:

  • You don't know which request originated the error.
  • We still need to raise errors in the main thread if errors are raised in the callbacks.

Any suggestions welcome.

@adamcooke
Copy link

adamcooke commented Feb 24, 2017

You don't know the request with the current situation either, though, do you? It's entirely possible that no request caused the error too. If the socket disconnects (my actual case) when no request is in progress.

@ostinelli
Copy link
Owner Author

ostinelli commented Feb 24, 2017

Indeed. Will have something hopefully by tomorrow or next week.

@ostinelli
Copy link
Owner Author

@adamcooke and @phoet can you try out the branch https://github.com/ostinelli/net-http2/tree/error_callback and see if that covers your needs?

@ostinelli ostinelli changed the title Thread Safety abort_on_exception in threads (was: Thread Safety) Feb 27, 2017
@ostinelli ostinelli changed the title abort_on_exception in threads (was: Thread Safety) socket loop thread abort_on_exception (was: Thread Safety) Feb 27, 2017
@adamcooke
Copy link

This looks good to me and solves my issue (although I will need to a way to access the client object on an Apnotic Connection instance but I can monkey patch in).

@ostinelli
Copy link
Owner Author

ostinelli commented Feb 27, 2017

No worries I will add this callback on Apnotic too. But if you can try it out (even with a monkey patch for now) to see if this is ok for you before I merge it in, it would be great.

@adamcooke
Copy link

All looked good in development. I'm now running this live and will report back in a few hours 👍

@adamcooke
Copy link

All running fine for me now. I'd say it was good to merge.

@ostinelli
Copy link
Owner Author

Merged and published 0.15.0. Tonight I will add the Apnotic part of it.
Thank you for your help.

@augustosamame
Copy link

Hello guys. I think I am being bit by this issue too. My current setup is apnotic 1.0.1 and net_http2 0.14.1
My problem is that after a few hours of sidekiq and apnotic working fine I get this error in the sidekiq logs:

Socket was remotely closed
/home/deploy/rails_app/shared/bundle/ruby/2.3.0/gems/net-http2-0.14.1/lib/net-http2/client.rb:104:in `rescue in block (2 levels) in ensure_open'
/home/deploy/rails_app/shared/bundle/ruby/2.3.0/gems/net-http2-0.14.1/lib/net-http2/client.rb:98:in `block (2 levels) in ensure_open'

and my sidekiq process comes crashing down, all threads at a time.
It needs to be restarted manually to work again.

Your discussion seems to be something of a similar issue.

I would appreciate it if you can confirm that this is fixed by upgrading to net-http2 0.15.0 version or if I have to add some extra code (error callback?). You mention adding the "Apnotic part of it" too.

@ostinelli
Copy link
Owner Author

Hello @augustosamame, you are commenting on a resolved issue asking if it would resolve your issue. I'd recommend that you follow the readme instructions and report if the latest version does not solve what you are experiencing.

@phoet
Copy link

phoet commented May 1, 2017

@augustosamame i think you can simply upgrade, the additional errorhandling in the callback is optional. on top of that there is not much that you can do to recover from such errors except for maybe logging them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants