Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent SSL Handshakes cause IO threads to stay busy for long time #7020

Closed
ghost opened this issue Jul 25, 2017 · 27 comments
Closed

Concurrent SSL Handshakes cause IO threads to stay busy for long time #7020

ghost opened this issue Jul 25, 2017 · 27 comments

Comments

@ghost
Copy link

ghost commented Jul 25, 2017

Expected behavior

Existing connections should not be impacted and should continue to exchange messages if new connections are added

Actual behavior

If there are a few thousand connections to the server, and then we get a burst of say another few thousand connections, these new SSL handshakes keep the worker threads busy for a long time, which causes the existing connections to start timing out.

Steps to reproduce

  1. Compile and build the reproducer code using
    mvn clean package assembly:single

  2. Start the tcp server using this command:
    java -jar tcp-server-1.0.0-SNAPSHOT-jar-with-dependencies.jar
    This starts a TCP server with 8 IO threads and 1 acceptor thread on port 8000. For customizing, look at the class TcpServer

  3. Open mission control and monitor the Connection Count using the TcpServer.getConnectionCount managed attribute. This should be 0 to start with

  4. On a second machine, trigger the load using the tcp client jar.
    java -jar tcp-client-1.0.0-SNAPSHOT-jar-with-dependencies.jar
    This will trigger 5000 concurrent handshakes to the server. These connections are kept open and the client keeps sending requests to the server (1 at a time). These values can also be customized using the LoadRunner class

  5. Monitor for a couple of minutes, the connection count reported would be 5000

  6. On a third machine, repeat step 4 and monitor the connection count. You will see the connection count to rise to 10000, and after a couple of minutes the clients start timing out (graph attached)

Minimal yet complete reproducer code (or URL to code)

https://github.com/anshuman-osc/netty.git

Netty version

4.1.6.Final, 4.1.12.Final

JVM version (e.g. java -version)

java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

OS version (e.g. uname -a)

Tested on 2 Linux distributions:
Linux 4.4.0-83-generic #106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Linux 3.0.101-63-default #1 SMP Tue Jun 23 16:02:31 UTC 2015 (4b89d0c) x86_64 x86_64 x86_64 GNU/Linux
io-threads-blocked-counts

missioncontrol-graph
io-threads-blocked-counts

Stack overflow thread

https://stackoverflow.com/questions/44751058/multiple-worker-event-loop-groups

@Scottmitch
Copy link
Member

ChannelInboundHandlers will execute on the EventLoop thread by default. Here are general recommendations to work through this performance issue:

  • If the handshakes are generating too much load you can offload processing to another thread.
  • Use netty-tcnative and OpenSSL to reduce the overhead of SSL and handshakes
  • Upgrade your version of Netty (4.1.6 is old and much has changed since then).

@ghost
Copy link
Author

ghost commented Jul 26, 2017

Hi @Scottmitch

  • If the handshakes are generating too much load you can offload processing to another thread.

Is there an example on how to offload just SSL to another thread ?

  • Upgrade your version of Netty (4.1.6 is old and much has changed since then).

I tried with 4.1.12 and the issue is there as well

  • Use netty-tcnative and OpenSSL to reduce the overhead of SSL and handshakes.

Please see the SO thread where @normanmaurer agreed that the behaviour is unexpected. Also, if you watch Norman's presentation on youtube, he was able to achieve 120K requests per second with the Java SSL. Here, I understand that requests per second is quite different to concurrent connections (or handshakes), however I think something as less as 5-10K handshakes should not degrade performance to this extent.

@normanmaurer
Copy link
Member

@anshuman-osc I did not say its unexpected I just said I would be interested to see a profiler snapshot ;) Can you share the snapshot ?

@ghost
Copy link
Author

ghost commented Jul 27, 2017

Hi @normanmaurer
Thanks for replying. I have attached the snapshots from MissionControl which shows the CPU/Memory Thread states and Connection count. Apart from these what exactly are you interested in, and I can try and get them.

Thanks

@Scottmitch
Copy link
Member

java version "1.8.0_65"

If you are using Java for SSL I would also recommend updating your version of java.

I have attached the snapshots from MissionControl

Where are these attached? Can you provide a link?

@ghost
Copy link
Author

ghost commented Jul 27, 2017

@Scottmitch
Copy link
Member

I can see the screenshots. What we would like is the actual profile snapshot files themselves to understand what is going on in more detail.

@ghost
Copy link
Author

ghost commented Jul 27, 2017

HI @Scottmitch
I have attached the flight recording.zip from Mission Control. You can open it using Mission Control and take a look.
Does this work ?

Thanks

@rkapsi
Copy link
Member

rkapsi commented Jul 28, 2017

+1 on this ticket.

We have theorized about this attack vector and it'd be great to be able to offload the handshaking to a different thread pool that doesn't interfere with the normal worker threads.

@ghost
Copy link
Author

ghost commented Aug 23, 2017

Hi @Scottmitch , @normanmaurer

Did the flight recording help?

Thanks

@Scottmitch
Copy link
Member

sorry I have not had cycles to investigate this yet

@normanmaurer
Copy link
Member

normanmaurer commented Aug 23, 2017 via email

@rkapsi
Copy link
Member

rkapsi commented Oct 6, 2017

In my mind... Split SslHandler into two classes. Let there be a new SslHandshakeHandler and some class that does everything but the handshake portion. Wrap these two into a codec style class for the convenient path and give people to use the handlers separately.

Then use ChannelPipeline#add(EventExecutorGroup, SslHandshakeHandler) if you want the handshakes to happen on a different thread than the default.

That is obviously an API breaking change. For compatibility this could be baked into SslHandler which could take an optional configuration parameter in the form of an executor.

Just throwing it out there.

@normanmaurer
Copy link
Member

normanmaurer commented Oct 6, 2017 via email

@johnou
Copy link
Contributor

johnou commented Nov 13, 2017

So I took a peek, looks like we have javax.net.ssl.SSLEngine#getDelegatedTask which is "implemented" here io.netty.handler.ssl.ReferenceCountedOpenSslEngine#getDelegatedTask..

    @Override
    public final Runnable getDelegatedTask() {
        // Currently, we do not delegate SSL computation tasks
        // TODO: in the future, possibly create tasks to do encrypt / decrypt async
        return null;
    }

The Java implementation uses this exclusively for the handshake SSL computation which I think would work well enough to mitigate the DoS attack vector @rkapsi and I are concerned about (large amounts of SSL handshakes which in turn would slow down the event loop and potentially cause connection drops / latency).

@johnou
Copy link
Contributor

johnou commented Nov 13, 2017

@normanmaurer if you don't mind I can try and add this to our scaling project?

@normanmaurer
Copy link
Member

@johnou yep thats the way to go... I just had no time yet and its not trivial to make it work with the state-machine. Feel free to try tho :)

@johnou
Copy link
Contributor

johnou commented May 10, 2018

@normanmaurer I started by pulling out all the handshake related code from ReferenceCountedOpenSslEngine into it's own class then realised that runDelegatedTasks will anyway block until the delegated task returns, even if we use another thread pool that would still block the event loop.. did you have something else in mind?

@johnou
Copy link
Contributor

johnou commented May 10, 2018

Wouldn't it make more sense modifying io.netty.handler.ssl.SslHandler#handshake to execute in another thread pool and then set the promise allowing the connection to proceed without blocking the event loop.

@johnou
Copy link
Contributor

johnou commented May 10, 2018

Or better yet.. why not just create a second event loop and when handshake completes, move the channel over.

    @Override
    public void channelActive(ChannelHandlerContext context) throws Exception {
        context.pipeline().get(SslHandler.class).handshakeFuture().addListener(sslFuture -> {
            if (sslFuture.isSuccess()) {
                context.channel().deregister().addListener(future -> {
                    if (future.isSuccess()) {
                        fastEventLoop.register(context.channel());
                    } else {
                        // log bad stuff
                    }
                });
            }
        });
        super.channelActive(context);
    }

@johnou
Copy link
Contributor

johnou commented May 10, 2018

@anshuman-osc would you please try with another event loop and see if that helps?

@ghost
Copy link
Author

ghost commented May 10, 2018 via email

@johnou
Copy link
Contributor

johnou commented May 10, 2018

@normanmaurer we heavily rely on registering/deregistering channels, would be a nightmare to lose this feature (after implementing the below solution I recall we observed a nice drop in CPU), consider that 60 people are in a chat room, when an incoming message needs to be sent to all players the remote message comes in (from outside the event loop) then needs to written to every single channel waking up each individual event loop (multiple times), what I've done is taken all the event loops and put them in a helper class which uses deterministic routing..

    public EventLoop getEventLoop(int hashCode) {
        return eventLoops[(Math.abs(hashCode)) % eventLoops.length];
    }

when a player enters a room we can take the hash code of that actor / room id and move the client to the correct event loop..

final EventLoop newEventLoop = eventLoopHelper.getEventLoop(hashCode);
        if (clientSession.getChannel().eventLoop() != newEventLoop) {
            clientSession.getChannel().deregister().addListener(future -> {
                if (future.isSuccess()) {
                    newEventLoop.register(clientSession.getChannel());
                } else {
                    logger.warn("Failed moving player #" + clientSession.getPlayerId() +
                            " to new event loop", future.cause());
                }
            });
        }

now that we know that every single client is on the same event loop for room x when processing a remote message to broadcast we can use the event loop to send it like this..

return new SpaceStreamProxyObject<T>(this, streamActorRef,
                eventLoopHelper.getEventLoop(actorIdentity.hashCode()));
    @Override
    public Task<Void> onNext(final T data, final StreamSequenceToken sequenceToken) {
        return Task.done().whenCompleteAsync((aVoid, throwable) -> {
            // invoke onNext from the correct event loop thread
        }, executor).thenCompose(x -> onNext0(data));
    }

and the message is finally sent to all interested clients on the correct event loop thread avoiding un-necessary wakeups 🎉

@luengnat
Copy link

SSL handshake is really the bottleneck, especially on Oracle JSSE as it processes a lot of algorithm constraint and generates a lot of garbage and CPU intensive.

normanmaurer added a commit that referenced this issue Feb 6, 2019
…en processing TLS / SSL via the SslHandler.

Motivation:

The SSLEngine does provide a way to signal to the caller that it may need to execute a blocking / long-running task which then can be offloaded to an Executor to ensure the I/O thread is not blocked. Currently how we handle this in SslHandler is not really optimal as while we offload to the Executor we still block the I/O Thread.

Modifications:

- Correctly support offloading the task to the Executor while suspending processing of SSL in the I/O Thread
- Add new methods to SslContext to specify the Executor when creating a SslHandler
- Remove @deprecated annotations from SslHandler constructor that takes an Executor
- Adjust tests to also run with the Executor to ensure all works as expected.

Result:

Be able to offload long running tasks to an Executor when using SslHandler. Partly fixes #7862 and #7020.
normanmaurer added a commit that referenced this issue Feb 8, 2019
…en processing TLS / SSL via the SslHandler.

Motivation:

The SSLEngine does provide a way to signal to the caller that it may need to execute a blocking / long-running task which then can be offloaded to an Executor to ensure the I/O thread is not blocked. Currently how we handle this in SslHandler is not really optimal as while we offload to the Executor we still block the I/O Thread.

Modifications:

- Correctly support offloading the task to the Executor while suspending processing of SSL in the I/O Thread
- Add new methods to SslContext to specify the Executor when creating a SslHandler
- Remove @deprecated annotations from SslHandler constructor that takes an Executor
- Adjust tests to also run with the Executor to ensure all works as expected.

Result:

Be able to offload long running tasks to an Executor when using SslHandler. Partly fixes #7862 and #7020.
normanmaurer added a commit that referenced this issue Feb 11, 2019
#8847)

Motivation:

The SSLEngine does provide a way to signal to the caller that it may need to execute a blocking / long-running task which then can be offloaded to an Executor to ensure the I/O thread is not blocked. Currently how we handle this in SslHandler is not really optimal as while we offload to the Executor we still block the I/O Thread.

Modifications:

- Correctly support offloading the task to the Executor while suspending processing of SSL in the I/O Thread
- Add new methods to SslContext to specify the Executor when creating a SslHandler
- Remove @deprecated annotations from SslHandler constructor that takes an Executor
- Adjust tests to also run with the Executor to ensure all works as expected.

Result:

Be able to offload long running tasks to an Executor when using SslHandler. Partly fixes #7862 and #7020.
normanmaurer added a commit that referenced this issue Feb 11, 2019
#8847)

Motivation:

The SSLEngine does provide a way to signal to the caller that it may need to execute a blocking / long-running task which then can be offloaded to an Executor to ensure the I/O thread is not blocked. Currently how we handle this in SslHandler is not really optimal as while we offload to the Executor we still block the I/O Thread.

Modifications:

- Correctly support offloading the task to the Executor while suspending processing of SSL in the I/O Thread
- Add new methods to SslContext to specify the Executor when creating a SslHandler
- Remove @deprecated annotations from SslHandler constructor that takes an Executor
- Adjust tests to also run with the Executor to ensure all works as expected.

Result:

Be able to offload long running tasks to an Executor when using SslHandler. Partly fixes #7862 and #7020.
@mpfau
Copy link

mpfau commented Feb 23, 2019

@normanmaurer
Copy link
Member

As of today we support offloading handshake stuff from the EventLoop by specify an Executor when creating the SslHandler. Closing this

@johnou
Copy link
Contributor

johnou commented Oct 11, 2019

-Dio.netty.handler.ssl.openssl.useTasks=true must be passed into the JVM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants