Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition under load #89

Closed
anjackson opened this issue Oct 6, 2020 · 8 comments
Closed

Possible race condition under load #89

anjackson opened this issue Oct 6, 2020 · 8 comments

Comments

@anjackson
Copy link
Contributor

anjackson commented Oct 6, 2020

I've been running some load tests on OutbackCDX, and as indicated in this comment when I run 1000 threads (all running the same GET 100 times), I start seeing odd errors:

Premature end of chunk coded message body: closing chunk expected
Socket closed

It's rock solid at 750 threads, but at 1000 it goes wonky. The same client works fine at 1000 threads when running against an NGINX instance configured to respond as if it was OutbackCDX.

So, this seems like it might be a subtle race condition under load in OutbackCDX itself?

EDIT Sorry should have mentioned that this is irrespective of the number of threads OutbackCDX is configured to use (as long as it's plenty!) and doesn't see to be related to ulimits (which manifests itself differently).

@anjackson
Copy link
Contributor Author

Perhaps just an exception getting swallowed, maybe here.

@ato
Copy link
Member

ato commented Oct 7, 2020

Ah, yes that sounds like a swallowed exception. galgeek recently added some more logging to WbCdxApi to help with that.

@anjackson
Copy link
Contributor Author

Yeah, I dunno what's going on. Some clumsy handling of error conditions in my client was confusing things. Having tidied that up I only get Socket closed errors:

SEVERE: GET http://localhost:9090/fc?limit=100&matchType=exact&sort=reverse&url=http%3A%2F%2Facid.matkelly.com%2F failed with error Socket closed
java.net.SocketException: Socket closed
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:150)
	at java.net.SocketInputStream.read(SocketInputStream.java:121)
	at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:136)
	at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:152)
	at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:270)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
	at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:161)
	at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:153)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
	at uk.bl.wap.util.OutbackCDXClient.getCDXLine(OutbackCDXClient.java:268)

Having added some logging to OutbackCDX, I can see

java.net.SocketException: Broken pipe (Write failed)
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at outbackcdx.NanoHTTPD$CountingOutputStream.write(NanoHTTPD.java:1092)
	at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
	at outbackcdx.NanoHTTPD$ChunkingOutputStream.write(NanoHTTPD.java:702)
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)

	at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
	at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
	at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
	at java.io.BufferedWriter.flush(BufferedWriter.java:254)
	at outbackcdx.WbCdxApi.lambda$queryIndex$0(WbCdxApi.java:71)
	at outbackcdx.NanoHTTPD$Response.send(NanoHTTPD.java:539)

	at outbackcdx.NanoHTTPD$1$1.run(NanoHTTPD.java:210)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

(warning mixture of threads in that log)

So the error seems to be consistent with the client going away (broken pipe).

@anjackson
Copy link
Contributor Author

I note my NGINX comparison would not have stretched the chunked transfer decoding, for example, so may not be a fair test. It's also possible I'm exhausting ephemeral ports on this machine, but it's an odd way for that error to show up.

@ato
Copy link
Member

ato commented Oct 7, 2020

Yeah, that's consistent with the client closing the connection. Although it could obviously be doing that in response to something the server does like invalid chunked encoding. Can't be ephemeral port exhaustion because the connection was already established (unless the client library responds to being unable to open new connections by closing random existing ones or something).

One thing you could try is running OutbackCDX with the undocumented and experimental '-u' command-line option. This makes it use Undertow instead of NanoHTTPD as the http server. Chances are Undertow has more overhead but it's more battle tested and probably has better error handling.

@anjackson
Copy link
Contributor Author

Oof, well, to give you an idea of how it's going, I tried upping the threads some more and got this!

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000010c33ad1c, pid=36661, tid=111619
#
# JRE version: Java(TM) SE Runtime Environment (8.0-b124) (build 1.8.0-ea-b124)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b66 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libnet.dylib+0x9d1c]  Java_java_net_SocketInputStream_socketRead0+0x23d

So perhaps there are deeper problems.

Thanks for the tip about Undertow. It does seem slightly happier, reporting some errors on the back-end but not in the client (?!).

java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
	at org.xnio.nio.NioSocketConduit.write(NioSocketConduit.java:164)
	at io.undertow.server.protocol.http.HttpResponseConduit.write(HttpResponseConduit.java:609)
	at io.undertow.conduits.ChunkedStreamSinkConduit.flush(ChunkedStreamSinkConduit.java:270)
	at org.xnio.conduits.ConduitStreamSinkChannel.flush(ConduitStreamSinkChannel.java:162)
	at io.undertow.channels.DetachableStreamSinkChannel.flush(DetachableStreamSinkChannel.java:119)
	at org.xnio.channels.Channels.flushBlocking(Channels.java:63)
	at io.undertow.io.UndertowOutputStream.close(UndertowOutputStream.java:348)
	at outbackcdx.UWeb$UServer.sendResponse(UWeb.java:71)
	at outbackcdx.UWeb$UServer.dispatch(UWeb.java:42)
	at io.undertow.server.Connectors.executeRootHandler(Connectors.java:376)
	at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:830)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

and then later on:

Oct 07, 2020 11:54:01 PM io.undertow.server.Connectors executeRootHandler
ERROR: UT005071: Undertow request failed HttpServerExchange{ GET /fc}
java.lang.IllegalStateException: UT000002: The response has already been started
	at io.undertow.server.HttpServerExchange.setStatusCode(HttpServerExchange.java:1406)
	at outbackcdx.UWeb$UServer.sendResponse(UWeb.java:54)
	at outbackcdx.UWeb$UServer.dispatch(UWeb.java:49)
	at io.undertow.server.Connectors.executeRootHandler(Connectors.java:376)
	at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:830)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Upped the threads to OutbackCDX and these errors no longer showed up. Could be coincidence.

Very odd speed profile. First column is parallel thread count. Third is seconds per thread (where each thread has the same workload). Gets faster up to 750 threads then big slowdown at 1000. Not clear why.

10	1620	162.0	0	1000
50	818	16.36	0	5000
100	1209	12.09	0	10000
250	2286	9.144	0	25000
500	4664	9.328	0	50000
750	6665	8.886666666666667	0	75000
1000	63796	63.796	0	100000

I think the problem is likely not with OutbackCDX, unless maybe something subtle going on with keepalive and chunked encoding. I'll do some more experiments on a different platform.

@anjackson
Copy link
Contributor Author

Using Undertow, through Docker:

10	1522	152.2	0	1000
50	1890	37.8	0	5000
100	3621	36.21	0	10000
250	8615	34.46	0	25000
500	18705	37.41	0	50000
750	27317	36.422666666666665	0	75000
1000	86674	86.674	0	100000

No errors!

Then back to NanoHTTPD, still under Docker:

10	5633	563.3	0	1000
50	5720	114.4	0	5000
100	8543	85.43	0	10000
250	19600	78.4	0	25000
500	37367	74.734	0	50000
750	47824	63.76533333333333	0	75000
...LOTS OF SOCKET CLOSED BLARFING...
1000	71197	71.197	0	100000

So, while the underlying reasons are unclear, Undertow appears to be faster in general and more stable in particular. So I'll try using Undertow mode in production.

@ato
Copy link
Member

ato commented Oct 8, 2020

Good to know. Maybe I'll make undertow mode the default then and deprecate nanohttpd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants