Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated "Could not process key for channel" IllegalStateException for unix sockets #4865

Closed
mpuncel opened this issue May 11, 2020 · 10 comments

Comments

@mpuncel
Copy link

mpuncel commented May 11, 2020

Jetty version
9.4.28.v20200408

Java version
11

OS type/version
Centos7

Description
I'm not sure if this is a jetty issue, or a jnr-unixsocket issue but I'd appreciate any directional help I can get on figuring this one out.

Under load, we sometimes see exceptions like the following when using unix sockets (we've never seen this with a port)

2020-05-11T14:10:34,520  WARN [jetty-unix-thread-184] [SpeleoTraceId=zBAkHZMVSHHEg] io.ManagedSelector - Could not process key for channel jnr.unixsocket.UnixSocketChannel@34b89bcd
java.lang.IllegalStateException: key=jnr.enxio.channels.PollSelectionKey@28fa4db9, att=null, iOps=0, rOps=0
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.processSelected(ManagedSelector.java:547)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:401)

This exception will be thrown over and over again for the same channel/key combination.

I'm guessing what's happening is select() keeps returning the same key, but that key doesn't have an attachment that is usable, and we keep revisiting the same key over and over in processSelected().

I'd appreciate any pointers on how to track this down, thanks! I'm able to reproduce this reliably by sending a bunch of concurrent requests to an HTTP endpoint that sleeps for a few seconds.

@sbordet
Copy link
Contributor

sbordet commented May 11, 2020

@mpuncel if you can reproduce easily, can you please enable DEBUG logging and attach here the log file?

@gregw
Copy link
Contributor

gregw commented May 12, 2020

Note that with the release 9.4.29 that is just about to come out, we have probably improved the failure behaviour in this case due to some other changes. On exception handling we used to only close if there was an endpoint attachment. Now if there is no endpoint, we close the channel. Hopefully this will be enough to remove the key from the selector, so once something has gone bad, it doesn't keep going bad. If it is not enough to avoid looping on that key, then it is a case that we should explicitly cancel the key.

@gregw
Copy link
Contributor

gregw commented May 12, 2020

Also thinking about the root cause if this issue.... if it was a race, where the key was being selected before the attachment was attached, then I would think the attachement would eventually be set and the exception stop. The fact that it doesn't stop suggests that it is not a race, but a lost attachement???

@mpuncel
Copy link
Author

mpuncel commented May 12, 2020

if you can reproduce easily, can you please enable DEBUG logging and attach here the log file?

Yep, will do! I'll reply to this issue when I have them.

The fact that it doesn't stop suggests that it is not a race, but a lost attachement???

Yeah it does seem that way, although it could possibly be that enough keys get into a bad state that the exception spam itself prevents (or greatly slows down) the keys from transitioning to a good state? When this happens it does usually happen for a lot of keys all at once, it seems to be related to the number of concurrent connections/requests that jetty is processing

@mpuncel
Copy link
Author

mpuncel commented May 13, 2020

I have some logs now, trying to pare them down into something sharable. Do you have any suggestions for what we want to see? Should I just filter for lines that have the relevant key or channel ID logged in the exception message?

@mpuncel
Copy link
Author

mpuncel commented May 13, 2020

Here's a go at producing some logs. This is the first instance of the exception:

2020-05-13T21:13:55,004  WARN [jetty-unix-thread-175] io.ManagedSelector - Could not process key for channel jnr.unixsocket.UnixSocketChannel@4cd79ce1
java.lang.IllegalStateException: key=jnr.enxio.channels.PollSelectionKey@4ec756c0, att=null, iOps=0, rOps=0
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.processSelected(ManagedSelector.java:547)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:401)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

The exception happened 370 times

$ grep 'Could not process key for channel jnr.unixsocket.UnixSocketChannel@4cd79ce1' exemplar.log.active | wc -l
     370

First timestamp is 2020-05-13T21:13:52,752, second is 2020-05-13T21:13:56,572, so it took 4 seconds to "recover". Under a production workload I've seen the same exception logged for over a minute (just found a 67 second one) so I think it can often take longer to recover, possibly due to how busy the server is.

Before each exception I see this line, confirming the attachment is null

2020-05-13T21:13:55,004 DEBUG [jetty-unix-thread-175] [SpeleoTraceId=zZRfWFjjAWPBe] io.ManagedSelector - selected 0 jnr.enxio.channels.PollSelectionKey@4ec756c0 null

Eventually I see it get an attachment, suggesting it's a race after all

2020-05-13T21:13:56,775 DEBUG [jetty-unix-thread-189] io.ManagedSelector - selected 0 jnr.enxio.channels.PollSelectionKey@4ec756c0 UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=FI,flush=-,to=37/30000}{io=0/1,kio=0,kro=0}->HttpConnection@4c31b120[p=HttpParser{s=START,0 of -1},g=HttpGenerator@7293fd3{s=START}]=>HttpChannelOverHttp@73f15d9c{s=HttpChannelState@62e84875{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=0,c=false/false,a=IDLE,uri=null,ageck=0}

after that, I don't see the exception again for that same PollSelectionKey.

Backing up a bit to track that UnixSocketEndpoint@5db2cc31, here are the first 10 lines where it appears

2020-05-13T21:13:56,737 DEBUG [jetty-unix-thread-224] [SpeleoTraceId=zgYcJQEMfJkQP] io.IdleTimeout - UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=-,flush=-,to=0/30000}{io=0/0,kio=0,kro=0}-><null> idle timeout check, elapsed: 0 ms, remaining: 30000 ms
2020-05-13T21:13:56,737 DEBUG [jetty-unix-thread-224] [SpeleoTraceId=zgYcJQEMfJkQP] server.HttpChannel - new HttpChannelOverHttp@73f15d9c{s=HttpChannelState@62e84875{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=0,c=false/false,a=IDLE,uri=null,age=0} -> UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=-,flush=-,to=0/30000}{io=0/0,kio=0,kro=0}-><null>,null,HttpChannelState@62e84875{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0}
2020-05-13T21:13:56,737 DEBUG [jetty-unix-thread-224] [SpeleoTraceId=zgYcJQEMfJkQP] server.HttpConnection - New HTTP Connection HttpConnection@4c31b120::UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=-,flush=-,to=0/30000}{io=0/0,kio=0,kro=0}-><null>
2020-05-13T21:13:56,737 DEBUG [jetty-unix-thread-224] [SpeleoTraceId=zgYcJQEMfJkQP] io.AbstractEndPoint - onOpen UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=-,flush=-,to=0/30000}{io=0/0,kio=0,kro=0}->HttpConnection@4c31b120[p=HttpParser{s=START,0 of -1},g=HttpGenerator@7293fd3{s=START}]=>HttpChannelOverHttp@73f15d9c{s=HttpChannelState@62e84875{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=0,c=false/false,a=IDLE,uri=null,age=0}
2020-05-13T21:13:56,737 DEBUG [jetty-unix-thread-224] [SpeleoTraceId=zgYcJQEMfJkQP] io.AbstractConnection - onOpen HttpConnection@4c31b120::UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=-,flush=-,to=0/30000}{io=0/0,kio=0,kro=0}->HttpConnection@4c31b120[p=HttpParser{s=START,0 of -1},g=HttpGenerator@7293fd3{s=START}]=>HttpChannelOverHttp@73f15d9c{s=HttpChannelState@62e84875{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=0,c=false/false,a=IDLE,uri=null,age=0}
2020-05-13T21:13:56,737 DEBUG [jetty-unix-thread-224] [SpeleoTraceId=zgYcJQEMfJkQP] io.AbstractConnection - fillInterested HttpConnection@4c31b120::UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=-,flush=-,to=0/30000}{io=0/0,kio=0,kro=0}->HttpConnection@4c31b120[p=HttpParser{s=START,0 of -1},g=HttpGenerator@7293fd3{s=START}]=>HttpChannelOverHttp@73f15d9c{s=HttpChannelState@62e84875{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=0,c=false/false,a=IDLE,uri=null,age=0}
2020-05-13T21:13:56,737 DEBUG [jetty-unix-thread-224] [SpeleoTraceId=zgYcJQEMfJkQP] io.FillInterest - interested FillInterest@7e4d9f17{AC.ReadCB@4c31b120{HttpConnection@4c31b120::UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=FI,flush=-,to=0/30000}{io=0/0,kio=0,kro=0}->HttpConnection@4c31b120[p=HttpParser{s=START,0 of -1},g=HttpGenerator@7293fd3{s=START}]=>HttpChannelOverHttp@73f15d9c{s=HttpChannelState@62e84875{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=0,c=false/false,a=IDLE,uri=null,age=0}}}
2020-05-13T21:13:56,737 DEBUG [jetty-unix-thread-224] [SpeleoTraceId=zgYcJQEMfJkQP] io.ChannelEndPoint - changeInterests p=false 0->1 for UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=FI,flush=-,to=0/30000}{io=0/1,kio=0,kro=0}->HttpConnection@4c31b120[p=HttpParser{s=START,0 of -1},g=HttpGenerator@7293fd3{s=START}]=>HttpChannelOverHttp@73f15d9c{s=HttpChannelState@62e84875{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=0,c=false/false,a=IDLE,uri=null,age=0}
2020-05-13T21:13:56,738 DEBUG [jetty-unix-thread-224] [SpeleoTraceId=zgYcJQEMfJkQP] io.ManagedSelector - Created UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=FI,flush=-,to=0/30000}{io=0/1,kio=0,kro=0}->HttpConnection@4c31b120[p=HttpParser{s=START,0 of -1},g=HttpGenerator@7293fd3{s=START}]=>HttpChannelOverHttp@73f15d9c{s=HttpChannelState@62e84875{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=0,c=false/false,a=IDLE,uri=null,age=0}
2020-05-13T21:13:56,775 DEBUG [jetty-unix-thread-189] io.ManagedSelector - selected 0 jnr.enxio.channels.PollSelectionKey@4ec756c0 UnixSocketEndPoint@5db2cc31{null<->null,OPEN,fill=FI,flush=-,to=37/30000}{io=0/1,kio=0,kro=0}->HttpConnection@4c31b120[p=HttpParser{s=START,0 of -1},g=HttpGenerator@7293fd3{s=START}]=>HttpChannelOverHttp@73f15d9c{s=HttpChannelState@62e84875{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=0,c=false/false,a=IDLE,uri=null,age=0} 

Let me know if there's anything else I should look for

@sbordet
Copy link
Contributor

sbordet commented May 14, 2020

@mpuncel if you cannot redact the logs (but keep all the lines), then the approach is correct.

We want to track every possible object related to that SelectionKey to try to understand what happened.

Track PollSelectionKey@4ec756c0 and UnixSocketEndPoint@5db2cc31 and HttpConnection@4c31b120 and HttpChannelOverHttp@73f15d9c and so on.

The disappearance of the attachment, but then reappearance, and final disappearance indeed seem related to an issue in JNR as we don't update the attachment once we set it.

Bear in mind that support for Unix pipes is highly experimental.
You are probably better off using the network loopback interface if you want performance and stability.

FTR, Unix pipes support could arrive in Java 15+ (https://openjdk.java.net/jeps/380).

@mpuncel
Copy link
Author

mpuncel commented May 18, 2020

Thanks for the info!

@gregw mentioned earlier that 9.4.29 might mitigate the issue, I'm wondering when that release will go out?

@sbordet
Copy link
Contributor

sbordet commented May 19, 2020

@mpuncel I believe @gregw was referring to #4798 and we plan to release 9.4.29 this week.

@janbartel
Copy link
Contributor

@mpuncel 9.4.29 and even 9.4.30 were released. If either one of those solved the problem, please close the issue, otherwise please provide updated info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants