Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCK Challenge: connectionLostFor1500msTest expects reconnect after server responds with 204 #1007

Closed
andymc12 opened this issue Aug 9, 2021 · 20 comments

Comments

@andymc12
Copy link
Contributor

andymc12 commented Aug 9, 2021

Specification:
Jakarta RESTful Web Services 3.0 (EE 9.0/9.1) Full Platform TCK

Challenged test(s):
com.sun.ts.tests.jaxrs.jaxrs21.ee.sse.sseeventsource.JAXRSClient#connectionLostForDefault500msTest

TCK version:
9.0.0/9.1.0

Tested implementation:
Open Liberty using RESTEasy

Description:

The test in question is intended to test that a client will reconnect after an initial connection is terminated by the server via sink.close(). A default SSE client should be expected to attempt reconnect in 500 milliseconds and then receive an event in order to pass the test. The problem here is that the Javadoc for SseEventSource says that when a client receives a 204 No Content response, then it should not attempt to reconnect.

204 - This indicates that server has no events to send. Only onComplete is invoked.

The result of the server terminating the connection with the client is a 204 since it sends no content - therefore the proper behavior of the client in this case should be to not reconnect.

The reason that this has not been discovered until now is that RESTEasy did not properly implement the 204 behavior as documented in the javadoc. This behavior is not tested in the TCK. As such, prior to recent changes in the Open Liberty fork of RESTEasy, the test passed because the client did reconnect on receiving a 204.

I propose:
(1) That this test be excluded for now.
(2) That we add a new test for the proper 204 behavior (ensuring that the client does not reconnect), and
(3) That we update the connectionLostFor1500msTest so that the initial connection returns an event before terminating the connection. This will ensure that the response code for the initial request will be 200, and that the client should attempt to reconnect after waiting 500ms.

@scottmarlow
Copy link

@spericas as mentioned in the email I sent , regarding the TCK process, please handle this TCK challenge in the best way for the Jakarta RESTful Web Services spec team. The outcome should be that you should accept or reject the challenge to resolve it in the next two weeks (or less). As also mentioned in the email, you have a few options to handle these challenges (committers vote, lazy consensus or something else that follows Eclipse development process).

I'm going to start encouraging that we follow the TCK process for challenges, so that all Spec teams are aware of the different ways that challenges can be handled. Hope this helps!

@jansupol
Copy link
Contributor

jansupol commented Aug 10, 2021

The tests tries to simulate the 3x reconnect, so it calls 3 times

 @Produces(MediaType.SERVER_SENT_EVENTS)
 public void sseLost(@Context SseEventSink sink, @Context Sse sse) {
        sink.close();
}

What I see as the major "issue" with the test is what should sink.close() do. I understand RestEasy responds with 204, but to me it is not clear why. @andymc12 Can you please answer this? The client calls try (SseEventSource source = SseEventSource.target(target).build()) { to hit the SSE endpoint and it is hard to distinguish the response codes within SseEventSource. Do I miss something?

@andymc12
Copy link
Contributor Author

@jansupol - thanks for the question and for reviewing the challenge.

What I see as the major "issue" with the test is what should sink.close() do. I understand RestEasy responds with 204, but to me it is not clear why. @andymc12 Can you please answer this?

I didn't write the RESTEasy code, so I cannot say for certain why it behaves that way. But I think I understand the design decision. The spec says that SSE should behave similar to asynchronous resource methods. An async response that returned no non-null entity would result in a 204. In this test case, there is no non-null entity (or event) sent back to the client, so I think 204 makes sense.

Further, I couldn't figure out a way for an SSE resource method to return a 204 otherwise... because it uses the async approach, it's not possible to return a Response.noContent().build() from the resource method, and there is no AsyncResponse that could be used to return null. So I think closing the sink without sending an event may be the only way to send a 204. It might be possible to use filters to return a 204, but that's not as feasible as:

if (anyEventsForThisClient(sink)) {
    sendEvents(sink);
}
sink.close(); // close with 204 if no events or 200 if there are

The client calls try (SseEventSource source = SseEventSource.target(target).build()) { to hit the SSE endpoint and it is hard to distinguish the response codes within SseEventSource. Do I miss something?

No, that is right. It's not easy to tell the response code when using SseEventSource. You could use a client filter to log the response code or a tool like tcptunnel. My colleagues and I were able to determine what was happening by investigating why our function tests were failing when the server explicitly returned a 204 - RESTEasy was still reconnecting. Then we modified RESTEasy's code to not reconnect on 204s, and that's when this TCK test started failing.

@spericas
Copy link
Contributor

@andymc12 If I understand it correctly, a recent change to RESTEasy has resulted in new logic to detect the immediate closing of the connection, resulting in a 204 instead a 200. It seems this new logic would wait until it sees the first connection event (a close in this case) to decide which status code to return.

Now, this new logic seems both clever and problematic. How long do you wait for that connection event to arrive? Waiting long enough could cause the client code to time out. What it was likely doing before, and what other implementations are doing, is immediately returning a 200, which is not as clever but certainly less problematic.

Perhaps the behavior is different to what I'm describing here. If so, please help us understand it to better evaluate the challenge in question.

@andymc12
Copy link
Contributor Author

@spericas

If I understand it correctly, a recent change to RESTEasy has resulted in new logic to detect the immediate closing of the connection, resulting in a 204 instead a 200. It seems this new logic would wait until it sees the first connection event (a close in this case) to decide which status code to return.

Sorry, no. The logic to return a 204 has been there since JAX-RS 2.1, afaict. @jamezp may be able to confirm.

Now, this new logic seems both clever and problematic. How long do you wait for that connection event to arrive? Waiting long enough could cause the client code to time out. What it was likely doing before, and what other implementations are doing, is immediately returning a 200, which is not as clever but certainly less problematic.

This should probably be considered a side discussion to the challenge, but you raise a good point. In the event of a client timeout, wouldn't the client just reconnect (assuming it hadn't already closed it's SseEventSource)? In that case, the client reconnects (which the server had hoped to avoid) but only because the server didn't send any event or close the sink within the client's read timeout. In any case, we may want to clarify how a server should send a 204 in the spec/javadocs.

But I agree with you about the other implementations immediately returning a 200. I believe that is why we were able to pass the 2.1 TCK with Apache CXF.

Perhaps the behavior is different to what I'm describing here. If so, please help us understand it to better evaluate the challenge in question.

The new logic that I added was to handle the 204 on the client side. So far, I have only made this change in Liberty's fork of RESTEasy, but I would like to contribute it back to the main RESTEasy repo. I added this new logic so that we would conform to the javadoc mentioned in the initial description.

Hope this helps! Thanks for investigating.

@spericas
Copy link
Contributor

@spericas

If I understand it correctly, a recent change to RESTEasy has resulted in new logic to detect the immediate closing of the connection, resulting in a 204 instead a 200. It seems this new logic would wait until it sees the first connection event (a close in this case) to decide which status code to return.

Sorry, no. The logic to return a 204 has been there since JAX-RS 2.1, afaict. @jamezp may be able to confirm.

OK, thanks for the clarification. So the improper handling of a 204 response on the client side resulted on a pass before.

Now, this new logic seems both clever and problematic. How long do you wait for that connection event to arrive? Waiting long enough could cause the client code to time out. What it was likely doing before, and what other implementations are doing, is immediately returning a 200, which is not as clever but certainly less problematic.

This should probably be considered a side discussion to the challenge, but you raise a good point. In the event of a client timeout, wouldn't the client just reconnect (assuming it hadn't already closed it's SseEventSource)? In that case, the client reconnects (which the server had hoped to avoid) but only because the server didn't send any event or close the sink within the client's read timeout. In any case, we may want to clarify how a server should send a 204 in the spec/javadocs.

I'm not sure it's a side discussion to be honest. We are asking the community to invalidate/modify a TCK test because of what appears to be a specific behavior in one of the implementations. If we all agree the behavior is desirable, we should probably update the spec and formalize it; if not, then the test should be honored.

But I agree with you about the other implementations immediately returning a 200. I believe that is why we were able to pass the 2.1 TCK with Apache CXF.

Jersey also returns 200 immediately.

Perhaps the behavior is different to what I'm describing here. If so, please help us understand it to better evaluate the challenge in question.

The new logic that I added was to handle the 204 on the client side. So far, I have only made this change in Liberty's fork of RESTEasy, but I would like to contribute it back to the main RESTEasy repo. I added this new logic so that we would conform to the javadoc mentioned in the initial description.

The 204 handling on the client side makes perfect sense to me; it just happens to expose some server-side behavior that is worth discussing in the context of this challenge.

@jamezp
Copy link
Contributor

jamezp commented Aug 11, 2021

Sorry, no. The logic to return a 204 has been there since JAX-RS 2.1, afaict. @jamezp may be able to confirm.

I'm fairly new to RESTEasy so I'm still learning, but I can confirm that 202 204 has been getting returned on null responses since 2008 so it's been a while :)

@andymc12
Copy link
Contributor Author

Now, this new logic seems both clever and problematic. How long do you wait for that connection event to arrive? Waiting long enough could cause the client code to time out. What it was likely doing before, and what other implementations are doing, is immediately returning a 200, which is not as clever but certainly less problematic.

This should probably be considered a side discussion to the challenge, but you raise a good point. In the event of a client timeout, wouldn't the client just reconnect (assuming it hadn't already closed it's SseEventSource)? In that case, the client reconnects (which the server had hoped to avoid) but only because the server didn't send any event or close the sink within the client's read timeout. In any case, we may want to clarify how a server should send a 204 in the spec/javadocs.

I'm not sure it's a side discussion to be honest. We are asking the community to invalidate/modify a TCK test because of what appears to be a specific behavior in one of the implementations. If we all agree the behavior is desirable, we should probably update the spec and formalize it; if not, then the test should be honored.

I view it as a side discussion because the intent of the test case is to verify client-side behavior (reconnecting after receiving a non-204 response code). Ideally, the test case would be written using some sort of mocked server (like MockServer or WireMock, etc.) that supplies exactly the server responses we want rather than depending on the vendor implementation to supply both sides of the test. Maybe that is something we can improve on in 3.1 and beyond.

If the server returned a 200, then RESTEasy's client implementation would pass - which is the spirit of the test. The test description doesn't mention testing that the server should return a 200 - just that the client should reconnect after 500ms. So RESTEasy's client is operating correctly as far as this test is concerned. The only downside to excluding it is that if a new vendor certifies with this broken - where their client does not reconnect on a 200 response where the connection is lost. All of the "usual suspects" already pass this test.

To really fix this situation, it sounds like we would need to add to my original suggestions:

I propose:
(1) That this test be excluded for now.
(2) That we add a new test for the proper 204 behavior (ensuring that the client does not reconnect), and
(3) That we update the connectionLostFor1500msTest so that the initial connection returns an event before terminating the connection. This will ensure that the response code for the initial request will be 200, and that the client should attempt to reconnect after waiting 500ms.

(4) Add a clarification and server side test for what is the behavior of closing the sink without sending any events (200 or 204, or something else).

Personally, I wish that we could add and update tests to the TCK after the release. That wouldn't help with (4) since that would be new behavior, but it would help with (2) and (3) - and would prevent any newcomers from getting away with a broken implementation.

Thanks again!

@spericas
Copy link
Contributor

Now, this new logic seems both clever and problematic. How long do you wait for that connection event to arrive? Waiting long enough could cause the client code to time out. What it was likely doing before, and what other implementations are doing, is immediately returning a 200, which is not as clever but certainly less problematic.

This should probably be considered a side discussion to the challenge, but you raise a good point. In the event of a client timeout, wouldn't the client just reconnect (assuming it hadn't already closed it's SseEventSource)? In that case, the client reconnects (which the server had hoped to avoid) but only because the server didn't send any event or close the sink within the client's read timeout. In any case, we may want to clarify how a server should send a 204 in the spec/javadocs.

I'm not sure it's a side discussion to be honest. We are asking the community to invalidate/modify a TCK test because of what appears to be a specific behavior in one of the implementations. If we all agree the behavior is desirable, we should probably update the spec and formalize it; if not, then the test should be honored.

I view it as a side discussion because the intent of the test case is to verify client-side behavior (reconnecting after receiving a non-204 response code). Ideally, the test case would be written using some sort of mocked server (like MockServer or WireMock, etc.) that supplies exactly the server responses we want rather than depending on the vendor implementation to supply both sides of the test. Maybe that is something we can improve on in 3.1 and beyond.

Clearly, it is not the intent of the test, but, client and server APIs are all part of the same spec: there are no bonus points for excellent compatibility of one part and not the other. If an incompatibility is found indirectly, it is still an incompatibility IMO.

It appears RESTEasy's behavior is intentional, in that, there's special code to handle this scenario that isn't mandated by the Jakarta REST spec --of course, it is not forbidden either but is arguably less natural than returning 200.

If the server returned a 200, then RESTEasy's client implementation would pass - which is the spirit of the test. The test description doesn't mention testing that the server should return a 200 - just that the client should reconnect after 500ms. So RESTEasy's client is operating correctly as far as this test is concerned. The only downside to excluding it is that if a new vendor certifies with this broken - where their client does not reconnect on a 200 response where the connection is lost. All of the "usual suspects" already pass this test.

To really fix this situation, it sounds like we would need to add to my original suggestions:

I propose:
(1) That this test be excluded for now.
(2) That we add a new test for the proper 204 behavior (ensuring that the client does not reconnect), and
(3) That we update the connectionLostFor1500msTest so that the initial connection returns an event before terminating the connection. This will ensure that the response code for the initial request will be 200, and that the client should attempt to reconnect after waiting 500ms.

(4) Add a clarification and server side test for what is the behavior of closing the sink without sending any events (200 or 204, or something else).

I'm not in favor of (1), it just sets a bad precedence. (3) seems reasonable and very much in the spirit of the test in question. Together with (3), we should add a new test for the 200 vs. 204 behavior and decide which one to enforce.

@andymc12
Copy link
Contributor Author

Are we allowed to change the TCK tests post-release? If so, let's do it. My understanding was that that was not allowed, and that exclusion was the only way until the next release, but I'd prefer to keep the tests included - just fixed.

@jansupol
Copy link
Contributor

So far the easiest "fix" for the test seems to me to add a ContainerResponseFilter that would make sure to return HTTP 200. I am not sure whether it is possible to release Jakarta REST 3.0.2 TCK with a fix, or we need to exclude the test.

Unrelated to Jakarta REST (at the moment), there is an SSE requirement in MP REST Client, and the MP REST Client SSE tests do not even expect the reconnect. That could be exactly because the testing implementation used HTTP 204.

@spericas
Copy link
Contributor

Are we allowed to change the TCK tests post-release? If so, let's do it. My understanding was that that was not allowed, and that exclusion was the only way until the next release, but I'd prefer to keep the tests included - just fixed.

I'm not certain about the process myself. Perhaps @scottmarlow can help us here? We may have to exclude it and fix it for the next release.

@spericas
Copy link
Contributor

@andymc12 Is there any setting or way by which RESTEasy can return 200 instead of 204? I'm still puzzled a bit as to how this works exactly: there must be some sort of timeout to wait for the first event and decide between 200 or 204.

@andymc12
Copy link
Contributor Author

@spericas I'm looking into that. Since this is not GA in Open Liberty, we could probably work around this by returning a 200 instead of a 204, though that won't help RESTEasy users outside of Open Liberty (like Wildfly, Quarkus, standalone, etc.) since it will be a breaking change for them.

I think the way it works is like an async server request. The server does not respond with anything until either (1) the sink is closed (204), (2) an event is sent (200), (3) an error has occurred (less sure on this... 500 or 503) - and maybe other actions trigger the response to be sent.

If I can workaround the issue, I'll let you know. Thanks.

@andymc12
Copy link
Contributor Author

I am able to work around the issue with the changes in OpenLiberty/open-liberty#18172 - that said, I really like the original RESTEasy behavior better. The changes that I made to the client side basically behave differently when a 200 is returned with no events (reconnect) than it does when a 200 is returned with 1 or more events (don't reconnect) - that seems wrong to me.

RESTEasy's server behavior of returning a 204 when there are no events seems more intuitive (though not currently defined in the spec/javadoc), and it's behavior of reconnecting after all 200 responses are closed seems more consistent with the javadoc. Note that RESTEasy's client doesn't currently handle 204s correctly, but that is being remedied under RESTEASY-2985.

Hope this helps

@spericas
Copy link
Contributor

@andymc12 That's good news. We can certainly discuss this in a future release. Will you close this issue then?

@andymc12
Copy link
Contributor Author

Will do - thanks!

@scottmarlow
Copy link

@spericas should this issue be marked as rejected since the test will not be excluded now?

I would like to update https://github.com/eclipse-ee4j/jakartaee-tck/projects/2 to mark this challenge as either accepted or rejected.

Thanks!

@spericas
Copy link
Contributor

@spericas should this issue be marked as rejected since the test will not be excluded now?

I would like to update https://github.com/eclipse-ee4j/jakartaee-tck/projects/2 to mark this challenge as either accepted or rejected.

Thanks!

Yes, please mark it as rejected.

@scottmarlow
Copy link

@andymc12 @spericas could one of you please mark this challenge as rejected as I do not have permission to make that change. Thanks!

jamezp added a commit to jamezp/resteasy that referenced this issue Oct 9, 2023
…received. Note that an SseEventSink.close() in RESTEasy will result in a 204. In other implementations, this results in a 200. Add a way to override this with a system property or configuration parameter. See jakartaee/rest#1007 for details.

https://issues.redhat.com/browse/RESTEASY-2985
Signed-off-by: James R. Perkins <jperkins@redhat.com>
jamezp added a commit to jamezp/resteasy that referenced this issue Oct 18, 2023
…received. Note that an SseEventSink.close() in RESTEasy will result in a 204. In other implementations, this results in a 200. Add a way to override this with a system property or configuration parameter. See jakartaee/rest#1007 for details.

https://issues.redhat.com/browse/RESTEASY-2985
Signed-off-by: James R. Perkins <jperkins@redhat.com>
jamezp added a commit to jamezp/resteasy that referenced this issue Oct 18, 2023
…received. Note that an SseEventSink.close() in RESTEasy will result in a 204. In other implementations, this results in a 200. Add a way to override this with a system property or configuration parameter. See jakartaee/rest#1007 for details.

https://issues.redhat.com/browse/RESTEASY-2985
Signed-off-by: James R. Perkins <jperkins@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants