-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMap GetOperation invoked in 3.5 EA2-Snapshot produced IllegalStateException with wrong partition but not occurred in 3.4.2 #5341
Comments
@pveentjer, or someone else. I tried it many times this afternoon that it threw such exception shown above while hz3.4.2 would not. i didn't test with the E-A release but for the latest snapshot artifact. so |
I didn't have an extensive look at the stacktrace yet, but if we have to reproduce your problem at our side it would be great if you could give us some more information and maybe a code snippet :)
|
@Donnerbart, thanks for view and advice. as compared to hz3.4.2, I did think that hz3.5 changed for the interval innovation from the stack trace. what's the more I will introduce the procedure as follows:
|
I think something fishy is going on when within an operation another invocation is being done. This should not be a problem for the same partition, but is not allowed for a different one due to deadlocks. Apparently it isn't detected correctly and therefore this illegal call is allowed and only when executed you illegal call is detected. The top call is a queue PollOperation and this triggers an internal invocation map.get invocation. I'll have a closer look why it isn't immediately detected. |
I think I found the cause: We have a method that checks if an invocation can be done from the calling thread and the operation executed as well from the calling thread.
In this case we check if the calling thread is owner of that partition and apparently 225 and 157 are mapped to the same thread. However.. instead of offloading the inner operation to the correct OperationRunner (every partition has its own), I guess we are using the OperationRunner of the outer operation and there we have a hard check that the partitions need to be the same. So there is a discrepancy |
I managed to reproduce it:
|
@pveentjer, thanks for view and accurate location and for producing such. I wondered that why it's not occurred in 3.4.x does it changed a lot since 3.5 or else if I used it wrong I will make some modification. |
The 3.5 release has a OperationRunnerImpl for every partition; something not available in the 3.4 release. |
@pveentjer, thanks for feed back and hope it could be fixed as expected. |
The problem was that the system didn't deal correctly with an inner call for 2 different partitions, but mapped to the same thread. The error was that the OperationRunnerImpl for the inner operation was not obtained using the partition id of the inner operation, but by accessing the OperationThread.currentOperationRunner. So the inner operation would run on the outer OperationRunner and then you get the exception.# Also tests have been added for local and remote calls where this behaviour is tested. 1 change is ClassicOperationScheduler and 6 changes are in the tests.
Fixed: |
@pveentjer, thank for quick fix and hope it will be merged into master quickly. |
The problem was that the system didn't deal correctly with an inner call for 2 different partitions, but mapped to the same thread. The error was that the OperationRunnerImpl for the inner operation was not obtained using the partition id of the inner operation, but by accessing the OperationThread.currentOperationRunner. So the inner operation would run on the outer OperationRunner and then you get the exception.# Also tests have been added for local and remote calls where this behaviour is tested. 1 change is ClassicOperationScheduler and 6 changes are in the tests.
The problem was that the system didn't deal correctly with an inner call for 2 different partitions, but mapped to the same thread. The error was that the OperationRunnerImpl for the inner operation was not obtained using the partition id of the inner operation, but by accessing the OperationThread.currentOperationRunner. So the inner operation would run on the outer OperationRunner and then you get the exception.# Also tests have been added for local and remote calls where this behaviour is tested. 1 change is ClassicOperationScheduler and 6 changes are in the tests.
hzteam, while upgraded to the latest 3.5 EA2 snapshot, I got the exception stack trace as follows while in 3.4.2 is OK, here comes my stack traces and hope someone could help me locate the problem.
The text was updated successfully, but these errors were encountered: