New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISPN-14115 Potential deadlock in index writer threads under heavy load #10292
Conversation
to express the idea that this is true if and only if we're replacing the type of the entity
At this stage we know that the remove and the add are performed on different indexes, so we can execute them concurrently
@yrodiere please review it when you have time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, see comments below.
// We don't need to wait for a possible removeFromIndexes operation, | ||
// since if it exists, the oldValue is UNKNOWN or replacedWithADifferentEntity is true, | ||
// which implies that the delete and the add operations are related to different indexes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Did you get confirmation that this was indeed the case for the person who reported the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my test and looking at the code, in particular see the commit 3c8c832, it seems so.
I asked the original issue reporter on the original issue Jira page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So given this change it makes me think that the removeFromIndexes
operation cannot complete successfully until the updateIndexes
invocation has occurred. Can we not just invoke updateIndexes
first and then call removeFromIndexes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes me think that the
removeFromIndexes
operation cannot complete successfully until theupdateIndexes
invocation has occurred
I'm not sure why you think that?
The two operations are unrelated since they apply to two different indexes.
They literally affect two different files on the filesystem.
They can execute in any order, so I'm not sure how what you're suggesting would change anything....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes me think that because having the second index update wait on the first causes neither to complete. If they were indeed independent operations not interdependent upon each other it shouldn't matter what order you had them wait on each other at all. It just seems fishy to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the change the only operation submitted to the indexer is removeFromIndexes
. The other operation never does, because removeFromIndexes
never finished. If that hangs by itself, how would adding additional operations cause it to not hang? The only way is if the operations were interdependent. That is why I am saying something must be going on between the two operations. Either that or this change wouldn't fix the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thread dumps provided in the original issue support the theory that the problem has to do with the queue of operations, not the execution of operations themselves.
My theory (explained in the JIRA ticket) was that this happens:
- Client thread: needs to delete an existing value of type A to replace it with a value of the same type
- Client thread: submits
delete some entity of type A
to the queue for type A. - Background thread for type A: picks
delete some entity of type A
from the queue - Background thread for type A: executes
delete some entity of type A
- Background thread for type A: marks the future as complete
- Background thread for type A: executes
thenCompose
- Background thread for type A: adds the
update some entity of type A
to the queue for type A... but the queue is full, because someone else added many other operations! (let's assume the server is under heavy load) - Background thread: blocks, waiting for a free slot in the queue. But this background thread is the only thread able to consume the queue!
- Deadlock: the background thread is waiting for an operation that only itself can perform.
If that's actually the problem, then no, the order of operations is not the problem. What matters is that we're introducing an artificial dependency between the two operations via thenCompose
.
That being said, you're right that there's something suspicious: if, like Fabio suggested, the two operations are related to two different entity types, then they should be submitted to two different queues (I just checked the Hibernate Search code: they really should). So my theory wouldn't hold if Fabio is right.
In that case, I suppose it would be a slightly more complicated scenario with two background threads:
- Client thread 1: needs to delete an existing entry of type A to replace it with a value of type B
- Client thread 1: submits
delete some entity of type A
to the queue for type A - Background thread for type A: picks
delete some entity of type A
from the queue for type A - Background thread for type A: starts executing
delete some entity of type A
- Client thread 2: needs to delete an existing entry of type B to replace it with a value of type A
- Client thread 2: submits
delete some entity of type B
to the queue for type B - Background thread for type B: picks
delete some entity of type B
from the queue for type B - Background thread for type B: starts executing
delete some entity of type B
- Background thread for type A: finishes executing
delete some entity of type A
- Background thread for type A: marks the future as complete
- Background thread for type A: executes
thenCompose
- Background thread for type A: adds the
update some entity of type B
to the queue for type B... but the queue is full, because someone else added many other operations! (let's assume the server is under heavy load) - Background thread for type A: blocks, waiting for a free slot in the queue for type B.
- Background thread for type B: finishes executing
delete some entity of type B
- Background thread for type B: marks the future as complete
- Background thread for type B: executes
thenCompose
- Background thread for type B: adds
update some entity of type A
to the queue for type A... but the queue is full, because someone else added many other operations! (let's assume the server is under heavy load) - Background thread for type B: blocks, waiting for a free slot in the queue for type A.
- Deadlock: background thread for type B is waiting for an an operation that only the background thread for type A can perform, and vice-versa.
While complex, I believe this scenario is possible, and gets more likely as an Infinispan node gets hammered with many requests that change the type of values.
And the fix is simple: since operations applied to type A and B require different resources (different indexes), we don't need to use thenCompose
, which introduces an artificial (and unnecessary) inter-dependency. We can just execute the operations in parallel, and then there's no chance of deadlock.
@fax4ever , you confirm that you weren't able to reproduce the problem when using only one type? And that your patch did fix the problem exhibited in your test:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fax4ever , you confirm that you weren't able to reproduce the problem when using only one type? And that your patch did fix the problem exhibited in your test
Yes I do and yes it did it
@@ -159,6 +160,21 @@ public void testOverwriteIndexedValue() { | |||
StaticTestingErrorHandler.assertAllGood(cache); | |||
} | |||
|
|||
@Test | |||
public void testOverwriteIndexedValue_heavyLoad() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check this locally, so I'll trust that you checked this failed before your patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it does not fail, it hangs
Thanks for the review @yrodiere! |
Merged, thanks |
Thank you! |
https://issues.redhat.com/browse/ISPN-14115
Without the commit ac8ab84 the reproducer hangs running the script on my local machine:
#!/bin/bash mvn clean install -pl query -am -DskipTests mvn clean install -pl query
let me know if it is the same for you.