Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-14115 Potential deadlock in index writer threads under heavy load #10292

Merged
merged 3 commits into from Sep 8, 2022

Conversation

fax4ever
Copy link
Contributor

@fax4ever fax4ever commented Sep 7, 2022

https://issues.redhat.com/browse/ISPN-14115

Without the commit ac8ab84 the reproducer hangs running the script on my local machine:

#!/bin/bash

mvn clean install -pl query -am -DskipTests
mvn clean install -pl query

let me know if it is the same for you.

to express the idea that this is true if and only if we're replacing the type of the entity
At this stage we know that the remove and the add are performed on different indexes, so we can execute them concurrently
@fax4ever
Copy link
Contributor Author

fax4ever commented Sep 7, 2022

@yrodiere please review it when you have time

Copy link
Contributor

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, see comments below.

Comment on lines +432 to +434
// We don't need to wait for a possible removeFromIndexes operation,
// since if it exists, the oldValue is UNKNOWN or replacedWithADifferentEntity is true,
// which implies that the delete and the add operations are related to different indexes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Did you get confirmation that this was indeed the case for the person who reported the problem?

Copy link
Contributor Author

@fax4ever fax4ever Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my test and looking at the code, in particular see the commit 3c8c832, it seems so.
I asked the original issue reporter on the original issue Jira page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So given this change it makes me think that the removeFromIndexes operation cannot complete successfully until the updateIndexes invocation has occurred. Can we not just invoke updateIndexes first and then call removeFromIndexes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes me think that the removeFromIndexes operation cannot complete successfully until the updateIndexes invocation has occurred

I'm not sure why you think that?

The two operations are unrelated since they apply to two different indexes.
They literally affect two different files on the filesystem.

They can execute in any order, so I'm not sure how what you're suggesting would change anything....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me think that because having the second index update wait on the first causes neither to complete. If they were indeed independent operations not interdependent upon each other it shouldn't matter what order you had them wait on each other at all. It just seems fishy to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the change the only operation submitted to the indexer is removeFromIndexes. The other operation never does, because removeFromIndexes never finished. If that hangs by itself, how would adding additional operations cause it to not hang? The only way is if the operations were interdependent. That is why I am saying something must be going on between the two operations. Either that or this change wouldn't fix the issue.

Copy link
Contributor

@yrodiere yrodiere Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread dumps provided in the original issue support the theory that the problem has to do with the queue of operations, not the execution of operations themselves.

My theory (explained in the JIRA ticket) was that this happens:

  • Client thread: needs to delete an existing value of type A to replace it with a value of the same type
  • Client thread: submits delete some entity of type A to the queue for type A.
  • Background thread for type A: picks delete some entity of type A from the queue
  • Background thread for type A: executes delete some entity of type A
  • Background thread for type A: marks the future as complete
  • Background thread for type A: executes thenCompose
  • Background thread for type A: adds the update some entity of type A to the queue for type A... but the queue is full, because someone else added many other operations! (let's assume the server is under heavy load)
  • Background thread: blocks, waiting for a free slot in the queue. But this background thread is the only thread able to consume the queue!
  • Deadlock: the background thread is waiting for an operation that only itself can perform.

If that's actually the problem, then no, the order of operations is not the problem. What matters is that we're introducing an artificial dependency between the two operations via thenCompose.

That being said, you're right that there's something suspicious: if, like Fabio suggested, the two operations are related to two different entity types, then they should be submitted to two different queues (I just checked the Hibernate Search code: they really should). So my theory wouldn't hold if Fabio is right.

In that case, I suppose it would be a slightly more complicated scenario with two background threads:

  • Client thread 1: needs to delete an existing entry of type A to replace it with a value of type B
  • Client thread 1: submits delete some entity of type A to the queue for type A
  • Background thread for type A: picks delete some entity of type A from the queue for type A
  • Background thread for type A: starts executing delete some entity of type A
  • Client thread 2: needs to delete an existing entry of type B to replace it with a value of type A
  • Client thread 2: submits delete some entity of type B to the queue for type B
  • Background thread for type B: picks delete some entity of type B from the queue for type B
  • Background thread for type B: starts executing delete some entity of type B
  • Background thread for type A: finishes executing delete some entity of type A
  • Background thread for type A: marks the future as complete
  • Background thread for type A: executes thenCompose
  • Background thread for type A: adds the update some entity of type B to the queue for type B... but the queue is full, because someone else added many other operations! (let's assume the server is under heavy load)
  • Background thread for type A: blocks, waiting for a free slot in the queue for type B.
  • Background thread for type B: finishes executing delete some entity of type B
  • Background thread for type B: marks the future as complete
  • Background thread for type B: executes thenCompose
  • Background thread for type B: adds update some entity of type A to the queue for type A... but the queue is full, because someone else added many other operations! (let's assume the server is under heavy load)
  • Background thread for type B: blocks, waiting for a free slot in the queue for type A.
  • Deadlock: background thread for type B is waiting for an an operation that only the background thread for type A can perform, and vice-versa.

While complex, I believe this scenario is possible, and gets more likely as an Infinispan node gets hammered with many requests that change the type of values.

And the fix is simple: since operations applied to type A and B require different resources (different indexes), we don't need to use thenCompose, which introduces an artificial (and unnecessary) inter-dependency. We can just execute the operations in parallel, and then there's no chance of deadlock.

@fax4ever , you confirm that you weren't able to reproduce the problem when using only one type? And that your patch did fix the problem exhibited in your test:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fax4ever , you confirm that you weren't able to reproduce the problem when using only one type? And that your patch did fix the problem exhibited in your test

Yes I do and yes it did it

@@ -159,6 +160,21 @@ public void testOverwriteIndexedValue() {
StaticTestingErrorHandler.assertAllGood(cache);
}

@Test
public void testOverwriteIndexedValue_heavyLoad() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check this locally, so I'll trust that you checked this failed before your patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it does not fail, it hangs

@fax4ever
Copy link
Contributor Author

fax4ever commented Sep 7, 2022

LGTM, see comments below.

Thanks for the review @yrodiere!

@tristantarrant
Copy link
Member

Merged, thanks

@fax4ever
Copy link
Contributor Author

fax4ever commented Sep 8, 2022

Thank you!

@fax4ever fax4ever deleted the ISPN-14115 branch September 8, 2022 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants