Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store event counters and client application timestamps per node to improve performance of replication #1819

Closed
fredreichbier opened this issue Aug 22, 2019 · 4 comments · Fixed by #1849

Comments

@fredreichbier
Copy link
Member

commented Aug 22, 2019

In replicated setups like Galera, we sometimes have problems with multiple nodes trying to write to the same database row at the same time. This happens especially in case of the clientapplication table, which records the last seen date of the client application and is modified in each /validate/check request. If counting event handlers are defined, the same happens for the eventcounter tables. In case of Galera, multiple nodes trying to write to the same row might cause deadlocks (see #1268).

We could try to mitigate the issue by having each privacyIDEA node only write counters to its own private table row. As an example, with eventcounter: We could add a column node to the table, which holds the name of the privacyIDEA node that the counter belongs to. Assume that we have three privacyIDEA nodes (A, B and C), each connected to its own Galera node.

Then, A updates an event counter like this: UPDATE eventcounter SET counter_value = counter_value + 1 WHERE counter_name = 'mycounter' AND node = 'A'

And B updates an event counter like this: UPDATE eventcounter SET counter_value = counter_value + 1 WHERE counter_name = 'mycounter' AND node = 'B'

C updates an event counter like this: UPDATE eventcounter SET counter_value = counter_value + 1 WHERE counter_name = 'mycounter' AND node = 'C'

Then, when retrieving the event counter value with a given name, we would need to sum up all rows of all nodes: SELECT SUM(counter_value) FROM eventcounter WHERE counter_name = 'mycounter'.

This way, each Galera node would only try to write to its own private row of eventcounter, which would reduce the number of deadlocks.

However, this would only work if each privacyIDEA node has its own Galera node. It wouldn't work, for example, if the privacyIDEA nodes connect to a load balancer that selects the Galera node to use.

@cornelinux

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

I think it would even work, if there are several privacyIDEA Applications servers all writing to the same database, since each application server has it's own node-specifer:

nodeA, nodeB, nodeC writing to one database would result in each node updating it's own row.
Or am I missing something?

I like this idea a lot and I think we should address this in version 3.2.

@fredreichbier

This comment has been minimized.

Copy link
Member Author

commented Sep 2, 2019

Do you mean the scenario that we have no database-level replication in place, but three privacyIDEA servers nodeA, nodeB and nodeC all writing to the same database db1?
I also think this should work, but I'm not sure how much of a performance improvement we would get in this case -- but I guess it can't hurt :)

@fredreichbier fredreichbier self-assigned this Sep 2, 2019

@fredreichbier fredreichbier added this to To do in privacyIDEA 3.2 via automation Sep 2, 2019

@fredreichbier fredreichbier moved this from To do to In progress in privacyIDEA 3.2 Sep 2, 2019

fredreichbier added a commit that referenced this issue Sep 2, 2019
fredreichbier added a commit that referenced this issue Sep 2, 2019
@fredreichbier

This comment has been minimized.

Copy link
Member Author

commented Sep 5, 2019

I performed some tests of the code #1833, which implements private rows for event counters.

Setup

  • Galera cluster with 3 database nodes h1, h2, h3
  • Two privacyIDEA nodes pinode1, pinode2
  • pinode1 connects to h1
  • pinode2 connects to h3
  • Three post-event handlers counting the total number of /validate/check requests, the number of successful /validate/check requests and the number of failed /validate/check requests, respectively
  • 1000 tokens assigned to users from a LDAP resolver

Test

  • 1000 successful /validate/check requests to pinode1, performed by 4 concurrent threads
  • 1000 successful /validate/check requests to pinode2, performed by 4 concurrent threads

Results on current master

-> pinode1:
Transactions:		         973 hits
Availability:		       97.30 %
Elapsed time:		      319.45 secs
Data transferred:	        0.80 MB
Response time:		        1.30 secs
Transaction rate:	        3.05 trans/sec
Throughput:		        0.00 MB/sec
Concurrency:		        3.96
Successful transactions:         973
Failed transactions:	          27
Longest transaction:	        2.72
Shortest transaction:	        0.26

-> pinode2:
Transactions:		         971 hits
Availability:		       97.10 %
Elapsed time:		      321.62 secs
Data transferred:	        0.80 MB
Response time:		        1.31 secs
Transaction rate:	        3.02 trans/sec
Throughput:		        0.00 MB/sec
Concurrency:		        3.96
Successful transactions:         971
Failed transactions:	          29
Longest transaction:	        2.86
Shortest transaction:	        0.20

In total, we have 56 failed requests due to deadlocks (but some of them due to a deadlock in the eventcounter table, some of them due to a deadlock in the clientapplication table).

Results with #1833

-> pinode1:
Transactions:		         992 hits
Availability:		       99.20 %
Elapsed time:		      339.57 secs
Data transferred:	        0.80 MB
Response time:		        1.34 secs
Transaction rate:	        2.92 trans/sec
Throughput:		        0.00 MB/sec
Concurrency:		        3.92
Successful transactions:         992
Failed transactions:	           8
Longest transaction:	        2.83
Shortest transaction:	        0.17
 
-> pinode2:
Transactions:		         997 hits
Availability:		       99.70 %
Elapsed time:		      345.88 secs
Data transferred:	        0.80 MB
Response time:		        1.37 secs
Transaction rate:	        2.88 trans/sec
Throughput:		        0.00 MB/sec
Concurrency:		        3.94
Successful transactions:         997
Failed transactions:	           3
Longest transaction:	        3.77
Shortest transaction:	        0.13

In total, we have 11 deadlocks, probably due to the clientapplication table.

Summary

With #1833, we have a lot less deadlocks (only 11 vs 56), but request handling seems to be slightly slower. :)

So it's questionable whether we should merge this.

fredreichbier added a commit that referenced this issue Sep 5, 2019
fredreichbier added a commit that referenced this issue Sep 9, 2019
Store clientapplication "last seen" info per privacyIDEA node
This also adds a corresponding migration script and tests.

Working on #1819
@fredreichbier

This comment has been minimized.

Copy link
Member Author

commented Sep 9, 2019

As #1883 has been merged, I opened #1849 to address the clientapplication table. With these changes, no deadlocks occur anymore in the setup described above. I'll do some more detailed performance tests this week.

privacyIDEA 3.2 automation moved this from In progress to Done Sep 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.