-
Notifications
You must be signed in to change notification settings - Fork 35
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1311 from open-zaak/docs/notification-performance
document notification performance test
- Loading branch information
Showing
2 changed files
with
155 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,3 +13,4 @@ Performance is measured using different types of scenarios. | |
profiling | ||
scenarios | ||
apachebench | ||
notifications |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
.. _performance_notifications: | ||
|
||
Notifications performance | ||
========================= | ||
|
||
Open Zaak sends notifications using `Celery`_ as an asynchronous task queue. Notification delivery | ||
failure to the Notifications API/service may happen, which is why we leverage "autoretry" for | ||
these failed tasks. | ||
|
||
The reliability and performance of sending notifications was analyzed and the results are reported | ||
below. In this test case, we observed the behaviour of notifications triggered by "case create" | ||
events. | ||
|
||
|
||
Sending notifications under burst load | ||
-------------------------------------- | ||
|
||
We measure the notifications median time and failure rate under an increasing | ||
amount of concurrent users. The goal is to observe the reliability of notification | ||
delivery and delay between notification scheduling and it actually arriving at the | ||
Notifications API. | ||
|
||
**Test specifications:** | ||
|
||
* 1 celery worker | ||
* 3 processes for the worker | ||
* no waiting time between requests for 1 user | ||
* all users send requests to create zaak (``POST /api/v1/zaken``) | ||
* time under load is 1 minute | ||
|
||
.. csv-table:: Notification performance results per users | ||
:header-rows: 1 | ||
|
||
Users,Request Count,Requests/s,Failure Count,Median Response Time (ms),Notifications failure,Median Notification Time (ms) | ||
10,1430,24.183,0,390,0,88.572 | ||
20,1479,25.034,0,740,0,83.922 | ||
40,1429,24.189,0,1200,0,86.557 | ||
|
||
The data suggests no reliability issues, even with only a single Celery worker, as the failure | ||
rates are 0 for both the API operation and notification delivery. | ||
|
||
Note that the worker containers can scale horizontally and we recommend deploying at | ||
least two worker containers, ideally distributed over different hardware instances for | ||
high-available set-ups. | ||
|
||
.. _Celery: https://docs.celeryq.dev/en/stable/ | ||
|
||
|
||
Automatic retry for notifications | ||
--------------------------------- | ||
|
||
By default, sending notifications has automatic retry behaviour, i.e. if the notification | ||
publishing task has failed, it will automatically be scheduled/tried again until the maximum | ||
retry limit has been reached. | ||
|
||
Autoretry explanation and configuration | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In **Configuratie > Notificatiescomponentconfiguratie** admin page the autoretry settings | ||
can be configured: | ||
|
||
* **Notification delivery max retries**: the maximum number of retries the task queue | ||
will do if sending a notification failed. Default is ``5``. | ||
* **Notification delivery retry backoff**: a boolean or a number. If this option is set to | ||
``True``, autoretries will be delayed following the rules of exponential backoff. If | ||
this option is set to a number, it is used as a delay factor. Default is ``3``. | ||
* **Notification delivery retry backoff max**: an integer, specifying number of seconds. | ||
If ``Notification delivery retry backoff`` is enabled, this option will set a maximum | ||
delay in seconds between task autoretries. Default is ``48`` seconds. | ||
|
||
With the assumption that the requests are done immediately we can model the notification | ||
tasks schedule with the default configurations: | ||
|
||
1. At 0s the request to create a zaak is made, the notification task is scheduled, picked up | ||
by worker and failed | ||
2. At 3s with 3s delay the first retry happens (1s * ``Notification delivery retry backoff``) | ||
3. At 9s with 6s delay - the second retry | ||
4. At 21s with 12s delay - the third retry | ||
5. At 45s with 24s delay - the fourth retry | ||
6. At 1m33s with 48s delay - the fifth retry, which is the last one. | ||
|
||
So if the Notification API is up after 1 min of downtime the default configuration can handle it | ||
automatically. | ||
|
||
Autoretry test with default configuration | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In this scenario, we measure how autoretry behaves with increasing downtime durations of | ||
the notification service and with the default autoretry configuration. | ||
|
||
**Test specifications:** | ||
|
||
* 1 celery worker | ||
* 3 processes for the worker | ||
* 10 concurrent users | ||
* no waiting time between requests for 1 user | ||
* all users send requests to create zaak (``POST /api/v1/zaken``) | ||
* requests are sent for a total duration of 1 min | ||
* auto retry settings: default (5 maximum attempts per task) | ||
|
||
.. csv-table:: Notification autoretries per downtime | ||
:header-rows: 1 | ||
|
||
Notification downtime (s),Request Count,Failure Count,Notifications failed,Notifications retried,Notifications processed | ||
10,1486,0,0,919,2114 | ||
60,1478,0,0,3729,5195 | ||
300,1509,0,1276,6380,7908 | ||
|
||
* ``Notifications retried`` includes all notifications that were automatically retried. | ||
The minimum number equals the number of failed notifications. If all notifications are eventually failed | ||
``Notifications retried`` = ``Notifications failed`` * ``maximum retries per task``. | ||
* ``Notifications processed`` includes failed, retried and succeeded notifications. | ||
|
||
According to the result if the notification server is down for less than 1 min, then all notifications | ||
would be eventually send. If it's expected that the notification service is down for longer time | ||
the autoretry configuration should be adjusted in the admin. | ||
|
||
Autoretry test with adjusted configuration | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In the last case in the test run, when the notification server was down for 300 seconds, | ||
all notifications have failed. Let's change the autoretry configuration, so the notifications | ||
will be retrying during 5 minutes: | ||
|
||
* **Notification delivery max retries** is changed to ``7``. | ||
* **Notification delivery retry backoff max** is changed to ``192``. | ||
|
||
With this configuration we expect the following task schedule: | ||
|
||
7. At 189s with 96s delay - the 6th retry | ||
8. At 381s with 192s delay - the 7th retry, which is now the last one. | ||
|
||
**Test specifications:** | ||
|
||
* 1 celery worker | ||
* 3 processes for the worker | ||
* 10 concurrent users | ||
* no waiting time between requests for 1 user | ||
* all users send requests to create zaak (``POST /api/v1/zaken``) | ||
* requests are sent for a total duration of 1 min | ||
* auto retry settings: ``7`` | ||
|
||
.. csv-table:: Notification autoretries per downtime | ||
:header-rows: 1 | ||
|
||
Notification downtime (s),Request Count,Failure Count,Notifications failed,Notifications retried,Notifications processed | ||
10,1328,0,0,639,2002 | ||
60,1335,0,0,3846,5209 | ||
300,1262,0,0,7427,8717 | ||
600,1393,0,1181,8267,9687 | ||
|
||
The adjusted autoretry configuration resulted in 0 failed notifications for 5 min downtime with the tradeoff of | ||
the increased amount of the retried ones. However the adjusted settings were not efficient for the 10 min downtime. | ||
Therefore we advice to take into account the statistics of server downtimes before adjusting autoretry settings. |