Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of getAllAfter queries into DTOs [8] #9320

Closed
4 tasks done
Tracked by #7372
StefanKock opened this issue May 24, 2022 · 7 comments · Fixed by #10282 or #10351
Closed
4 tasks done
Tracked by #7372

Improve performance of getAllAfter queries into DTOs [8] #9320

StefanKock opened this issue May 24, 2022 · 7 comments · Fixed by #10282 or #10351
Assignees
Labels
backend Affects the web backend change A change of an existing feature (ticket type) code-quality Reliability, maintainability, reusability, or testability general Not directly concerned with any particular functional section of the application performance Issues that are meant to increase the performance of the application refactoring Technical refactoring of an existing feature

Comments

@StefanKock
Copy link
Contributor

StefanKock commented May 24, 2022

Problem Description

As shown by the following analysis, many getAllAfter methods show an inperformant pattern:

  1. The "Entity"Service.getAllAfter method takes some seconds. As shown in Improve performance of PersonService.getAllAfter method [3] #8946 (comment), this can be improved by initially fetching only the ids (reduced distinct effort) and using a dedicated index with appropriate sorting.
  2. "Entity"Service.inJurisdictionOrOwned per each entity used for pseudonymization seems to be inperformant. For Cases it took ~330ms per entity, for Persons ~0,3ms per entity with in clause (PersonService.getInJurisdictionIDs).
Analysis

Dataset:

  • 1225343 persons
  • 85677 cases
  • 114782 contacts
  • 236583 tasks
  • 2600890 immunizations
  • 1991 vaccinations
  • 15693 samples
  • 11846 events
  • 534 eventparticipants

The following measurements were taken from backend logs (EJB methods) and the Postgres logs.
Observations:

  • the actual SQL queries are executed in few seconds while the EJB methods take minutes to complete
  • a relevant contribution to runtime comes from inJurisdictionOrOwned methods, here mainly the number of calls
  • batch size is relevant. Some requests did not complete within reasonable time for a batch size of 10000, these were measured for batch size 1000.

Requests:

  • http://localhost:6080/sormas-rest/persons/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 3 min 5 sec
    server: 2 min 58 sec
    image
    SQL queries:
    duration: 948.716 ms
    duration: 2297.275 ms
    duration: 4012.220 ms
    duration: 4393.538 ms
    duration: 696.277 ms
    ~ 12.5 sec
    Note: SQL query times are inaccurate, either due to an error in the analysis or changes introduced later, see later analysis of this method.

  • http://localhost:6080/sormas-rest/cases/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 8 min 21 sec
    server: 8 min 18 sec
    image
    SQL queries:
    duration: 787.011 ms
    duration: 449.616 ms
    duration: 146.224 ms
    duration: 178.515 ms
    duration: 154.552 ms
    duration: 136.836 ms
    duration: 142.597 ms
    duration: 143.688 ms
    duration: 150.584 ms
    duration: 152.298 ms
    duration: 133.814 ms
    duration: 147.687 ms
    ~ 3 sec

  • http://localhost:6080/sormas-rest/contacts/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 35 sec
    server: 35 sec
    image
    SQL queries:
    duration: 151.413 ms
    duration: 143.237 ms
    duration: 152.661 ms
    duration: 144.054 ms
    duration: 145.374 ms
    duration: 155.861 ms
    duration: 177.212 ms
    duration: 153.647 ms
    duration: 201.516 ms
    duration: 187.382 ms
    ~ 1.6 sec

  • http://localhost:6080/sormas-rest/tasks/all/1637090372005/1000/NO_LAST_SYNCED_UUID: 1 min 27 sec
    server: 1 min 24 sec
    image
    SQL queries:
    duration: 466.097 ms
    duration: 154.093 ms
    duration: 132.150 ms
    duration: 137.751 ms
    duration: 140.757 ms
    duration: 135.526 ms
    duration: 136.612 ms
    duration: 126.399 ms
    duration: 124.695 ms
    duration: 127.888 ms
    duration: 133.940 ms
    ~1.9 sec

  • http://localhost:6080/sormas-rest/samples/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 27 sec
    server: 27 sec
    image
    SQL queries:
    duration: 129.587 ms

  • http://localhost:6080/sormas-rest/immunizations/all/1637090372005/1000/NO_LAST_SYNCED_UUID (1min 50 sec)
    server: 1 min 48 sec
    image
    SQL queries:
    duration: 6985.128 ms
    duration: 106.608 ms
    ~ 7 sec

Proposed Change

  • 1. Rewrite AdoServiceWithUserFilter.getAllAfter to first fetch the needed ids (see pattern in PersonService.getAllAfter, then fetch the entities by id with IN-clause (use BaseAdoService.getByIds).
  • 2. Add indices for sorting.
  • 3. Use the pattern of PersonService.getInJurisdictionIDs to query by ids with IN clause also for other entities where "Entity"Service.inJurisdictionOrOwned is currently running one query per Entity.

Acceptance Criteria

  • An analysis before/after shows the performance improvement (also mentioning the amount of existing and queried entities). -> comment

Implementation Details

  • All getAllAfter and getInJurisdictionIDs methods avoid parameter limit exception with IterableHelper.executeBatched batching.

  • Remove or adapt not with superclass aligned getAllAfter implementations in:

    • CampaignService.getAllAfter
    • CampaignFormMetaService.getAllAfter
    • ContactService.getAllAfter (active contacts filter)
    • EventParticipantService.getAllAfter (not archived filter)
    • EventService.getAllAfter (configured EventUserFilterCriteria)
    • ImmunizationService.getAllAfter
    • BaseTravelEntryService.getAllAfter
  • Remove or adapt not with superclass aligned implementations parallel to getAllAfter:

Additional Information

  • Sibling to Improve performance of PersonService.getAllAfter method [3] #8946
  • Batch jurisdiction processing for Persons was done in Pseudonymization of sensitive data [3+2] #2130 (commit 305b169).
  • I tried to use CREATE INDEX CONCURRENTLY but that does not work within a transaction as it is used in the automatic db migration. If deployments get stuck because of INDEX creation (not sure what the likelyhood is), they can be created with CREATE INDEX CONCURRENTLY IF NOT EXISTS ... before on running instance.
  • This ticket improves the performance of the following functions:
    1. Sync for android-app (getAllAfter methods) for most to all entities.
    2. List (getIndexList) for AdditionalTest, PathogenTest
    3. List (getByUuids, getBy"Reference"/"criteria") for Case, Sample, Task, AdditionalTest, PathogenTest, Prescription, Treatment, Vaccination, Visit.

Performance Analysis

Preliminary analysis of performance before and after the changes introduced in PR #10282

(tested on dafd47d)

The analysis was conducted by measuring the runtime of REST calls to the respective endpoints with batch size 1000.
A substantial improvement can be observed for all entities with long response times:

cases (1000) .................00:05:28 -> 00:01:31
contacts (1000) ..............00:01:07 -> 00:00:30
immunizations (1000) .........00:01:16 -> 00:00:12
samples (1000) ...............00:01:41 -> 00:00:20
tasks (1000) .................00:01:06 -> 00:00:18
visits (1000) ................00:00:27 -> 00:00:04

An exception is pathogen test with no improvement / slightly worse performance:

pathogentests (1000) .........00:01:15 -> 00:01:26

The results based on the dataset of the initial analysis (see issue description). Further measurements will be performed as needed (e.g., retesting of individual endpoints, larger batch sizes etc.).

Analysis data:
performance_9320.zip

Detailed analysis

Full list of REST calls
Performance of REST calls (.../<entities>/all/) before the refacoring:

actions (1000) ...............00:00:00
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:00
cases (1000) .................00:05:28
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:01:07
continents (1000) ............00:00:00
countries (1000) .............00:00:01
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:07
events (1000) ................00:00:04
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:01:16
pathogentests (1000) .........00:01:15
persons (1000) ...............00:00:07
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:00
regions (1000) ...............00:00:00
samples (1000) ...............00:01:41
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:01:06
treatments (1000) ............00:00:00
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:27
weeklyreports (1000) .........00:00:01

... and after:

actions (1000) ...............00:00:01
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:01
cases (1000) .................00:01:31
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:00:30
continents (1000) ............00:00:00
countries (1000) .............00:00:00
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:05
events (1000) ................00:00:02
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:00:12
pathogentests (1000) .........00:01:26
persons (1000) ...............00:00:08
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:01
regions (1000) ...............00:00:00
samples (1000) ...............00:00:20
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:00:18
treatments (1000) ............00:00:01
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:04
weeklyreports (1000) .........00:00:00

Execution details for selected entities

Cases
before:
image

after:
image

Here the calls to CaseService.inJurisdictionOrOwned per selected case have been dropped, roughly accounting for the measurde difference.

Contacts
before:
image

after:
image

As for cases, a heavy reduction of calls to ContactService.inJurisdictionOrOwned.

Immunizations
before:
image

after:
image

Here the main culprits were calls to ImmunizationService.inJurisdictionOrOwned and FeatureConfigurationFacadeEjb.isPropertyValueTrue

Tasks
before:
image

after:
image

Here, calls to TaskService.inJurisdictionOrOwned on a per object basis accounted for most of the time spent.

Analysis for different batch sizes

A analysis with different batch sizes after completion of this task shows:

  • request durations are substantially improved (see times above)
  • there is little overhead (long running SQL queries etc.) for small batch sizes: for batch size 1, no entity exceeds a response time of 1second
  • for larger batch sizes, response time is approximately linear to the batch size
  • this is consistent across all entities (response time for 1000 pathogentests is down to 15 sec now)

Response times for selected entities:

cases
(1) ....................00:00:01
(10) ...................00:00:03
(100) ..................00:00:08
(500) ..................00:00:36
(1000) .................00:01:11
(5000) .................00:05:52

persons
(1) ..................00:00:01
(10) .................00:00:00
(100) ................00:00:01
(500) ................00:00:04
(1000) ...............00:00:07
(5000) ...............00:00:33

samples
(1) ..................00:00:01
(10) .................00:00:01
(100) ................00:00:02
(500) ................00:00:10
(1000) ...............00:00:17
(5000) ...............00:01:10

Analysis data:
differentBatchSizes.zip

@StefanKock StefanKock added backend Affects the web backend change A change of an existing feature (ticket type) performance Issues that are meant to increase the performance of the application general Not directly concerned with any particular functional section of the application labels May 24, 2022
@StefanKock StefanKock added refactoring Technical refactoring of an existing feature code-quality Reliability, maintainability, reusability, or testability labels May 24, 2022
@ClimedoHealthCodingTests

Hi there, I am not sure if this is related but since recently we are having issues to get all tasks created within the last 24 hours via /tasks/all/{since} No matter the size of the response e.g. even if only 1 task will be returned matching the time criteria. No response even for 5 minutes. The highest possible timespan that we got working on some instances is last 4 hours. Even that one takes a couple of minutes to respond. Let us know if you need more info around the issue and keep us posted about the outcome here.

@StefanKock
Copy link
Contributor Author

StefanKock commented May 31, 2022

Hi there, I am not sure if this is related but since recently we are having issues to get all tasks created within the last 24 hours via /tasks/all/{since}

Identifying the changed entities is what we want to improve. That includes also tasks.

@ClimedoHealthCodingTests

Perfect! Are there already some concrete plans when can this one be expected to make its way to production?

@StefanKock StefanKock changed the title Improve performance of getAllAfter queries into DTOs Improve performance of getAllAfter queries into DTOs [8] Jul 27, 2022
@StefanKock StefanKock added this to Backlog in SORMAS Team 2 - DEV - Iteration Backlog via automation Aug 31, 2022
@StefanKock StefanKock self-assigned this Sep 5, 2022
@StefanKock StefanKock moved this from Backlog to In Progress in SORMAS Team 2 - DEV - Iteration Backlog Sep 5, 2022
@StefanKock
Copy link
Contributor Author

StefanKock commented Sep 6, 2022

Discussed current status and next steps with @HolgerReiseVSys and @syntakker. Results:

  1. Update feature branch to current development branch
  2. @syntakker will start with performance tests on cases, contacts, immunizations, tasks, samples as prechecks
  3. Integrate CaseService.getAllActiveCasesAfter as all the other sibling methods.
  4. Refactor BaseAdoService.getAll to getList to be reusable with other filters, make it protected or leave public?
  5. Indizes for all entities that implement AdoServiceWithUserFilter

StefanKock added a commit that referenced this issue Sep 6, 2022
- Removed superfluous includeExtendedChangeDateFilters parameter
StefanKock added a commit that referenced this issue Sep 6, 2022
- Removed redundant getAll(Predicate, Integer) method
StefanKock added a commit that referenced this issue Sep 7, 2022
- Sorting matches BaseAdoService.getList queries
StefanKock added a commit that referenced this issue Sep 21, 2022
- Removed redundant getAll(Predicate, Integer) method
StefanKock added a commit that referenced this issue Sep 21, 2022
- Sorting matches BaseAdoService.getList queries
StefanKock added a commit that referenced this issue Sep 21, 2022
- Unified methods in CoreAdoServices for Facade:
  - getInJurisdictionIds: List<Long>
  - inJurisdictionOrOwned: boolean
 - inJurisdictionOrOwned: Predicate
- Made jurisdiction check and convertToDto resilient against null
entites
StefanKock added a commit that referenced this issue Sep 21, 2022
- new: BaseAdoService.getIdList with batching
StefanKock added a commit that referenced this issue Sep 21, 2022
Applied toPseudonymizedDtos and createPseudonymizer pattern in
corresponding Facades
StefanKock added a commit that referenced this issue Sep 21, 2022
- Adjusted to match patterns for CoreAdo, added VisitQueryContext
StefanKock added a commit that referenced this issue Sep 21, 2022
- Case dependent entities: SurveillanceReport, ClinicalVisit,
Prescription, Treatment
- Sample dependent entities: AdditionalTest, PathogenTest
- Immunization dependent entities: Vaccination
StefanKock added a commit that referenced this issue Sep 21, 2022
- Removed duplicate ContactJurisdictionFlagsDto
- Simplified jurisdiction check where only Contact itself is relevant
StefanKock added a commit that referenced this issue Sep 21, 2022
- Removed obsolete method BaseAdoService.getBatchedQueryResults
- Removed duplicate method CaseService.createActiveCasesFilter
- Fixed naming of a Case unit test
StefanKock added a commit that referenced this issue Sep 21, 2022
…getAllAfter_methods

#9320 improve getAllAfter methods and batch pseudonymization
SORMAS Team 2 - DEV - Iteration Backlog automation moved this from Review to Testing Sep 21, 2022
@StefanKock
Copy link
Contributor Author

Changes on performance tooling still open

@StefanKock StefanKock reopened this Sep 21, 2022
SORMAS Team 2 - DEV - Iteration Backlog automation moved this from Testing to In Progress Sep 21, 2022
@StefanKock StefanKock moved this from In Progress to Review in SORMAS Team 2 - DEV - Iteration Backlog Sep 21, 2022
syntakker added a commit that referenced this issue Sep 21, 2022
syntakker added a commit that referenced this issue Sep 27, 2022
syntakker added a commit that referenced this issue Sep 27, 2022
StefanKock pushed a commit that referenced this issue Sep 27, 2022
* Performance analysis: process directories #9320
* Default log level TRACE for cargo server #9320
* Automate REST performance analysis #9320
* update TROUBLESHOOTING.md #9320
SORMAS Team 2 - DEV - Iteration Backlog automation moved this from Review to Testing Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Affects the web backend change A change of an existing feature (ticket type) code-quality Reliability, maintainability, reusability, or testability general Not directly concerned with any particular functional section of the application performance Issues that are meant to increase the performance of the application refactoring Technical refactoring of an existing feature
Projects
None yet
3 participants