A weakness in our handling of FHIR history operations has been reported.
Specifically, on a server with a very large number of resources, if the history operation is executed by many clients (e.g. 200) concurrently, the server becomes unresponsive and ultimately consumes a large amount of disk and becomes unstable.
Our investigation has revealed that the root cause is a SELECT COUNT(...) query that executes at the start of all _history operations. Essentially, anytime a _history is executed, the server executes 2 SQL statements (statements here are approximate):
A SELECT COUNT(pid) FROM hfj_res_ver is performed to supply a value for Bundle.total
A SELECT * FROM hfj_res_ver ORDER BY updated DESC LIMIT 100 is performed to supply the contents
The second query is executed against an index and is very fast. The first query by its nature requires a full index scan and is slow. Executing it 200 concurrent times quickly overwhelms the database and leads to timeouts, exceptions, and eventually instability.
The proposed fix is as follows:
A new DaoConfig setting is added. This setting introduces a "history count mode" with 3 options:
Cached. This is the new default: A loading cache will be used for history counts, meaning that counts are stored in RAM for up to one minute, and the loading cache blocks all but one client thread per JVM from actually performing the count. This effectively throttles access to the database.
Not cached. This is the status quo and does exhibit the weakness described here, but may be appropriate in scenarios where users are trusted and accuracy is always required.
No count. This setting avoids the count query entirely, saving time and avoiding this weakness at the expense of not including any total in the response.
A huge thanks to Zachary Minneker at Security Innovation who discovered and submitted a responsible disclosure of this issue.
This issue will be resolved for the upcoming 5.4.0 release. A CVE number is forthcoming.
The text was updated successfully, but these errors were encountered:
A weakness in our handling of FHIR history operations has been reported.
Specifically, on a server with a very large number of resources, if the history operation is executed by many clients (e.g. 200) concurrently, the server becomes unresponsive and ultimately consumes a large amount of disk and becomes unstable.
Our investigation has revealed that the root cause is a
SELECT COUNT(...)query that executes at the start of all_historyoperations. Essentially, anytime a_historyis executed, the server executes 2 SQL statements (statements here are approximate):SELECT COUNT(pid) FROM hfj_res_veris performed to supply a value forBundle.totalSELECT * FROM hfj_res_ver ORDER BY updated DESC LIMIT 100is performed to supply the contentsThe second query is executed against an index and is very fast. The first query by its nature requires a full index scan and is slow. Executing it 200 concurrent times quickly overwhelms the database and leads to timeouts, exceptions, and eventually instability.
The proposed fix is as follows:
A huge thanks to Zachary Minneker at Security Innovation who discovered and submitted a responsible disclosure of this issue.
This issue will be resolved for the upcoming 5.4.0 release. A CVE number is forthcoming.
The text was updated successfully, but these errors were encountered: