Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test out the performance of batch-export using a sizeable test database #39

Closed
kimaina opened this issue Oct 19, 2020 · 29 comments
Closed
Assignees
Labels
question Further information is requested

Comments

@kimaina
Copy link
Collaborator

kimaina commented Oct 19, 2020

Test out the performance of batch-export using a sizeable test database within AMPATH. Metrics to report

  1. The test server specs,
  2. The size of the test data including summary stats
  3. The time it takes to complete batch-export
  4. Any other data that might seem informative
@kimaina kimaina added the question Further information is requested label Oct 19, 2020
@kimaina kimaina self-assigned this Oct 22, 2020
@kimaina
Copy link
Collaborator Author

kimaina commented Oct 26, 2020

I tried running the pipeline over the weekend, unfortunately, I'm experiencing some internal server error calling FHIR/ws/fhir2/R3/{Resource}. It is quite surprising the non-bulk endpoint works fine FHIR/ws/fhir2/R3/{Resource}/1

Using FHIR version 1.0.0-SNAPSHOT and OMR Version: 2.1.2

@bashir2 @ibacher @mozzy11 @jecihjoy @corneliouzbett

Type Exception Report

Message Servlet execution threw an exception

Description The server encountered an unexpected condition that prevented it from fulfilling the request.

Exception

javax.servlet.ServletException: Servlet execution threw an exception
	org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:101)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:101)
	org.openmrs.module.fhir2.web.filter.ForwardingFilter.doFilter(ForwardingFilter.java:60)
	org.openmrs.module.web.filter.ModuleFilterChain.doFilter(ModuleFilterChain.java:70)
	org.openmrs.module.fhir2.web.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:70)
	org.openmrs.module.web.filter.ModuleFilterChain.doFilter(ModuleFilterChain.java:70)
	org.openmrs.module.spa.filter.SpaFilter.doFilter(SpaFilter.java:54)
	org.openmrs.module.web.filter.ModuleFilterChain.doFilter(ModuleFilterChain.java:70)
	org.openmrs.module.web.filter.ModuleFilter.doFilter(ModuleFilter.java:54)
	org.openmrs.web.filter.OpenmrsFilter.doFilterInternal(OpenmrsFilter.java:108)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
	org.springframework.orm.hibernate4.support.OpenSessionInViewFilter.doFilterInternal(OpenSessionInViewFilter.java:150)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
	org.openmrs.web.filter.StartupFilter.doFilter(StartupFilter.java:105)
	org.openmrs.web.filter.StartupFilter.doFilter(StartupFilter.java:105)
	org.openmrs.web.filter.StartupFilter.doFilter(StartupFilter.java:105)
	org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
Root Cause

java.lang.NoClassDefFoundError: com/fasterxml/jackson/databind/JsonNode
	ca.uhn.fhir.parser.JsonParser.createJsonWriter(JsonParser.java:166)
	ca.uhn.fhir.parser.JsonParser.doEncodeResourceToWriter(JsonParser.java:183)
	ca.uhn.fhir.parser.BaseParser.encodeResourceToWriter(BaseParser.java:419)
	ca.uhn.fhir.parser.BaseParser.encodeResourceToWriter(BaseParser.java:403)
	ca.uhn.fhir.rest.server.RestfulServerUtils.streamResponseAsResource(RestfulServerUtils.java:925)
	ca.uhn.fhir.rest.server.RestfulResponse.streamResponseAsResource(RestfulResponse.java:87)
	ca.uhn.fhir.rest.server.interceptor.ExceptionHandlingInterceptor.handleException(ExceptionHandlingInterceptor.java:103)
	ca.uhn.fhir.rest.server.interceptor.ExceptionHandlingInterceptor.handleException(ExceptionHandlingInterceptor.java:63)
	ca.uhn.fhir.rest.server.RestfulServer.handleRequest(RestfulServer.java:1118)
	ca.uhn.fhir.rest.server.RestfulServer.doGet(RestfulServer.java:363)
	ca.uhn.fhir.rest.server.RestfulServer.service(RestfulServer.java:1722)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
	org.openmrs.module.web.ModuleServlet.service(ModuleServlet.java:70)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
	org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:101)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:101)
	org.openmrs.module.fhir2.web.filter.ForwardingFilter.doFilter(ForwardingFilter.java:60)
	org.openmrs.module.web.filter.ModuleFilterChain.doFilter(ModuleFilterChain.java:70)
	org.openmrs.module.fhir2.web.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:70)
	org.openmrs.module.web.filter.ModuleFilterChain.doFilter(ModuleFilterChain.java:70)
	org.openmrs.module.spa.filter.SpaFilter.doFilter(SpaFilter.java:54)
	org.openmrs.module.web.filter.ModuleFilterChain.doFilter(ModuleFilterChain.java:70)
	org.openmrs.module.web.filter.ModuleFilter.doFilter(ModuleFilter.java:54)
	org.openmrs.web.filter.OpenmrsFilter.doFilterInternal(OpenmrsFilter.java:108)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
	org.springframework.orm.hibernate4.support.OpenSessionInViewFilter.doFilterInternal(OpenSessionInViewFilter.java:150)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
	org.openmrs.web.filter.StartupFilter.doFilter(StartupFilter.java:105)
	org.openmrs.web.filter.StartupFilter.doFilter(StartupFilter.java:105)
	org.openmrs.web.filter.StartupFilter.doFilter(StartupFilter.java:105)
	org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
Root Cause

java.lang.ClassNotFoundException: com.fasterxml.jackson.databind.JsonNode
	org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1955)
	org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1798)
	org.openmrs.module.ModuleClassLoader.loadClass(ModuleClassLoader.java:571)
	java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	ca.uhn.fhir.parser.JsonParser.createJsonWriter(JsonParser.java:166)
	ca.uhn.fhir.parser.JsonParser.doEncodeResourceToWriter(JsonParser.java:183)
	ca.uhn.fhir.parser.BaseParser.encodeResourceToWriter(BaseParser.java:419)
	ca.uhn.fhir.parser.BaseParser.encodeResourceToWriter(BaseParser.java:403)
	ca.uhn.fhir.rest.server.RestfulServerUtils.streamResponseAsResource(RestfulServerUtils.java:925)
	ca.uhn.fhir.rest.server.RestfulResponse.streamResponseAsResource(RestfulResponse.java:87)
	ca.uhn.fhir.rest.server.interceptor.ExceptionHandlingInterceptor.handleException(ExceptionHandlingInterceptor.java:103)
	ca.uhn.fhir.rest.server.interceptor.ExceptionHandlingInterceptor.handleException(ExceptionHandlingInterceptor.java:63)
	ca.uhn.fhir.rest.server.RestfulServer.handleRequest(RestfulServer.java:1118)
	ca.uhn.fhir.rest.server.RestfulServer.doGet(RestfulServer.java:363)
	ca.uhn.fhir.rest.server.RestfulServer.service(RestfulServer.java:1722)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
	org.openmrs.module.web.ModuleServlet.service(ModuleServlet.java:70)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
	org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:101)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:101)
	org.openmrs.module.fhir2.web.filter.ForwardingFilter.doFilter(ForwardingFilter.java:60)
	org.openmrs.module.web.filter.ModuleFilterChain.doFilter(ModuleFilterChain.java:70)
	org.openmrs.module.fhir2.web.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:70)
	org.openmrs.module.web.filter.ModuleFilterChain.doFilter(ModuleFilterChain.java:70)
	org.openmrs.module.spa.filter.SpaFilter.doFilter(SpaFilter.java:54)
	org.openmrs.module.web.filter.ModuleFilterChain.doFilter(ModuleFilterChain.java:70)
	org.openmrs.module.web.filter.ModuleFilter.doFilter(ModuleFilter.java:54)
	org.openmrs.web.filter.OpenmrsFilter.doFilterInternal(OpenmrsFilter.java:108)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
	org.springframework.orm.hibernate4.support.OpenSessionInViewFilter.doFilterInternal(OpenSessionInViewFilter.java:150)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
	org.openmrs.web.filter.StartupFilter.doFilter(StartupFilter.java:105)
	org.openmrs.web.filter.StartupFilter.doFilter(StartupFilter.java:105)
	org.openmrs.web.filter.StartupFilter.doFilter(StartupFilter.java:105)
	org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
	org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
Note The full stack trace of the root cause is available in the server logs.

Apache Tomcat/7.0.94```

@mozzy11
Copy link
Collaborator

mozzy11 commented Oct 26, 2020

@kimaina , are you trying to mean the pipeline cant query R3 Resources from the OpenMRS server ?? if i got you right ??

@kimaina
Copy link
Collaborator Author

kimaina commented Oct 26, 2020

Hi @mozzy11 Thanks for following up! Running locally using refApp works fine for both (R4 and R3) but against prod does not!

@mozzy11
Copy link
Collaborator

mozzy11 commented Oct 26, 2020

is the production server Updated with the latest build of the Fhir module ??

@kimaina
Copy link
Collaborator Author

kimaina commented Oct 26, 2020

is the production server Updated with the latest build of the Fhir module ??

We are currently using FHIR2 version 1.0.0-SNAPSHOT (deployed a few wks ago) and OMR Version: 2.1.2

@mozzy11
Copy link
Collaborator

mozzy11 commented Oct 26, 2020

@kimaina , the latest version is 1.2.0-SNAPSHOT . we have made quite many bug fixes since 1.0.0-SNAPSHOT and released 1.0.0 , 1.1.0 .
can you atleast deploy 1.1.0 ?? its more stable . https://addons.openmrs.org/show/org.openmrs.module.openmrs-fhir2-module

@kimaina
Copy link
Collaborator Author

kimaina commented Oct 26, 2020

@kimaina , the latest version is 1.2.0-SNAPSHOT . we have made quite many bug fixes since 1.0.0-SNAPSHOT and released 1.0.0 , 1.1.0 .
can you atleast deploy 1.1.0 ?? its more stable . https://addons.openmrs.org/show/org.openmrs.module.openmrs-fhir2-module

Thanks, @mozzy11 will keep you posted on this!

@mozzy11
Copy link
Collaborator

mozzy11 commented Oct 26, 2020

sure ,

@ibacher
Copy link
Collaborator

ibacher commented Oct 26, 2020

This commit in particular is likely relevant to this. Jackson2 was added to openmrs-core in 2.1.3.

@kimaina
Copy link
Collaborator Author

kimaina commented Oct 30, 2020

Thanks @ibacher

@kimaina
Copy link
Collaborator Author

kimaina commented Oct 30, 2020

Updates on this task, This pr solves the error
ERROR - SqlExceptionHelper.logExceptions(146) |2020-10-28 19:46:32,583| Value '0000-00-00 00:00:00' can not be represented as java.sql.Timestamp when trying to hit batch/search FHIR endpoint. This will be deployed and we should have some preliminary results on this by next week. Thank you @corneliouzbett for working on this

@kimaina
Copy link
Collaborator Author

kimaina commented Nov 2, 2020

A few more updates on this:

We are currently hitting timeouts (504) for endpoints with large volume of data e.g: /ws/fhir2/R3/Encounter?_count=1 or Patient or Obs. Need to do further investigations!

@kimaina
Copy link
Collaborator Author

kimaina commented Nov 3, 2020

Upon further investigation, hitting an endpoint like /ws/fhir2/R3/Observation?_count=1 sends this query.

select this_.uuid as y0_ from obs this_ where this_.uuid in (select this_.uuid as y0_ from obs this_ where this_.voided=0) group by this_.uuid

which makes the API call to timeout when you have a large dataset (with or without setting _count=1)

I tried running the query against the reporting MySQL server (containing over 300M obs), and it took more than 10 minutes, and had to terminate the query.

Given that we are setting _count to 1, we expect the above query to have a limit clause, but then I remembered that the total is required to be calculated and included in the response JSON bundle, as shown below.

    "resourceType": "Bundle",
    "id": "9a1b7f9a-86b5-47e0-9e81-1aa935bdba2a",
    "meta": {
        "lastUpdated": "2020-11-03T12:32:01.658+00:00"
    },
    "type": "searchset",
    "total": 281710,
    "link": [
        {
            "relation": "self",
            "url": "http://localhost:8099/openmrs/ws/fhir2/R4/Observation?_count=1"
        },
        {
            "relation": "next",
            "url": "http://localhost:8099/openmrs/ws/fhir2/R4?_getpages=7702cb19-c890-4aaa-bc50-e52ab7312eed&_getpagesoffset=1&_count=1&_bundletype=searchset"
        }
    ],

maybe I'm missing something! @ibacher @mozzy11 @bashir2 ?

@ibacher
Copy link
Collaborator

ibacher commented Nov 3, 2020

Nope, that’s exactly right. Note that the FHIR _count really refers to the number of results per query page rather than anything else. There is an optional _total parameter that we currently don’t support, but which could be used to turn the calculation of the total count off. That would require some redesign of our paging provider to properly support. The issue is that what we’re currently doing is basically requesting the full list of UUIDs matching a query and then paging across that list, with the idea that that’s relatively performant and provides a consistent paging view at the expense of occasionally reporting data that no longer matches a query (if the underlying data were modified).

It strikes me that to best serve the bulk workflow, what we really want is some means of indicating that a given search request only ever wants a single page of data, that way we can eliminate some overhead (I’m open to suggestions about how we represent this).

Now the other part is (somehow) efficiently getting the total number of results so that the total number of requests (i.e., how many total pages) can be worked out. For this, I’m a bit stuck. For this particular query, we could speed things up by having an index on the voided and uuid columns, which has the advantage of basically precalculating the results (I’m assuming that if you look at the EXPLAIN output for the query on your live database, you’ll see it’s doing a full table scan). That works for the situation where we’re extracting all of the Obs, but not if we’re doing any additional filtering (a fundamental limitation of MySQL is that indexes are not composable, i.e., it cannot take advantage of the index I previously described if there are other filtering criteria).

But what I’d really want to push for this use case is implementing the FHIR Bulk Data IG described here. It’s a more complicated song-and-dance for both the client and server (the request is submitted asynchronously and then polled until complete; the resulting data needs to be separately fetched as an ndjson file—on this point, we can be flexible), but it does allow us to do this kind of heavy-duty querying without worrying about web request timeouts. It also gives us a space to come up with some less-naïve handling for bulk data requests (i.e., we can cache a number of lookups that happen as part of the translation process, etc.). This also could make it sensible to create appropriate database-level indexes for the necessary queries (I appreciate that adding indexes to the AMPATH database is going to be a painful process, but it’s the easiest way to make queries more efficient).

@bashir2
Copy link
Collaborator

bashir2 commented Nov 3, 2020

Thanks @kimaina for the analysis. For the query, I understand it is Hibernate who composes it but it seems the group by clause is redundant. Can you try the inner query (select this_.uuid from obs where voided=0) on your DB and see if it is any faster. Also please try the same with just counting (select count(this_.uuid) from obs where voided=0).

Note that with the initial query we not only need the total but also the list of uuids where for subsequent calls to the fhir2 module handles paging on that list automatically for us; when I was implementing the batch pipeline, that seemed the fastest/easiest way to implement the prototype but of course not necessarily the best way. Two other approaches that can help us are (i) manage the paging and list of uuids inside the pipeline and (ii) even consider doing #5 by which the pipeline can work directly off of MySQL with no dependence on OpenMRS service.

@ibacher, I agree with your suggestion around the bulk export but I think the right place for this is to be implemented inside the fhir2 module (is that what you meant?) and IIRC, we discussed this idea a few times in the FHIR Squad but we did not prioritize it. I would go even one step further and say instead of ndjson, we can even implement a binary serialization like Avro, Protocol Buffer, or even Parquet which will make the communication much faster and the pipeline code simpler.

@ibacher
Copy link
Collaborator

ibacher commented Nov 3, 2020

is that what you meant?

Yes, that functionality definitely belongs in the FHIR2 module. It's just being able to handle the FHIR asynchronous requests part has some (small) impact on the client (i.e., it's not a simple request -> response flow). Also, we need to actually implement it.

I would go even one step further and say instead of ndjson, we can even implement a binary serialization like Avro, Protocol Buffer, or even Parquet which will make the communication much faster and the pipeline code simpler.

I would very much like to explore exporting large datasets via a non-JSON format, though with Parquet, we're likely to run up against the same limitations (R3 only unless we build it).

it seems the group by clause is redundant

In this particular case it is; however, as a general rule, it necessary to avoid duplicate UUIDs when there are joined tables. I haven't really experimented with this, but I assumed the query optimiser could determine that the group by was redundant in this case.

Two other approaches that can help us are (i) manage the paging and list of uuids inside the pipeline and (ii) even consider doing #5 by which the pipeline can work directly off of MySQL with no dependence on OpenMRS service.

I would suggest that (i) is likely to be less performant, though it would work work. This is largely because of the overhead of needing to do a HTTP request-per-resource and (read-only) database transaction per-resource, but it might be interesting to compare the performance as an actual experiment. (ii) might be worth exploring too, although that does put a lot of work on the analytics engine that properly belongs with the rest of the FHIR work. A better way to approach that might be to see if we generalise the translation layer to rely less directly on OpenMRS objects.

@kimaina
Copy link
Collaborator Author

kimaina commented Nov 3, 2020

Thank you @bashir2 and @ibacher for these great ideas and insights

@kimaina
Copy link
Collaborator Author

kimaina commented Nov 3, 2020

Now the other part is (somehow) efficiently getting the total number of results so that the total number of requests (i.e., how many total pages) can be worked out. For this, I’m a bit stuck. For this particular query, we could speed things up by having an index on the voided and uuid columns, which has the advantage of basically precalculating the results (I’m assuming that if you look at the EXPLAIN output for the query on your live database, you’ll see it’s doing a full table scan). That works for the situation where we’re extracting all of the Obs, but not if we’re doing any additional filtering (a fundamental limitation of MySQL is that indexes are not composable, i.e., it cannot take advantage of the index I previously described if there are other filtering criteria).

Screenshot 2020-11-03 at 6 40 26 PM

@kimaina
Copy link
Collaborator Author

kimaina commented Nov 3, 2020

(select this_.uuid from obs where voided=0)

This is extremely fast, executes in < 0.1 secs, possibly due to the first index

@kimaina
Copy link
Collaborator Author

kimaina commented Nov 3, 2020

(select count(this_.uuid) from obs where voided=0).

This took more than 2 mins, canceled it!

@kimaina
Copy link
Collaborator Author

kimaina commented Nov 3, 2020

The issue is that what we’re currently doing is basically requesting the full list of UUIDs matching a query and then paging across that list, with the idea that that’s relatively performant and provides a consistent paging view

can this be achieved with a simple query that does take into account the index? e.g see the above query

(select this_.uuid from obs where voided=0)
This is extremely fast, executes in < 0.1 secs, possibly due to the first index

@kimaina
Copy link
Collaborator Author

kimaina commented Nov 3, 2020

On indexing it is also possible to force a particular index https://dev.mysql.com/doc/refman/5.6/en/index-hints.html
though I'm not sure how this can be done using hibernate

@bashir2
Copy link
Collaborator

bashir2 commented Nov 3, 2020

Okay just to wrap up what we discussed: If other simple solutions do not work, let's go with approach (i), i.e., we grab the list of uuids in the pipeline code directly from MySQL and manage the paging inside the pipeline code. We leave the FHIR conversion to OpenMRS but each time, request multiple uuids to be returned as a bundle as @ibacher suggested. Thanks @kimaina for volunteering to implement this; please create a ticket for this once you try other simple tweaks and are convinced this is necessary to be implemented.

@bashir2
Copy link
Collaborator

bashir2 commented Nov 3, 2020

(select this_.uuid from obs where voided=0)

This is extremely fast, executes in < 0.1 secs, possibly due to the first index

My guess is that this is fast only because you are grabbing a limited number of uuids. If you stream all uuids into an app (e.g., our batch pipeline) it should not be faster than the count query.

@ibacher
Copy link
Collaborator

ibacher commented Nov 3, 2020

Also for documentation: my initial reaction to @kimaina above findings is to try out the version of the FHIR module in this branch and see if that gets performance to a quasi-acceptable level. That branch should change this query:

select this_.uuid as y0_
from obs this_
where this_.uuid in (
    select this_.uuid as y0_ 
    from obs this_
    where this_.voided=0)
group by this_.uuid

To this:

select this_.uuid
from obs this_
where this_.voided=0

My guess is that this is fast only because you are grabbing a limited number of uuids.

Yeah, if that query is only asking for a subset of uuids, then we'd need to look elsewhere for a fix.

I just realised that the index this is relying on voided_status_of_observation looks to be an AMPATH-specific index, although I'd actually be supportive of getting a similar index into core if anyone wants to take that on.

@kimaina
Copy link
Collaborator Author

kimaina commented Nov 3, 2020

My guess is that this is fast only because you are grabbing a limited number of uuids. If you stream all uuids into an app (e.g., our batch pipeline) it should not be faster than the count query.

I agree, streaming in 300 million UUID would still make the API call to timeout

@bashir2
Copy link
Collaborator

bashir2 commented Jan 26, 2021

@kimaina can we repeat this experiment with the merged JDBC based batch pipeline against AMPATH data again? If everything is fine we can close this; I am curious if you can reproduce issue #92.

@kimaina
Copy link
Collaborator Author

kimaina commented Feb 9, 2021

Hi @bashir2 thank you for testing this out! I have been able to reproduce #92 will go ahead and try to figure out

@bashir2
Copy link
Collaborator

bashir2 commented Dec 6, 2022

Closing this until someone else starts evaluating the batch pipeline for OpenMRS.
For the record, with HAPI as the source, we have done some performance evaluations which are documented here and here.

@bashir2 bashir2 closed this as completed Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Development

No branches or pull requests

4 participants