Skip to content
This repository has been archived by the owner on Nov 9, 2022. It is now read-only.

Matching is too slow for 100K+ documents #317

Closed
anthony-gray opened this issue May 22, 2019 · 2 comments
Closed

Matching is too slow for 100K+ documents #317

anthony-gray opened this issue May 22, 2019 · 2 comments
Milestone

Comments

@anthony-gray
Copy link

anthony-gray commented May 22, 2019

Matching just 1 thousand documents (excluding merging) takes ~1min11sec, whereas it should take ~111ms. Matching needs to scale to 1 million+ documents. Adding threads does not solve the problem. The issue is a design issue. The code should use the results of cts:value-tuples to determine the cts:uris corresponding to matches, instead of doing cts:searches on provided URIs. For example, pass cts:path-reference("memberId") and cts:and-not-query(cts:collection-query("mdm-content"), cts:collection-query("deleted")) to cts:value-tuples and use the result to produce cts:and-not-query(cts:and-query((cts:collection-query("mdm-content"), cts:path-range-query("memberId", "=", "1"))), cts:collection-query("deleted")) and cts:and-not-query(cts:and-query((cts:collection-query("mdm-content"), cts:path-range-query("memberId", "=", "2"))), cts:collection-query("deleted")) that gets passed to cts:uris to find the matching uris. I tested matching performance by calling proc:process-match-and-merge-with-options($uris, $mergeOptions, cts:true-query()) and stopping before the merge-impl:save-merge-models-by-uri call in the function by simply returning the consolidated matches.

@ryanjdew
Copy link
Contributor

The 1.3.1 release will allow index references to be associated with properties in the match options to allow range indexes to be used with queries.

Also, some additional minor optimizations were made, but significant optimization will require outside orchestration. Data Hub 5 provides orchestration for batching the process and we have it on our roadmap to adjust the logic to both avoid duplicate work and avoid deadlocks for best performance.

@anthony-gray
Copy link
Author

I will have to test this. I am very skeptical that you can achieve fast, memory efficient, matching performance for 1 million+ documents without using cts:value-tuples. You say "significant optimization will require outside orchestration." I can't post our code, but I achieved the ~111ms I mentioned using one single threaded, configurable method of a little more than 100 lines of XQuery. This method handles configurable fuzzy matching, any combination of and/or conditions for matching, and it doesn’t require one to pass the actual URIs to be matched, but rather just the conditions and collections for matching. All of this is to say that outside orchestration is not required and highly configurable, highly performant, scalable matching for millions of documents is attainable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants